bash 如何解析多行记录（使用 awk？）

Question

提问by Six

I'm trying to figure out how to extract particular fields from multi line records separated by \n\n.

我试图弄清楚如何从由\n\n.

In this instance, it happens to be output from apt-cache akin to DEBIAN control files. See output of apt-cache show "$package"

在这种情况下，它恰好是从类似于 DEBIAN 控制文件的 apt-cache 输出。见输出apt-cache show "$package"

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 641
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.8.3
Depends: python3:any (>= 3.3.2-2~), python3, gir1.2-gtk-3.0, gir1.2-appindicator3-0.1, python3-xlib, python3-pkg-resources, libnet-dbus-perl
Filename: pool/main/c/caffeine/caffeine_2.8.3_all.deb
Size: 58774
MD5sum: 4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1: e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256: ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
Description-en: Prevent the desktop becoming idle in full-screen mode
 Caffeine stops the desktop becoming idle when an application
 is running full-screen. A desktop indicator ‘caffeine-indicator'
 supplies a manual toggle, and the command ‘caffeinate' can be used
 to prevent idleness for the duration of any command.
Description-md5: 7c14f8adc007b10f6ecafed36260bedb

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 655
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.6+555~ubuntu14.04.1
Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~), python, gir1.2-gtk-2.0, gir1.2-appindicator3-0.1, x11-utils, python-dbus
Filename: pool/main/c/caffeine/caffeine_2.6+555~ubuntu14.04.1_all.deb
Size: 58604
MD5sum: 1051c3f7d40d344f986bb632d7436849
SHA1: 5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256: 11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
Description: Prevent the desktop becoming idle
 Caffeine allows the user to prevent the desktop becoming idle,
 either manually or when certain applications are run. This
 prevents screen-blanking, locking, suspending, and so on.
Description-md5: 738866350e5086e77408d7a9c7ffa59b

Package: caffeine
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 794
Maintainer: Isaiah Heyer <[email protected]>
Architecture: all
Version: 2.4.1+478~raring1
Depends: dconf-gsettings-backend | gsettings-backend, python (>= 2.6), python-central (>= 0.6.11), python-xlib, python-appindicator, python-xdg, python-notify, python-kaa-metadata
Description: Caffeine
 A status bar application able to temporarily prevent the activation
 of both the screensaver and the "sleep" powersaving mode.
Description-md5: 1c29acf1ab0f2e6636db29fbde1d14a3
Homepage: https://launchpad.net/caffeine
Python-Version: >= 2.6

My desired output is one line per record in the format apt-get download $pkg=$ver -a=$arch. Basically a list of the installation commands for available packages...

我想要的输出是格式为每条记录一行apt-get download $pkg=$ver -a=$arch。基本上是可用软件包的安装命令列表...

So far what I've got is apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s\n", $1, $3, $2}'

到目前为止，我所拥有的是 apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s\n", $1, $3, $2}'

This is the actual output:

这是实际输出：

apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The is as desired but it appears to be a fluke only because the order of the fields is consistent in this example. It would break if the order of fields was different.

符合要求，但它似乎只是侥幸，因为在此示例中字段的顺序是一致的。如果字段的顺序不同，它会中断。

I can do parsing like this using object orientation in Python but I'm having difficulty getting this done in one awk command. The only way I can see doing this correctly would be to split each record into individual tmp files (using split or something along those lines) and then parse each file individually (which is straightforward). Obviously I'd really like to avoid unnecessary I/O as this seems like something that awk is well equipped for. Any awk pro's know how to solve this? I'd even be open to a Perl one-liner or utilizing bash but I'm really interested in learning how to better leverage awk.

我可以在 Python 中使用面向对象进行这样的解析，但我很难在一个 awk 命令中完成这项工作。我能看到正确执行此操作的唯一方法是将每个记录拆分为单独的 tmp 文件（使用 split 或类似方法），然后单独解析每个文件（这很简单）。显然，我真的很想避免不必要的 I/O，因为这似乎是 awk 所擅长的。任何 awk 专业人士都知道如何解决这个问题？我什至对 Perl 单行或使用 bash 持开放态度，但我真的很想学习如何更好地利用 awk。

Answer 1

采纳答案by John1024

$ package=sed
$ apt-cache show "$package" | awk '/^Package: /{p=} /^Version: /{v=} /^Architecture: /{a=} /^$/{print "apt-get download "p"="v" -a="a}' 
apt-get download sed=4.2.1-10 -a=amd64

How it works

这个怎么运作

/^Package: /{p=$2}
Save the package information in variable p.
/^Version: /{v=$2}
Save the version information in variable v.
/^Architecture: /{a=$2}
Save the architecture information in variable a.
/^$/{print "apt-get download "p"="v" -a="a}
When we reach a blank line, print out the information in the desired form.
My version of apt-cachealways outputs a blank line after each package. Your sample output is missing the last blank line. If your apt-cachegenuinely does not produce that last blank line, then we will need to add a little bit more code to compensate.
As a matter of style, some may prefer printfto print. In which case, replace the above with:
```
/^$/{printf "apt-get download %s=%s -a=%s\n",v,p,a}' 
```

/^Package: /{p=$2}
将包裹信息保存在 variable 中p。
/^Version: /{v=$2}
将版本信息保存在变量中v。
/^Architecture: /{a=$2}
将架构信息保存在变量中a。
/^$/{print "apt-get download "p"="v" -a="a}
当我们到达一个空行时，以所需的形式打印出信息。
我的版本apt-cache总是在每个包后输出一个空行。您的示例输出缺少最后一个空行。如果你apt-cache真的没有产生最后一个空行，那么我们将需要添加更多的代码来补偿。
作为一个风格问题，有些人可能更喜欢printf到print。在这种情况下，将上述内容替换为：
```
/^$/{printf "apt-get download %s=%s -a=%s\n",v,p,a}' 
```

Answer 2

回答by Ed Morton

I find the best way to deal with data that contains name to value pairings is to create an array of those pairs and then just access the values by their names:

我发现处理包含名称到值配对的数据的最佳方法是创建这些对的数组，然后仅通过它们的名称访问值：

$ cat tst.awk
BEGIN { RS=""; FS="\n" }
{
    delete n2v
    for (i=1;i<=NF;i++) {
        if ($i !~ /^ /) {
            name = gensub(/:.*/,"","",$i)
            value = gensub(/[^:]+:\s+/,"","",$i)
            n2v[name] = value
        }
    }
    printf "apt-get download %s=%s -a=%s\n",
        n2v["Package"], n2v["Version"], n2v["Architecture"]
}

$ awk -f tst.awk file
apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The above uses a couple of gawk extensions but is easily adapted to any awk if necessary.

上面使用了几个 gawk 扩展，但如果需要，可以很容易地适应任何 awk。

bash 如何解析多行记录（使用 awk？）

提问by Six

采纳答案by John1024

How it works

这个怎么运作

回答by Ed Morton

相关推荐

最近更新

标签

bash 如何解析多行记录（使用 awk？）

提问by Six

采纳答案by John1024

How it works

这个怎么运作

回答by Ed Morton

相关推荐

Unix Bash - 将 if/else 分配给变量

bash 使用bash终端命令打开目录和子目录中的所有文件？

bash 在 awk 字符串中调用“date”命令，格式为 +%a

bash Wget 不获取谷歌搜索结果

相关推荐

最近更新

标签