bash 有条件的 awk hashmap 匹配查找

Question

提问by user836087

I have 2 tabular files. One file contains a mapping of 50 key values only called lookup_file.txt.The other file has the actual tabular data with 30 columns and millions of rows. data.txtI would like to replace the id column of the second file with the values from the lookup_file.txt..

我有 2 个表格文件。一个文件包含仅称为lookup_file.txt的 50 个键值的映射。另一个文件具有 30 列和数百万行的实际表格数据。data.txt我想用lookup_file.txt 中的值替换第二个文件的id 列。.

How can I do this? I would prefer using awk in bash script.. Also, Is there a hashmap data-structure i can use in bash for storing the 50 key/values rather than another file?

我怎样才能做到这一点？我更喜欢在 bash 脚本中使用 awk .. 另外，我可以在 bash 中使用哈希图数据结构来存储 50 个键/值而不是另一个文件吗？

Answer 1

采纳答案by Ed Morton

Assuming your files have comma-separated fields and the "id column" is field 3:

假设您的文件具有逗号分隔的字段，并且“id 列”是字段 3：

awk '
BEGIN{ FS=OFS="," }
NR==FNR { map[] = ; next }
{  = map[]; print }
' lookup_file.txt data.txt

If any of those assumptions are wrong, clue us in if the fix isn't obvious...

如果这些假设中的任何一个是错误的，如果修复不明显，请提示我们......

EDIT: and if you want to avoid the (IMHO negligible) NR==FNR test performance impact, this would be one of those every rare cases when use of getline is appropriate:

编辑：如果您想避免（恕我直言可以忽略不计）NR==FNR 测试性能影响，这将是适合使用 getline 的极少数情况之一：

awk '
BEGIN{
   FS=OFS=","
   while ( (getline line < "lookup_file.txt") > 0 ) {
      split(line,f)
      map[f[1]] = f[2]
   }
}
{  = map[]; print }
' data.txt

Answer 2

回答by tunagami

You could use a mix of "sort" and "join" via bash instead of having to write it in awk/sed and it is likely to be even faster:

您可以通过 bash 混合使用“排序”和“连接”，而不必在 awk/sed 中编写它，而且它可能会更快：

key.cvs (id, name)

1,homer
2,marge
3,bart
4,lisa
5,maggie

data.cvs (name,animal,owner,age)

data.cvs（名称，动物，所有者，年龄）

snowball,dog,3,1
frosty,yeti,1,245
cujo,dog,5,4

Now, you need to sort both files first on the user id columns:

现在，您需要首先在用户 ID 列上对两个文件进行排序：

cat key.cvs | sort -t, -k1,1 > sorted_keys.cvs
cat data.cvs | sort -t, -k3,3 > sorted_data.cvs

Now join the 2 files:

现在加入2个文件：

join -1 1 -2 3 -o "2.1 2.2 1.2 2.4" -t , sorted_keys.cvs sorted_data.cvs > replaced_data.cvs

This should produce:

这应该产生：

snowball,dog,bart,1
frosty,yeti,homer,245
cujo,dog,maggie,4

This:

这个：

-o "2.1 2.2 1.2 2.4"

Is saying what columns from the 2 files you want in your final output.

是说您想要在最终输出中的 2 个文件中的哪些列。

It is pretty fast for finding and replacing multiple gigs of data compared to other scripting languages. I haven't done a direct comparison to SED/AWK, but it is much easier to write a bash script wrapping this than writing in SED/AWK (for me at least).

与其他脚本语言相比，查找和替换多个演出数据的速度非常快。我还没有与 SED/AWK 进行直接比较，但是编写一个包装它的 bash 脚本比在 SED/AWK 中编写要容易得多（至少对我而言）。

Also, you can speed up the sort by using an upgraded version of gnu coreutils so that you can do the sort in parallel

此外，您可以使用升级版的 gnu coreutils 来加速排序，以便您可以并行执行排序

cat data.cvs | sort --parallel=4 -t, -k3,3 > sorted_data.cvs

4 being how many threads you want to run it in. I was recommended 2 threads per machine core will usually max out the machine, but if it is dedicated just for this, that is fine.

4 是你想在多少线程中运行它。我被推荐为每个机器核心 2 个线程通常会最大化机器，但如果它专门用于此，那很好。

Answer 3

回答by matchew

There are several ways to do this. But if you want an easy one liner, without much in the way of validation I would go with an awk/sed solution.

有几种方法可以做到这一点。但是，如果你想要一个简单的单线，没有太多的验证方式，我会使用 awk/sed 解决方案。

Assume the following:

假设如下：

the files are tab delimited
you are using bash shell
the id in the data file is in the first column
your files look like this:

文件以制表符分隔
你正在使用 bash shell
数据文件中的 id 在第一列
您的文件如下所示：

lookup

抬头

1   one
2   two
3   three
4   four
5   five

data

数据

1   col2    col3    col4    col5
2   col2    col3    col4    col5
3   col2    col3    col4    col5
4   col2    col3    col4    col5
5   col2    col3    col4    col5

I would use awkand sedto accomplish this task like this:

我会使用awk和sed来完成这样的任务：

awk '{print "sed -i s/^""/""/ data"}' lookup | bash

what this is doing is going through each line of lookup and writing the following to stdout

这样做是通过每一行查找并将以下内容写入标准输出

sed -i s/^1/one/ data

sed -i s/^2/two/ data

and so on.

等等。

it next pipes each line to the shell (| bash), which will execute the sedexpression. -i for inplace, you may want -i.bakto create a backup file. note you can change the extension to whatever you would like. the sed is looking for the id at the start of the line, as indicated by the ^. You don't want to be replacing an 'id' in a column that might not contain an id.

它接下来将每一行管道传送到 shell ( | bash)，后者将执行sed表达式。-i 用于就地，您可能想要-i.bak创建一个备份文件。请注意，您可以将扩展名更改为您想要的任何内容。sed 正在寻找行首的 id，如^. 您不想在可能不包含 id 的列中替换“id”。

your output would look like the following:

您的输出如下所示：

one     col2    col3    col4    col5
two     col2    col3    col4    col5
three   col2    col3    col4    col5
four    col2    col3    col4    col5
five    col2    col3    col4    col5

of course, your ids are probably not simply 1 to one, 2 to two, etc, but this might get you started in the right direction. And I use the term right very loosely.

当然，您的 ID 可能不是简单的 1 比 1、2 比 2 等，但这可能会让您朝着正确的方向开始。我非常松散地使用这个词。

Answer 4

回答by rici

The way I'd do this is to use awkto write an awkprogram to process the larger file:

我这样做的方法是使用awk编写awk程序来处理更大的文件：

awk -f <(awk '
   BEGIN{print " BEGIN{"}
        {printf "      a[\"%s\"]=\"%s\";",,}
   END  {print "      }";
         print "      {=a[];print ##代码##}"}
   ' lookup_file.txt
) data.txt

That assumes that the idcolumn is column 1; if not, you need to change both instances of $1in $1=a[$1]

假设该id列是第 1 列；如果没有，您需要更改$1in 的两个实例$1=a[$1]

bash 有条件的 awk hashmap 匹配查找

提问by user836087

采纳答案by Ed Morton

回答by tunagami

回答by matchew

lookup

抬头

data

数据

回答by rici

相关推荐

最近更新

标签

bash 有条件的 awk hashmap 匹配查找

提问by user836087

采纳答案by Ed Morton

回答by tunagami

回答by matchew

lookup

抬头

data

数据

回答by rici

相关推荐

连续运行 bash 脚本

bash 替换 XML 文件中的动态内容

bash 从 perl 更改 PATH

bash 中前导零的范围

相关推荐

最近更新

标签