python 按公共列组合 2 个 .csv 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/826812/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:56:04  来源:igfitidea点击:

Combining 2 .csv files by common column

pythonshelljoincsvdebian

提问by S.Lott

I have two .csv files where the first line in file 1 is:

我有两个 .csv 文件,其中文件 1 的第一行是:

MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description,Manufacturer ID,Manufacturer Description,URL,Manufacturer (Brand) URL,Image URL,AR Price,Price,Ship Price,Stock,Condition

The first line from file 2:

文件 2 的第一行:

Regular Price,Sale Price,Manufacturer Name,Model Number,Retailer Category,Buy URL,Product Name,Availability,Shipping Cost,Condition,MPID,Image URL,UPC,Description

and then rest of every file is filled with info.

然后每个文件的其余部分都填充了信息。

As you can see, both files have a common field called MPID (file 1: col 1, file 2: col 9, where the first col is col 1).

如您所见,两个文件都有一个名为 MPID 的公共字段(文件 1:col 1,文件 2:col 9,其中第一个 col 是 col 1)。

I would like to create a new file which will combine these two files by looking at this column (as in: if there is an MPID that is in both files, then in the new file this MPID will appear with both its row from file 1 and its row from file 2). IF one MPID appears only in one file then it should also go into this combined file.

我想创建一个新文件,通过查看此列来合并这两个文件(例如:如果两个文件中都有一个 MPID,那么在新文件中,该 MPID 将与文件 1 中的行一起出现及其文件 2 中的行)。如果一个 MPID 只出现在一个文件中,那么它也应该进入这个组合文件。

The files are not sorted in any way.

文件没有以任何方式排序。

How do I do this on a debian machine with either a shell script or python?

如何在带有 shell 脚本或 python 的 debian 机器上执行此操作?

Thanks.

谢谢。

EDIT: Both files dont have commas other than the ones separating the fields.

编辑:除了分隔字段的逗号之外,这两个文件都没有逗号。

回答by Pierre

sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2

回答by S.Lott

This is the classical "relational join" problem.

这是经典的“关系连接”问题。

You have several algorithms.

你有几种算法。

  • Nested Loops. You read from one file to pick a "master" record. You read the entire other file locating all "detail" records that match the master. This is a bad idea.

  • Sort-Merge. You sort each file into a temporary copy based on the common key. You then merge both files by reading from the master and then reading all matching rows from the detail and writing the merged records.

  • Lookup. You read one of the files entirely into a dictionary in memory, indexed by the key field. This can be tricky for the detail file, where you'll have multiple children per key. Then you read the other file and lookup the matching records in the dictionary.

  • 嵌套循环。您从一个文件中读取以选择一个“主”记录。您阅读整个其他文件,找到与主文件匹配的所有“详细信息”记录。这是一个坏主意。

  • 排序合并。您可以根据公共密钥将每个文件分类到一个临时副本中。然后,您通过从 master 读取,然后从 detail 读取所有匹配的行并写入合并的记录来合并这两个文件。

  • 抬头。您将其中一个文件完全读入内存中的字典,由键字段索引。这对于详细文件来说可能很棘手,因为每个键都有多个子项。然后您读取另一个文件并在字典中查找匹配的记录。

Of these, sort-merge is often the fastest. This is done entirely using the unix sortcommand.

其中,排序合并通常是最快的。这完全是使用 unix sort命令完成的。

Lookup Implementation

查找实现

import csv
import collections

index = collections.defaultdict(list)

file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
    index[row['MPID']].append( row )
file1.close()

file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
    print row, index[row['MPID']]
file2.close()

回答by Jonathan Leffler

You'll need to look at the joincommand in the shell. You will also need to sort the data, and probably lose the first lines. The whole process will fall flat if any of the data contains commas. Or you will need to process the data with a CSV-sensitive process that introduces a different field separator (perhaps control-A) that you can use to split fields unambiguously.

您需要查看joinshell中的命令。您还需要对数据进行排序,并且可能会丢失第一行。如果任何数据包含逗号,则整个过程将失败。或者,您将需要使用 CSV 敏感过程处理数据,该过程引入了不同的字段分隔符(可能是 control-A),您可以使用该分隔符明确拆分字段。

The alternative, using Python, reads the two files into a pair of dictionaries (keyed on the common column(s)) and then use a loop to cover all the elements in the smaller of the two dictionaries, looking for matching values in the other. (This is basic nested loop query processing.)

另一种方法是使用 Python 将两个文件读入一对字典(以公共列为键),然后使用循环覆盖两个字典中较小字典中的所有元素,在另一个字典中查找匹配值. (这是基本的嵌套循环查询处理。)

回答by galets

It seems that you're trying to do in a shell script, which is commonly done using SQL server. Is it possible to use SQL for that task? For example, you could import both files into mysql, then create a join, then export it to CSV.

您似乎正在尝试在 shell 脚本中执行此操作,这通常是使用 SQL 服务器完成的。是否可以将 SQL 用于该任务?例如,您可以将这两个文件导入 mysql,然后创建一个连接,然后将其导出为 CSV。

回答by galets

You could take a look at my FOSS project CSVfix, which is a stream editor for manipulating CSV files. It supports joins, among its other features, and requires no scripting to use.

你可以看看我的 FOSS 项目CSVfix,它是一个用于操作 CSV 文件的流编辑器。它支持连接以及其他功能,并且不需要使用脚本。

回答by Felix Francis

For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.

为了合并基于一个或多个常见列的多个文件(甚至 > 2),python 中最好和最有效的方法之一是使用“brewery”。您甚至可以指定需要考虑合并的字段以及需要保存的字段。

import brewery
from brewery
import ds
import sys

sources = [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

Create list of all fields and add filename to store information about origin of data records.Go through source definitions and collect the fields:

创建所有字段的列表并添加文件名以存储有关数据记录来源的信息。通过源定义并收集字段:

for source in sources:
    for field in source["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

for source in sources:

    path = source["file"]

# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer