Python 读取和解析 TSV 文件,然后操作它以保存为 CSV(*有效*)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13992971/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:07:21  来源:igfitidea点击:

reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

pythonfilecsvtab-delimited-text

提问by CJH

My source data is in a TSV file, 6 columns and greater than 2 million rows.

我的源数据在一个 TSV 文件中,有 6 列和超过 200 万行。

Here's what I'm trying to accomplish:

这是我想要完成的:

  1. I need to read the data in 3 of the columns (3, 4, 5) in this source file
  2. The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
  3. I want to write the output of #2 to an output file in CSV format.
  1. 我需要读取此源文件中 3 列 (3, 4, 5) 中的数据
  2. 第五列是一个整数。我需要使用这个整数值来复制一个行条目,并使用第三和第四列中的数据(按整数次)。
  3. 我想将 #2 的输出写入 CSV 格式的输出文件。

Below is what I came up with.

下面是我想出来的。

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

我的问题:这是一种有效的方法吗?当尝试处理 200 万行时,它似乎很密集。

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

首先,我制作了一个示例选项卡单独文件以供使用,并将其命名为“sample.txt”。它是基本的,只有四行:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6
Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6
Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6
Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

then I have this code:

然后我有这个代码:

import csv 

with open('sample.txt','r') as tsv:
    AoA = [line.strip().split('\t') for line in tsv]

for a in AoA:
    count = int(a[4])
    while count > 0:
        with open('sample_new.csv', 'a', newline='') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow([a[2], a[3]])
        count = count - 1

采纳答案by Martijn Pieters

You should use the csvmodule to read the tab-separated value file. Do notread it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

您应该使用该csv模块来读取制表符分隔值文件。不要一口气读入内存。毕竟,您读取的每一行都包含将行写入输出 CSV 文件所需的所有信息。始终保持输出文件打开。

import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in range(count)])

or, using the itertoolsmodule to do the repeating with itertools.repeat():

或者,使用itertools模块进行重复itertools.repeat()

from itertools import repeat
import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))