Python 读取和解析 TSV 文件，然后操作它以保存为 CSV（有效）

Question

提问by CJH

My source data is in a TSV file, 6 columns and greater than 2 million rows.

我的源数据在一个 TSV 文件中，有 6 列和超过 200 万行。

Here's what I'm trying to accomplish:

这是我想要完成的：

I need to read the data in 3 of the columns (3, 4, 5) in this source file
The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
I want to write the output of #2 to an output file in CSV format.

我需要读取此源文件中 3 列 (3, 4, 5) 中的数据
第五列是一个整数。我需要使用这个整数值来复制一个行条目，并使用第三和第四列中的数据（按整数次）。
我想将 #2 的输出写入 CSV 格式的输出文件。

Below is what I came up with.

下面是我想出来的。

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

我的问题：这是一种有效的方法吗？当尝试处理 200 万行时，它似乎很密集。

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

首先，我制作了一个示例选项卡单独文件以供使用，并将其命名为“sample.txt”。它是基本的，只有四行：

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6
Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6
Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6
Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

then I have this code:

然后我有这个代码：

import csv 

with open('sample.txt','r') as tsv:
    AoA = [line.strip().split('\t') for line in tsv]

for a in AoA:
    count = int(a[4])
    while count > 0:
        with open('sample_new.csv', 'a', newline='') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow([a[2], a[3]])
        count = count - 1

Answer 1

采纳答案by Martijn Pieters

You should use the csvmodule to read the tab-separated value file. Do notread it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

您应该使用该csv模块来读取制表符分隔值文件。不要一口气读入内存。毕竟，您读取的每一行都包含将行写入输出 CSV 文件所需的所有信息。始终保持输出文件打开。

import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in range(count)])

or, using the itertoolsmodule to do the repeating with itertools.repeat():

或者，使用itertools模块进行重复itertools.repeat()：

from itertools import repeat
import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))

Python 读取和解析 TSV 文件，然后操作它以保存为 CSV（有效）

提问by CJH

采纳答案by Martijn Pieters

相关推荐

最近更新

标签

Python 读取和解析 TSV 文件，然后操作它以保存为 CSV（*有效*）

提问by CJH

采纳答案by Martijn Pieters

相关推荐

如何在 Python 中使用“raise”关键字

使用多个范围语句的 Python 列表初始化

如何在 python 中使用 Selenium 和 Beautifulsoup 解析网站？

使用python读取动态生成的网页

相关推荐

最近更新

标签

Python 读取和解析 TSV 文件，然后操作它以保存为 CSV（有效）