使用 Python 将纯文本文件解析为 CSV 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16248513/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:06:50  来源:igfitidea点击:

Parse a plain text file into a CSV file using Python

pythoncsv

提问by

I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within the text file, so the output will look something like:

我有一系列 HTML 文件,这些文件使用 Beautiful Soup 解析为一个文本文件。HTML 文件的格式设置为它们的输出始终是文本文件中的三行,因此输出将类似于:

Hello!
How are you?
Well, Bye!

But it could just as easily be

但这也很容易

83957
And I ain't coming back!
hgu39hgd

In other words, the contents of the HTML files are not really standard across each of them, but they do always produce three lines.

换句话说,每个 HTML 文件的内容并不是真正的标准,但它们总是产生三行。

So, I was wondering where I should start if I want to then take the text file that is produced from Beautiful Soup and parse that into a CSV file with columns such as (using the above examples):

所以,我想知道如果我想获取从 Beautiful Soup 生成的文本文件并将其解析为带有列的 CSV 文件,我应该从哪里开始(使用上述示例):

Title   Intro   Tagline
Hello!    How are you?    Well, Bye!
83957    And I ain't coming back!    hgu39hgd

The Python code for stripping the HTML from the text files is this:

从文本文件中剥离 HTML 的 Python 代码是这样的:

import os
import glob
import codecs
import csv
from bs4 import BeautifulSoup

path = "c:\users\me\downloads\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read())
    with open("extracted.txt", "a") as myfile:
        myfile.write(soup.get_text())

And I gather I can use this to set up the columns in my CSV file:

我想我可以用它来设置我的 CSV 文件中的列:

csv.put_HasColumnNames(True)

csv.SetColumnName(0,"title")
csv.SetColumnName(1,"intro")
csv.SetColumnName(2,"tagline")

Where I'm drawing blank is how to iterate through the text file (extracted.txt) one line at a time and, as I get to a new line, set it to the correct cell in the CSV file. The first several lines of the file are blank, and there are many blank lines between each grouping of text. So, first I would need to open the file and read it:

我画空白的地方是如何一次一行地遍历文本文件 (extracted.txt),当我到达新行时,将其设置为 CSV 文件中的正确单元格。文件的前几行是空白的,每组文本之间有许多空白行。所以,首先我需要打开文件并阅读它:

file = open("extracted.txt")

for line in file.xreadlines():
    pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X)

Also, I don't know how to tell Python to just keep reading the file, and adding to the CSV file until it's finished. In other words, there's no way to know exactly how many total lines will be in the HTML files, and so I can't just csv.SetCell(0,0) to cdv.SetCell(999,999)

另外,我不知道如何告诉 Python 继续读取文件,并添加到 CSV 文件直到完成。换句话说,没有办法确切知道 HTML 文件中总共有多少行,所以我不能只是csv.SetCell(0,0) to cdv.SetCell(999,999)

采纳答案by icktoofay

I'm not entirely sure what CSV library you're using, but it doesn't look like Python's built-in one. Anyway, here's how I'd do it:

我不完全确定您使用的是什么 CSV 库,但它看起来不像Python 的内置库。无论如何,这就是我的方法:

import csv
import itertools

with open('extracted.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line for line in stripped if line)
    grouped = itertools.izip(*[lines] * 3)
    with open('extracted.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro', 'tagline'))
        writer.writerows(grouped)

This sort of makes a pipeline. It first gets data from the file, then removes all the whitespace from the lines, then removes any empty lines, then groups them into groups of three, and then (after writing the CSV header) writes those groups to the CSV file.

这种类型的管道。它首先从文件中获取数据,然后从行中删除所有空格,然后删除任何空行,然后将它们分成三组,然后(在写入 CSV 标头后)将这些组写入 CSV 文件。

To combine the last two columns as you mentioned in the comments, you could change the writerowcall in the obvious way and the writerowsto:

要结合评论中提到的最后两列,您可以writerow以明显的方式将调用更改writerows为:

writer.writerows((title, intro + tagline) for title, intro, tagline in grouped)

回答by Oscar Mederos

Perhaps I didn't understand you correctly, but you can do:

也许我没有正确理解你,但你可以这样做:

file = open("extracted.txt")

# if you don't want to do .strip() again, just create a list of the stripped 
# lines first.
lines = [line.strip() for line in file if line.strip()]

for i, line in enumerate(lines):
    csv.SetCell(i % 3, line)