pandas Python将txt文件读入数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33912773/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python read txt files into a dataframe
提问by OAK
I am attempting to ingest txt files (an entire directory) into a pandas dataframe such that each row in the data frame has the content of one file.
我试图将 txt 文件(整个目录)摄取到 Pandas 数据框中,以便数据框中的每一行都包含一个文件的内容。
The text files as far as I can tell are not delimited, they are the body of email messages. All files but one are split into many rows. So instead of having 20 something rows (one for each file) I have over 500 rows. I cannot tell how the one file differs from the rest. They are all plain-text.
据我所知,文本文件没有分隔,它们是电子邮件的正文。除了一个文件之外的所有文件都被分成多行。因此,我有超过 500 行,而不是 20 行(每个文件一个)。我无法分辨一个文件与其他文件有何不同。它们都是纯文本的。
The code I am using is:
我正在使用的代码是:
import pandas as pd
for i in files:
list_.append(pd.read_csv('//directory'+i ,sep="\t" , quoting=csv.QUOTE_NONE,header=None,names=["message", "label"]))
I've set the separator to tabular as I think it will not effect the ingestion of the text at all. Any ideas what the problem is here?
我已将分隔符设置为表格,因为我认为它根本不会影响文本的摄取。任何想法这里的问题是什么?
回答by paul-g
You are reading the emails as CSV files, so the file contents will be:
您正在将电子邮件作为 CSV 文件阅读,因此文件内容将是:
Split at every tab separator to create a column; whatever separator you chose, I suspect it will be a bad choice, since any character is likely to appear in the body of your email;
Every newline in the email will create a new row (which probably explains your 500 rows)
在每个制表符分隔符处拆分以创建一列;无论您选择什么分隔符,我怀疑这都是一个糟糕的选择,因为任何字符都可能出现在您的电子邮件正文中;
电子邮件中的每个换行符都会创建一个新行(这可能解释了您的 500 行)
Since emails are not CSV files, why not just write your own function to read each file individually into a string, then create a data frame out of all of these strings. For example, to read all the files in the current dir as strings:
由于电子邮件不是 CSV 文件,为什么不编写自己的函数将每个文件单独读入一个字符串,然后从所有这些字符串中创建一个数据框。例如,要将当前目录中的所有文件作为字符串读取:
data = []
path = '.'
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
with open (f, "r") as myfile:
data.append(myfile.read())
df = pd.DataFrame(data)
Here is an example of this in actionas it were:
下面是一个这样的例子在行动,因为它是:
$ ls .
test1.txt test2.txt load_files.py
$ cat load_files.py
import pandas as pd
import os
data = []
path = '.'
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
with open (f, "r") as myfile:
data.append(myfile.read())
df = pd.DataFrame(data)
print df
$ cat test1.txt
asdasd
ada
adasd
$ cat test2.txt
sasdad
asd
dadaadad
$ python load_files.py
0
0 asdasd\nada\nadasd\n
1 sasdad\nasd\ndadaadad\n\n
2 import pandas as pd\nimport os\n\ndata = []\np...
回答by bradchattergoon
After reading the answer by @paul-g I decided to go about it a little bit differently. For context, my application is for use in an NLP project. My files had unique identifiers so using the list approach wasn't quite what I was looking for and I decided to go about it with a dictionary approach. The file name was my unique identifier. Note, you may have to do additional cleaning if your directory has other files beyond the ones you want to load. My directory had only my text files. Unlike the ls
example in @paul-g's answer, my python files were in a different directory, so the python file was not included in my data frame.
在阅读了@paul-g 的答案后,我决定稍微改变一下。对于上下文,我的应用程序用于 NLP 项目。我的文件有唯一的标识符,所以使用列表方法并不是我想要的,我决定使用字典方法。文件名是我的唯一标识符。请注意,如果您的目录除了要加载的文件之外还有其他文件,您可能需要进行额外的清理。我的目录只有我的文本文件。与ls
@paul-g 的答案中的示例不同,我的 python 文件位于不同的目录中,因此 python 文件未包含在我的数据框中。
import pandas as pd
import os
file_names = os.listdir('<folder file path here>')
# Create Dictionary for File Name and Text
file_name_and_text = {}
for file in file_names:
with open('<folder file path here>' + file, "r") as target_file:
file_name_and_text[file] = target_file.read()
file_data = (pd.DataFrame.from_dict(file_name_and_text, orient='index')
.reset_index().rename(index = str, columns = {'index': 'file_name', 0: 'text'}))
This will give you a data frame as follows:
这将为您提供如下数据框:
index file_name text
索引文件名文本
0 file1.txt This is text from file 1
0 file1.txt 这是来自文件 1 的文本
1 file2.txt This is text from file 2
1 file2.txt 这是文件 2 中的文本