pandas 将csv导入pandas数据帧时不读取所有行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33161769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:03:22  来源:igfitidea点击:

Not reading all rows while importing csv into pandas dataframe

python-3.xcsvpandasmachine-learningkaggle

提问by imba22

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step. My limited python knowledge has to be blamed for this. I am trying to read the datasetsinto a pandas dataframe by executing following command:

我在这里尝试 kaggle 挑战,不幸的是我被困在一个非常基本的步骤上。这必须归咎于我有限的 Python 知识。我正在尝试通过执行以下命令将数据集读入Pandas数据帧:

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")

The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945, 21.

问题是您会发现这个文件有超过 300,000 条记录,但我只读取了 7945、21。

print (test.shape)
(7945, 21)

Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening. Seems very ordinary situation, I hope some of you who have ran across this error can help me out.

现在我已经仔细检查了该文件,但我找不到关于第 7945 行的任何特别之处。任何指示为什么会发生这种情况。看起来很普通的情况,希望遇到过这个错误的朋友可以帮帮我。

回答by jezrael

I think better is use function read_csvwith parameters quoting=csv.QUOTE_NONEand error_bad_lines=False. link

我认为更好的是使用带有参数的函数read_csvquoting=csv.QUOTE_NONEerror_bad_lines=False. 关联

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)

print (test.shape)
#(381422, 22)

But some data (problematic) will be skipped.

但是一些数据(有问题的)将被跳过。

If you want skip emails body data, you can use:

如果您想跳过电子邮件正文数据,您可以使用:

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE,  sep=',', error_bad_lines=False, header=None,
    names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])

print (test.shape)

#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']