pandas 熊猫,阅读 CSV 忽略额外的逗号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48668125/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas, read CSV ignoring extra commas
提问by MikeS159
I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9
我正在将一个包含 8 列的 CSV 文件读入 Pandas 数据框中。最后一列包含错误消息,其中一些包含逗号。这会导致文件读取失败并出现错误ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9
Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?
有没有办法忽略第 8 个字段之后的所有逗号,而不必遍历文件并删除多余的逗号?
Code to read file:
读取文件的代码:
import pandas as pd
df = pd.read_csv('C:\somepath\output.csv')
Line that works:
有效的行:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message
Line that fails:
失败的行:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas
采纳答案by Bill Bell
You can use re.sub
to replace the first few commas with, say, the '|', save the intermediate results in a StringIO
then process that.
您可以使用re.sub
“|”来替换前几个逗号,StringIO
然后将中间结果保存在然后处理中。
import pandas as pd
from io import StringIO
import re
for_pd = StringIO()
with open('MikeS159.csv') as mike:
for line in mike:
new_line = re.sub(r',', '|', line.rstrip(), count=7)
print (new_line, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd, sep='|', header=None)
print (df)
I put the two lines from your question into a file to get this output.
我将您问题中的两行放入一个文件中以获取此输出。
0 1 2 3 4 5 6 \
0 061AE Active 1 2017_02_24 15_18_01 6 1 13
1 061AE Active 1 2017_02_24 15_18_01 6 1 13
7
0 some message
1 longer message, with commas
回答by Blazina
You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:
您可以在 read_csv 函数中使用参数 usecols 来限制您读取的列。例如:
import pandas as pd
pd.read_csv(path, usecols=range(8))
if you only want to read the first 8 columns.
如果您只想阅读前 8 列。
回答by FatihAkici
You can take a shot at this roundabout posted on the Pandas issues page:
您可以在Pandas 问题页面上发布的这个回旋处拍照:
import csv
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:
您还可以对数据进行预处理,基本上将所有前 7 个(第 0 到第 6 个,都包括在内)逗号更改为分号,然后使用以下内容将其后的逗号保留为逗号 *:
to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
for line in f:
while counter < 7:
line = list(line)
line[line.index(",")] = ";"
counter += 1
counter = 0
to_write.append("".join(line))
You can now read this to_write
list as a Pandas object like
您现在可以将此to_write
列表作为 Pandas 对象阅读,例如
data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),
or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';')
.
或将其写回 csv 并使用带有分号分隔符的 Pandas 读取,例如read_csv(csv_path, sep=';')
.
I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.
我在没有经过严格测试的情况下很快就起草了这个,但应该给你一些尝试的想法。如果它有帮助或没有帮助,请发表评论,我会编辑它。
*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.
*另一种选择是删除第 7 个之后的所有逗号,并继续使用逗号分隔符。无论哪种方式,重点都是将前 7 个定界符与随后的标点符号区分开来。
回答by DINA TAKLIT
to join @Tblaz answer If you use GoogleColabyou can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:
加入@Tblaz 回答如果您使用GoogleColab,您可以使用此解决方案,在我的情况下,额外的逗号位于第 24 列,因此我只需阅读 23 列:
import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)