pandas 熊猫,阅读 CSV 忽略额外的逗号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48668125/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:08:35  来源:igfitidea点击:

Pandas, read CSV ignoring extra commas

pythonpandas

提问by MikeS159

I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9

我正在将一个包含 8 列的 CSV 文件读入 Pandas 数据框中。最后一列包含错误消息,其中一些包含逗号。这会导致文件读取失败并出现错误ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9

Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?

有没有办法忽略第 8 个字段之后的所有逗号,而不必遍历文件并删除多余的逗号?

Code to read file:

读取文件的代码:

import pandas as pd
df = pd.read_csv('C:\somepath\output.csv')

Line that works:

有效的行:

061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message

Line that fails:

失败的行:

061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas

采纳答案by Bill Bell

You can use re.subto replace the first few commas with, say, the '|', save the intermediate results in a StringIOthen process that.

您可以使用re.sub“|”来替换前几个逗号,StringIO然后将中间结果保存在然后处理中。

import pandas as pd
from io import StringIO
import re

for_pd = StringIO()
with open('MikeS159.csv') as mike:
    for line in mike:
        new_line = re.sub(r',', '|', line.rstrip(), count=7)
        print (new_line, file=for_pd)

for_pd.seek(0)

df = pd.read_csv(for_pd, sep='|', header=None)
print (df)

I put the two lines from your question into a file to get this output.

我将您问题中的两行放入一个文件中以获取此输出。

       0       1  2                    3  4  5   6  \
0  061AE  Active  1  2017_02_24 15_18_01  6  1  13   
1  061AE  Active  1  2017_02_24 15_18_01  6  1  13   

                             7  
0                 some message  
1  longer message, with commas  

回答by Blazina

You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:

您可以在 read_csv 函数中使用参数 usecols 来限制您读取的列。例如:

import pandas as pd
pd.read_csv(path, usecols=range(8))

if you only want to read the first 8 columns.

如果您只想阅读前 8 列。

回答by FatihAkici

You can take a shot at this roundabout posted on the Pandas issues page:

您可以在Pandas 问题页面上发布的这个回旋处拍照:

import csv
import pandas as pd
import numpy as np

df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:

您还可以对数据进行预处理,基本上将所有前 7 个(第 0 到第 6 个,都包括在内)逗号更改为分号,然后使用以下内容将其后的逗号保留为逗号 *:

to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
    for line in f:
        while counter < 7:
            line = list(line)
            line[line.index(",")] = ";"
            counter += 1
        counter = 0
        to_write.append("".join(line))

You can now read this to_writelist as a Pandas object like

您现在可以将此to_write列表作为 Pandas 对象阅读,例如

data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),

or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';').

或将其写回 csv 并使用带有分号分隔符的 Pandas 读取,例如read_csv(csv_path, sep=';').

I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.

我在没有经过严格测试的情况下很快就起草了这个,但应该给你一些尝试的想法。如果它有帮助或没有帮助,请发表评论,我会编辑它。

*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.

*另一种选择是删除第 7 个之后的所有逗号,并继续使用逗号分隔符。无论哪种方式,重点都是将前 7 个定界符与随后的标点符号区分开来。

回答by DINA TAKLIT

to join @Tblaz answer If you use GoogleColabyou can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:

加入@Tblaz 回答如果您使用GoogleColab,您可以使用此解决方案,在我的情况下,额外的逗号位于第 24 列,因此我只需阅读 23 列:

import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)