pandas 熊猫，阅读 CSV 忽略额外的逗号

Question

提问by MikeS159

I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9

我正在将一个包含 8 列的 CSV 文件读入 Pandas 数据框中。最后一列包含错误消息，其中一些包含逗号。这会导致文件读取失败并出现错误ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9

Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?

有没有办法忽略第 8 个字段之后的所有逗号，而不必遍历文件并删除多余的逗号？

Code to read file:

读取文件的代码：

import pandas as pd
df = pd.read_csv('C:\somepath\output.csv')

Line that works:

有效的行：

061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message

Line that fails:

失败的行：

061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas

Answer 1

采纳答案by Bill Bell

You can use re.subto replace the first few commas with, say, the '|', save the intermediate results in a StringIOthen process that.

您可以使用re.sub“|”来替换前几个逗号，StringIO然后将中间结果保存在然后处理中。

import pandas as pd
from io import StringIO
import re

for_pd = StringIO()
with open('MikeS159.csv') as mike:
    for line in mike:
        new_line = re.sub(r',', '|', line.rstrip(), count=7)
        print (new_line, file=for_pd)

for_pd.seek(0)

df = pd.read_csv(for_pd, sep='|', header=None)
print (df)

I put the two lines from your question into a file to get this output.

我将您问题中的两行放入一个文件中以获取此输出。

       0       1  2                    3  4  5   6  \
0  061AE  Active  1  2017_02_24 15_18_01  6  1  13   
1  061AE  Active  1  2017_02_24 15_18_01  6  1  13   

                             7  
0                 some message  
1  longer message, with commas

Answer 2

回答by Blazina

You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:

您可以在 read_csv 函数中使用参数 usecols 来限制您读取的列。例如：

import pandas as pd
pd.read_csv(path, usecols=range(8))

if you only want to read the first 8 columns.

如果您只想阅读前 8 列。

Answer 3

回答by FatihAkici

You can take a shot at this roundabout posted on the Pandas issues page:

您可以在Pandas 问题页面上发布的这个回旋处拍照：

import csv
import pandas as pd
import numpy as np

df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:

您还可以对数据进行预处理，基本上将所有前 7 个（第 0 到第 6 个，都包括在内）逗号更改为分号，然后使用以下内容将其后的逗号保留为逗号 *：

to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
    for line in f:
        while counter < 7:
            line = list(line)
            line[line.index(",")] = ";"
            counter += 1
        counter = 0
        to_write.append("".join(line))

You can now read this to_writelist as a Pandas object like

您现在可以将此to_write列表作为 Pandas 对象阅读，例如

data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),

or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';').

或将其写回 csv 并使用带有分号分隔符的 Pandas 读取，例如read_csv(csv_path, sep=';').

I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.

我在没有经过严格测试的情况下很快就起草了这个，但应该给你一些尝试的想法。如果它有帮助或没有帮助，请发表评论，我会编辑它。

*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.

*另一种选择是删除第 7 个之后的所有逗号，并继续使用逗号分隔符。无论哪种方式，重点都是将前 7 个定界符与随后的标点符号区分开来。

Answer 4

回答by DINA TAKLIT

to join @Tblaz answer If you use GoogleColabyou can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:

加入@Tblaz 回答如果您使用GoogleColab，您可以使用此解决方案，在我的情况下，额外的逗号位于第 24 列，因此我只需阅读 23 列：

import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)

pandas 熊猫，阅读 CSV 忽略额外的逗号

提问by MikeS159

采纳答案by Bill Bell

回答by Blazina

回答by FatihAkici

回答by DINA TAKLIT

相关推荐

最近更新

标签

pandas 熊猫，阅读 CSV 忽略额外的逗号

提问by MikeS159

采纳答案by Bill Bell

回答by Blazina

回答by FatihAkici

回答by DINA TAKLIT

相关推荐

如何使用 Pandas 数据框打开 csv 文件

pandas 将数组附加到数据帧（python）

在 Pandas/Python 中使用 loc 和仅使用方括号过滤列有什么区别？

pandas 从数据框熊猫中获取单个值

相关推荐

最近更新

标签