Python 大熊猫在列中使用额外的逗号读取 csv

Question

提问by David

I'm reading a basic csv file where the columns are separated by commas with these column names:

我正在阅读一个基本的 csv 文件，其中的列用逗号分隔，这些列名：

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

但是，正文列是一个可能包含逗号的字符串。显然这会导致一个问题，pandas 会抛出一个错误：

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

有没有办法告诉熊猫忽略特定列中的逗号或解决这个问题的方法？

Answer 1

Imagine we're reading your dataframe called comma.csv:

想象一下，我们正在读取您的名为的数据框comma.csv：

userid, username, body
01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

您可以做的一件事是指定列中字符串的分隔符：

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by 'are considered as total, no matter commas inside them.

在这种情况下，以分隔的字符串'被视为总数，无论其中是否有逗号。

Answer 2

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

将 usecols 和 lineterminator 添加到您的 read_csv() 函数中，其中 n 是您的列的长度。

In my case:

就我而言：

n = 5 #define yours
df = pd.read_csv(file,
                 usecols=range(n),
                 lineterminator='\n',
                 header=None)