pandas 熊猫 read_csv 的解析器错误

Question

提问by lanadaquenada

I'm trying to read a txt file that as different number of columns per row. Here's the beginning of my file:

我正在尝试读取一个 txt 文件，该文件每行的列数不同。这是我的文件的开头：

60381 6
1 0.270 0.30 0.30 0.70 0.70
4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
2 0.078 0.30 0.30 0.70 0.70
5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
6 0.241 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.357 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.031 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.022 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
60382 5
6 0.197 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.413 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.016 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.111 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
11 0.263 0.30 0.50 0.70 0.50
1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100

I've tried using pandas read_csv to read it:

我试过使用 pandas read_csv 来阅读它：

import pandas as pd
data = pd.read_csv('./myfile.txt',header=None,sep='\s')

Which gives the following error:

这给出了以下错误：

ParserError: Expected 6 fields in line 3, saw 12. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

So my file doesn't have a multi-char delimiter or quotation marks. I've tried a solution for this I found in this forum, which suggested using:

所以我的文件没有多字符分隔符或引号。我已经在这个论坛中找到了一个解决方案，它建议使用：

data = pd.read_csv(open('./myfile.txt','r'), header=None,encoding='utf-8', engine='c')

Although this solves the error above, the array I'm presented with does not use space as a delimiter of columns, and the output has only 1 column:

虽然这解决了上面的错误，但我看到的数组没有使用空格作为列的分隔符，并且输出只有 1 列：

How should I read the file in order to get a column for each value? I don't mind if there are nan values that fill the rest.

我应该如何读取文件以获得每个值的列？我不介意是否有填充其余部分的 nan 值。

Answer 1

采纳答案by Unatiel

If you've managed to get the data in a single column, you can use Series.str.split()to workaround this issue.

如果您已设法在单个列中获取数据，则可以使用Series.str.split()来解决此问题。

Here is an example with some sample data you provided (you can use any string or a regex as a delimiter in split()) :

这是您提供的一些示例数据的示例（您可以使用任何字符串或正则表达式作为中的分隔符split()）：

df[0].str.split(' ', expand=True)

     0      1      2      3      4      5      6      7      8      9   \
0  0.270   0.30   0.30   0.70   0.70   None   None   None   None   None
1  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988

If you do this, then you might as well create the dataframe with pd.DataFrame(open(...).readlines())or something like that, since you don't benefit at all from read_csv(), and your file isn't exactly a standard csv file.

如果您这样做，那么您不妨使用pd.DataFrame(open(...).readlines())或类似的东西创建数据框，因为您根本无法从中受益read_csv()，并且您的文件不完全是标准的 csv 文件。

# f is a StringIO with some of your sample data to simulate a file
df = pd.DataFrame(line.strip().split(' ') for line in f)

       0      1      2      3      4      5      6      7      8      9   \
0   60381      6   None   None   None   None   None   None   None   None
1       1  0.270   0.30   0.30   0.70   0.70   None   None   None   None
2   4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988  4.988
3       2  0.078   0.30   0.30   0.70   0.70   None   None   None   None
4   5.387  5.312  5.338  4.463  4.675  4.275  4.238  3.562  3.175  3.925

Of course, you can also fix the input file by making sure every line contains the same number of columns, which will solve the ParserErrorissue.

当然，您也可以通过确保每行包含相同数量的列来修复输入文件，这将解决ParserError问题。

Answer 2

回答by Parfait

Consider iteratively rebuilding the text file and then pd.read_table()on cleaned text version:

考虑迭代重建文本文件，然后pd.read_table()在清理文本版本：

from io import StringIO

txt="""
60381 6
1 0.270 0.30 0.30 0.70 0.70
4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
2 0.078 0.30 0.30 0.70 0.70
5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
6 0.241 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.357 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.031 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.022 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
60382 5
6 0.197 0.30 0.60 0.70 0.40
3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
7 0.413 0.30 0.60 0.70 0.40
1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
8 0.016 0.30 0.70 0.70 0.30
0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
10 0.111 0.30 0.80 0.70 0.20
0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
11 0.263 0.30 0.50 0.70 0.50
1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100
"""

for line in StringIO(txt):

    if len(line) == 8:
        header = line
        next
    if len(line) == 28 or len(line) == 29:
        firstdata = line
        next
    if len(line) == 72:
        seconddata = header + ' ' + firstdata + ' ' + line
        print(seconddata.replace("\n", ""))

Output

输出

# 60381 6 1 0.270 0.30 0.30 0.70 0.70 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988 4.988
# 60381 6 2 0.078 0.30 0.30 0.70 0.70 5.387 5.312 5.338 4.463 4.675 4.275 4.238 3.562 3.175 3.925 4.950 4.762
# 60381 6 6 0.241 0.30 0.60 0.70 0.40 3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
# 60381 6 7 0.357 0.30 0.60 0.70 0.40 1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
# 60381 6 8 0.031 0.30 0.70 0.70 0.30 0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
# 60381 6 10 0.022 0.30 0.80 0.70 0.20 0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
# 60382 5 6 0.197 0.30 0.60 0.70 0.40 3.700 3.200 2.738 2.325 1.250 0.975 1.175 1.950 2.488 3.613 3.987 3.950
# 60382 5 7 0.413 0.30 0.60 0.70 0.40 1.212 1.125 1.050 0.950 0.663 0.488 0.425 0.512 0.637 0.900 1.112 1.188
# 60382 5 8 0.016 0.30 0.70 0.70 0.30 0.225 0.213 0.200 0.175 0.200 0.213 0.375 0.887 0.975 0.512 0.262 0.262
# 60382 5 10 0.111 0.30 0.80 0.70 0.20 0.712 0.700 0.738 0.550 0.513 0.688 0.613 0.600 0.850 0.812 0.800 0.775
# 60382 5 11 0.263 0.30 0.50 0.70 0.50 1.812 1.388 1.087 0.825 0.538 0.400 0.338 0.400 0.500 0.925 0.962 1.100

Now for file import into pandas, consider writing new output to text file:

现在要将文件导入到 Pandas，请考虑将新输出写入文本文件：

with open('input.txt', 'r') as f1, open('output.txt', 'w') as f2:
   for line in f1:
      if len(line) == 8:
          header = line
          next
      if len(line) == 28 or len(line) == 29:
          firstdata = line
          next
      if len(line) == 72:
          seconddata = header + ' ' + firstdata + ' ' + line
          f2.write(seconddata.replace("\n", ""))

df = read.table('output.txt', header=None, sep="\s+")

pandas 熊猫 read_csv 的解析器错误

提问by lanadaquenada

采纳答案by Unatiel

回答by Parfait

相关推荐

最近更新

标签

pandas 熊猫 read_csv 的解析器错误

提问by lanadaquenada

采纳答案by Unatiel

回答by Parfait

相关推荐

pandas 替换熊猫数据框中的特殊字符

Python Pandas - 突出显示列中的最大值

pandas 如何使用 matplotlib 正确绘制数据框

pandas 在python中基于pandas索引在新列中添加值

相关推荐

最近更新

标签