Python Pandas read_csv 期望列数错误,csv 文件参差不齐

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20154303/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:41:17  来源:igfitidea点击:

Pandas read_csv expects wrong number of columns, with ragged csv file

pythoncsvpandasragged

提问by chrisfs

I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. When I try to read it in using read_csv() I get the following error. "ValueError: Expecting 23 columns, got 26 in row 64"

我有一个 csv 文件,它有几百行和 26 列,但最后几列只有几行有一个值,它们靠近文件的中间或末尾。当我尝试使用 read_csv() 读取它时,出现以下错误。“ValueError: 期望 23 列,在第 64 行得到 26”

I can't see where to explicitly state the number of columns in the file, or how it determines how many columns it thinks the file should have. The dump is below

我看不到在哪里明确说明文件中的列数,或者它如何确定它认为文件应该有多少列。转储在下面

In [3]:

infile =open(easygui.fileopenbox(),"r")
pledge = read_csv(infile,parse_dates='true')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b35e7a16b389> in <module>()
      1 infile =open(easygui.fileopenbox(),"r")
      2 
----> 3 pledge = read_csv(infile,parse_dates='true')


C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
    234         kwds['delimiter'] = sep
    235 
--> 236     return _read(TextParser, filepath_or_buffer, kwds)
    237 
    238 @Appender(_read_table_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in _read(cls, filepath_or_buffer, kwds)
    189         return parser
    190 
--> 191     return parser.get_chunk()
    192 
    193 @Appender(_read_csv_doc)

C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in get_chunk(self, rows)
    779             msg = ('Expecting %d columns, got %d in row %d' %
    780                    (col_len, zip_len, row_num))
--> 781             raise ValueError(msg)
    782 
    783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))

ValueError: Expecting 23 columns, got 26 in row 64

采纳答案by Roman Pekar

You can use namesparameter. For example, if you have csv file like this:

您可以使用names参数。例如,如果您有这样的 csv 文件:

1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6

And try to read it, you'll receive and error

并尝试阅读它,您会收到并出错

>>> pd.read_csv(r'D:/Temp/tt.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6

But if you pass namesparameters, you'll get result:

但是如果你传递names参数,你会得到结果:

>>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))
   a  b  c   d   e   f
0  1  2  1 NaN NaN NaN
1  2  3  4   2   3 NaN
2  1  2  3   3 NaN NaN
3  1  2  3   4   5   6

Hope it helps.

希望能帮助到你。

回答by yemu

Suppose you have a file like this:

假设你有一个这样的文件:

a,b,c
1,2,3
1,2,3,4

You could use csv.readerto clean the file first,

你可以csv.reader先清理文件,

lines=list(csv.reader(open('file.csv')))    
header, values = lines[0], lines[1:]    
data = {h:v for h,v in zip (header, zip(*values))}

and get:

并得到:

{'a' : ('1','1'), 'b': ('2','2'), 'c': ('3', '3')}

If you don't have header you could use:

如果您没有标题,则可以使用:

data = {h:v for h,v in zip (str(xrange(number_of_columns)), zip(*values))}

and then you can convert dictionary to dataframe with

然后你可以将字典转换为数据框

import pandas as pd
df = pd.DataFrame.from_dict(data)

回答by arjepak

you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).

您还可以使用分隔符 '^' 加载 CSV,将整个字符串加载到列中,然后使用 split 将字符串分解为所需的分隔符。之后,您执行 concat 以与原始数据帧合并(如果需要)。

temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X')
temp2=temp.X0.str.split(',',expand=True)
del temp['X0']
temp=pd.concat([temp,temp2],axis=1)

回答by Abhinav Yadav

The problem with the given solution is that you have to know the max number of columns required. I couldn't find a direct function for this problem, but you can surely write a def which can:

给定解决方案的问题是您必须知道所需的最大列数。我找不到这个问题的直接函数,但你肯定可以写一个 def ,它可以:

  1. read all the lines
  2. split it
  3. count the number of words/elements in each row
  4. store the max number of words/elements
  5. place that max value in the names option (as suggested by Roman Pekar)
  1. 阅读所有行
  2. 拆分它
  3. 计算每行中单词/元素的数量
  4. 存储单词/元素的最大数量
  5. 将最大值放在 names 选项中(如Roman Pekar所建议的)

Here is the def (function) I wrote for my files:

这是我为我的文件编写的 def(函数):

def ragged_csv(filename):
    f=open(filename)
    max_n=0
    for line in f.readlines():
        words = len(line.split(' '))
        if words > max_n:
            max_n=words
    lines=pd.read_csv(filename,sep=' ',names=range(max_n))
    return lines