使用 Pandas 处理可变数量的列 - Python

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15242746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:36:33  来源:igfitidea点击:

Handling Variable Number of Columns with Pandas - Python

pythonpandas

提问by Hymanie Shephard

I have a data set that looks like this (at most 5 columns - but can be less)

我有一个看起来像这样的数据集(最多 5 列 - 但可以更少)

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

我正在尝试使用 pandas read_table 将其读入 5 列数据框。我想在没有额外按摩的情况下阅读这篇文章。

If I try

如果我尝试

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

我收到一个错误 - “列名有 5 个字段,数据有 3 个字段”。

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

有没有办法让pandas在读取数据时为缺失的列填充NaN?

采纳答案by DSM

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

一种似乎有效的方法(至少在 0.10.1 和 0.11.0.dev-fc8de6d 中):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

请注意,此方法要求您为所需的列命名。不像其他一些方法那么通用,但在适用时效果很好。

回答by herrfz

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

我也很想知道这是否可行,从文档来看似乎并非如此。您可能会做的是逐行读取文件,并将每个读取连接到一个 DataFrame:

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

它有效,但不是以最优雅的方式,我想......

回答by Hymanie Shephard

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

好的。不确定这有多有效 - 但这是我所做的。很想听听是否有更好的方法来做到这一点。谢谢 !

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)