使用 Pandas 处理可变数量的列 - Python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15242746/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Handling Variable Number of Columns with Pandas - Python
提问by Hymanie Shephard
I have a data set that looks like this (at most 5 columns - but can be less)
我有一个看起来像这样的数据集(最多 5 列 - 但可以更少)
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....
I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.
我正在尝试使用 pandas read_table 将其读入 5 列数据框。我想在没有额外按摩的情况下阅读这篇文章。
If I try
如果我尝试
import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)
I get an error - "column names have 5 fields, data has 3 fields".
我收到一个错误 - “列名有 5 个字段,数据有 3 个字段”。
Is there any way to make pandas fill in NaN for the missing columns while reading the data?
有没有办法让pandas在读取数据时为缺失的列填充NaN?
采纳答案by DSM
One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):
一种似乎有效的方法(至少在 0.10.1 和 0.11.0.dev-fc8de6d 中):
>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
A B C D E
0 1 2 3 NaN NaN
1 1 2 3 4 NaN
2 1 2 3 4 5
3 1 2 NaN NaN NaN
4 1 2 3 4 NaN
Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.
请注意,此方法要求您为所需的列命名。不像其他一些方法那么通用,但在适用时效果很好。
回答by herrfz
I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:
我也很想知道这是否可行,从文档来看似乎并非如此。您可能会做的是逐行读取文件,并将每个读取连接到一个 DataFrame:
import pandas as pd
df = pd.DataFrame()
with open(filepath, 'r') as f:
for line in f:
df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )
It works but not in the most elegant way, I guess...
它有效,但不是以最优雅的方式,我想......
回答by Hymanie Shephard
Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !
好的。不确定这有多有效 - 但这是我所做的。很想听听是否有更好的方法来做到这一点。谢谢 !
from pandas import DataFrame
list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
line=line.rstrip('\n')
list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

