使用 Pandas 处理可变数量的列 - Python

Question

提问by Hymanie Shephard

I have a data set that looks like this (at most 5 columns - but can be less)

我有一个看起来像这样的数据集（最多 5 列 - 但可以更少）

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

我正在尝试使用 pandas read_table 将其读入 5 列数据框。我想在没有额外按摩的情况下阅读这篇文章。

If I try

如果我尝试

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

我收到一个错误 - “列名有 5 个字段，数据有 3 个字段”。

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

有没有办法让pandas在读取数据时为缺失的列填充NaN？

Answer 1

采纳答案by DSM

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

一种似乎有效的方法（至少在 0.10.1 和 0.11.0.dev-fc8de6d 中）：

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

请注意，此方法要求您为所需的列命名。不像其他一些方法那么通用，但在适用时效果很好。

Answer 2

回答by herrfz

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

我也很想知道这是否可行，从文档来看似乎并非如此。您可能会做的是逐行读取文件，并将每个读取连接到一个 DataFrame：

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

它有效，但不是以最优雅的方式，我想......

Answer 3

回答by Hymanie Shephard

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

好的。不确定这有多有效 - 但这是我所做的。很想听听是否有更好的方法来做到这一点。谢谢！

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

使用 Pandas 处理可变数量的列 - Python

提问by Hymanie Shephard

采纳答案by DSM

回答by herrfz

回答by Hymanie Shephard

相关推荐

最近更新

标签

使用 Pandas 处理可变数量的列 - Python

提问by Hymanie Shephard

采纳答案by DSM

回答by herrfz

回答by Hymanie Shephard

相关推荐

Python 3.2 输入日期函数

Python AttributeError: 'numpy.ndarray' 对象没有属性 'append'：图像处理示例

Python 将 Pandas Multi-Index 转成列

如何在python中计算段落中的句子数量

相关推荐

最近更新

标签