使用 Pandas 将 CSV 读入具有不同行长的数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/55129640/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read CSV into a dataFrame with varying row lengths using Pandas
提问by caaax
So I have a CSV that looks a bit like this:
所以我有一个 CSV 看起来有点像这样:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
And when I try to use the following code to generate a dataFrame..
当我尝试使用以下代码生成数据帧时..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)
它只将 3 列的行添加到 df(上面的第 1、3 和 5 行)
The rest are considered 'bad lines' giving me the following error:
其余的被认为是“坏线”给我以下错误:
Skipping line 17467: expected 3 fields, saw 9
How do I create a data frame that includes alldata in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?
如何创建一个包含csv 中所有数据的数据框,可能只是用 null 填充空单元格?或者我是否必须在添加到 df 之前声明最大行长度?
Thanks!
谢谢!
回答by ALollz
If using only pandas
, read in lines, deal with the separator after.
如果只使用pandas
,则逐行读取,然后处理分隔符。
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
回答by unutbu
If you know that the data contains N
columns, you can
tell Pandas in advance how many columns to expect via the names
parameter:
如果您知道数据包含N
列,您可以通过names
参数提前告诉 Pandas 需要多少列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
yields
产量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
If you have an the upper limit, N
, on the number of columns, then you can
have Pandas read N
columns and then use dropna
to drop completely empty columns:
如果您有一个上限,N
, 列数,那么您可以让 Pandas 读取N
列,然后用于dropna
删除完全空的列:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
yields
产量
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
Note that this could drop columns from the middle of the data set (not just columns from the right-hand side) if they are completely empty.
请注意,如果它们完全为空,这可能会从数据集的中间删除列(不仅仅是右侧的列)。
回答by Yo_Chris
Read fixed width should work:
读取固定宽度应该有效:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
or with a delimiter
param
或带有delimiter
参数
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
note that for your actual file you will not use StringIO
you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)
请注意,对于您不会使用的实际文件,您StringIO
只需将其替换为您的文件路径:pd.read_fwf('data.csv', delimiter='|', header=None)
回答by jfaccioni
add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:
在 csv 文件的顶部添加额外的列(空或其他)。Pandas 将第一行作为默认大小,它下面的任何内容都将具有 NaN 值。例子:
file.csv:
文件.csv:
a,b,c,d,e
1,2,3
3
2,3,4
code:
代码:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN
回答by Rich Andrews
Consider using Python csv
to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.
考虑使用 Pythoncsv
来完成导入数据和格式整理的工作。您可以实现自定义方言来处理不同的 csv-ness。
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
Gives you a csv import dialect and the following DataFrame:
为您提供 csv 导入方言和以下 DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Left as an exercise is handling the whitespace padding in the input file.
剩下的练习是处理输入文件中的空白填充。
回答by phoenixio
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
Change 9
in colnames to number x
if code gives the error
如果代码给出错误9
,则将列名更改为数字x
Skipping line 17467: expected 3 fields, saw x