Python 使用 Pandas 导入每行具有不同列数的 csv
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27020216/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
import csv with different number of columns per row using Pandas
提问by Erich
What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
使用 Pandas 或 CSV 模块将每行具有不同列数的 CSV 导入 Pandas DataFrame 的最佳方法是什么。
"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
使用此代码:
import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
产生以下错误
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
采纳答案by Bob Haffner
Supplying a list of columns names in the read_csv() should do the trick.
在 read_csv() 中提供列名列表应该可以解决问题。
ex: names=['a', 'b', 'c', 'd', 'e']
例如:名称=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
编辑:如果您不想提供列名,请按照 Nicholas 的建议进行操作
回答by kavin
We could even use pd.read_table()
method to read csv file which converts it into type DataFrame
of single columns which can be read and split by ','
我们甚至可以使用pd.read_table()
方法读取 csv 文件,将其转换DataFrame
为单列类型,可以通过 ',' 读取和拆分
回答by P-S
You can dynamically generate column names as simple counters (0, 1, 2, etc).
您可以将列名动态生成为简单的计数器(0、1、2 等)。
Dynamically generate column names
动态生成列名
# Input
data_file = "smallsample.txt"
# Delimiter
data_file_delimiter = ','
# The max column count a line in the file could have
largest_column_count = 0
# Loop the data lines
with open(data_file, 'r') as temp_f:
# Read the lines
lines = temp_f.readlines()
for l in lines:
# Count the column count for the current line
column_count = len(l.split(data_file_delimiter)) + 1
# Set the new most column count
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
# Close file
temp_f.close()
# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]
# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)
Missing valueswill be assigned to the columns which your CSV lines don't have a value for.
缺失值将分配给 CSV 行没有值的列。
回答by shantanu pathak
Polished version of P.S. answer is as follows. It works. Remember we have inserted lot of missing values in the dataframe.
精修版PS答案如下。有用。请记住,我们在数据框中插入了很多缺失值。
### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
# get No of columns in each line
col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
### Generate column names (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)
回答by Tanvir Islam
If you want something really concise without explicitly giving column names, you could do this:
如果你想要一些非常简洁的东西而不明确给出列名,你可以这样做:
- Make a one column DataFrame with each row being a line in the .csv file
- Split each row on commas and expand the DataFrame
- 制作一列 DataFrame,每一行都是 .csv 文件中的一行
- 用逗号分割每一行并展开 DataFrame
df = pd.read_fwf('<filename>.csv', header=None)
df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)
df[0].str.split(',', expand=True)
回答by amran hossen
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
标记数据时出错。C 错误:第 2 行预期有 4 个字段,看到 8 个
The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.
该错误提供了解决问题“第 2 行预期有 4 个字段”的线索,看到 8 表示第二行的长度为 8,第一行的长度为 4。
import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8
data = pd.read_csv("smallsample.txt",header = None,names=range(8))
Use range instead of manually setting names as it will be cumbersome when you have many columns.
使用范围而不是手动设置名称,因为当您有很多列时会很麻烦。
You can use shantanu pathak's method to find longest row length in your data.
您可以使用 shantanu pathak 的方法来查找数据中最长的行长度。
Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)
此外,如果您需要使用偶数数据长度,您可以用 0 填充 NaN 值。例如。用于聚类(k-means)
new_data = data.fillna(0)