Python 使用 Pandas 导入每行具有不同列数的 csv

Question

提问by Erich

What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.

使用 Pandas 或 CSV 模块将每行具有不同列数的 CSV 导入 Pandas DataFrame 的最佳方法是什么。

"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"

Using this code:

使用此代码：

import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)

the following error is generated

产生以下错误

Error tokenizing data. C error: Expected 4 fields in line 2, saw 8

Answer 1

采纳答案by Bob Haffner

Supplying a list of columns names in the read_csv() should do the trick.

在 read_csv() 中提供列名列表应该可以解决问题。

ex: names=['a', 'b', 'c', 'd', 'e']

例如：名称=['a', 'b', 'c', 'd', 'e']

https://github.com/pydata/pandas/issues/2981

Edit: if you don't want to supply column names then do what Nicholas suggested

编辑：如果您不想提供列名，请按照 Nicholas 的建议进行操作

Answer 2

回答by kavin

We could even use pd.read_table()method to read csv file which converts it into type DataFrameof single columns which can be read and split by ','

我们甚至可以使用pd.read_table()方法读取 csv 文件，将其转换DataFrame为单列类型，可以通过 ',' 读取和拆分

Answer 3

回答by P-S

You can dynamically generate column names as simple counters (0, 1, 2, etc).

您可以将列名动态生成为简单的计数器（0、1、2 等）。

Dynamically generate column names

动态生成列名

# Input
data_file = "smallsample.txt"

# Delimiter
data_file_delimiter = ','

# The max column count a line in the file could have
largest_column_count = 0

# Loop the data lines
with open(data_file, 'r') as temp_f:
    # Read the lines
    lines = temp_f.readlines()

    for l in lines:
        # Count the column count for the current line
        column_count = len(l.split(data_file_delimiter)) + 1

        # Set the new most column count
        largest_column_count = column_count if largest_column_count < column_count else largest_column_count

# Close file
temp_f.close()

# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]

# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)

Missing valueswill be assigned to the columns which your CSV lines don't have a value for.

缺失值将分配给 CSV 行没有值的列。

Answer 4

回答by shantanu pathak

Polished version of P.S. answer is as follows. It works. Remember we have inserted lot of missing values in the dataframe.

精修版PS答案如下。有用。请记住，我们在数据框中插入了很多缺失值。

### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)

Answer 5

回答by Tanvir Islam

If you want something really concise without explicitly giving column names, you could do this:

如果你想要一些非常简洁的东西而不明确给出列名，你可以这样做：

Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame

制作一列 DataFrame，每一行都是 .csv 文件中的一行
用逗号分割每一行并展开 DataFrame

df = pd.read_fwf('<filename>.csv', header=None)

df[0].str.split(',', expand=True)

Answer 6

回答by amran hossen

Error tokenizing data. C error: Expected 4 fields in line 2, saw 8

标记数据时出错。C 错误：第 2 行预期有 4 个字段，看到 8 个

The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.

该错误提供了解决问题“第 2 行预期有 4 个字段”的线索，看到 8 表示第二行的长度为 8，第一行的长度为 4。

import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8 
data = pd.read_csv("smallsample.txt",header = None,names=range(8))

Use range instead of manually setting names as it will be cumbersome when you have many columns.

使用范围而不是手动设置名称，因为当您有很多列时会很麻烦。

You can use shantanu pathak's method to find longest row length in your data.

您可以使用 shantanu pathak 的方法来查找数据中最长的行长度。

Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)

此外，如果您需要使用偶数数据长度，您可以用 0 填充 NaN 值。例如。用于聚类（k-means）

new_data = data.fillna(0)

Python 使用 Pandas 导入每行具有不同列数的 csv

提问by Erich

采纳答案by Bob Haffner

回答by kavin

回答by P-S

回答by shantanu pathak

回答by Tanvir Islam

回答by amran hossen

相关推荐

最近更新

标签

Python 使用 Pandas 导入每行具有不同列数的 csv

提问by Erich

采纳答案by Bob Haffner

回答by kavin

回答by P-S

回答by shantanu pathak

回答by Tanvir Islam

回答by amran hossen

相关推荐

Python 连续数组和非连续数组有什么区别？

如何使用python的openpyxl模块访问单元格的真实值

如何附加到python 3中的字节

Python 将对列表转换为字典

相关推荐

最近更新

标签