如何使用 Python 仅读取 CSV 文件的标题列？

Question

提问by Andy

I am looking for a a way to read just the header row of a large number of large CSV files.

我正在寻找一种方法来读取大量大型 CSV 文件的标题行。

Using Pandas, I have this method available, for each csv file:

使用 Pandas，我可以为每个 csv 文件提供这种方法：

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:

我可以只用 csv 模块做到这一点：

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.

这些文件的问题在于每个 CSV 文件的大小为 500MB+，为了拉出标题行而读取每个文件的整个文件似乎是一种巨大的浪费。

My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.

我所有这些的最终目标是提取唯一的列名。一旦我有这些文件中的每一个中的列标题列表，我就可以做到这一点。

How can I extract only the header row of a CSV file, quickly?

如何快速仅提取 CSV 文件的标题行？

Answer 1

采纳答案by Jon Clements

I've used iglobas an example to search for the .csvfiles, but one way is to use a set, then adjust as necessary, eg:

我iglob以搜索.csv文件为例，但一种方法是使用一组，然后根据需要进行调整，例如：

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))

Answer 2

回答by Jeff

Here's one way. You get 1 row.

这是一种方法。你得到 1 行。

In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')

In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]: 
          a         b         c         d
0  0.365453  0.633631 -1.917368 -1.996505

Answer 3

回答by Tyler

I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.

我参加聚会可能有点晚了，但这是仅使用 Python 标准库的一种方法。在处理文本数据时，我更喜欢使用 Python 3，因为 unicode。所以这与您最初的建议非常接近，除了我只阅读一行而不是整个文件。

import csv    

with open(fpath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

Hopefully that helps!

希望这有帮助！

Answer 4

回答by mdubez

What about:

关于什么：

pandas.read_csv(PATH_TO_CSV, nrows=1).columns

That'll read the first row only and return the columns found.

这将仅读取第一行并返回找到的列。

Answer 5

回答by Muhieddine Alkousy

it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:

这取决于标题的用途，如果您只需要标题用于比较目的（我的情况），此代码将简单且超快，它将整个标题作为一个字符串读取。您可以根据需要将所有收集的字符串一起转换：

for filename in glob.glob(files_path+"\*.csv"):
    with open(filename) as f:
        first_line = f.readline()

Answer 6

回答by Jarno

Expanding on the answer given by JeffIt is now possbile to use pandaswithout actually reading any rows.

扩展Jeff 给出的答案现在可以在pandas不实际读取任何行的情况下使用。

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')

In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']

pandascan have the advantage that it deals more gracefully with CSV encodings.

pandas可以有一个优势，它可以更优雅地处理 CSV 编码。

Answer 7

回答by Aaksh Kumar

import pandas as pd

get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)

Answer 8

回答by Saurabh Chandra Patel

you have missed nrows=1param to read_csv

你错过nrows=1了 read_csv 的参数

>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns

如何使用 Python 仅读取 CSV 文件的标题列？

提问by Andy

采纳答案by Jon Clements

回答by Jeff

回答by Tyler

回答by mdubez

回答by Muhieddine Alkousy

回答by Jarno

回答by Aaksh Kumar

回答by Saurabh Chandra Patel

相关推荐

最近更新

标签

如何使用 Python 仅读取 CSV 文件的标题列？

提问by Andy

采纳答案by Jon Clements

回答by Jeff

回答by Tyler

回答by mdubez

回答by Muhieddine Alkousy

回答by Jarno

回答by Aaksh Kumar

回答by Saurabh Chandra Patel

相关推荐

我什么时候应该在 Python 中使用函数柯里化？

Python Pycharm 不显示情节

Python 如何使用 Flask 从 URL 获取命名参数？

错误：“str”对象不支持项目分配python

相关推荐

最近更新

标签