如何使用 Python 仅读取 CSV 文件的标题列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24962908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I read only the header column of a CSV file using Python?
提问by Andy
I am looking for a a way to read just the header row of a large number of large CSV files.
我正在寻找一种方法来读取大量大型 CSV 文件的标题行。
Using Pandas, I have this method available, for each csv file:
使用 Pandas,我可以为每个 csv 文件提供这种方法:
>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns
I could do this with just the csv module:
我可以只用 csv 模块做到这一点:
>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames
The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.
这些文件的问题在于每个 CSV 文件的大小为 500MB+,为了拉出标题行而读取每个文件的整个文件似乎是一种巨大的浪费。
My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.
我所有这些的最终目标是提取唯一的列名。一旦我有这些文件中的每一个中的列标题列表,我就可以做到这一点。
How can I extract only the header row of a CSV file, quickly?
如何快速仅提取 CSV 文件的标题行?
采纳答案by Jon Clements
I've used iglob
as an example to search for the .csv
files, but one way is to use a set, then adjust as necessary, eg:
我iglob
以搜索.csv
文件为例,但一种方法是使用一组,然后根据需要进行调整,例如:
import csv
from glob import iglob
unique_headers = set()
for filename in iglob('*.csv'):
with open(filename, 'rb') as fin:
csvin = csv.reader(fin)
unique_headers.update(next(csvin, []))
回答by Jeff
Here's one way. You get 1 row.
这是一种方法。你得到 1 行。
In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')
In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]:
a b c d
0 0.365453 0.633631 -1.917368 -1.996505
回答by Tyler
I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.
我参加聚会可能有点晚了,但这是仅使用 Python 标准库的一种方法。在处理文本数据时,我更喜欢使用 Python 3,因为 unicode。所以这与您最初的建议非常接近,除了我只阅读一行而不是整个文件。
import csv
with open(fpath, 'r') as infile:
reader = csv.DictReader(infile)
fieldnames = reader.fieldnames
Hopefully that helps!
希望这有帮助!
回答by mdubez
What about:
关于什么:
pandas.read_csv(PATH_TO_CSV, nrows=1).columns
That'll read the first row only and return the columns found.
这将仅读取第一行并返回找到的列。
回答by Muhieddine Alkousy
it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:
这取决于标题的用途,如果您只需要标题用于比较目的(我的情况),此代码将简单且超快,它将整个标题作为一个字符串读取。您可以根据需要将所有收集的字符串一起转换:
for filename in glob.glob(files_path+"\*.csv"):
with open(filename) as f:
first_line = f.readline()
回答by Jarno
Expanding on the answer given by JeffIt is now possbile to use pandas
without actually reading any rows.
扩展Jeff 给出的答案现在可以在pandas
不实际读取任何行的情况下使用。
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')
In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']
pandas
can have the advantage that it deals more gracefully with CSV encodings.
pandas
可以有一个优势,它可以更优雅地处理 CSV 编码。
回答by Aaksh Kumar
import pandas as pd
get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)
回答by Saurabh Chandra Patel
you have missed nrows=1
param to read_csv
你错过nrows=1
了 read_csv 的参数
>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns