Python 使用 csv 模块从 csv 文件中读取特定列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16503560/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:49:57  来源:igfitidea点击:

Read specific columns from a csv file with csv module?

pythoncsv

提问by frankV

I'm trying to parse through a csv file and extract the data from only specific columns.

我正在尝试解析 csv 文件并仅从特定列中提取数据。

Example csv:

示例 csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

I'm trying to capture only specific columns, say ID, Name, Zipand Phone.

我想只捕获特定的列,说IDNameZipPhone

Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Namewould correspond to 2and iterating through each row using row[2]would produce all the items in column 2. Only it doesn't.

我看过的代码让我相信我可以通过相应的编号来调用特定的列,所以 ie:Name将对应2并遍历每一行 usingrow[2]将生成第 2 列中的所有项目。只有它没有。

Here's what I've done so far:

这是我到目前为止所做的:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.

我希望这将只打印出我想要的每一行的特定列,除非它没有,我只得到最后一列。

采纳答案by Ryan Saxe

The only way you would be getting the last column from this code is if you don't include your print statement inyour forloop.

你会得到从这个代码的最后一列的唯一方法是,如果你不包括你的print语句for循环。

This is most likely the end of your code:

这很可能是您代码的结尾:

for row in reader:
    content = list(row[i] for i in included_cols)
print content

You want it to be this:

你希望它是这样的:

for row in reader:
        content = list(row[i] for i in included_cols)
        print content

Now that we have covered your mistake, I would like to take this time to introduce you to the pandasmodule.

现在我们已经解决了您的错误,我想借此机会向您介绍pandas模块。

Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:

Pandas 在处理 csv 文件方面非常出色,以下代码就是读取 csv 并将整列保存到变量中所需的全部代码:

import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']

so if you wanted to save all of the info in your column Namesinto a variable, this is all you need to do:

因此,如果您想将列中的所有信息保存Names到变量中,您只需执行以下操作:

names = df.Names

It's a great module and I suggest you look into it. If for some reason your print statement was in forloop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!

这是一个很棒的模块,我建议你研究一下。如果由于某种原因您的打印语句处于for循环中并且它仍然只打印出最后一列,这是不应该发生的,但是如果我的假设是错误的,请告诉我。您发布的代码有很多缩进错误,因此很难知道应该在哪里。希望这是有帮助的!

回答by HennyH

import csv
from collections import defaultdict

columns = defaultdict(list) # each value in each column is appended to a list

with open('file.txt') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

print(columns['name'])
print(columns['phone'])
print(columns['street'])

With a file like

像这样的文件

name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.

Will output

会输出

>>> 
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']

Or alternatively if you want numerical indexing for the columns:

或者,如果您想对列进行数字索引:

with open('file.txt') as f:
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        for (i,v) in enumerate(row):
            columns[i].append(v)
print(columns[0])

>>> 
['Bob', 'James', 'Smithers']

To change the deliminator add delimiter=" "to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")

要更改分隔符添加delimiter=" "到适当的实例化,即reader = csv.reader(f,delimiter=" ")

回答by G M

You can use numpy.loadtext(filename). For example if this is your database .csv:

您可以使用numpy.loadtext(filename). 例如,如果这是您的数据库.csv

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

And you want the Namecolumn:

而你想要的Name列:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

More easily you can use genfromtext:

您可以更轻松地使用genfromtext

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

回答by PeteBeat

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

上下文:对于这种类型的工作,您应该使用令人惊叹的 python petl 库。使用标准 csv 模块“手动”做事,这将为您节省大量工作和潜在的挫折。AFAIK,唯一仍在使用 csv 模块的人是那些尚未发现处理表格数据(pandas、petl 等)的更好工具的人,这很好,但是如果您打算在你的职业生涯来自各种奇怪的来源,学习 petl 之类的东西是你能做的最好的投资之一。完成 pip install petl 后,只需 30 分钟即可开始使用。文档非常好。

Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.

答:假设您有 csv 文件中的第一个表(您也可以使用 petl 直接从数据库加载)。然后您只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv 

#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

回答by ayhan

With pandasyou can use read_csvwith usecolsparameter:

随着熊猫,你可以使用read_csv带有usecols参数:

df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

Example:

例子:

import pandas as pd
import io

s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''

df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)

   total_bill  day  size
0       16.99  Sun     2
1       10.34  Sun     3
2       21.01  Sun     3

回答by Suren

To fetch column name, instead of using readlines()better use readline()to avoid loop & reading the complete file & storing it in the array.

要获取列名,而不是使用readlines()更好地使用readline()以避免循环和读取完整文件并将其存储在数组中。

with open(csv_file, 'rb') as csvfile:

    # get number of columns

    line = csvfile.readline()

    first_item = line.split(',')

回答by VasiliNovikov

Use pandas:

使用熊猫

import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']

Discard unneeded columns at parse time:

在解析时丢弃不需要的列:

my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])

P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from hereand here.

PS我只是以简单的方式汇总其他人所说的内容。实际答案取自此处此处

回答by vestland

Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:

由于您可以索引和子集熊猫数据帧的方式,从 csv 文件中提取单列到变量中的一种非常简单的方法是:

myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']


A few things to consider:

需要考虑的几点:

The snippet above will produce a pandas Seriesand not dataframe. The suggestion from ayhan with usecolswill also be faster if speed is an issue. Testing the two different approaches using %timeiton a 2122 KB sized csv file yields 22.8 msfor the usecols approach and 53 msfor my suggested approach.

上面的代码片段将生成一个 pandasSeries而不是dataframe. usecols如果速度有问题,ayhan with 的建议也会更快。使用%timeit2122 KB 大小的 csv 文件测试两种不同的方法会产生22.8 msusecols 方法和53 ms我建议的方法。

And don't forget import pandas as pd

并且不要忘记 import pandas as pd

回答by Robert Jensen

If you need to process the columns separately, I like to destructure the columns with the zip(*iterable)pattern (effectively "unzip"). So for your example:

如果您需要单独处理列,我喜欢使用zip(*iterable)模式(有效地“解压缩”)来解构列。所以对于你的例子:

ids, names, zips, phones = zip(*(
  (row[1], row[2], row[6], row[7])
  for row in reader
))

回答by Hari K

import pandas as pd 
csv_file = pd.read_csv("file.csv") 
column_val_list = csv_file.column_name._ndarray_values