pandas 使用pandas读取csv文件时如何选择多行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47917943/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:56:02  来源:igfitidea点击:

How to select several rows when reading a csv file using pandas?

pythonpandascsv

提问by huier

I have a very large csv file with millions of rowsand a list of the row numbers that I need.like

我有一个非常大的 csv 文件millions of rows和我需要的行号列表。

rownumberList = [1,2,5,6,8,9,20,22]

I know there is something called skiprowsthat helps to skip several rows when reading csv filelike that

我知道有一种叫做skiprows可以帮助跳过几行的reading csv file东西

df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList

However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rowswhen using read_csv? Not try to select rows using dataframeafterwards, since I try to minimize the time of reading file.Thanks.

但是,由于 csv 文件非常大,直接选择我需要的行可能会更有效率。所以我想知道select rows使用时有什么方法read_csv吗?之后不要尝试选择行dataframe,因为我尽量减少读取文件的时间。谢谢。

回答by Bharath

There is a parameter called nrows : int, default NoneNumber of rows of file to read. Useful for reading pieces of large files(Docs)

有一个参数称为 nrows : int, default None要读取的文件行数。用于读取大文件(文档)

pd.read_csv(file_name,nrows=int)

In case you need some part in the middle. Use both skiprowsas well as nrowsin read_csv.if skiprows indicate the beginning rows and nrowswill indicate the next number of rows after skipping eg.

如果您需要中间的某些部分。两者都使用skiprows,以及nrowsread_csv。如果skiprows指示开始行和nrows跳绳例如打开后会显示行的下一个号码。

Example:

例子:

pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)

This will select data from the 6th row to 16 row

这将从第 6 行到第 16 行选择数据

Edit based on comment:

根据评论编辑

Since there is a list this one might help i.e

由于有一个列表,这个列表可能会有所帮助,即

li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.

回答by alecxe

I am not sure about read_csv()from Pandas (there is though a way to use an iteratorfor reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader(or csv.DictReader), leaving only the desired rows with the help of enumerate():

我不确定read_csv()来自 Pandas(虽然有一种方法可以使用 aniterator以块为单位读取大文件),但是您可以使用csv.reader(或)逐行读取文件(延迟加载,而不是读取内存中的整个文件)csv.DictReader),在以下的帮助下只留下所需的行enumerate()

import csv

import pandas as pd


DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
    reader = csv.reader(input_file)

    desired_rows = [row for row_number, row in enumerate(reader)
                    if row_number in DESIRED_ROWS]

df = pd.DataFrame(desired_rows)

(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case @James's idea to have "start and "stop" would work generally better).

(假设您想选择随机/不连续的行,而不是中间某处的“连续块”——在这种情况下,@James 的“开始”和“停止”的想法通常会更好)。

回答by J. Weikert

import pandas as pd

df = pd.read_csv('Data.csv')

df.iloc[3:6] 

Returns rows 3 through 5 and all columns.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html

返回第 3 行到第 5 行以及所有列。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html

回答by Traxidus Wolf

From de documentationyou can see that skiprowscan take an integer or a list as values to remove some lines.

从 de文档中您可以看到,skiprows可以将整数或列表作为值来删除某些行。

So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:

所以基本上你可以告诉它删除所有你想要的。为此,您首先需要通过打开文件并按如下方式计算文件中的行数(最好事先知道):

with open('myfile.csv') as f:
    row_count = sum(1 for row in f)

Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.

现在您需要创建补充列表(这里有集合但也有效,不知道为什么)。首先创建一个从 1 到行数的值,然后减去要读取的行数。

skiplist = set(range(1, row_count+1)) - set(rownumberList)

Finally you can read the csv as normal.

最后,您可以正常读取 csv。

df = pd.read_csv('myfile.csv',skiprows = skiplist)

here is the full code:

这是完整的代码:

import pandas as pd

with open('myfile.csv') as f:
    row_count = sum(1 for row in f)

rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)

df = pd.read_csv('myfile.csv', skiprows=skiplist)

回答by daniel zoulla

you could try this

你可以试试这个

import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want") 
# retrieving multiple rows by iloc method 
rows = data.iloc [[1,2,5,6,8,9,20,22]]

回答by gaozhidf

import pandas as pd

rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)

for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows

对于 pandas 0.25.1,pandas read_csv,您可以将可调用函数传递给skiprows

回答by James

You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.

访问大文件时,您将无法避开读取时间。如果您有一个非常大的 CSV 文件,则任何程序都需要通读它,至少直到您要开始提取行的位置。真的,这就是数据库的设计目的。

However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading justthe data you need into Python before converting it to a data frame in Pandas. For this you can use the csvmodule.

不过,如果你想从一个10,000,000行的CSV文件中提取行300,000至300123,你最好阅读只是你把它转换成在大Pandas数据帧之前需要到Python的数据。为此,您可以使用该csv模块。

import csv
import pandas

start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
    reader = csv.reader(fp)
    for i, line in enumerate(reader):
        if i >= start:
            data.append(line)
        if i > stop:
            break

df = pd.DataFrame(data)

回答by Nacho Monsalve

for i in range (1,20)

对于 i 在范围内 (1,20)

the first parameter is the first row and the last parameter is the last row...

第一个参数是第一行,最后一个参数是最后一行......