pandas 使用pandas读取csv文件时如何选择多行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47917943/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to select several rows when reading a csv file using pandas?
提问by huier
I have a very large csv file with millions of rows
and a list of the row numbers that I need.like
我有一个非常大的 csv 文件millions of rows
和我需要的行号列表。
rownumberList = [1,2,5,6,8,9,20,22]
I know there is something called skiprows
that helps to skip several rows when reading csv file
like that
我知道有一种叫做skiprows
可以帮助跳过几行的reading csv file
东西
df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList
However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rows
when using read_csv
? Not try to select rows using dataframe
afterwards, since I try to minimize the time of reading file.Thanks.
但是,由于 csv 文件非常大,直接选择我需要的行可能会更有效率。所以我想知道select rows
使用时有什么方法read_csv
吗?之后不要尝试选择行dataframe
,因为我尽量减少读取文件的时间。谢谢。
回答by Bharath
There is a parameter called nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files(Docs)
有一个参数称为 nrows : int, default None
要读取的文件行数。用于读取大文件(文档)
pd.read_csv(file_name,nrows=int)
In case you need some part in the middle. Use both skiprows
as well as nrows
in read_csv
.if skiprows indicate the beginning rows and nrows
will indicate the next number of rows after skipping eg.
如果您需要中间的某些部分。两者都使用skiprows
,以及nrows
在read_csv
。如果skiprows指示开始行和nrows
跳绳例如打开后会显示行的下一个号码。
Example:
例子:
pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)
This will select data from the 6th row to 16 row
这将从第 6 行到第 16 行选择数据
Edit based on comment:
根据评论编辑:
Since there is a list this one might help i.e
由于有一个列表,这个列表可能会有所帮助,即
li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.
回答by alecxe
I am not sure about read_csv()
from Pandas (there is though a way to use an iterator
for reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader
(or csv.DictReader
), leaving only the desired rows with the help of enumerate()
:
我不确定read_csv()
来自 Pandas(虽然有一种方法可以使用 aniterator
以块为单位读取大文件),但是您可以使用csv.reader
(或)逐行读取文件(延迟加载,而不是读取内存中的整个文件)csv.DictReader
),在以下的帮助下只留下所需的行enumerate()
:
import csv
import pandas as pd
DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
reader = csv.reader(input_file)
desired_rows = [row for row_number, row in enumerate(reader)
if row_number in DESIRED_ROWS]
df = pd.DataFrame(desired_rows)
(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case @James's idea to have "start and "stop" would work generally better).
(假设您想选择随机/不连续的行,而不是中间某处的“连续块”——在这种情况下,@James 的“开始”和“停止”的想法通常会更好)。
回答by J. Weikert
import pandas as pd
df = pd.read_csv('Data.csv')
df.iloc[3:6]
Returns rows 3 through 5 and all columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
返回第 3 行到第 5 行以及所有列。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
回答by Traxidus Wolf
From de documentationyou can see that skiprows
can take an integer or a list as values to remove some lines.
从 de文档中您可以看到,skiprows
可以将整数或列表作为值来删除某些行。
So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:
所以基本上你可以告诉它删除所有你想要的。为此,您首先需要通过打开文件并按如下方式计算文件中的行数(最好事先知道):
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.
现在您需要创建补充列表(这里有集合但也有效,不知道为什么)。首先创建一个从 1 到行数的值,然后减去要读取的行数。
skiplist = set(range(1, row_count+1)) - set(rownumberList)
Finally you can read the csv as normal.
最后,您可以正常读取 csv。
df = pd.read_csv('myfile.csv',skiprows = skiplist)
here is the full code:
这是完整的代码:
import pandas as pd
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)
df = pd.read_csv('myfile.csv', skiprows=skiplist)
回答by daniel zoulla
you could try this
你可以试试这个
import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want")
# retrieving multiple rows by iloc method
rows = data.iloc [[1,2,5,6,8,9,20,22]]
回答by gaozhidf
import pandas as pd
rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)
for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows
对于 pandas 0.25.1,pandas read_csv,您可以将可调用函数传递给skiprows
回答by James
You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.
访问大文件时,您将无法避开读取时间。如果您有一个非常大的 CSV 文件,则任何程序都需要通读它,至少直到您要开始提取行的位置。真的,这就是数据库的设计目的。
However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading justthe data you need into Python before converting it to a data frame in Pandas. For this you can use the csv
module.
不过,如果你想从一个10,000,000行的CSV文件中提取行300,000至300123,你最好阅读只是你把它转换成在大Pandas数据帧之前需要到Python的数据。为此,您可以使用该csv
模块。
import csv
import pandas
start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
reader = csv.reader(fp)
for i, line in enumerate(reader):
if i >= start:
data.append(line)
if i > stop:
break
df = pd.DataFrame(data)
回答by Nacho Monsalve
for i in range (1,20)
对于 i 在范围内 (1,20)
the first parameter is the first row and the last parameter is the last row...
第一个参数是第一行,最后一个参数是最后一行......