来自csv的第一行和最后一行的Python pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26806581/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas DataFrame from first and last row of csv
提问by wrcobb
All -
全部 -
I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:
我希望仅从一个非常大的 csv 的第一行和最后一行创建一个 Pandas DataFrame。本练习的目的是能够轻松地从这些 csv 文件的第一个和最后一个条目中获取一些属性。我可以使用以下方法抓取 csv 的第一行:
pd.read_csv(filename, nrows=1)
I also have no problem grabbing the last row of a text file in various ways, such as:
我也可以通过各种方式抓取文本文件的最后一行没有问题,例如:
with open(filename) as f:
last_line = f.readlines()[-1]
However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?
但是,将这两个东西放入单个 DataFrame 使我陷入困境。关于如何最好地实现这一目标的任何见解?
EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.
编辑注意:我正在尝试在不将所有数据加载到单个 DataFrame 中的情况下完成此任务,因为我正在处理非常大(> 15MM 行)的 csv 文件。
Thanks!
谢谢!
回答by Jerome Montino
Just use headand tailand concat. You can even adjust the number of rows.
只需使用head和tail和concat。您甚至可以调整行数。
import pandas as pd
df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])
print concatenated
Result:
结果:
Date Cases
0 9/1/2014 45
121 12/31/2014 97
Adjusting headand tailto take in 5 rows from top and 10 from bottom...
调整head并tail从顶部接收 5 行,从底部接收 10 行......
Date Cases
0 9/1/2014 45
1 9/2/2014 104
2 9/3/2014 47
3 9/4/2014 108
4 9/5/2014 49
112 12/22/2014 30
113 12/23/2014 81
114 12/24/2014 99
115 12/25/2014 85
116 12/26/2014 55
117 12/27/2014 91
118 12/28/2014 68
119 12/29/2014 109
120 12/30/2014 55
121 12/31/2014 97
One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.
如果您不想将整个 CSV 文件作为数据帧加载,可以使用的一种可能方法是将它们单独作为 CSV 处理。以下代码与您的方法类似。
import pandas as pd
import csv
top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values
with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
last_line = f.readlines()[-1].strip().split(",")
c = csv.writer(g)
c.writerow(headers)
c.writerow(last_line)
bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)
print concatenated
Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.
结果是一样的,除了索引。对一百万行进行测试,并在大约一秒钟内处理完毕。
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 0.9s]
How it scales versus 15 million rows, maybe that's your ballgame now.So I decided to test it against exactly 15,728,626 rows and the results seem good enough.
它与 1500 万行相比如何扩展,也许这就是你现在的球赛。所以我决定对 15,728,626 行进行测试,结果似乎足够好。
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 3.3s]
回答by JD Long
So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:
因此,在不首先将整个文件读入 Python 的情况下执行此操作的方法是获取第一行,然后遍历文件到最后一行。然后使用 StringIO 将它们吸入 Pandas。也许是这样的:
import pandas as pd
import StringIO
with open('tst.csv') as f:
first_line = f.readline()
for line in f:
pass #iterate to the end
last_line = line
mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))
回答by allen-smithee
You want this answer https://stackoverflow.com/a/18603065/4226476- not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.
你想要这个答案https://stackoverflow.com/a/18603065/4226476- 不是公认的答案,而是最好的答案,因为它向后寻找第一个换行符而不是猜测。
Then wrap the two lines in a StringIO:
然后将这两行包装在一个 StringIO 中:
from cStringIO import StringIO
import pandas as pd
# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)
回答by Stefan Manole
This is the best solution I found
这是我找到的最佳解决方案
import pandas as pd
count=len(open(filename).readlines())
df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)

