来自csv的第一行和最后一行的Python pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26806581/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:38:43  来源:igfitidea点击:

Python pandas DataFrame from first and last row of csv

pythoncsvpandasdataframe

提问by wrcobb

All -

全部 -

I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:

我希望仅从一个非常大的 csv 的第一行和最后一行创建一个 Pandas DataFrame。本练习的目的是能够轻松地从这些 csv 文件的第一个和最后一个条目中获取一些属性。我可以使用以下方法抓取 csv 的第一行:

pd.read_csv(filename, nrows=1)

I also have no problem grabbing the last row of a text file in various ways, such as:

我也可以通过各种方式抓取文本文件的最后一行没有问题,例如:

with open(filename) as f:
    last_line = f.readlines()[-1]

However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?

但是,将这两个东西放入单个 DataFrame 使我陷入困境。关于如何最好地实现这一目标的任何见解?

EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.

编辑注意:我正在尝试在不将所有数据加载到单个 DataFrame 中的情况下完成此任务,因为我正在处理非常大(> 15MM 行)的 csv 文件。

Thanks!

谢谢!

回答by Jerome Montino

Just use headand tailand concat. You can even adjust the number of rows.

只需使用headtailconcat。您甚至可以调整行数。

import pandas as pd

df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])

print concatenated

Result:

结果:

           Date  Cases
0      9/1/2014     45
121  12/31/2014     97

Adjusting headand tailto take in 5 rows from top and 10 from bottom...

调整headtail从顶部接收 5 行,从底部接收 10 行......

           Date  Cases
0      9/1/2014     45
1      9/2/2014    104
2      9/3/2014     47
3      9/4/2014    108
4      9/5/2014     49
112  12/22/2014     30
113  12/23/2014     81
114  12/24/2014     99
115  12/25/2014     85
116  12/26/2014     55
117  12/27/2014     91
118  12/28/2014     68
119  12/29/2014    109
120  12/30/2014     55
121  12/31/2014     97

One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.

如果您不想将整个 CSV 文件作为数据帧加载,可以使用的一种可能方法是将它们单独作为 CSV 处理。以下代码与您的方法类似。

import pandas as pd
import csv

top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values

with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
    last_line = f.readlines()[-1].strip().split(",")
    c = csv.writer(g)
    c.writerow(headers)
    c.writerow(last_line)

bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)

print concatenated

Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.

结果是一样的,除了索引。对一百万行进行测试,并在大约一秒钟内处理完毕。

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 0.9s]

How it scales versus 15 million rows, maybe that's your ballgame now.So I decided to test it against exactly 15,728,626 rows and the results seem good enough.

它与 1500 万行相比如何扩展,也许这就是你现在的球赛。所以我决定对 15,728,626 行进行测试,结果似乎足够好。

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 3.3s]

回答by JD Long

So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:

因此,在不首先将整个文件读入 Python 的情况下执行此操作的方法是获取第一行,然后遍历文件到最后一行。然后使用 StringIO 将它们吸入 Pandas。也许是这样的:

import pandas as pd
import StringIO

with open('tst.csv') as f:
    first_line = f.readline()
    for line in f:
        pass #iterate to the end
    last_line = line

mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))

回答by allen-smithee

You want this answer https://stackoverflow.com/a/18603065/4226476- not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.

你想要这个答案https://stackoverflow.com/a/18603065/4226476- 不是公认的答案,而是最好的答案,因为它向后寻找第一个换行符而不是猜测。

Then wrap the two lines in a StringIO:

然后将这两行包装在一个 StringIO 中:

from cStringIO import StringIO
import pandas as pd

# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)

回答by Stefan Manole

This is the best solution I found

这是我找到的最佳解决方案

import pandas as pd

count=len(open(filename).readlines()) 

df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)