来自csv的第一行和最后一行的Python pandas DataFrame

Question

提问by wrcobb

All -

全部 -

I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:

我希望仅从一个非常大的 csv 的第一行和最后一行创建一个 Pandas DataFrame。本练习的目的是能够轻松地从这些 csv 文件的第一个和最后一个条目中获取一些属性。我可以使用以下方法抓取 csv 的第一行：

pd.read_csv(filename, nrows=1)

I also have no problem grabbing the last row of a text file in various ways, such as:

我也可以通过各种方式抓取文本文件的最后一行没有问题，例如：

with open(filename) as f:
    last_line = f.readlines()[-1]

However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?

但是，将这两个东西放入单个 DataFrame 使我陷入困境。关于如何最好地实现这一目标的任何见解？

EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.

编辑注意：我正在尝试在不将所有数据加载到单个 DataFrame 中的情况下完成此任务，因为我正在处理非常大（> 15MM 行）的 csv 文件。

Thanks!

谢谢！

Answer 1

回答by Jerome Montino

Just use headand tailand concat. You can even adjust the number of rows.

只需使用head和tail和concat。您甚至可以调整行数。

import pandas as pd

df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])

print concatenated

Result:

结果：

           Date  Cases
0      9/1/2014     45
121  12/31/2014     97

Adjusting headand tailto take in 5 rows from top and 10 from bottom...

调整head并tail从顶部接收 5 行，从底部接收 10 行......

           Date  Cases
0      9/1/2014     45
1      9/2/2014    104
2      9/3/2014     47
3      9/4/2014    108
4      9/5/2014     49
112  12/22/2014     30
113  12/23/2014     81
114  12/24/2014     99
115  12/25/2014     85
116  12/26/2014     55
117  12/27/2014     91
118  12/28/2014     68
119  12/29/2014    109
120  12/30/2014     55
121  12/31/2014     97

One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.

如果您不想将整个 CSV 文件作为数据帧加载，可以使用的一种可能方法是将它们单独作为 CSV 处理。以下代码与您的方法类似。

import pandas as pd
import csv

top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values

with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
    last_line = f.readlines()[-1].strip().split(",")
    c = csv.writer(g)
    c.writerow(headers)
    c.writerow(last_line)

bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)

print concatenated

Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.

结果是一样的，除了索引。对一百万行进行测试，并在大约一秒钟内处理完毕。

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 0.9s]

~~How it scales versus 15 million rows, maybe that's your ballgame now.~~So I decided to test it against exactly 15,728,626 rows and the results seem good enough.

~~它与 1500 万行相比如何扩展，也许这就是你现在的球赛。~~所以我决定对 15,728,626 行进行测试，结果似乎足够好。

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 3.3s]

Answer 2

回答by JD Long

So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:

因此，在不首先将整个文件读入 Python 的情况下执行此操作的方法是获取第一行，然后遍历文件到最后一行。然后使用 StringIO 将它们吸入 Pandas。也许是这样的：

import pandas as pd
import StringIO

with open('tst.csv') as f:
    first_line = f.readline()
    for line in f:
        pass #iterate to the end
    last_line = line

mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))

Answer 3

回答by allen-smithee

You want this answer https://stackoverflow.com/a/18603065/4226476- not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.

你想要这个答案https://stackoverflow.com/a/18603065/4226476- 不是公认的答案，而是最好的答案，因为它向后寻找第一个换行符而不是猜测。

Then wrap the two lines in a StringIO:

然后将这两行包装在一个 StringIO 中：

from cStringIO import StringIO
import pandas as pd

# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)

Answer 4

回答by Stefan Manole

This is the best solution I found

这是我找到的最佳解决方案

import pandas as pd

count=len(open(filename).readlines()) 

df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)

来自csv的第一行和最后一行的Python pandas DataFrame

提问by wrcobb

回答by Jerome Montino

回答by JD Long

回答by allen-smithee

回答by Stefan Manole

相关推荐

最近更新

标签

来自csv的第一行和最后一行的Python pandas DataFrame

提问by wrcobb

回答by Jerome Montino

回答by JD Long

回答by allen-smithee

回答by Stefan Manole

相关推荐

Python Pandas：使用 groupby() 和 agg() 时是否保留顺序？

pandas 熊猫：获取相关性高的列组合

pandas 熊猫系列的分位数函数的倒数是多少？

pandas 在 python 中删除 NaN 值的列表的中位数

相关推荐

最近更新

标签