Python 创建一个空的 Pandas DataFrame,然后填充它?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13784192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 09:36:11  来源:igfitidea点击:

Creating an empty Pandas DataFrame, then filling it?

pythondataframepandas

提问by Matthias Kauer

I'm starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

我从这里的 Pandas DataFrame 文档开始:http://pandas.pydata.org/pandas-docs/stable/dsintro.html

I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I'd like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN.

我想用时间序列类型的计算中的值迭代地填充 DataFrame。所以基本上,我想用列 A、B 和时间戳行初始化 DataFrame,全 0 或全 NaN。

I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1or so.

然后我会添加初始值并检查这些数据,从前一行计算新行,比如row[A][t] = row[A][t-1]+1左右。

I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. Note: I'm using Python 2.7.

我目前正在使用下面的代码,但我觉得它有点难看,必须有一种方法可以直接使用 DataFrame 来做到这一点,或者一般来说只是一种更好的方法。注意:我使用的是 Python 2.7。

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

采纳答案by Andy Hayden

Here's a couple of suggestions:

这里有一些建议:

Use date_rangefor the index:

使用date_range的指标:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

注意:我们可以NaN通过编写简单地创建一个空的 DataFrame(带有s):

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

要对数据进行这些类型的计算,请使用 numpy 数组:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

因此我们可以创建 DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

回答by geekidharsh

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

如果您只是想创建一个空的数据框并稍后用一些传入的数据框填充它,请尝试以下操作:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas docto create a new data frame and then using appendto write to the newDF with data from oldDF.

在这个例子中,我使用这个 Pandas 文档来创建一个新的数据框,然后使用append 将来自 oldDF 的数据写入 newDF。

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()

如果我必须不断地将来自多个 oldDF 的新数据附加到这个 newDF 中,我只需使用 for 循环来迭代 pandas.DataFrame.append()

回答by Afshin Amiri

Initialize empty frame with column names

用列名初始化空框架

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

向帧添加新记录

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

您可能还想传递字典:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

将另一个框架附加到现有框架

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)


Performance considerations

性能注意事项

If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.

如果您在循环内添加行,请考虑性能问题。对于前 1000 条记录,“my_df.loc”性能更好,但随着循环中记录数的增加,它逐渐变慢。

If you plan to do thins inside a big loop (say 10M? records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.

如果你打算在一个大循环中做细化(比如 10M?记录左右),你最好混合使用这两者;用 iloc 填充数据帧,直到大小达到 1000 左右,然后将其附加到原始数据帧,并清空临时数据帧。这将使您的性能提高约 10 倍。

回答by Ajay Ohri

Assume a dataframe with 19 rows

假设一个有 19 行的数据框

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

将 A 列保持为常量

test['A']=10

Keeping column b as a variable given by a loop

将 b 列作为循环给定的变量

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x])with any value

您可以pd.Series([x], index = [x])用任何值替换第一个 x

回答by cs95

The Right Way? to Create a DataFrame

正确的方式?创建数据帧

TLDR; (just read the bold text)

TLDR;(只需阅读粗体文本)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

这里的大多数答案都会告诉你如何创建一个空的 DataFrame 并填写它,但没有人会告诉你这是一件坏事。

Here is my advice: Wait until you are sure you have all the data you need to work with.Use a list to collect your data, then initialise a DataFrame when you are ready.

这是我的建议:等到您确定您拥有所有需要处理的数据。使用列表来收集您的数据,然后在您准备好后初始化一个 DataFrame。

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one gothan it is to create an empty DataFrame (or one of of NaNs) and append to it over and over again. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

总是更便宜追加到一个列表,并一次性创建一个数据帧比它是一遍又一遍创建一个空的数据框(或NaN的一种)和附加到它。列表也占用更少的内存,并且是一个更轻的数据结构,可以处理、追加和删除(如果需要)。

The other advantage of this method is dtypesare automatically inferred(rather than assigning objectto all of them).

这种方法的另一个优点dtypes是自动推断(而不是分配object给所有这些)。

The last advantage is that a RangeIndexis automatically created for your data, so it is one less thing to worry about (take a look at the poor appendand locmethods below, you will see elements in both that require handling the index appropriately).

最后一个优点是aRangeIndex是自动为你的数据创建的,所以不用担心(看看下面的穷人appendloc方法,你会看到两者中的元素都需要适当地处理索引)。



Things you should NOT do

你不应该做的事情

appendor concatinside a loop

appendconcat在循环内

Here is the biggest mistake I've seen from beginners:

这是我从初学者那里看到的最大错误:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every appendor concatoperation you have. Couple this with a loop and you have a quadratic complexity operation. From the df.appenddoc page:

内存重新分配给每一个appendconcat你有操作。将此与循环相结合,您将获得二次复杂度运算。从df.append文档页面

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

以迭代方式将行附加到 DataFrame 可能比单个串联在计算上更加密集。更好的解决方案是将这些行附加到列表中,然后一次性将列表与原始 DataFrame 连接起来。

The other mistake associated with df.appendis that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

与此相关的另一个错误df.append是用户往往会忘记append 不是就地函数,因此必须将结果分配回来。您还必须担心 dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

处理对象列从来都不是一件好事,因为 Pandas 无法对这些列进行矢量化操作。您需要这样做来修复它:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

locinside a loop

loc在一个循环内

I have also seen locused to append to a DataFrame that was created empty:

我还看到loc用于附加到创建为空的 DataFrame:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.

和以前一样,您没有预先分配每次所需的内存量,因此每次创建新行时都会重新增加内存。它append和一样糟糕,甚至更丑陋。

Empty DataFrame of NaNs

NaN 的空数据帧

And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.

然后,创建一个 NaN 的 DataFrame,以及与之相关的所有警告。

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

它创建一个对象列的 DataFrame,就像其他列一样。

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

追加仍然存在上述方法的所有问题。

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]


The Proof is in the Pudding

证据就在布丁里

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

对这些方法进行计时是查看它们在内存和效用方面有多大差异的最快方法。

enter image description here

在此处输入图片说明

Benchmarking code for reference.

基准代码供参考。