如何指定 Pandas 数据框的行数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29609118/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:11:45  来源:igfitidea点击:

How to specify the number of rows a pandas dataframe will have?

pythonpandasdataframedata-analysisreal-time-data

提问by ps.george

I have a Pandas dataframe and I am continually appending a row of data each second as below.

我有一个 Pandas 数据框,我每秒不断地附加一行数据,如下所示。

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).

如果我继续这样做,最终我将开始遇到内存问题(每次它都会调用整个 DataFrame)。

I only need to keep X rows of the data. i.e. after the operation, it will be:

我只需要保留 X 行数据。即手术后,它将是:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?

有没有一种方法可以指定最大行数,以便在添加任何后续行时,同时删除最旧的行,而无需“检查 DataFrame 的长度,如果 DataFrame 的长度 > X,则删除第一行,追加新行”?

Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578

像这样,但对于 Pandas DataFrame:https://stackoverflow.com/a/10155753/4783578

回答by S Anand

One way would be to pre-allocate the rows, and replace the values cyclically.

一种方法是预先分配行,并循环替换值。

# Say we to limit to a thousand rows
N = 1000

# Create the DataFrame with N rows and 5 columns -- all NaNs
data = pd.DataFrame(pd.np.empty((N, 5)) * pd.np.nan) 

# To check the length of the DataFrame, we'll need to .dropna().
len(data.dropna())              # Returns 0

# Keep a running counter of the next index to insert into
counter = 0

# Insertion always happens at that counter
data.loc[counter, :] = pd.np.random.rand(5)

# ... and increment the counter, but when it exceeds N, set it to 0
counter = (counter + 1) % N

# Now, the DataFrame contains one row
len(data.dropna())              # Returns 1

# We can add several rows one after another. Let's add twice as many as N
for row in pd.np.random.rand(2 * N, 5):
    data.loc[counter, :] = row
    counter = (counter + 1) % N

# Now that we added them, we still have only the last N rows
len(data)                       # Returns N

This avoids the need to modify the data in any way, and would be a fast approach to appending the data. However, to reading from the data can be slower if:

这避免了以任何方式修改数据的需要,并且将是附加数据的快速方法。但是,在以下情况下读取数据可能会更慢:

  • The order of the data matters. If you need the data in the same order, you need to slice datausing counterto extract the original order.
  • The number of rows is small. If you end up appending fewer rows than N, you'll need .dropna()(or count the total inserted rows) to remove the unused ones.
  • 数据的顺序很重要。如果您需要在同一顺序的数据,你需要切片datacounter提取的原始顺序。
  • 行数很小。如果您最终添加的行数少于N,您将需要.dropna()(或计算插入的总行数)删除未使用的行。

In most of the scenarios that I deal with where truncated append performance matters, neither of the above are true, but your scenario may be different. In that case, @Alexander has a good solution involving .shift().

在我处理的大多数情况下,截断附加性能很重要,上述两种情况都不正确,但您的情况可能会有所不同。在这种情况下,@Alexander 有一个很好的解决方案,涉及.shift().

回答by TheBlackCat

pandasstores data in arrays. Doing the sort of operation you want inherently requires a copy for an array data structure. Since data is stored in contiguous (or strided) memory, adding something to the end and removing something from the beginning requires copying everything to a new region of memory. There is no way around this. You need to use a different data structure.

pandas将数据存储在数组中。做你想要的那种操作本质上需要一个数组数据结构的副本。由于数据存储在连续(或跨步)内存中,因此在末尾添加内容并从开头删除内容需要将所有内容复制到新的内存区域。没有办法解决这个问题。您需要使用不同的数据结构。

Edit: Thinking about this a bit more, I see two approaches to do this.

编辑:再考虑一下,我看到了两种方法来做到这一点。

The simplest and most straightforward would be to use a collections.dequeof tuples. You can just append a new tuple to the end, and if it gets too full it will dump the corresponding on from the beginning. At the end, you can just convert them into a DataFrame. I am just using the forloop as an example, I gather you get your data in a different way. It wouldn't matter:

最简单和最直接的方法是使用 a collections.dequeof tuples。您可以在末尾附加一个新元组,如果它太满,它将从头开始转储相应的元组。最后,您可以将它们转换为DataFrame. 我只是以for循环为例,我收集您以不同的方式获取数据。没关系:

import pandas as pd
from collections import deque

maxlen = 1000

dq = deque(maxlen=maxlen)

for reading1, reading3, reading3 in readings:
    dq.append(pd.Series([reading1, reading2, reading3], 
                        index=['sensor1', 'sensor2', 'sensor3'], 
                        name=time.strftime("%Y-%m-%d %H:%M:%S")))

df = pd.concat(dq, axis=1).T

The second approach is to use a DataFrameof a fixed size, and use the modulo of the maximum length to choose the place to overwrite, but also keep the item number in the DataFrame. Then you can sort by item number. In your case, you could conceivably sort by time, but this approach is more general. As with the previous example, I will use a forloop to demonstrate, but you probably don't have one. Further, I will also assume that you don't have a real iterable you can enumerate, if you do then you don't have to keep track of the index number as I do here:

第二种方法是使用DataFrame固定大小的a,并使用最大长度的模来选择覆盖的地方,同时也将项目编号保留在DataFrame. 然后您可以按项目编号排序。在您的情况下,您可以想象按时间排序,但这种方法更通用。与前面的示例一样,我将使用for循环进行演示,但您可能没有循环。此外,我还将假设您没有真正的可迭代对象enumerate,如果您有,那么您不必像我在这里所做的那样跟踪索引号:

import pandas as pd

maxlen = 1000

df = pd.DataFrame(np.full((maxlen, 5), np.nan),
                  columns=['index', 'time', 
                           'sensor1', 'sensor2', 'sensor3'])

i = 0
for reading1, reading3, reading3 in readings:
    df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
                           reading1, reading2, reading3]
    i+=1

df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)

回答by Alexander

This example initializes a DataFrame equal to the max size and fills it with Nones. It then iterates over a list of new rows, first shifting the original DataFrame and then appending the new row to the end. You didn't specify how you wanted to treat the index, so I ignored it.

此示例初始化一个等于最大大小的 DataFrame 并用 None 填充它。然后它遍历新行列表,首先移动原始 DataFrame,然后将新行附加到末尾。您没有指定要如何处理索引,所以我忽略了它。

max_rows = 5
cols = list('AB')

# Initialize empty DataFrame
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in cols})

new_rows = [pd.DataFrame({'A': [1], 'B': [10]}), 
            pd.DataFrame({'A': [2], 'B': [11]}),
            pd.DataFrame({'A': [3], 'B': [12]}),
            pd.DataFrame({'A': [4], 'B': [13]}),
            pd.DataFrame({'A': [5], 'B': [14]}),
            pd.DataFrame({'A': [6], 'B': [15]}),
            pd.DataFrame({'A': [7], 'B': [16]})]

for row in new_rows:
    df = df.shift(-1)
    df.iloc[-1, :] = row.values

>>> df
df
   A   B
0  3  12
1  4  13
2  5  14
3  6  15
4  7  16

Let's use a real example with one year of stock prices for AAPL.

让我们用一个真实的例子来说明 AAPL 一年的股票价格。

from datetime import timedelta

aapl = DataReader("AAPL", data_source="yahoo", start="2014-1-1", end="2015-1-1")
cols = aapl.columns
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in aapl.columns})[cols]
# Initialize a datetime index
df.index = pd.DatetimeIndex(end=aapl.index[0] + timedelta(days=-1), periods=max_rows, freq='D')

for timestamp, row in aapl.iterrows():
    df = df.shift(-1)
    df.iloc[-1, :] = row.values
    idx = df.index[:-1].tolist()
    idx.append(timestamp)
    df.index = idx

>>> df
              Open    High     Low   Close       Volume Adj Close
2013-12-28  112.58  112.71  112.01  112.01  1.44796e+07    111.57
2013-12-29   112.1  114.52  112.01  113.99   3.3721e+07    113.54
2013-12-30  113.79  114.77   113.7  113.91  2.75989e+07    113.46
2013-12-31  113.64  113.92  112.11  112.52  2.98815e+07    112.08
2014-12-31  112.82  113.13  110.21  110.38  4.14034e+07    109.95