Pandas 数据帧在每第 n 行重新采样

Question

提问by nom-mon-ir

I have a script that reads system log files into pandas dataframes and produces charts from those. The charts are fine for small data sets. But when I face larger data sets due to larger timeframe of data gathering, the charts become too crowded to discern.

我有一个脚本可以将系统日志文件读入 Pandas 数据帧并从中生成图表。这些图表适用于小数据集。但是，当我由于数据收集的时间跨度较大而面临更大的数据集时，图表会变得过于拥挤而无法辨别。

I am planning to resample the dataframe so that if the dataset passes certain size, I will resample it so there are ultimately only the SIZE_LIMIT number of rows. This means I need to filter the dataframe so every n = actual_size/SIZE_LIMIT rows would aggregated to a single row in the new dataframe. The agregation can be either average value or just the nth row taken as is.

我计划重新采样数据框，以便如果数据集超过特定大小，我将对其重新采样，因此最终只有 SIZE_LIMIT 行数。这意味着我需要过滤数据帧，以便每 n = actual_size/SIZE_LIMIT 行都聚合到新数据帧中的一行。聚合可以是平均值，也可以是第 n 行。

I am not fully versed with pandas, so may have missed some obvious means.

我不完全熟悉Pandas，所以可能错过了一些明显的方法。

Answer 1

回答by heltonbiker

Actually I think you should not modify the data itself, but to take a view of the data in the desired interval to plot. This view would be the actual datapoints to be plotted.

实际上，我认为您不应该修改数据本身，而应该查看所需时间间隔内的数据进行绘图。该视图将是要绘制的实际数据点。

A naive approach would be, for a computer screen for example, to calculate how many points are in your interval, and how many pixels you have available. Thus, for plotting a dataframe with 10000 points in a window 1000 pixels width, you take a slice with a STEP of 10, using this syntax (whole_data would be a 1D array just for the example):

例如，对于计算机屏幕，一种天真的方法是计算您的间隔中有多少个点，以及您有多少可用像素。因此，要在 1000 像素宽度的窗口中绘制具有 10000 个点的数据帧，您可以使用以下语法（整个数据将是一个一维数组，仅用于示例）以 10 的步长进行切片：

data_to_plot = whole_data[::10]

This might have undesired effects, specifically masking short peaks that might "escape invisible" from the slicing operation. An alternative would be to split your data into bins, then calculating one datapoint (maximum value, for example) for each bin. I feel that these operations might actually be fast due to numpy/pandas efficient array operations.

这可能会产生不良影响，特别是掩盖可能从切片操作中“逃脱不可见”的短峰。另一种方法是将您的数据拆分为多个 bin，然后为每个 bin 计算一个数据点（例如最大值）。我觉得由于 numpy/pandas 高效的数组操作，这些操作实际上可能很快。

Hope this helps!

希望这可以帮助！

Answer 2

回答by Zelazny7

You could use the pandas.qcutmethod on the index to divide the index into equal quantiles. The value you pass to qcutcould be actual_size/SIZE_LIMIT.

您可以使用pandas.qcut索引上的方法将索引分成相等的分位数。您传递给的值qcut可能是actual_size/SIZE_LIMIT.

In [1]: from pandas import *

In [2]: df = DataFrame({'a':range(10000)})

In [3]: df.head()

Out[3]:
   a
0  0
1  1
2  2
3  3
4  4

Here, grouping the index by qcut(df.index,5)results in 5 equally binned groups. I then take the mean of each group.

在这里，按qcut(df.index,5)结果对索引进行分组会产生 5 个同样分箱的组。然后我取每组的平均值。

In [4]: df.groupby(qcut(df.index,5)).mean()

Out[4]:
                       a
[0, 1999.8]        999.5
(1999.8, 3999.6]  2999.5
(3999.6, 5999.4]  4999.5
(5999.4, 7999.2]  6999.5
(7999.2, 9999]    8999.5

Pandas 数据帧在每第 n 行重新采样

提问by nom-mon-ir

回答by heltonbiker

回答by Zelazny7

相关推荐

最近更新

标签

Pandas 数据帧在每第 n 行重新采样

提问by nom-mon-ir

回答by heltonbiker

回答by Zelazny7

相关推荐

python pandas csv导出

pandas 了解熊猫数据框索引

pandas 使用熊猫数据帧的内存泄漏

pandas python pandas自定义agg函数

相关推荐

最近更新

标签