将 Pandas 数据帧拆分为 N 块
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48704526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split pandas dataframe into chunks of N
提问by Henrik Poulsen
I'm currently trying to split a pandas dataframe into an unknown number of chunks containing each N rows.
我目前正在尝试将 Pandas 数据帧拆分为包含每 N 行的未知数量的块。
I have tried using numpy.array_split() this funktion however splits the dataframe into N chunks containing an unknown number of rows.
我曾尝试使用 numpy.array_split() 这个功能,但是将数据帧分成 N 个包含未知行数的块。
Is there a clever way to split a python dataframe into multiple dataframes, each containing a specific number of rows from the parent dataframe
是否有一种巧妙的方法可以将 python 数据帧拆分为多个数据帧,每个数据帧都包含来自父数据帧的特定行数
回答by James Schinner
You can try this:
你可以试试这个:
def rolling(df, window, step):
count = 0
df_length = len(df)
while count < (df_length -window):
yield count, df[count:window+count]
count += step
Usage:
用法:
for offset, window in rolling(df, 100, 100):
# | | | |
# | The current chunk. | How many rows to step at a time.
# The current offset index. How many rows in each chunk.
# your code here
pass
There is also this simpler idea:
还有一个更简单的想法:
def chunk(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
Usage:
用法:
for df_chunk in chunk(df, 100):
# |
# The chunk size
# your code here
BTW. All this can be found on SO, with a search.
顺便提一句。所有这些都可以通过搜索在 SO 上找到。
回答by nnnmmm
You can calculate the number of splits from N:
您可以从 N 计算拆分数:
splits = int(np.floor(len(df.index)/N))
chunks = np.split(df.iloc[:splits*N], splits)
chunks.append(df.iloc[splits*N:])
回答by Romain Jouin
calculate the index of splits :
计算分裂指数:
size_of_chunks = 3
index_for_chunks = list(range(0, index.max(), size_of_chunks))
index_for_chunks.extend([index.max()+1])
use them to split the df :
使用它们来分割 df :
dfs = {}
for i in range(len(index_for_chunks)-1):
dfs[i] = df.iloc[index_for_chunks[i]:index_for_chunks[i+1]]