Python 如果熊猫数据帧超过 10 行,则将其拆分为两部分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25290757/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:00:55  来源:igfitidea点击:

Split pandas dataframe in two if it has more than 10 rows

pythonpandassplitdataframe

提问by Boosted_d16

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.

我有一个巨大的 CSV 文件,里面有很多行的表格。如果每个数据帧包含超过 10 行,我想简单地将它分成 2 个。

If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.

如果为真,我希望第一个数据帧包含前 10 个数据帧,其余数据帧包含在第二个数据帧中。

Is there a convenient function for this? I've looked around but found nothing useful...

有没有方便的功能呢?我环顾四周,但没有发现任何有用的东西...

i.e. split_dataframe(df, 2(if > 10))?

split_dataframe(df, 2(if > 10))

采纳答案by ely

This will return the split DataFrames if the condition is met, otherwise return the original and None(which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per dfand that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.

如果满足条件,这将返回拆分的数据帧,否则返回原始和None(然后您需要单独处理)。请注意,这假设拆分每次只需要发生一次,df并且拆分的第二部分(如果它长于 10 行(意味着原始长于 20 行))是可以的。

df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None

Note you can also use df.head(10)and df.tail(len(df) - 10)to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10]instead of df[:10, :](though I like to code explicitly about the dimensions you are taking). You can can also use df.ilocand df.ixto index in similar ways.

请注意,您还可以根据需要使用df.head(10)df.tail(len(df) - 10)获取正面和背面。您还可以使用各种索引方法:如果需要,您可以只提供第一个维度索引,例如df[:10]代替df[:10, :](尽管我喜欢明确地编码您正在使用的维度)。您还可以使用df.ilocdf.ix以类似方式进行索引。

Be careful about using df.lochowever, since it is label-based and the input will never be interpreted as an integer position. .locwould only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.

df.loc但是要小心使用,因为它是基于标签的,并且输入永远不会被解释为整数 position.loc只有在您碰巧有从 0 开始且没有间隙的整数的索引标签时,才会“意外”工作。

But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.

但是您还应该考虑 Pandas 提供的各种选项,用于将 DataFrame 的内容转储到 HTML 和 LaTeX 中,以便为演示文稿制作更好的设计表格(而不仅仅是复制和粘贴)。只需在谷歌上搜索如何将 DataFrame 转换为这些格式,就会为这个应用程序提供大量教程和建议。

回答by EdChum

There is no specific convenience function.

没有特定的便利功能。

You'd have to do something like:

您必须执行以下操作:

first_ten = pd.DataFrame()
rest = pd.DataFrame()

if df.shape[0] > 10: # len(df) > 10 would also work
    first_ten = df[:10]
    rest = df[10:]

回答by Tom Walker

You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10

您可以使用 DataFrame head 和 tail 方法作为语法糖,而不是在此处使用 slicing/loc。我使用 3 的分割大小;对于您的示例,请使用 headSize=10

def split(df, headSize) :
    hd = df.head(headSize)
    tl = df.tail(len(df)-headSize)
    return hd, tl

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)

回答by cheevahagadog

If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:

如果您有一个大型数据框并且需要将子数据框行划分为可变数量的行,例如每个子数据框最多有 4500 行,此脚本可能会有所帮助:

max_rows = 4500
dataframes = []
while len(df) > max_rows:
    top = df[:max_rows]
    dataframes.append(top)
    df = df[max_rows:]
else:
    dataframes.append(df)

You could then save out these data frames:

然后您可以保存这些数据框:

for _, frame in enumerate(dataframes):
    frame.to_csv(str(_)+'.csv', index=False)

Hope this helps someone!

希望这对某人有帮助!

回答by webelo

A method based on np.split:

基于的方法np.split

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]

A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4)will throw an error).

一个使用模的小函数可以处理分裂不均匀的情况(例如np.split(df.index,4)会抛出错误)。

(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)

是的,我知道原始问题比这更具体。但是,这应该回答标题中的问题。

回答by Roei Bahumi

Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:

下面是一个简单的函数实现,它将 DataFrame 拆分为块和一些代码示例:

import pandas as pd

def split_dataframe_to_chunks(df, n):
    df_len = len(df)
    count = 0
    dfs = []

    while True:
        if count > df_len-1:
            break

        start = count
        count += n
        #print("%s : %s" % (start, count))
        dfs.append(df.iloc[start : count])
    return dfs


# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])

# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]

# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]

回答by agittarius

I used a List Comprehensionto cut a huge DataFrame into blocks of 100'000:

我使用List Comprehension将一个巨大的 DataFrame 切成 100'000 个块:

size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]

or as generator:

或作为发电机:

list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))

回答by Ram Prajapati

The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.

基于列表推导和 的方法groupby,它将所有拆分的数据帧存储在列表变量中,并且可以使用索引进行访问。

Example:

例子:

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name

回答by Romain Jouin

def split_and_save_df(df, name, size, output_dir):
    """
    Split a df and save each chunk in a different csv file.

    Parameters:
        df : pandas df to be splitted
        name : name to give to the output file
        size : chunk size
        output_dir : directory where to write the divided df
    """
    import os
    for i in range(0, df.shape[0],size):
        start  = i
        end    = min(i+size-1, df.shape[0]) 
        subset = df.loc[start:end] 
        output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
        print(f"Going to write into {output_path}")
        subset.to_csv(output_path)
        output_size = os.stat(output_path).st_size
        print(f"Wrote {output_size} bytes")