从一列中的唯一值创建 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44722436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:52:07  来源:igfitidea点击:

Create Pandas DataFrames from Unique Values in one Column

pythonpandas

提问by ylcnky

I have a Pandas dataframe with 1000s of rows. and it has the Namescolumn includes the customer names and their records. I want to create individual dataframes for each customer based on their unique names. I got the unique names into a list

我有一个包含 1000 行的 Pandas 数据框。它有一Names列包括客户姓名和他们的记录。我想根据每个客户的唯一名称为每个客户创建单独的数据框。我把唯一的名字放到了一个列表中

customerNames = DataFrame['customer name'].unique().tolist()this gives the following array

customerNames = DataFrame['customer name'].unique().tolist()这给出了以下数组

['Name1', 'Name2', 'Name3, 'Name4']

I tried a loop by catching the unique names in the above list and creating dataframes for each name and assign the dataframes to the customer name. So for example when I write Name3, it should give the Name3's data as a separate dataframe

我通过捕获上面列表中的唯一名称并为每个名称创建数据帧并将数据帧分配给客户名称来尝试循环。因此,例如,当我编写时Name3,它应该将Name3的数据作为单独的数据框提供

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

Above lines returned the dataframe for only Name4as dataframe result, but skipped the rest.

以上几行仅Name4作为数据帧结果返回了数据帧,但跳过了其余部分。

How can I solve this problem?

我怎么解决这个问题?

回答by Hyman6e

Your current iteration overwrites xtwice every time it runs: the forloop assigns a customer name to x, and then you assign a dataframe to it.

您当前的迭代x每次运行时都会覆盖两次:for循环将客户名称分配给x,然后您为其分配一个数据帧。

To be able to call each dataframe later by name, try storing them in a dictionary:

为了能够稍后按名称调用每个数据帧,请尝试将它们存储在字典中:

df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}

df_dict['Name3']

回答by Trenton McKinney

To create a dataframe for all the unique values in a column, create a dictof dataframes, as follows.

要为列中的所有唯一值创建dict数据框,请创建一组数据框,如下所示。

  • Creates a dict, where each key is a unique value from the column of choice and the value is a dataframe.
  • Access each dataframe as you would a standard dict (e.g. df_names['Name1'])
  • .groupby()creates a generator, which can be unpacked.
    • kis the unique values in the column and vis the data associated with each k.
  • 创建一个dict,其中每个键都是所选列中的唯一值,该值是一个数据框。
  • 像访问标准字典一样访问每个数据框(例如df_names['Name1']
  • .groupby()创建一个可以解包的generator
    • k是列中的唯一值,v是与每个 相关联的数据k

With a for-loopand .groupby:

用一个for-loop.groupby

df_names = dict()
for k, v in df.groupby('customer name'):
    df_names[k] = v

With a Python Dictionary Comprehension

使用Python 词典理解

Using .groupby

使用 .groupby

df_names = {k: v for (k, v) in df.groupby('customer name')}
  • This comes from a conversation with rafaelc, who pointed out that using .groupbyis faster than .unique.
    • With 6 unique values in the column, .groupbyis faster, at 104 ms compared to 392 ms
    • With 26 unique values in the column, .groupbyis faster, at 147 ms compared to 1.53 s.
  • Using an a for-loopis slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M).
  • 这来自与rafaelc的对话,他指出使用.groupby.unique.
    • 列中有 6 个唯一值,.groupby速度更快,为 104 毫秒,而 392 毫秒
    • 列中有 26 个唯一值,.groupby速度更快,为 147 毫秒,而 1.53 秒。
  • 使用 afor-loop比理解略快,特别是对于更独特的列值或大量行(例如 10M)。

Using .unique:

使用.unique

df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}

Testing

测试

  • The following data was used for testing
  • 以下数据用于测试
import pandas as pd
import string
import random

random.seed(365)

# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

df = pd.DataFrame(data)

回答by N. P.

maybe i get you wrong but

也许我误会你了,但是

when

什么时候

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

gives you the right output for the last list entry its because your output is out of the indent of the loop

为您提供最后一个列表条目的正确输出,因为您的输出超出了循环的缩进

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']

for x in customer_list:
    x = customer_df.loc[customer_df['customer'] == x]
    print(x)
    print('now I could append the data to something new')

you get the output:

你得到输出:

  customer country
B    James     USA
now I could append the data to something new
  customer country
A     Jean  France
now I could append the data to something new

Or if you dont like loops you could go with

或者如果你不喜欢循环你可以去

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']


print(customer_df[customer_df['customer'].isin(customer_list)])

Output:

输出:

  customer country
A     Jean  France
B    James     USA

df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe

df.isin 更好地解释如下:如何为 Pandas 数据实现“in”和“not in”