从一列中的唯一值创建 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44722436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create Pandas DataFrames from Unique Values in one Column
提问by ylcnky
I have a Pandas dataframe with 1000s of rows. and it has the Names
column includes the customer names and their records. I want to create individual dataframes for each customer based on their unique names. I got the unique names into a list
我有一个包含 1000 行的 Pandas 数据框。它有一Names
列包括客户姓名和他们的记录。我想根据每个客户的唯一名称为每个客户创建单独的数据框。我把唯一的名字放到了一个列表中
customerNames = DataFrame['customer name'].unique().tolist()
this gives the following array
customerNames = DataFrame['customer name'].unique().tolist()
这给出了以下数组
['Name1', 'Name2', 'Name3, 'Name4']
I tried a loop by catching the unique names in the above list and creating dataframes for each name and assign the dataframes to the customer name. So for example when I write Name3
, it should give the Name3
's data as a separate dataframe
我通过捕获上面列表中的唯一名称并为每个名称创建数据帧并将数据帧分配给客户名称来尝试循环。因此,例如,当我编写时Name3
,它应该将Name3
的数据作为单独的数据框提供
for x in customerNames:
x = DataFrame.loc[DataFrame['customer name'] == x]
x
Above lines returned the dataframe for only Name4
as dataframe result, but skipped the rest.
以上几行仅Name4
作为数据帧结果返回了数据帧,但跳过了其余部分。
How can I solve this problem?
我怎么解决这个问题?
回答by Hyman6e
Your current iteration overwrites x
twice every time it runs: the for
loop assigns a customer name to x
, and then you assign a dataframe to it.
您当前的迭代x
每次运行时都会覆盖两次:for
循环将客户名称分配给x
,然后您为其分配一个数据帧。
To be able to call each dataframe later by name, try storing them in a dictionary:
为了能够稍后按名称调用每个数据帧,请尝试将它们存储在字典中:
df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}
df_dict['Name3']
回答by Trenton McKinney
To create a dataframe for all the unique values in a column, create a dict
of dataframes, as follows.
要为列中的所有唯一值创建dict
数据框,请创建一组数据框,如下所示。
- Creates a
dict
, where each key is a unique value from the column of choice and the value is a dataframe. - Access each dataframe as you would a standard dict (e.g.
df_names['Name1']
) .groupby()
creates a generator, which can be unpacked.k
is the unique values in the column andv
is the data associated with eachk
.
- 创建一个
dict
,其中每个键都是所选列中的唯一值,该值是一个数据框。 - 像访问标准字典一样访问每个数据框(例如
df_names['Name1']
) .groupby()
创建一个可以解包的generator。k
是列中的唯一值,v
是与每个 相关联的数据k
。
With a for-loop
and .groupby
:
用一个for-loop
和.groupby
:
df_names = dict()
for k, v in df.groupby('customer name'):
df_names[k] = v
With a Python Dictionary Comprehension
使用Python 词典理解
Using .groupby
使用 .groupby
df_names = {k: v for (k, v) in df.groupby('customer name')}
- This comes from a conversation with rafaelc, who pointed out that using
.groupby
is faster than.unique
.- With 6 unique values in the column,
.groupby
is faster, at 104 ms compared to 392 ms - With 26 unique values in the column,
.groupby
is faster, at 147 ms compared to 1.53 s.
- With 6 unique values in the column,
- Using an a
for-loop
is slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M).
- 这来自与rafaelc的对话,他指出使用
.groupby
比.unique
.- 列中有 6 个唯一值,
.groupby
速度更快,为 104 毫秒,而 392 毫秒 - 列中有 26 个唯一值,
.groupby
速度更快,为 147 毫秒,而 1.53 秒。
- 列中有 6 个唯一值,
- 使用 a
for-loop
比理解略快,特别是对于更独特的列值或大量行(例如 10M)。
Using .unique
:
使用.unique
:
- Use Boolean indexingto match the unique values in the column of choice.
- 使用布尔索引匹配所选列中的唯一值。
df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}
Testing
测试
- The following data was used for testing
- 以下数据用于测试
import pandas as pd
import string
import random
random.seed(365)
# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
df = pd.DataFrame(data)
回答by N. P.
maybe i get you wrong but
也许我误会你了,但是
when
什么时候
for x in customerNames:
x = DataFrame.loc[DataFrame['customer name'] == x]
x
gives you the right output for the last list entry its because your output is out of the indent of the loop
为您提供最后一个列表条目的正确输出,因为您的输出超出了循环的缩进
import pandas as pd
customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
orient='index', columns=['customer', 'country'])
customer_list = ['James', 'Jean']
for x in customer_list:
x = customer_df.loc[customer_df['customer'] == x]
print(x)
print('now I could append the data to something new')
you get the output:
你得到输出:
customer country
B James USA
now I could append the data to something new
customer country
A Jean France
now I could append the data to something new
Or if you dont like loops you could go with
或者如果你不喜欢循环你可以去
import pandas as pd
customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
orient='index', columns=['customer', 'country'])
customer_list = ['James', 'Jean']
print(customer_df[customer_df['customer'].isin(customer_list)])
Output:
输出:
customer country
A Jean France
B James USA
df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe
df.isin 更好地解释如下:如何为 Pandas 数据框实现“in”和“not in”