pandas 根据不同列中的值复制行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32792263/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Duplicate row based on value in different column
提问by MRA
I have a dataframe of transactions. Each row represents a transaction of two item (think of it like a transaction of 2 event tickets or something). I want to duplicate each row based on the quantity sold.
我有一个交易数据框。每行代表两个项目的交易(把它想象成 2 张活动门票或其他东西的交易)。我想根据销售数量复制每一行。
Here's example code:
这是示例代码:
# dictionary of transactions
d = {
'1': ['20', 'NYC', '2'],
'2': ['30', 'NYC', '2'],
'3': ['5', 'NYC', '2'],
'4': ['300', 'LA', '2'],
'5': ['30', 'LA', '2'],
'6': ['100', 'LA', '2']
}
columns=['Price', 'City', 'Quantity']
# create dataframe and rename columns
df = pd.DataFrame.from_dict(
data=d, orient='index'
)
df.columns = columns
This produces a dataframe that looks like this
这会产生一个看起来像这样的数据框
Price City Quantity
20 NYC 2
30 NYC 2
5 NYC 2
300 LA 2
30 LA 2
100 LA 2
So in the case above, each row will transform into two duplicate rows. If the 'quantity' column was 3, then that row would transform into three duplicate rows.
所以在上面的例子中,每一行都会变成两个重复的行。如果“数量”列是 3,那么该行将转换为三个重复的行。
采纳答案by Alexander
First, I recreated your data using integers instead of text. I also varied the quantity so that one can more easily understand the problem.
首先,我使用整数而不是文本重新创建了您的数据。我还改变了数量,以便人们可以更容易地理解问题。
d = {1: [20, 'NYC', 1], 2: [30, 'NYC', 2], 3: [5, 'SF', 3],
4: [300, 'LA', 1], 5: [30, 'LA', 2], 6: [100, 'SF', 3]}
columns=['Price', 'City', 'Quantity']
# create dataframe and rename columns
df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
df.columns = columns
>>> df
Price City Quantity
1 20 NYC 1
2 30 NYC 2
3 5 SF 3
4 300 LA 1
5 30 LA 2
6 100 SF 3
I created a new DataFrame by using a nested list comprehension structure.
我使用嵌套列表理解结构创建了一个新的 DataFrame。
df_new = pd.DataFrame([df.ix[idx]
for idx in df.index
for _ in range(df.ix[idx]['Quantity'])]).reset_index(drop=True)
>>> df_new
Price City Quantity
0 20 NYC 1
1 30 NYC 2
2 30 NYC 2
3 5 SF 3
4 5 SF 3
5 5 SF 3
6 300 LA 1
7 30 LA 2
8 30 LA 2
9 100 SF 3
10 100 SF 3
11 100 SF 3
回答by YOBEN_S
Answer by using repeat
使用回答 repeat
df.loc[df.index.repeat(df.Quantity)]
Out[448]:
Price City Quantity
1 20 NYC 2
1 20 NYC 2
2 30 NYC 2
2 30 NYC 2
3 5 NYC 2
3 5 NYC 2
4 300 LA 2
4 300 LA 2
5 30 LA 2
5 30 LA 2
6 100 LA 2
6 100 LA 2
回答by Dickster
How about this approach. I changed your data slightly to call out a sale of 4 tickets.
这个方法怎么样。我稍微更改了您的数据以显示 4 张门票的销售。
We use a helper np.ones() array, suitably sized ,and then the key line of code is: a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0
我们使用一个助手 np.ones() 数组,大小合适,然后关键的代码行是: a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0
I was shown this technique here: numpy - update values using slicing given an array value
我在这里展示了这种技术:numpy - update values using slicing given an array value
Then its simply a call to .stack()and some basic filtering to complete.
然后它只是一个调用.stack()和一些基本的过滤来完成。
d = {'1': ['20', 'NYC', '2'], '2': ['30', 'NYC', '2'], '3': ['5', 'NYC', '2'], \
'4': ['300', 'LA', '2'], '5': ['30', 'LA', '4'], '6': ['100', 'LA', '2']}
columns=['Price', 'City', 'Quantity']
df = pd.DataFrame.from_dict(data=d, orient='index')
df.columns = columns
df['Quantity'] = df['Quantity'].astype(int)
# make a ones array
my_ones = np.ones(shape=(len(df),df['Quantity'].max()))
# turn my_ones into a dataframe same index as df so we can join it to the right hand side. Plenty of other ways to achieve the same outcome.
df_my_ones = pd.DataFrame(data =my_ones,index = df.index)
df = df.join(df_my_ones)
which looks like:
看起来像:
Price City Quantity 0 1 2 3
1 20 NYC 2 1 1 1 1
3 5 NYC 2 1 1 1 1
2 30 NYC 2 1 1 1 1
5 30 LA 4 1 1 1 1
4 300 LA 2 1 1 1 1
now get the Quantity column and the ones into a numpy array
现在将 Quantity 列和那些列放入一个 numpy 数组中
a = df.iloc[:,2:].values
this is the clever bit
这是聪明的一点
a[np.arange(a.shape[1])[:] > a[:,0,np.newaxis]] = 0
and re-assign back to df.
并重新分配回 df。
df.iloc[:,2:] = a
and now df looks like following, notice how we have set to zero past the number in Quantity:
现在 df 如下所示,请注意我们如何将 Quantity 中的数字设置为零:
Price City Quantity 0 1 2 3
1 20 NYC 2 1 1 0 0
3 5 NYC 2 1 1 0 0
2 30 NYC 2 1 1 0 0
5 30 LA 4 1 1 1 1
4 300 LA 2 1 1 0 0
df.set_index(['Price','City','Quantity'],inplace=True)
df = df.stack().to_frame()
df.columns = ['sale_flag']
df.reset_index(inplace=True)
print df[['Price','City', 'Quantity']][df['sale_flag'] !=0]
print df
which produces:
它产生:
Price City Quantity
0 20 NYC 2
1 20 NYC 2
4 5 NYC 2
5 5 NYC 2
8 30 NYC 2
9 30 NYC 2
12 30 LA 4
13 30 LA 4
14 30 LA 4
15 30 LA 4
16 300 LA 2
17 300 LA 2

