pandas 如何在熊猫中填充重复数据的行？

Question

提问by Amyunimus

In R, when adding new data of unequal length to a data frame, the values repeat to fill the data frame:

在 R 中，当向数据帧添加不等长的新数据时，这些值会重复以填充数据帧：

df <- data.frame(first=c(1,2,3,4,5,6))
df$second <- c(1,2,3)

yielding:

产生：

  first second
1     1      1
2     2      2
3     3      3
4     4      1
5     5      2
6     6      3

However, pandas requires equal index lengths.

但是，pandas 需要相同的索引长度。

How do I "fill in" repeating data in pandas like I can in R?

如何像在 R 中一样在 Pandas 中“填写”重复数据？

Answer 1

采纳答案by Yeqing Zhang

Seems there is no elegant way. This is the workaround I just figured out. Basically create a repeating list just bigger than original dataframe, and then left join them.

似乎没有优雅的方式。这是我刚刚想出的解决方法。基本上创建一个比原始数据框大的重复列表，然后将它们加入。

import pandas
df = pandas.DataFrame(range(100), columns=['first'])
repeat_arr = [1, 2, 3]
df = df.join(pandas.DataFrame(repeat_arr * (len(df)/len(repeat_arr)+1),
    columns=['second']))

Answer 2

回答by Meow

The cycle method from itertools is good for repeating a common pattern.

itertools 中的循环方法非常适合重复常见的模式。

from itertools import cycle

seq = cycle([1, 2, 3])
df['Seq'] = [next(seq) for count in range(df.shape[0])]

Answer 3

回答by unutbu

import pandas as pd
import numpy as np

def put(df, column, values):
    df[column] = 0
    np.put(df[column], np.arange(len(df)), values)

df = pd.DataFrame({'first':range(1, 8)})    
put(df, 'second', [1,2,3])

yields

产量

   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3
6      7       1

Not particularly beautiful, but one "feature" it possesses is that you do not have to worry if the length of the DataFrame is a multiple of the length of the repeated values. np.putrepeats the values as necessary.

不是特别漂亮，但它拥有的一个“功能”是您不必担心 DataFrame 的长度是否是重复值长度的倍数。np.put根据需要重复这些值。

My first answer was:

我的第一个回答是：

import itertools as IT
df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))

but it turns out this is significantly slower:

但事实证明这要慢得多：

In [312]: df = pd.DataFrame({'first':range(10**6)})

In [313]: %timeit df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))
10 loops, best of 3: 143 ms per loop

In [316]: %timeit df['second'] = 0; np.put(df['second'], np.arange(N), [1,2,3])
10 loops, best of 3: 27.9 ms per loop

Answer 4

回答by Paul H

How general of a solution are you looking for? I tried to make this a little less hard-coded:

您正在寻找的解决方案有多普遍？我试图让这个不那么硬编码：

import numpy as np
import pandas 

df = pandas.DataFrame(np.arange(1,7), columns=['first'])

base = [1, 2, 3]
df['second'] = base * (df.shape[0]/len(base))
print(df.to_string())


   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3

Answer 5

回答by Daniele

In my case I needed to repeat the values without knowing the length of the sub-list, i.e. checking the length of every group. This was my solution:

在我的情况下，我需要在不知道子列表长度的情况下重复这些值，即检查每个组的长度。这是我的解决方案：

import numpy as np
import pandas 

df = pandas.DataFrame(['a','a','a','b','b','b','b'], columns=['first'])

list = df.groupby('first').apply(lambda x: range(len(x))).tolist()
loop = [val for sublist in list for val in sublist]
df['second']=loop

df
  first  second
0     a       0
1     a       1
2     a       2
3     b       0
4     b       1
5     b       2
6     b       3

Answer 6

回答by SBM

Probably inefficient, but here's sort of a pure pandas solution.

可能效率低下，但这是一种纯Pandas解决方案。

import numpy as np
import pandas as pd

base = [1,2,3]
df = pd.DataFrame(data = None,index = np.arange(10),columns = ["filler"])
df["filler"][:len(base)] = base

df["tmp"] = np.arange(len(df)) % len(base)
df["filler"] = df.sort_values("tmp")["filler"].ffill() #.sort_index()
print(df)

Answer 7

回答by JDenman6

You might want to try using the power of modulo (%). You can take the value (or index) of first and use the length of second as the modulus to get the value (or index) you're looking for. Something like:

您可能想尝试使用模数 (%) 的幂。您可以取 first 的值（或索引）并使用 second 的长度作为模数来获取您要查找的值（或索引）。就像是：

df = pandas.DataFrame([0,1,2,3,4,5], columns=['first'])
sec = [0,1,2]
df['second'] = df['first'].apply(lambda x: x % len(sec) )
print(df)
   first  second
0      0       0
1      1       1
2      2       2
3      3       0
4      4       1
5      5       2

I hope that helps.

我希望这有帮助。

pandas 如何在熊猫中填充重复数据的行？

提问by Amyunimus

采纳答案by Yeqing Zhang

回答by Meow

回答by unutbu

回答by Paul H

回答by Daniele

回答by SBM

回答by JDenman6

相关推荐

最近更新

标签

pandas 如何在熊猫中填充重复数据的行？

提问by Amyunimus

采纳答案by Yeqing Zhang

回答by Meow

回答by unutbu

回答by Paul H

回答by Daniele

回答by SBM

回答by JDenman6

相关推荐

Pandas group by 不起作用

使用 Pandas 拆分数据

pandas 接收`KeyError: u'no item named XYZ'` 错误

Python Pandas 在函数中处理数据帧

相关推荐

最近更新

标签