Python Pandas：如何将一列中的所有列表编译成一个唯一的列表

Question

提问by kitchenprinzessin

I have a pandas dataframe as below:

我有一个Pandas数据框，如下所示：

How can I combine all the lists (in the 'val' column) into a unique list (set), e.g. [val1, val2, val33, val9, val6, val7]?

如何将所有列表（在“val”列中）组合成一个唯一的列表（集），例如[val1, val2, val33, val9, val6, val7]？

I can solve this with the following code. I wonder if there is an easier way to get all unique values from a column without iterating the dataframe rows?

我可以用下面的代码解决这个问题。我想知道是否有一种更简单的方法可以在不迭代数据帧行的情况下从列中获取所有唯一值？

def_contributors=[]
for index, row in df.iterrows():
    contri = ast.literal_eval(row['val'])
    def_contributors.extend(contri)
def_contributors = list(set(def_contributors))

Answer 1

回答by jezrael

Another solution with exporting Seriesto nested listsand then apply setto flatten list:

导出Series到嵌套lists然后应用于set展平列表的另一种解决方案：

df = pd.DataFrame({'id':['a','b', 'c'], 'val':[['val1','val2'],
                                               ['val33','val9','val6'],
                                               ['val2','val6','val7']]})

print (df)
  id                  val
0  a         [val1, val2]
1  b  [val33, val9, val6]
2  c   [val2, val6, val7]

print (type(df.val.ix[0]))
<class 'list'>

print (df.val.tolist())
[['val1', 'val2'], ['val33', 'val9', 'val6'], ['val2', 'val6', 'val7']]

print (list(set([a for b in df.val.tolist() for a in b])))
['val7', 'val1', 'val6', 'val33', 'val2', 'val9']

Timings:

时间：

df = pd.concat([df]*1000).reset_index(drop=True)

In [307]: %timeit (df['val'].apply(pd.Series).stack().unique()).tolist()
1 loop, best of 3: 410 ms per loop

In [355]: %timeit (pd.Series(sum(df.val.tolist(),[])).unique().tolist())
10 loops, best of 3: 31.9 ms per loop

In [308]: %timeit np.unique(np.hstack(df.val)).tolist()
100 loops, best of 3: 10.7 ms per loop

In [309]: %timeit (list(set([a for b in df.val.tolist() for a in b])))
1000 loops, best of 3: 558 μs per loop

If types is not listbut stringuse str.stripand str.split:

如果类型不是list但string使用str.strip和str.split：

df = pd.DataFrame({'id':['a','b', 'c'], 'val':["[val1,val2]",
                                               "[val33,val9,val6]",
                                               "[val2,val6,val7]"]})

print (df)
  id                val
0  a        [val1,val2]
1  b  [val33,val9,val6]
2  c   [val2,val6,val7]

print (type(df.val.ix[0]))
<class 'str'>

print (df.val.str.strip('[]').str.split(','))
0           [val1, val2]
1    [val33, val9, val6]
2     [val2, val6, val7]
Name: val, dtype: object

print (list(set([a for b in df.val.str.strip('[]').str.split(',') for a in b])))
['val7', 'val1', 'val6', 'val33', 'val2', 'val9']

Answer 2

回答by ayhan

Convert that column into a DataFrame with .apply(pd.Series). If you stack the columns, you can call the uniquemethod on the returned Series.

将该列转换为带有.apply(pd.Series). 如果堆叠列，则可以unique在返回的系列上调用该方法。

df
Out[123]: 
            val
0      [v1, v2]
1      [v3, v2]
2  [v4, v3, v2]

df['val'].apply(pd.Series).stack().unique()
Out[124]: array(['v1', 'v2', 'v3', 'v4'], dtype=object)

Answer 3

回答by Nickil Maveli

You can use str.concatfollowed by some stringmanipulations to obtain the desired list.

您可以使用str.concat后跟一些string操作来获得所需的list.

In [60]: import re
    ...: from collections import OrderedDict

In [62]: s = df['val'].str.cat()

In [63]: L = re.sub('[[]|[]]',' ', s).strip().replace("  ",',').split(',')

In [64]: list(OrderedDict.fromkeys(L))
Out[64]: ['val1', 'val2', 'val33', 'val9', 'val6', 'val7']

Answer 4

回答by Divakar

One way would be to extract those elements into an array using np.hstackand then using np.uniqueto give us an array of such unique elements, like so -

一种方法是使用np.hstack然后使用np.unique将这些元素提取到一个数组中，然后使用为我们提供这样一个唯一元素的数组，就像这样 -

np.unique(np.hstack(df.val))

If you want a list as output, append with .tolist()-

如果您想要一个列表作为输出，请附加.tolist()-

np.unique(np.hstack(df.val)).tolist()

Python Pandas：如何将一列中的所有列表编译成一个唯一的列表

提问by kitchenprinzessin

回答by jezrael

回答by ayhan

回答by Nickil Maveli

回答by Divakar

相关推荐

最近更新

标签

Python Pandas：如何将一列中的所有列表编译成一个唯一的列表

提问by kitchenprinzessin

回答by jezrael

回答by ayhan

回答by Nickil Maveli

回答by Divakar

相关推荐

python-pandas：处理熊猫数据帧日期列中的 NaT 类型值

在 Python pandas DataFrame 中将浮点数舍入/近似到小数点后 3 位

pandas 跳过 read_csv 中缺失值的行

Python pandas 删除不满足多个条件的行

相关推荐

最近更新

标签