Pandas:如何将 DataFrame 中的列表列按行与 Pandas 进行比较(不是 for 循环)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35616058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:45:36  来源:igfitidea点击:

Pandas: How to Compare Columns of Lists Row-wise in a DataFrame with Pandas (not for loop)?

pythonpandas

提问by Jarad

DataFrame

数据框

df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B'])

What it Looks Like

它看起来像什么

I have a dataframe with two columns of lists.

我有一个包含两列列表的数据框。

                     A                                      B
0              [gener]                               [gutter]
1              [gener]                               [gutter]
2             [system]                       [gutter, system]
3             [system]                [gutter, guard, system]
4             [gutter]                         [ohio, gutter]
5             [gutter]                       [gutter, toledo]
6             [gutter]                       [toledo, gutter]
7             [gutter]                               [gutter]
8             [gutter]                               [gutter]
9             [gutter]                               [gutter]
10          [aluminum]    [how, to, instal, aluminum, gutter]
11          [aluminum]                     [aluminum, gutter]
12          [aluminum]              [aluminum, gutter, color]
13          [aluminum]                     [aluminum, gutter]
14          [aluminum]       [aluminum, gutter, adrian, ohio]
15          [aluminum]  [aluminum, gutter, bowl, green, ohio]
16          [aluminum]        [aluminum, gutter, maume, ohio]
17          [aluminum]   [aluminum, gutter, perrysburg, ohio]
18          [aluminum]     [aluminum, gutter, tecumseh, ohio]
19  [aluminum, toledo]       [aluminum, gutter, toledo, ohio]

Question

If I have columns of lists, is there a pandas function that lets me operate on the entire array of lists to check for intersection and return either a boolean or the intersecting values as a new series?

如果我有一列列表,是否有一个 Pandas 函数可以让我对整个列表数组进行操作以检查交集并将布尔值或相交值作为新系列返回?

For example, I'd like pandas to have an equivalent of this:

例如,我希望Pandas有这样的等效:

def intersection(df, col1, col2, return_type='boolean'):
    if return_type == 'boolean':
        df = df[[col1, col2]]
        s = []
        for idx in df.iterrows():
            s.append(any([phrase in idx[1][0] for phrase in idx[1][1]]))
        S = pd.Series(s)
        return S
    elif return_type == 'word':
        df = df[[col1, col2]]
        s = []
        for idx in df.iterrows():
            s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]))
        S = pd.Series(s)
        return S

#Create column C in df
df['C'] = intersection(df, 'A', 'B', 'word')

... without having to write my own function or resort to for loops. I feel like there must be an easier way to compare lists in two columns on the same row to see if they intersect.

...无需编写自己的函数或使用 for 循环。我觉得必须有一种更简单的方法来比较同一行上两列中的列表,看看它们是否相交。

I can do it with forloops but it's ugly to me

我可以用for循环来做,但对我来说很难看

forloop to return a booleanseries:

for循环返回一个boolean系列:

for idx in df.iterrows():
    any([phrase in idx[1][0] for phrase in idx[1][1]])

Produces:

产生:

False
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True

Or, finding the intersecting words using sets:

或者,使用sets查找相交的单词:

for idx in df.iterrows():
    ', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])

''
''
'system'
'system'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'toledo, aluminum'

回答by Alexander

To check if every item in df.Ais contained in df.B:

要检查中的每个项目是否都df.A包含在df.B

>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1)
# OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool)
0     False
1     False
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
dtype: bool

To get the union:

要获得工会:

df['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df.A, df.B)]

>>> df
                     A                                      B        intersection
0              [gener]                               [gutter]                  []
1              [gener]                               [gutter]                  []
2             [system]                       [gutter, system]            [system]
3             [system]                [gutter, guard, system]            [system]
4             [gutter]                         [ohio, gutter]            [gutter]
5             [gutter]                       [gutter, toledo]            [gutter]
6             [gutter]                       [toledo, gutter]            [gutter]
7             [gutter]                               [gutter]            [gutter]
8             [gutter]                               [gutter]            [gutter]
9             [gutter]                               [gutter]            [gutter]
10          [aluminum]    [how, to, instal, aluminum, gutter]          [aluminum]
11          [aluminum]                     [aluminum, gutter]          [aluminum]
12          [aluminum]              [aluminum, gutter, color]          [aluminum]
13          [aluminum]                     [aluminum, gutter]          [aluminum]
14          [aluminum]       [aluminum, gutter, adrian, ohio]          [aluminum]
15          [aluminum]  [aluminum, gutter, bowl, green, ohio]          [aluminum]
16          [aluminum]        [aluminum, gutter, maume, ohio]          [aluminum]
17          [aluminum]   [aluminum, gutter, perrysburg, ohio]          [aluminum]
18          [aluminum]     [aluminum, gutter, tecumseh, ohio]          [aluminum]
19  [aluminum, toledo]       [aluminum, gutter, toledo, ohio]  [aluminum, toledo]

回答by ShellayLee

Just use the applyfunction supported by pandas, it's great.

就用apply支持的功能pandas吧,太好了。

Since you may have more than two columns for intersecting, the auxiliary function can be prepared like this and then applied with the DataFrame.applyfunction (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html, note the option axis=1means "across the series" while axis=0means "along the series", where one series is just one column in the data frame). Each row across the columns is then passed as a iterable Seriesobject to the function applied.

由于您可能有两列以上的相交,因此可以像这样准备辅助函数,然后将其与该DataFrame.apply函数一起应用(参见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply .html,请注意该选项axis=1表示“跨系列”而axis=0表示“沿着系列”,其中一个系列只是数据框中的一列)。然后将跨列的每一行作为可迭代Series对象传递给应用的函数。

def intersect(ss):
    ss = iter(ss)
    s = set(next(ss))
    for t in ss:
        s.intersection_update(t) # `t' must not be a `set' here, `list' or any `Iterable` is OK
    return s

res = df.apply(intersect, axis=1)

>>> res
0                     {}
1                     {}
2               {system}
3               {system}
4               {gutter}
5               {gutter}
6               {gutter}
7               {gutter}
8               {gutter}
9               {gutter}
10            {aluminum}
11            {aluminum}
12            {aluminum}
13            {aluminum}
14            {aluminum}
15            {aluminum}
16            {aluminum}
17            {aluminum}
18            {aluminum}
19    {aluminum, toledo}

You can augment further operations on the result of the auxiliary function, or make some variations similarly.

您可以对辅助函数的结果进行进一步的操作,或者类似地进行一些变化。

Hope this helps.

希望这可以帮助。