Pandas:如何将 DataFrame 中的列表列按行与 Pandas 进行比较(不是 for 循环)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35616058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: How to Compare Columns of Lists Row-wise in a DataFrame with Pandas (not for loop)?
提问by Jarad
DataFrame
数据框
df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B'])
What it Looks Like
它看起来像什么
I have a dataframe with two columns of lists.
我有一个包含两列列表的数据框。
A B
0 [gener] [gutter]
1 [gener] [gutter]
2 [system] [gutter, system]
3 [system] [gutter, guard, system]
4 [gutter] [ohio, gutter]
5 [gutter] [gutter, toledo]
6 [gutter] [toledo, gutter]
7 [gutter] [gutter]
8 [gutter] [gutter]
9 [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter]
11 [aluminum] [aluminum, gutter]
12 [aluminum] [aluminum, gutter, color]
13 [aluminum] [aluminum, gutter]
14 [aluminum] [aluminum, gutter, adrian, ohio]
15 [aluminum] [aluminum, gutter, bowl, green, ohio]
16 [aluminum] [aluminum, gutter, maume, ohio]
17 [aluminum] [aluminum, gutter, perrysburg, ohio]
18 [aluminum] [aluminum, gutter, tecumseh, ohio]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio]
Question
题
If I have columns of lists, is there a pandas function that lets me operate on the entire array of lists to check for intersection and return either a boolean or the intersecting values as a new series?
如果我有一列列表,是否有一个 Pandas 函数可以让我对整个列表数组进行操作以检查交集并将布尔值或相交值作为新系列返回?
For example, I'd like pandas to have an equivalent of this:
例如,我希望Pandas有这样的等效:
def intersection(df, col1, col2, return_type='boolean'):
if return_type == 'boolean':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(any([phrase in idx[1][0] for phrase in idx[1][1]]))
S = pd.Series(s)
return S
elif return_type == 'word':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]))
S = pd.Series(s)
return S
#Create column C in df
df['C'] = intersection(df, 'A', 'B', 'word')
... without having to write my own function or resort to for loops. I feel like there must be an easier way to compare lists in two columns on the same row to see if they intersect.
...无需编写自己的函数或使用 for 循环。我觉得必须有一种更简单的方法来比较同一行上两列中的列表,看看它们是否相交。
I can do it with for
loops but it's ugly to me
我可以用for
循环来做,但对我来说很难看
for
loop to return a boolean
series:
for
循环返回一个boolean
系列:
for idx in df.iterrows():
any([phrase in idx[1][0] for phrase in idx[1][1]])
Produces:
产生:
False
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
Or, finding the intersecting words using set
s:
或者,使用set
s查找相交的单词:
for idx in df.iterrows():
', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])
''
''
'system'
'system'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'toledo, aluminum'
回答by Alexander
To check if every item in df.A
is contained in df.B
:
要检查中的每个项目是否都df.A
包含在df.B
:
>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1)
# OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool)
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
dtype: bool
To get the union:
要获得工会:
df['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df.A, df.B)]
>>> df
A B intersection
0 [gener] [gutter] []
1 [gener] [gutter] []
2 [system] [gutter, system] [system]
3 [system] [gutter, guard, system] [system]
4 [gutter] [ohio, gutter] [gutter]
5 [gutter] [gutter, toledo] [gutter]
6 [gutter] [toledo, gutter] [gutter]
7 [gutter] [gutter] [gutter]
8 [gutter] [gutter] [gutter]
9 [gutter] [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter] [aluminum]
11 [aluminum] [aluminum, gutter] [aluminum]
12 [aluminum] [aluminum, gutter, color] [aluminum]
13 [aluminum] [aluminum, gutter] [aluminum]
14 [aluminum] [aluminum, gutter, adrian, ohio] [aluminum]
15 [aluminum] [aluminum, gutter, bowl, green, ohio] [aluminum]
16 [aluminum] [aluminum, gutter, maume, ohio] [aluminum]
17 [aluminum] [aluminum, gutter, perrysburg, ohio] [aluminum]
18 [aluminum] [aluminum, gutter, tecumseh, ohio] [aluminum]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio] [aluminum, toledo]
回答by ShellayLee
Just use the apply
function supported by pandas
, it's great.
就用apply
支持的功能pandas
吧,太好了。
Since you may have more than two columns for intersecting, the auxiliary function can be prepared like this and then applied with the DataFrame.apply
function (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html, note the option axis=1
means "across the series" while axis=0
means "along the series", where one
series is just one column in the data frame). Each row across the columns is then passed as a iterable Series
object to the function applied.
由于您可能有两列以上的相交,因此可以像这样准备辅助函数,然后将其与该DataFrame.apply
函数一起应用(参见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply .html,请注意该选项axis=1
表示“跨系列”而axis=0
表示“沿着系列”,其中一个系列只是数据框中的一列)。然后将跨列的每一行作为可迭代Series
对象传递给应用的函数。
def intersect(ss):
ss = iter(ss)
s = set(next(ss))
for t in ss:
s.intersection_update(t) # `t' must not be a `set' here, `list' or any `Iterable` is OK
return s
res = df.apply(intersect, axis=1)
>>> res
0 {}
1 {}
2 {system}
3 {system}
4 {gutter}
5 {gutter}
6 {gutter}
7 {gutter}
8 {gutter}
9 {gutter}
10 {aluminum}
11 {aluminum}
12 {aluminum}
13 {aluminum}
14 {aluminum}
15 {aluminum}
16 {aluminum}
17 {aluminum}
18 {aluminum}
19 {aluminum, toledo}
You can augment further operations on the result of the auxiliary function, or make some variations similarly.
您可以对辅助函数的结果进行进一步的操作,或者类似地进行一些变化。
Hope this helps.
希望这可以帮助。