Python/Pandas:如何将字符串列表与 DataFrame 列匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42683249/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/Pandas: How to Match List of Strings with a DataFrame column
提问by Riley Hun
I want to compare two columnn -- Description
and Employer
. I want to see if any keywords in Employer
are found in the Description
column. I have broken the Employer
column down to words and converted to a list. Now I want to see if any of those words are in the corresponding Description
column.
我想比较两个 columnn --Description
和Employer
. 我想看看Employer
在Description
列中是否找到任何关键字。我已将Employer
列分解为单词并转换为列表。现在我想看看这些词是否在相应的Description
列中。
Sample input:
样本输入:
print(df.head(25))
Date Description Amount AutoNumber \
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246
Employer
0 Cansel Survey Equipment
2 Cansel Survey Equipment
5 Cansel Survey Equipment
9 Cansel Survey Equipment
10 Cansel Survey Equipment
11 Cansel Survey Equipment
12 Cansel Survey Equipment
13 Cansel Survey Equipment
14 Cansel Survey Equipment
15 Cansel Survey Equipment
17 Cansel Survey Equipment
19 Cansel Survey Equipment
20 Cansel Survey Equipment
21 Cansel Survey Equipment
22 Cansel Survey Equipment
23 Cansel Survey Equipment
24 Cansel Survey Equipment
26 Cansel Survey Equipment
27 Cansel Survey Equipment
28 Cansel Survey Equipment
I tried something like this, but it doesn't seem to work.:
我试过这样的事情,但它似乎不起作用。:
df['Text_Search'] = df['Employer'].apply(lambda x: x.split(" "))
df['Match'] = np.where(df['Description'].str.contains("|".join(df['Text_Search'])), "Yes", "No")
My desired output would be as shown below:
我想要的输出如下所示:
Date Description Amount AutoNumber \
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246
29 12/30/2014 WZ553 TFR?FR xxx8690 200.00 49246
32 12/29/2014 JW173 TFR?FR xxx8690 300.00 49246
33 12/24/2014 CANSEL SURVEY E PAY 1219.21 49246
34 12/24/2014 CANSEL SURVEY E PAY 434.84 49246
36 12/23/2014 WT002 TFR?FR xxx8690 160.00 49246
Employer Text_Search Match
0 Cansel Survey Equipment [Cansel, Survey, Equipment] No
2 Cansel Survey Equipment [Cansel, Survey, Equipment] No
5 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
9 Cansel Survey Equipment [Cansel, Survey, Equipment] No
10 Cansel Survey Equipment [Cansel, Survey, Equipment] No
11 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
12 Cansel Survey Equipment [Cansel, Survey, Equipment] No
13 Cansel Survey Equipment [Cansel, Survey, Equipment] No
14 Cansel Survey Equipment [Cansel, Survey, Equipment] No
15 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
17 Cansel Survey Equipment [Cansel, Survey, Equipment] No
19 Cansel Survey Equipment [Cansel, Survey, Equipment] No
20 Cansel Survey Equipment [Cansel, Survey, Equipment] No
21 Cansel Survey Equipment [Cansel, Survey, Equipment] No
22 Cansel Survey Equipment [Cansel, Survey, Equipment] No
23 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
24 Cansel Survey Equipment [Cansel, Survey, Equipment] No
26 Cansel Survey Equipment [Cansel, Survey, Equipment] No
27 Cansel Survey Equipment [Cansel, Survey, Equipment] No
28 Cansel Survey Equipment [Cansel, Survey, Equipment] No
29 Cansel Survey Equipment [Cansel, Survey, Equipment] No
32 Cansel Survey Equipment [Cansel, Survey, Equipment] No
33 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
34 Cansel Survey Equipment [Cansel, Survey, Equipment] Yes
36 Cansel Survey Equipment [Cansel, Survey, Equipment] No
采纳答案by pansen
Here is a readable solution using an individual search_func
:
这是使用个人的可读解决方案search_func
:
def search_func(row):
matches = [test_value in row["Description"].lower()
for test_value in row["Text_Search"]]
if any(matches):
return "Yes"
else:
return "No"
This function is then applied row-wise:
然后按行应用此函数:
# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
"Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})
print(df)
Description Employer
0 CANSEL SURVEY E PAY Cansel Survey Equipment
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment
# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)
# show result
print(df)
Description Employer Text_Search Match
0 CANSEL SURVEY E PAY Cansel Survey Equipment [cansel, survey, equipment] Yes
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment [cansel, survey, equipment] No
回答by MaxU
Here is fast and memory-saving vectrorized solution, which uses sklearn.feature_extraction.text.CountVectorizermethod:
这是快速且节省内存的矢量化解决方案,它使用sklearn.feature_extraction.text.CountVectorizer方法:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)
X = vect.fit_transform(df['Employer'])
cols_emp = vect.get_feature_names()
X = vect.fit_transform(df['Description'])
cols_desc = vect.get_feature_names()
common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
df['Match'] = (X.toarray()[:, common_cols_idx] == 1).any(1)
Source DF:
来源DF:
In [259]: df
Out[259]:
Date Description Amount AutoNumber Employer
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521
Result:
结果:
In [261]: df
Out[261]:
Date Description Amount AutoNumber Employer Match
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453 True
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521 True
Some explanations:
一些解释:
In [266]: cols_desc
Out[266]:
['cansel',
'fr',
'hi540',
'hq010',
'hu204',
'hu521',
'ir453',
'jh401',
'jj500',
'jx154',
'pay',
'rq574',
'survey',
'tfr',
'ue200',
'uo414',
'uq263',
'ut022',
'ut460',
'wu455',
'ww120',
'xxx8690']
In [267]: cols_emp
Out[267]: ['cansel', 'equipment', 'hu521', 'ir453', 'survey']
In [268]: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
In [269]: common_cols_idx
Out[269]: [0, 5, 6, 12]
In [270]: X.toarray()
Out[270]:
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)
In [271]: X.toarray()[:, common_cols_idx]
Out[271]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 1, 0, 0]], dtype=int64)
In [272]: X.toarray()[:, common_cols_idx] == 1
Out[272]:
array([[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, True, False],
[False, False, False, False],
[False, False, False, False],
[False, True, False, False]], dtype=bool)
In [273]: (X.toarray()[:, common_cols_idx] == 1).any(1)
Out[273]: array([False, False, True, False, False, True, False, False, False, True, False, False, False, False, False, True, True, Fals
e, False, True], dtype=bool)
回答by MaxU
Here is one solution, which splits text into lower-case sets and uses sets intersection for each row:
这是一个解决方案,它将文本拆分为小写集合并为每一行使用集合交集:
In [160]: x['Match'] = x.Description.str.lower().str.split().map(set).to_frame('desc') \
...: .apply(lambda r: (x.Employer.str.lower().str.split().map(set) & r.desc).any(),
...: axis=1)
...:
In [161]: x
Out[161]:
Date Description Amount AutoNumber Employer Match
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel Survey Equipment False
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey Equipment False
PS it's pretty slow as it's using not vectorized .apply(..., axis=1)
method
PS 它很慢,因为它使用的是非矢量化.apply(..., axis=1)
方法
回答by MaxU
Timing comparison for different solutions
不同解决方案的时序比较
Let's prepare a bit bigger DF - 2.000 rows:
让我们准备一个更大的 DF - 2.000 行:
In [3]: df = pd.concat([df] * 10**2, ignore_index=True)
In [4]: df.shape
Out[4]: (2000, 5)
Solution 1:df.apply(..., axis=1)
:
解决方案1df.apply(..., axis=1)
::
df["Text_Search"] = df.Employer.str.lower().str.split().map(set)
In [15]: %%timeit
...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
...: .apply(lambda r: (df["Text_Search"] & r.desc).any(),
...: axis=1)
...:
1 loop, best of 3: 5.06 s per loop
Solution 2:CountVectorizer
解决方案2:CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)
In [8]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
10 loops, best of 3: 88.2 ms per loop
Solution 3:df.apply(search_func, axis=1)
解决方案3:df.apply(search_func, axis=1)
df["Text_Search"] = df["Employer"].str.lower().str.split()
In [12]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 362 ms per loop
NOTE: Solution 1
is too slow, so i will not "timeit" this solution for bigger DFs
注意:Solution 1
太慢了,所以我不会为更大的 DF“计时”这个解决方案
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 20.000 rows DF:
比较df.apply(search_func, axis=1)
和CountVectorizer
DF为20.000行:
In [16]: df = pd.concat([df] * 10, ignore_index=True)
In [17]: df.shape
Out[17]: (20000, 6)
In [20]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 3.66 s per loop
In [21]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 825 ms per loop
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 200.000 rows DF:
比较df.apply(search_func, axis=1)
和CountVectorizer
DF为200.000行:
In [22]: df = pd.concat([df] * 10, ignore_index=True)
In [23]: df.shape
Out[23]: (200000, 6)
In [24]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 36.8 s per loop
In [25]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 8.28 s per loop
Conclusion:CountVectorized
solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)
结论:CountVectorized
解决方案是大约。4.44 倍相比df.apply(search_func, axis=1)