如何找到所有()正则表达式序列到 Pandas 数据帧?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42290076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to findall() a sequence of regular expressions to a pandas dataframe?
提问by J.Do
I am extracting some patterns with pandas findallfunction. However, I have several regular expressions. This, how can I findall
N
regular expressions with pandas?.
我正在使用 pandas findall函数提取一些模式。但是,我有几个正则表达式。这,我怎么能用findall
N
Pandas正则表达式?
For example, lets say that I would like to extract the all the numbers and all the dates inside an specific column:
例如,假设我想提取特定列中的所有数字和所有日期:
In:
在:
dfs = pd.DataFrame(data={'c1': ['This dataset 11/12/98 contains 5,000 rows, which were sampled from a 500,000 11/12/12 row dataset spanning the same time period. Throughout these analyses',
'the number of events you count will be about 100 times smaller than they 11/12/78 actually were, but the 01/12/11 proportions of events will still generally be reflective that larger dataset. In this case, a sample is fine because our purpose is to learn methods of data analysis with Python, not to create 100% accurate recommendations to Watsi.']})
dfs
Out:
出去:
c1
0 This dataset 11/12/98 contains 5,000 rows, whi...
1 the number of events you count will be about 1...
I tried to, but I am getting the following error:
我尝试过,但出现以下错误:
In:
在:
dfs['patterns'] = dfs['c1'].str.findall([r'\d+',r'(\d+/\d+/\d+)']).apply(', '.join)
dfs
Out:
出去:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-af2969e06a61> in <module>()
----> 1 dfs['patterns'] = dfs['c1'].str.findall([r'\d+',r'(\d+/\d+/\d+)']).apply(', '.join)
2 dfs
/usr/local/lib/python3.5/site-packages/pandas/core/strings.py in wrapper2(self, pat, flags, **kwargs)
1268
1269 def wrapper2(self, pat, flags=0, **kwargs):
-> 1270 result = f(self._data, pat, flags=flags, **kwargs)
1271 return self._wrap_result(result)
1272
/usr/local/lib/python3.5/site-packages/pandas/core/strings.py in str_findall(arr, pat, flags)
827 extractall : returns DataFrame with one column per capture group
828 """
--> 829 regex = re.compile(pat, flags=flags)
830 return _na_map(regex.findall, arr)
831
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/re.py in compile(pattern, flags)
222 def compile(pattern, flags=0):
223 "Compile a regular expression pattern, returning a pattern object."
--> 224 return _compile(pattern, flags)
225
226 def purge():
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/re.py in _compile(pattern, flags)
279 # internal: compile pattern
280 try:
--> 281 p, loc = _cache[type(pattern), pattern, flags]
282 if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
283 return p
TypeError: unhashable type: 'list'
Therefore how can I "stack", "nest" or apply several regex with findall
function?. What I expect as an output is the resolution of each regular expression separated by ,
in a single column:
因此,我如何“堆叠”、“嵌套”或应用多个具有findall
函数的正则表达式?我期望作为输出的是每个正则表达式的分辨率,由,
单个列分隔:
col
0 '11/12/98', '5', '000', '500', '000', '11/12/12'
1 '100', '11/12/78', '01/12/11', '100'
UPDATE
更新
I tried to:
我试过了:
dfs['patterns'] = dfs['c1'].str.map(findall(),[r'\d+',r'(\d+/\d+/\d+)']).apply(', '.join)
dfs
采纳答案by su79eu7k
Still not clear your desired output. But please check below code.
仍然没有清除您想要的输出。但请检查下面的代码。
dfs['patterns'] = dfs['c1'].str.findall(r'\d+\/\d+\/\d+|\d+')
print dfs['patterns'].sum()
['11/12/98', '5', '000', '500', '000', '11/12/12', '100', '11/12/78', '01/12/11', '100']