Pandas 从 str.extractall('#') 给出错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38552688/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:39:24  来源:igfitidea点击:

Pandas gives an error from str.extractall('#')

pythonpandas

提问by Sitz Blogz

I am trying to filter all the #keywords from the tweet text. I am using str.extractall()to extract all the keywords with #keywords. This is the first time I am working on filtering keywords from the tweetText using pandas. Inputs, code, expected output and error are given below.

我正在尝试#从推文文本中过滤所有关键字。我正在使用str.extractall()关键字提取所有#关键字。这是我第一次使用 Pandas 从 tweetText 中过滤关键字。下面给出了输入、代码、预期输出和错误。

Input:

输入:

userID,tweetText 
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
04, world tour

and so on... the total datafile is in GB size scraped tweets with several other columns. But I am interested in only two columns.

等等......总数据文件是 GB 大小的刮推文和其他几个列。但我只对两列感兴趣。

Code:

代码:

import re
import pandas as pd

data = pd.read_csv('Text.csv', index_col=0, header=None, names=['userID', 'tweetText'])

fout = data['tweetText'].str.extractall('#')

print fout 

Expected Output:

预期输出:

userID,tweetText 
01,#sweet
01,#happy 
01,#life 
02,#world
03,#all

Error:

错误:

Traceback (most recent call last):
  File "keyword_split.py", line 7, in <module>
    fout = data['tweetText'].str.extractall('#')
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 1621, in extractall
    return str_extractall(self._orig, pat, flags=flags)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/strings.py", line 694, in str_extractall
    raise ValueError("pattern contains no capture groups")
ValueError: pattern contains no capture groups

Thanks in advance for the help. What should be the simplest way to filter keywords with respect to userid?

在此先感谢您的帮助。根据用户 ID 过滤关键字的最简单方法应该是什么?

Output Update:

输出更新:

When used only this the output is like above s.name = "tweetText" data_1 = data[~data['tweetText'].isnull()]

当仅使用此输出时,如上 s.name = "tweetText" data_1 = data[~data['tweetText'].isnull()]

The output in this case has empty []and the userID at still listed and for those which has keywords has an array of keywords and not in list form.

在这种情况下,输出为空[],并且 userID 仍然列出,并且对于那些具有关键字的输出具有关键字数组而不是列表形式。

When used only this the output us what needed but with NAN

当仅使用此输出时,我们需要什么,但与 NAN

s.name = "tweetText"
data_2 = data_1.drop('tweetText', axis=1).join(s)

The output here is correct format but those with no keywords has yet considered and has NAN

这里的输出是正确的格式,但那些没有关键字的人还没有考虑过并且有 NAN

If it is possible we got to neglect such userIDs and not shown in output at all.In next stages I am trying to calculate the frequency of keywords in which the NANor empty []will also be counted and that frequency may compromise the far future classification.

如果可能的话,我们必须忽略这样的用户 ID 并且根本不显示在输出中。在接下来的阶段,我将尝试计算关键字的频率,其中也将计算the NANor empty[]并且该频率可能会影响未来的分类。

enter image description here

在此处输入图片说明

采纳答案by Abdou

If you are not too tied to using extractall, you can try the following to get your final output:

如果您不太习惯使用extractall,则可以尝试以下操作来获得最终输出:

from io import StringIO
import pandas as pd
import re


data_text = """userID,tweetText
01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

data = pd.read_csv(StringIO(data_text),header=0)

data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
s.name = "tweetText"
data = data.drop('tweetText', axis=1).join(s)

     userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all
4       4       NaN

You drop the rows where the textTweet column returns Nan's by doing the following:

您可以Nan通过执行以下操作删除 textTweet 列返回的行:

data = data[~data['tweetText'].isnull()]

This should return:

这应该返回:

   userID tweetText
0       1    #sweet
1       1    #happy
1       1     #life
2       2    #world
3       3      #all

I hope this helps.

我希望这有帮助。

回答by Guillaume Ottavianoni

Set braces in your calculus :

在你的微积分中设置大括号:

fout = data['tweetText'].str.extractall('(#)')

instead of

代替

fout = data['tweetText'].str.extractall('#')

Hope that will work

希望这会奏效

回答by ???S???

The extractallfunction requires a regex pattern with capturing groupsas the first argument, for which you have provided #.

extractall函数需要使用捕获组作为第一个参数的正则表达式模式,您已经为其提供了#.

A possible argument could be (#\S+). The braces indicate a capture group, in other words what the extractallfunction needs to extract from each string.

一个可能的论点可能是(#\S+)。大括号表示捕获组,换句话说,extractall函数需要从每个字符串中提取什么。

Example:

例子:

data="""01, home #sweet home
01, #happy #life 
02, #world peace
03, #all are one
"""

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(data), 
                 header=None, 
                 names=['col1', 'col2'],
                 index_col=0)

df['col2'].str.extractall('(#\S+)')

The error ValueError: pattern contains no capture groupsdoesn't appear anymore with the above code (meaning the issue in the question is solved), but this hits a bug in the current version of pandas (I'm using '0.18.1').

ValueError: pattern contains no capture groups上面的代码不再出现错误(意味着问题中的问题已解决),但这在当前版本的Pandas中遇到了错误(我正在使用'0.18.1')。

The error returned is:

返回的错误是:

AssertionError: 1 columns passed, passed data had 6 columns

The issue is described here.

此处描述该问题。

If you would try df['col2'].str.extractall('#(\S)')(which will give you the first letter of every hashtag, you'll see that the extractallfunction works as long as the captured group only contains a single character (which matches the issue description). As the issue is closed, it should be fixed in an upcoming pandas release.

如果您尝试df['col2'].str.extractall('#(\S)')(这将为您提供每个主题标签的第一个字母,您会看到该extractall功能只要捕获的组只包含一个字符(与问题描述匹配)就可以工作。随着问题的关闭,它应该在即将发布的 Pandas 版本中修复。

回答by Merlin

Try this:

尝试这个:

Since it filters for '#', your NAN should not exist.

由于它过滤了“#”,因此您的 NAN 不应该存在。

    data = pd.read_csv(StringIO(data_text),header=0, index_col=0 )
    data = data["tweetText"].str.split(' ', expand=True).stack().reset_index().rename(columns = {0:"tweetText"}).drop('level_1', 1)
    data = data[data['tweetText'].str[0] == "#"].reset_index(drop=True) 


     userID tweetText
0       1    #sweet
1       1    #happy
2       1     #life
3       2    #world
4       3      #all

@Abdou method:

@Abdou 方法:

def try1():
     data = pd.read_csv(StringIO(data_text),header=0)
     data['tweetText'] = data.tweetText.apply(lambda x: re.findall('#(?=\w+)\w+',x))
     s = data.apply(lambda x: pd.Series(x['tweetText']),axis=1).stack().reset_index(level=1, drop=True)
     s.name = "tweetText"
     data = data.drop('tweetText', axis=1).join(s)
     data = data[~data['tweetText'].isnull()]

%timeit try1()
 100 loops, best of 3: 7.71 ms per loop

@Merlin method

@梅林方法

def try2():
    data = pd.read_csv(StringIO(data_text),header=0, index_col=0 )
    data = data["tweetText"].str.split(' ', expand=True).stack().reset_index().rename(columns = {'level_0':'userID',0:"tweetText"}).drop('level_1', 1)
    data = data[data['tweetText'].str[0] == "#"].reset_index(drop=True)

%timeit try2()
100 loops, best of 3: 5.36 ms per loop