pandas 如何使用 Python 从数据框中的每个字符串中获取第一个单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46231797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:27:51  来源:igfitidea点击:

How can I get the first word from each string in my Dataframe using Python?

pandas

提问by twhale

I have a Pandas DataFrame called "data" with 2 columns and 50 rows filled with one or two lines of text each, imported from a .tsv file. Some of the questions may contain integers and floats, besides strings. I am trying to extract the first word of every sentence (in both columns), but consistently get this error: AttributeError: 'DataFrame' object has no attribute 'str'.

我有一个名为“data”的 Pandas DataFrame,有 2 列和 50 行,每行填充一两行文本,从 .tsv 文件导入。除了字符串之外,一些问题可能包含整数和浮点数。我试图提取每个句子的第一个单词(在两列中),但始终收到此错误:AttributeError: 'DataFrame' object has no attribute 'str'。

At first, I thought the error was due to my wrong use of "data.str.split", but all changes I could Google failed. Then I through the file might not be composed of all strings. So I tried "data.astype(str)" on the file, but the same error remained. Any suggestions? Thanks a lot!

起初,我认为错误是由于我错误地使用了“data.str.split”,但我能谷歌的所有更改都失败了。然后我通过的文件可能不是所有的字符串组成。所以我在文件上尝试了“data.astype(str)”,但同样的错误仍然存​​在。有什么建议?非常感谢!

Here is my code:

这是我的代码:

import pandas as pd
questions = "questions.tsv"
data = pd.read_csv(questions, usecols = [3], nrows = 50, header=1, sep="\t")
data = data.astype(str)
first_words = data.str.split(None, 1)[0]

采纳答案by jezrael

Use:

用:

first_words = data.apply(lambda x: x.str.split().str[0])

Or:

或者:

first_words = data.applymap(lambda x: x.split()[0])

Sample:

样本:

data = pd.DataFrame({'a':['aa ss ss','ee rre', 1, 'r'],
                   'b':[4,'rrt ee', 'ee www ee', 6]})
print (data)
          a          b
0  aa ss ss          4
1    ee rre     rrt ee
2         1  ee www ee
3         r          6

data = data.astype(str)
first_words = data.apply(lambda x: x.str.split().str[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6


first_words = data.applymap(lambda x: x.split()[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6

回答by piRSquared

The problem is that you attempted to use the pd.Series.strstring accessor on a pd.DataFrame. Unfortunately, it is a pd.Seriesonly attribute. That means you need to use it in a pd.Seriescontext. You can accomplish in several ways.

问题是您试图pd.Series.strpd.DataFrame. 不幸的是,它是pd.Series唯一的属性。这意味着您需要在pd.Series上下文中使用它。您可以通过多种方式完成。

Setup
Assume your dataframe looked like this

设置
假设您的数据框看起来像这样

              Col1               Col2
0   this is a test        hello world
1  this is another          pandas123
2            test3       tommy trojan
3         etcetera  one more sentence


Option 1
Use stackto convert a 2-dimensional dataframe into a series... then use the string accessor

选项1
使用stack的2维数据帧转换成一系列的...然后用串访问

#  Make a
#  Series
#  /----\    
df.stack().str.split(n=1).str[0].unstack()
#                                 \_____/
#                                 Turn it
#                                   Back

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one


Option 2
Or you can use pd.DataFrame.applyto use the pd.Series.straccessor on each column separately.
This is covered in @jezrael's answer.

选项 2
或者您可以单独pd.DataFrame.apply使用pd.Series.str每列上的访问器。
这在@jezrael 的回答中有所涉及。

df.apply(lambda x: x.str.split(n=1).str[0])

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one


Option 3
Use a comprehension

选项 3
使用理解

pd.DataFrame({c: df[c].str.split(n=1).str[0] for c in df})

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one


You'll notice that in all options, we used the stron a pd.Seriesobject and not a pd.DataFrameobject.

您会注意到,在所有选项中,我们都使用strpd.Series对象而不是pd.DataFrame对象。