pandas 如何使用 Python 从数据框中的每个字符串中获取第一个单词？

Question

提问by twhale

I have a Pandas DataFrame called "data" with 2 columns and 50 rows filled with one or two lines of text each, imported from a .tsv file. Some of the questions may contain integers and floats, besides strings. I am trying to extract the first word of every sentence (in both columns), but consistently get this error: AttributeError: 'DataFrame' object has no attribute 'str'.

我有一个名为“data”的 Pandas DataFrame，有 2 列和 50 行，每行填充一两行文本，从 .tsv 文件导入。除了字符串之外，一些问题可能包含整数和浮点数。我试图提取每个句子的第一个单词（在两列中），但始终收到此错误：AttributeError: 'DataFrame' object has no attribute 'str'。

At first, I thought the error was due to my wrong use of "data.str.split", but all changes I could Google failed. Then I through the file might not be composed of all strings. So I tried "data.astype(str)" on the file, but the same error remained. Any suggestions? Thanks a lot!

起初，我认为错误是由于我错误地使用了“data.str.split”，但我能谷歌的所有更改都失败了。然后我通过的文件可能不是所有的字符串组成。所以我在文件上尝试了“data.astype(str)”，但同样的错误仍然存在。有什么建议？非常感谢！

Here is my code:

这是我的代码：

import pandas as pd
questions = "questions.tsv"
data = pd.read_csv(questions, usecols = [3], nrows = 50, header=1, sep="\t")
data = data.astype(str)
first_words = data.str.split(None, 1)[0]

Answer 1

采纳答案by jezrael

Use:

用：

first_words = data.apply(lambda x: x.str.split().str[0])

Or:

或者：

first_words = data.applymap(lambda x: x.split()[0])

Sample:

样本：

data = pd.DataFrame({'a':['aa ss ss','ee rre', 1, 'r'],
                   'b':[4,'rrt ee', 'ee www ee', 6]})
print (data)
          a          b
0  aa ss ss          4
1    ee rre     rrt ee
2         1  ee www ee
3         r          6

data = data.astype(str)
first_words = data.apply(lambda x: x.str.split().str[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6

first_words = data.applymap(lambda x: x.split()[0])
print (first_words)
    a    b
0  aa    4
1  ee  rrt
2   1   ee
3   r    6

Answer 2

回答by piRSquared

The problem is that you attempted to use the pd.Series.strstring accessor on a pd.DataFrame. Unfortunately, it is a pd.Seriesonly attribute. That means you need to use it in a pd.Seriescontext. You can accomplish in several ways.

问题是您试图pd.Series.str在pd.DataFrame. 不幸的是，它是pd.Series唯一的属性。这意味着您需要在pd.Series上下文中使用它。您可以通过多种方式完成。

Setup
Assume your dataframe looked like this

设置
假设您的数据框看起来像这样

              Col1               Col2
0   this is a test        hello world
1  this is another          pandas123
2            test3       tommy trojan
3         etcetera  one more sentence

Option 1
Use stackto convert a 2-dimensional dataframe into a series... then use the string accessor

选项1
使用stack的2维数据帧转换成一系列的...然后用串访问

#  Make a
#  Series
#  /----\    
df.stack().str.split(n=1).str[0].unstack()
#                                 \_____/
#                                 Turn it
#                                   Back

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

Option 2
Or you can use pd.DataFrame.applyto use the pd.Series.straccessor on each column separately.
This is covered in @jezrael's answer.

选项 2
或者您可以单独pd.DataFrame.apply使用pd.Series.str每列上的访问器。
这在@jezrael 的回答中有所涉及。

df.apply(lambda x: x.str.split(n=1).str[0])

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

Option 3
Use a comprehension

选项 3
使用理解

pd.DataFrame({c: df[c].str.split(n=1).str[0] for c in df})

       Col1       Col2
0      this      hello
1      this  pandas123
2     test3      tommy
3  etcetera        one

You'll notice that in all options, we used the stron a pd.Seriesobject and not a pd.DataFrameobject.

您会注意到，在所有选项中，我们都使用str了pd.Series对象而不是pd.DataFrame对象。

pandas 如何使用 Python 从数据框中的每个字符串中获取第一个单词？

提问by twhale

采纳答案by jezrael

回答by piRSquared

相关推荐

最近更新

标签

pandas 如何使用 Python 从数据框中的每个字符串中获取第一个单词？

提问by twhale

采纳答案by jezrael

回答by piRSquared

相关推荐

将 Pandas groupby 操作的输出保存为 CSV

pandas 将pandas数据帧逐行写入csv文件

pandas sklearn-LinearRegression：无法将字符串转换为浮点数：'--'

Pandas 中列的别名

相关推荐

最近更新

标签