pandas 如何使用 Python 从数据框中的每个字符串中获取第一个单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46231797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I get the first word from each string in my Dataframe using Python?
提问by twhale
I have a Pandas DataFrame called "data" with 2 columns and 50 rows filled with one or two lines of text each, imported from a .tsv file. Some of the questions may contain integers and floats, besides strings. I am trying to extract the first word of every sentence (in both columns), but consistently get this error: AttributeError: 'DataFrame' object has no attribute 'str'.
我有一个名为“data”的 Pandas DataFrame,有 2 列和 50 行,每行填充一两行文本,从 .tsv 文件导入。除了字符串之外,一些问题可能包含整数和浮点数。我试图提取每个句子的第一个单词(在两列中),但始终收到此错误:AttributeError: 'DataFrame' object has no attribute 'str'。
At first, I thought the error was due to my wrong use of "data.str.split", but all changes I could Google failed. Then I through the file might not be composed of all strings. So I tried "data.astype(str)" on the file, but the same error remained. Any suggestions? Thanks a lot!
起初,我认为错误是由于我错误地使用了“data.str.split”,但我能谷歌的所有更改都失败了。然后我通过的文件可能不是所有的字符串组成。所以我在文件上尝试了“data.astype(str)”,但同样的错误仍然存在。有什么建议?非常感谢!
Here is my code:
这是我的代码:
import pandas as pd
questions = "questions.tsv"
data = pd.read_csv(questions, usecols = [3], nrows = 50, header=1, sep="\t")
data = data.astype(str)
first_words = data.str.split(None, 1)[0]
采纳答案by jezrael
Use:
用:
first_words = data.apply(lambda x: x.str.split().str[0])
Or:
或者:
first_words = data.applymap(lambda x: x.split()[0])
Sample:
样本:
data = pd.DataFrame({'a':['aa ss ss','ee rre', 1, 'r'],
'b':[4,'rrt ee', 'ee www ee', 6]})
print (data)
a b
0 aa ss ss 4
1 ee rre rrt ee
2 1 ee www ee
3 r 6
data = data.astype(str)
first_words = data.apply(lambda x: x.str.split().str[0])
print (first_words)
a b
0 aa 4
1 ee rrt
2 1 ee
3 r 6
first_words = data.applymap(lambda x: x.split()[0])
print (first_words)
a b
0 aa 4
1 ee rrt
2 1 ee
3 r 6
回答by piRSquared
The problem is that you attempted to use the pd.Series.str
string accessor on a pd.DataFrame
. Unfortunately, it is a pd.Series
only attribute. That means you need to use it in a pd.Series
context. You can accomplish in several ways.
问题是您试图pd.Series.str
在pd.DataFrame
. 不幸的是,它是pd.Series
唯一的属性。这意味着您需要在pd.Series
上下文中使用它。您可以通过多种方式完成。
Setup
Assume your dataframe looked like this
设置
假设您的数据框看起来像这样
Col1 Col2
0 this is a test hello world
1 this is another pandas123
2 test3 tommy trojan
3 etcetera one more sentence
Option 1
Use stack
to convert a 2-dimensional dataframe into a series... then use the string accessor
选项1
使用stack
的2维数据帧转换成一系列的...然后用串访问
# Make a
# Series
# /----\
df.stack().str.split(n=1).str[0].unstack()
# \_____/
# Turn it
# Back
Col1 Col2
0 this hello
1 this pandas123
2 test3 tommy
3 etcetera one
Option 2
Or you can use pd.DataFrame.apply
to use the pd.Series.str
accessor on each column separately.
This is covered in @jezrael's answer.
选项 2
或者您可以单独pd.DataFrame.apply
使用pd.Series.str
每列上的访问器。
这在@jezrael 的回答中有所涉及。
df.apply(lambda x: x.str.split(n=1).str[0])
Col1 Col2
0 this hello
1 this pandas123
2 test3 tommy
3 etcetera one
Option 3
Use a comprehension
选项 3
使用理解
pd.DataFrame({c: df[c].str.split(n=1).str[0] for c in df})
Col1 Col2
0 this hello
1 this pandas123
2 test3 tommy
3 etcetera one
You'll notice that in all options, we used the str
on a pd.Series
object and not a pd.DataFrame
object.
您会注意到,在所有选项中,我们都使用str
了pd.Series
对象而不是pd.DataFrame
对象。