Python 如何将字符串拆分为令牌?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18312447/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I split a string into tokens?
提问by Martin Thetford
If I have a string
如果我有一个字符串
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
如何将其拆分为以下令牌列表?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
目前我正在使用 shlex 模块:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
但这会返回:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
所以我试图将字母与数字分开。我正在考虑采用包含字母和数字的字符串,然后以某种方式拆分它们,但不确定如何执行此操作或之后如何将它们与其他字符串一起重新添加到列表中。标记保持有序很重要,我不能有嵌套列表。
In an ideal world, e and E would not be recognised as letters in the same way, so
在理想的世界中,e 和 E 不会以相同的方式被识别为字母,因此
'-4e1'
would become
会成为
['-', '4e1']
but
但
'-4x1'
would become
会成为
['-', '4', 'x', '1']
Can anybody help?
有人可以帮忙吗?
采纳答案by Peter Varo
Use the regular expression module's split()
function, to split at
使用正则表达式模块的split()
函数,在
'\d+'
-- digits (number characters) and'\W+'
-- non-word characters:
'\d+'
-- 数字(数字字符)和'\W+'
-- 非单词字符:
CODE:
代码:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
输出:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
如果您不想分隔点(作为表达式中的浮点数),则应使用以下命令:
[\d.]+
-- digit or dot characters (although this allows you to write:13.5.5
[\d.]+
-- 数字或点字符(尽管这允许您编写:13.5.5
CODE:
代码:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
输出:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
回答by Tigran Saluev
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yaccfor creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.
好吧,问题似乎并不简单。我认为,获得强大(但不幸的是,不是那么短)解决方案的一个好方法是使用Python Lex-Yacc来创建一个全权重标记器。Lex-Yacc 是一种常见的(不仅仅是 Python)实践,因此可以有现成的语法来创建一个简单的算术标记器(就像这个),你只需要让它们适应你的特定需求。
回答by redrubia
Another alternative not suggested here, is to using nltk.tokenizemodule
此处未建议的另一种选择是使用nltk.tokenize模块