Python 如何将字符串拆分为令牌？

Question

提问by Martin Thetford

If I have a string

如果我有一个字符串

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

如何将其拆分为以下令牌列表？

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

目前我正在使用 shlex 模块：

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

但这会返回：

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

所以我试图将字母与数字分开。我正在考虑采用包含字母和数字的字符串，然后以某种方式拆分它们，但不确定如何执行此操作或之后如何将它们与其他字符串一起重新添加到列表中。标记保持有序很重要，我不能有嵌套列表。

In an ideal world, e and E would not be recognised as letters in the same way, so

在理想的世界中，e 和 E 不会以相同的方式被识别为字母，因此

'-4e1'

would become

会成为

['-', '4e1']

but

但

'-4x1'

would become

会成为

['-', '4', 'x', '1']

Can anybody help?

有人可以帮忙吗？

Answer 1

采纳答案by Peter Varo

Use the regular expression module's split()function, to split at

使用正则表达式模块的split()函数，在

'\d+'-- digits (number characters) and
'\W+'-- non-word characters:

'\d+'-- 数字（数字字符）和
'\W+'-- 非单词字符：

CODE:

代码：

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出：

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

如果您不想分隔点（作为表达式中的浮点数），则应使用以下命令：

[\d.]+-- digit or dot characters (although this allows you to write: 13.5.5

[\d.]+-- 数字或点字符（尽管这允许您编写： 13.5.5

CODE:

代码：

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

输出：

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

Answer 2

回答by Tigran Saluev

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yaccfor creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

好吧，问题似乎并不简单。我认为，获得强大（但不幸的是，不是那么短）解决方案的一个好方法是使用Python Lex-Yacc来创建一个全权重标记器。Lex-Yacc 是一种常见的（不仅仅是 Python）实践，因此可以有现成的语法来创建一个简单的算术标记器（就像这个），你只需要让它们适应你的特定需求。

Answer 3

回答by redrubia

Another alternative not suggested here, is to using nltk.tokenizemodule

此处未建议的另一种选择是使用nltk.tokenize模块

Python 如何将字符串拆分为令牌？

提问by Martin Thetford

采纳答案by Peter Varo

回答by Tigran Saluev

回答by redrubia

相关推荐

最近更新

标签

Python 如何将字符串拆分为令牌？

提问by Martin Thetford

采纳答案by Peter Varo

回答by Tigran Saluev

回答by redrubia

相关推荐

Python 如何在for循环中追加pandas数据帧中的行？

python：遍历具有列表值的字典

Python 禁用索引熊猫数据框

什么是依赖注入的 Pythonic 方式？

相关推荐

最近更新

标签