Python - 如何按非字母字符拆分字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35231285/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python - How to split a string by non alpha characters
提问by nickeb96
I'm trying to use python to parse lines of c++ source code. The only thing I am interested in is include directives.
我正在尝试使用 python 来解析 C++ 源代码行。我唯一感兴趣的是包含指令。
#include "header.hpp"
I want it to be flexible and still work with poor coding styles like:
我希望它灵活,并且仍然可以使用糟糕的编码风格,例如:
# include"header.hpp"
I have gotten to the point where I can read lines and trim whitespace before and after the #. However I still need to find out what directive it is by reading the string until a non-alpha character is encountered regardless of weather it is a space, quote, tab or angled bracket.
我已经到了可以在 # 之前和之后读取线条和修剪空格的地步。但是,我仍然需要通过读取字符串来找出它是什么指令,直到遇到非字母字符,无论天气如何,它都是空格、引号、制表符或尖括号。
So basically my question is: How can I split a string starting with alphas until a non alpha is encountered?
所以基本上我的问题是:如何拆分以 alpha 开头的字符串,直到遇到非 alpha?
I think I might be able to do this with regex, but I have not found anything in the documentation that looks like what I want.
我想我可以用正则表达式来做到这一点,但我没有在文档中找到任何我想要的东西。
Also if anyone has advice on how I would get the file name inside the quotes or angled brackets that would be a plus.
此外,如果有人对我如何在引号或尖括号内获取文件名有建议,那将是一个加号。
采纳答案by kfx
You can do that with a regex. However, you can also use a simple while
loop.
你可以用正则表达式来做到这一点。但是,您也可以使用简单的while
循环。
def splitnonalpha(s):
pos = 1
while pos < len(s) and s[pos].isalpha():
pos+=1
return (s[:pos], s[pos:])
Test:
测试:
>>> splitnonalpha('#include"blah.hpp"')
('#include', '"blah.hpp"')
回答by Daniyal Syed
import re
s = 'foo bar- blah/hm.lala'
print(re.findall(r"\w+",s))
output : ['foo', 'bar', 'blah', 'hm', 'lala']
输出:['foo', 'bar', 'blah', 'hm', 'lala']
回答by Patrick Carroll
You can use regex. The \W
token will match all non-word characters (which is about the same as non-alphanumeric). Word characters are A-Z
, a-z
, 0-9
, and _
. If you want to match underscores as well you could just do [\W_]
.
您可以使用正则表达式。该\W
令牌将匹配所有非单词字符(这是大约相同非字母数字)。字字符A-Z
,a-z
,0-9
,和_
。如果你也想匹配下划线,你可以做[\W_]
.
>>> import re
>>> line = '# include"header.hpp" '
>>> m = re.match(r'^\s*#\s*include\W+([\w\.]+)\W*$', line)
>>> m.group(1)
'header.hpp'
回答by nlloyd
Your instinct on using regex is correct.
您使用正则表达式的直觉是正确的。
import re
re.split('[^a-zA-Z]', string_to_split)
The [^a-zA-Z]
part means "not alphabetic characters".
该[^a-zA-Z]
部分的意思是“非字母字符”。
回答by Garrett R
This works:
这有效:
import re
test_str = ' # include "header.hpp"'
match = re.match(r'\s*#\s*include\s*("[\w.]*")', test_str)
if match:
print match.group(1)
回答by Garrett R
While not exact, most parse header directives like this
虽然不准确,但大多数解析头指令是这样的
(?m)^\h*#\h*include\h*["<](\w[\w.]*)\h*[">]
(?m)^\h*#\h*include\h*["<](\w[\w.]*)\h*[">]
Where, (?m) is multi-line mode, \h is horizontal whitespace (aka [^\S\r\n] ).
其中, (?m) 是多行模式, \h 是水平空白(又名 [^\S\r\n] )。
回答by Denis Drescher
The two options mentioned by others that are best in my opinion are re.split
and re.findall
:
其他人提到的在我看来最好的两个选项是re.split
和re.findall
:
>>> import re
>>> re.split(r'\W+', '#include "header.hpp"')
['', 'include', 'header', 'hpp', '']
>>> re.findall(r'\w+', '#include "header.hpp"')
['include', 'header', 'hpp']
A quick benchmark:
快速基准:
>>> setup = "import re; word_pattern = re.compile(r'\w+'); sep_pattern = re.compile(r'\W+')"
>>> iterations = 10**6
>>> timeit.timeit("re.findall(r'\w+', '#header foo bar!')", setup=setup, number=iterations)
3.000092029571533
>>> timeit.timeit("word_pattern.findall('#header foo bar!')", setup=setup, number=iterations)
1.5247418880462646
>>> timeit.timeit("re.split(r'\W+', '#header foo bar!')", setup=setup, number=iterations)
3.786440134048462
>>> timeit.timeit("sep_pattern.split('#header foo bar!')", setup=setup, number=iterations)
2.256173849105835
The functional difference is that re.split
keeps empty tokens. That's usually not useful for tokenization purposes, but the following should be identical to the re.findall
solution:
功能上的区别在于re.split
保留空令牌。这对于标记化目的通常没有用,但以下内容应与re.findall
解决方案相同:
>>> filter(bool, re.split(r'\W+', '#include "header.hpp"'))
['include', 'header', 'hpp']
回答by user2902302
import re re.split('[^a-zA-Z0-9]', string_to_split)
导入 re.split('[^a-zA-Z0-9]', string_to_split)
for all !(alphanumaric) characters
对于所有 !(字母数字) 字符