javascript 在javascript中将字符串拆分为句子
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18914629/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split string into sentences in javascript
提问by Tobias Golbs
Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.
目前我正在开发一个将长列拆分为短列的应用程序。为此,我将整个文本拆分为单词,但目前我的正则表达式也将数字拆分。
What i do is this:
我做的是这样的:
str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
The result is:
结果是:
Array [
"This is a long string with some numbers [125.",
"000,55 and 140.",
"000] and an end.",
" This is another sentence."
]
The desired result would be:
想要的结果是:
Array [
"This is a long string with some numbers [125.000, 140.000] and an end.",
"This is another sentence"
]
How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". "
, "? "
and "! "
?
我必须如何更改我的正则表达式才能实现这一目标?我是否需要注意可能会遇到的一些问题?或者搜索". "
,"? "
和 就足够了"! "
?
回答by Roger Poon
str.replace(/([.?!])\s*(?=[A-Z])/g, "|").split("|")
Output:
输出:
[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
'This is another sentence.' ]
Breakdown:
分解:
([.?!])
= Capture either .
or ?
or !
([.?!])
= 捕获.
或?
或!
\s*
= Capture 0 or more whitespace characters following the previous token ([.?!])
. This accounts for spaces following a punctuation mark which matches the English language grammar.
\s*
= 在前一个标记之后捕获 0 个或多个空白字符([.?!])
。这说明了与英语语法匹配的标点符号后面的空格。
(?=[A-Z])
= The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.
(?=[A-Z])
= 仅当下一个字符在范围 AZ(大写 A 到大写 Z)内时,前一个标记才匹配。大多数英语句子以大写字母开头。以前的正则表达式都没有考虑到这一点。
The replace operation uses:
替换操作使用:
"|"
We used one "capturing group" ([.?!])
and we capture one of those characters, and replace it with $1
(the match) plus |
. So if we captured ?
then the replacement would be ?|
.
我们使用了一个“捕获组”([.?!])
并捕获了其中一个字符,并将其替换为$1
(匹配项) plus |
。因此,如果我们捕获了,?
那么替换将是?|
.
Finally, we split the pipes |
and get our result.
最后,我们拆分管道|
并得到我们的结果。
So, essentially, what we are saying is this:
所以,本质上,我们要说的是:
1) Find punctuation marks (one of .
or ?
or !
) and capture them
1)找到标点符号(.
或?
或之一!
)并捕获它们
2) Punctuation marks can optionally include spaces after them.
2) 标点符号之后可以有选择地包含空格。
3) After a punctuation mark, I expect a capital letter.
3)在标点符号之后,我希望有一个大写字母。
Unlike the previous regular expressions provided, this would properly match the English language grammar.
与之前提供的正则表达式不同,这将正确匹配英语语言语法。
From there:
从那里:
4) We replace the captured punctuation marks by appending a pipe |
4)我们通过附加一个管道来替换捕获的标点符号 |
5) We split the pipes to create an array of sentences.
5)我们拆分管道以创建一个句子数组。
回答by Antonín Slej?ka
str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "|").split("|")
The RegExp (see on Debuggex):
RegExp(参见Debuggex):
- (.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
- (\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
- (\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
- g = global
- m = multiline
- (.+|:|!|\?) = 句子不仅可以以“.”、“!”结尾 或“?”,但也可以通过“...”或“:”
- (\" |\'|)*|} |]) = 句子可以用引号或括号括起来
- (\s|\n|\r|\r\n) = 句子后必须是空格或行尾
- g = 全局
- m = 多行
Remarks:
评论:
- If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "ü", "?" or "á" will not be recognised.
- 如果使用 (?=[AZ]),则 RegExp 在某些语言中将无法正常工作。例如“ü”,“?” 或“á”将不会被识别。
回答by tessi
You could exploit that the next sentence begins with an uppercase letter or a number.
您可以利用下一个句子以大写字母或数字开头。
.*?(?:\.|!|\?)(?:(?= [A-Z0-9])|$)
It splits this text
它拆分此文本
This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence. Sencenes beginning with numbers work. 10 people like that.
into the sentences:
成句子:
This is a long string with some numbers [125.000,55 and 140.000] and an end.
This is another sentence.
Sencenes beginning with numbers work.
10 people like that.
回答by anubhava
Use lookahead to avoid replacing dot if not followed by space + word char:
如果后面没有空格 + 字符字符,请使用前瞻来避免替换点:
sentences = str.replace(/(?=\s*\w)\./g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");
OUTPUT:
输出:
["This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."]
回答by Tibos
You're safer using lookahead to make sure what follows after the dot is not a digit.
使用前瞻来确保点之后的内容不是数字会更安全。
var str ="This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."
var sentences = str.replace(/\.(?!\d)/g,'.|');
console.log(sentences);
If you want to be even safer you could check if what is behind is a digit as well, but since JS doesn't support lookbehind, you need to capture the previous character and use it in the replace string.
如果你想更安全,你可以检查后面是否也是数字,但由于 JS 不支持后视,你需要捕获前一个字符并在替换字符串中使用它。
var str ="This is another sentence.1 is a good number"
var sentences = str.replace(/\.(?!\d)|([^\d])\.(?=\d)/g,'.|');
console.log(sentences);
An even simpler solution is to escape the dots inside numbers (replace them with $$$$ for example), do the split and afterwards unescape the dots.
一个更简单的解决方案是对数字中的点进行转义(例如用 $$$$ 替换它们),进行拆分,然后对点进行转义。
回答by yilmazburk
you forgot to put '\s' in your regexp.
您忘记在正则表达式中放入 '\s'。
try this one
试试这个
var str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
var sentences = str.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");
console.log(sentences[0]);
console.log(sentences[1]);
回答by Beejee
I would just change the strings and put something between each sentence. You told me you have the right to change them so it will be easier to do it this way.
我会改变字符串并在每个句子之间放一些东西。你告诉我你有权改变它们,所以这样做会更容易。
\r\n
By doing this you have a string to search for and you won't need to use these complex regex.
通过这样做,您可以搜索一个字符串,并且不需要使用这些复杂的正则表达式。
If you want to do it the harder way I would use a regex to look for "." "?" "!" folowed by a capital letter. Like Tessi showed you.
如果你想用更难的方式来做,我会使用正则表达式来寻找“。” “?” “!” 后跟一个大写字母。就像泰西向你展示的那样。