string 如何拆分多个连接的单词？

Question

提问by Taptronic

I have an array of 1000 or so entries, with examples below:

我有一个包含 1000 个左右条目的数组，示例如下：

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

I would like to be able to split these into their respective words, as:

我希望能够将它们分成各自的词，如：

wicked weather
liquid weather
drive our trucks
go compact
slim projector

I was hoping a regular expression my do the trick. But, since there is no boundary to stop on, nor is there any sort of capitalization that I could possibly key on, I am thinking, that some sort of reference to a dictionary might be necessary?

我希望正则表达式可以解决问题。但是，由于没有边界可以停下来，也没有任何可以键入的大写字母，我在想，可能需要某种对字典的引用？

I suppose it could be done by hand, but why - when it can be done with code! =) But this has stumped me. Any ideas?

我想它可以手工完成，但为什么 - 当它可以用代码完成时！=) 但这让我很难过。有任何想法吗？

Answer 1

采纳答案by SquareCog

Can a human do it?

一个人能做到吗？

farsidebag
far sidebag
farside bag
far side bag

Not only do you have to use a dictionary, you might have to use a statistical approach to figure out what's most likely (or, god forbid, an actual HMM for your human language of choice...)

您不仅必须使用字典，还可能必须使用统计方法来找出最有可能发生的情况（或者，上帝保佑，您选择的人类语言的实际 HMM ......）

For how to do statistics that might be helpful, I turn you to Dr. Peter Norvig, who addresses a different, but related problem of spell-checking in 21 lines of code: http://norvig.com/spell-correct.html

关于如何进行可能有用的统计，我将您转交给 Peter Norvig 博士，他用 21 行代码解决了一个不同但相关的拼写检查问题：http: //norvig.com/spell-correct.html

(he does cheat a bit by folding every for loop into a single line.. but still).

（他确实通过将每个 for 循环折叠成一行来作弊......但仍然如此）。

UpdateThis got stuck in my head, so I had to birth it today. This code does a similar split to the one described by Robert Gamble, but then it orders the results based on word frequency in the provided dictionary file (which is now expected to be some text representative of your domain or English in general. I used big.txt from Norvig, linked above, and catted a dictionary to it, to cover missing words).

更新这在我的脑海里卡住了，所以我今天不得不生下来。这段代码与 Robert Gamble 所描述的进行了类似的拆分，但随后它根据提供的字典文件中的词频对结果进行排序（现在预计它是代表您的域或一般英语的一些文本。我使用了 big .txt 来自 Norvig，链接在上面，并在其中添加了一本字典，以覆盖丢失的单词）。

A combination of two words will most of the time beat a combination of 3 words, unless the frequency difference is enormous.

大多数情况下，两个词的组合会胜过 3 个词的组合，除非频率差异很大。

I posted this code with some minor changes on my blog

我在我的博客上发布了一些小改动的代码

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/and also wrote a little about the underflow bug in this code.. I was tempted to just quietly fix it, but figured this may help some folks who haven't seen the log trick before: http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/并且还写了一些关于此代码中下溢错误的内容..我很想静静地修复它，但认为这可能会帮助一些以前没有见过日志技巧的人：http: //squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/

Output on your words, plus a few of my own -- notice what happens with "orcore":

输出你的话，加上我自己的话——注意“orcore”会发生什么：

perl splitwords.pl big.txt words
answerveal: 2 possibilities
 -  answer veal
 -  answer ve al

wickedweather: 4 possibilities
 -  wicked weather
 -  wicked we at her
 -  wick ed weather
 -  wick ed we at her

liquidweather: 6 possibilities
 -  liquid weather
 -  liquid we at her
 -  li quid weather
 -  li quid we at her
 -  li qu id weather
 -  li qu id we at her

driveourtrucks: 1 possibilities
 -  drive our trucks

gocompact: 1 possibilities
 -  go compact

slimprojector: 2 possibilities
 -  slim projector
 -  slim project or

orcore: 3 possibilities
 -  or core
 -  or co re
 -  orc ore

Code:

代码：

#!/usr/bin/env perl

use strict;
use warnings;

sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();

our(%DICT,$TOTAL);
{
  my( $dict_file, $word_file ) = @ARGV;
  ($dict_file && $word_file) or die(Usage);

  {
    my $DICT;
    ($DICT, $TOTAL) = get_word_stats($dict_file);
    %DICT = %$DICT;
  }

  {
    open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";

    foreach my $word (<$WORDS>) {
      chomp $word;
      my $arr = find_matches($word);


      local $_;
      # Schwartzian Transform
      my @sorted_arr =
        map  { $_->[0] }
        sort { $b->[1] <=> $a->[1] }
        map  {
          [ $_, find_word_seq_score(@$_) ]
        }
        @$arr;


      print_results( $word, @sorted_arr );
    }

    close $WORDS;
  }
}


sub find_matches($){
    my( $string ) = @_;

    my @found_parses;
    my @words;
    find_matches_rec( $string, @words, @found_parses );

    return  @found_parses if wantarray;
    return \@found_parses;
}

sub find_matches_rec($\@\@){
    my( $string, $words_sofar, $found_parses ) = @_;
    my $length = length $string;

    unless( $length ){
      push @$found_parses, $words_sofar;

      return @$found_parses if wantarray;
      return  $found_parses;
    }

    foreach my $i ( 2..$length ){
      my $prefix = substr($string, 0, $i);
      my $suffix = substr($string, $i, $length-$i);

      if( exists $DICT{$prefix} ){
        my @words = ( @$words_sofar, $prefix );
        find_matches_rec( $suffix, @words, @$found_parses );
      }
    }

    return @$found_parses if wantarray;
    return  $found_parses;
}


## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
    my( @words ) = @_;
    local $_;

    my $score = 1;
    foreach ( @words ){
        $score = $score * $DICT{$_} / $TOTAL;
    }

    return $score;
}

sub get_word_stats($){
    my ($filename) = @_;

    open(my $DICT, '<', $filename) or die "unable to open $filename\n";

    local $/= undef;
    local $_;
    my %dict;
    my $total = 0;

    while ( <$DICT> ){
      foreach ( split(/\b/, $_) ) {
        $dict{$_} += 1;
        $total++;
      }
    }

    close $DICT;

    return (\%dict, $total);
}

sub print_results($@){
    #( 'word', [qw'test one'], [qw'test two'], ... )
    my ($word,  @combos) = @_;
    local $_;
    my $possible = scalar @combos;

    print "$word: $possible possibilities\n";
    foreach (@combos) {
      print ' -  ', join(' ', @$_), "\n";
    }
    print "\n";
}

sub Usage(){
    return "import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
 /path/to/dictionary /path/to/your_words";
}

Answer 2

回答by Darius Bacon

The Viterbi algorithmis much faster. It computes the same scores as the recursive search in Dmitry's answer above, but in O(n) time. (Dmitry's search takes exponential time; Viterbi does it by dynamic programming.)

该Viterbi算法的速度要快得多。它计算的分数与上面 Dmitry 的答案中的递归搜索相同，但时间为 O(n)。（Dmitry 的搜索需要指数时间；Viterbi 通过动态规划来完成。）

>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'

Testing it:

测试它：

#!/usr/bin/perl

use strict;

my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary

# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
  chomp;
  $words{lc($_)} = 1;
}
close(WORDS);

# Read one line at a time from stdin, break into words
while (<>) {
  chomp;
  my @words;
  find_words(lc($_));
}

sub find_words {
  # Print every way $string can be parsed into whole words
  my $string = shift;
  my @words = @_;
  my $length = length $string;

  foreach my $i ( 1 .. $length ) {
    my $word = substr $string, 0, $i;
    my $remainder = substr $string, $i, $length - $i;
    # Some dictionaries contain each letter as a word
    next if ($i == 1 && ($word ne "a" && $word ne "i"));

    if (defined($words{$word})) {
      push @words, $word;
      if ($remainder eq "") {
        print join(' ', @words), "\n";
        return;
      } else {
        find_words($remainder, @words);
      }
      pop @words;
    }
  }

  return;
}

To be practical you'll likely want a couple refinements:

为了实用，您可能需要一些改进：

Add logs of probabilities, don't multiply probabilities. This avoids floating-point underflow.
Your inputs will in general use words not in your corpus. These substrings must be assigned a nonzero probability as words, or you end up with no solution or a bad solution. (That's just as true for the above exponential search algorithm.) This probability has to be siphoned off the corpus words' probabilities and distributed plausibly among all other word candidates: the general topic is known as smoothing in statistical language models. (You can get away with some pretty rough hacks, though.) This is where the O(n) Viterbi algorithm blows away the search algorithm, because considering non-corpus words blows up the branching factor.

添加概率的对数，不要乘以概率。这避免了浮点下溢。
您的输入通常会使用不在您的语料库中的单词。必须为这些子字符串分配一个非零概率作为单词，否则您最终将得不到解决方案或糟糕的解决方案。（对于上述指数搜索算法也是如此。）这个概率必须从语料库单词的概率中抽取出来，并合理地分布在所有其他候选单词中：一般主题在统计语言模型中被称为平滑。（不过，您可以通过一些非常粗略的技巧逃脱。）这就是 O(n) Viterbi 算法击败搜索算法的地方，因为考虑到非语料库单词会破坏分支因子。

Answer 3

回答by Robert Gamble

The best tool for the job here is recursion, not regular expressions. The basic idea is to start from the beginning of the string looking for a word, then take the remainder of the string and look for another word, and so on until the end of the string is reached. A recursive solution is natural since backtracking needs to happen when a given remainder of the string cannot be broken into a set of words. The solution below uses a dictionary to determine what is a word and prints out solutions as it finds them (some strings can be broken out into multiple possible sets of words, for example wickedweather could be parsed as "wicked we at her"). If you just want one set of words you will need to determine the rules for selecting the best set, perhaps by selecting the solution with fewest number of words or by setting a minimum word length.

在这里工作的最佳工具是递归，而不是正则表达式。基本思想是从字符串的开头开始查找单词，然后取出字符串的其余部分并查找另一个单词，依此类推，直到到达字符串的末尾。递归解决方案是很自然的，因为当字符串的给定剩余部分无法分解为一组单词时，需要进行回溯。下面的解决方案使用字典来确定什么是单词并在找到它们时打印出解决方案（某些字符串可以分解为多个可能的单词集，例如 wickedweather 可以被解析为“wicked we at her”）。如果您只想要一组单词，则需要确定选择最佳组的规则，

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

Answer 4

回答by kamran kausar

pip install wordninja

pip 安装 wordninja

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

Answer 5

回答by Greg Hewgill

I think you're right in thinking that it's not really a job for a regular expression. I would approach this using the dictionary idea - look for the longest prefix that is a word in the dictionary. When you find that, chop it off and do the same with the remainder of the string.

我认为您认为正则表达式并不是真正的工作是正确的。我会使用字典的想法来解决这个问题 - 在字典中寻找最长的前缀。当你发现它时，把它砍掉，对剩下的绳子做同样的事情。

The above method is subject to ambiguity, for example "drivereallyfast" would first find "driver" and then have trouble with "eallyfast". So you would also have to do some backtracking if you ran into this situation. Or, since you don't have that many strings to split, just do by hand the ones that fail the automated split.

上述方法存在歧义，例如“drivereallyfast”会先找到“driver”，然后在“eallyfast”上遇到麻烦。因此，如果遇到这种情况，您还必须进行一些回溯。或者，由于您没有那么多要拆分的字符串，只需手动执行自动拆分失败的字符串。

Answer 6

回答by mhucka

This is related to a problem known as identifier splittingor identifier name tokenization. In the OP's case, the inputs seem to be concatenations of ordinary words; in identifier splitting, the inputs are class names, function names or other identifiers from source code, and the problem is harder. I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question while looking for identifier splitters (like I was, not long ago), I would like to offer Spiral("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.

这与称为标识符拆分或标识符名称标记化的问题有关。在 OP 的情况下，输入似乎是普通单词的串联；在标识符拆分中，输入是源代码中的类名、函数名或其他标识符，问题更难。我意识到这是一个老问题，OP 要么解决了他们的问题，要么继续前进，但如果其他人在寻找标识符拆分器时遇到这个问题（就像我不久前一样），我想提供Spiral( “标识符的拆分器：库”）。它是用 Python 编写的，但带有一个命令行实用程序，可以读取标识符文件（每行一个）并拆分每个标识符。

Splitting identifiers is deceptively difficult. Programmers commonly use abbreviations, acronyms and word fragments when naming things, and they don't always use consistent conventions. Even in when identifiers do follow some convention such as camel case, ambiguities can arise.

拆分标识符看似困难。程序员在命名事物时通常使用缩写词、首字母缩略词和单词片段，并且他们并不总是使用一致的约定。即使在标识符确实遵循一些约定（例如驼峰式大小写）时，也会出现歧义。

Spiralimplements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfilerinto [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SEas a unit. Here are some more examples of what Ronin can split:

Spiral实现了许多标识符拆分算法，包括一种称为 Ronin 的新算法。它使用从挖掘源代码存储库中获得的各种启发式规则、英语词典和令牌频率表。Ronin 可以拆分不使用驼峰式大小写或其他命名约定的标识符，包括拆分J2SEProjectTypeProfiler为 [ J2SE, Project, Type, Profiler] 等情况，需要读者识别J2SE为一个单位。以下是 Ronin 可以拆分的更多示例：

# spiral wickedweather liquidweather  driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']

Using the examples from the OP's question:

使用 OP 问题中的示例：

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("????????????")

As you can see, it is not perfect. It's worth noting that Ronin has a number of parameters and adjusting them makes it possible to split driveourtruckstoo, but at the cost of worsening performance on program identifiers.

如您所见，它并不完美。值得注意的是，Ronin 有许多参数，调整它们也可以拆分driveourtrucks，但代价是程序标识符的性能变差。

More information can be found in the GitHub repo for Spiral.

更多信息可以在Spiral的GitHub 存储库中找到。

Answer 7

回答by Zoe Gagnon

Well, the problem itself is not solvable with just a regular expression. A solution (probably not the best) would be to get a dictionary and do a regular expression match for each work in the dictionary to each word in the list, adding the space whenever successful. Certainly this would not be terribly quick, but it would be easy to program and faster than hand doing it.

好吧，问题本身不能仅用正则表达式解决。一个解决方案（可能不是最好的）是获取一个字典并对字典中的每个工作与列表中的每个单词进行正则表达式匹配，只要成功就添加空格。当然，这不会非常快，但它很容易编程，而且比手工编程要快。

Answer 8

回答by Mitch Wheat

A dictionary based solution would be required. This might be simplified somewhat if you have a limited dictionary of words that can occur, otherwise words that form the prefix of other words are going to be a problem.

需要基于字典的解决方案。如果可能出现的单词字典有限，这可能会稍微简化，否则形成其他单词前缀的单词将成为问题。

Answer 9

回答by adam shamsudeen

There is python package released Santhosh thottingal called mlmorph which can be used for morphological analysis.

Santhosh thottingal 发布了一个名为 mlmorph 的 Python 包，可用于形态学分析。

https://pypi.org/project/mlmorph/

Examples:

例子：

[('?????<np><genitive>', 179)]

Gives

给

$ echo thisisatest | python -m wordsegment
this is a test

He also wrote a blog on the topic https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/

他还写了一篇关于这个话题的博客https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analysisr/

Answer 10

回答by Rabash

A simple solution with Python: install the wordsegmentpackage: pip install wordsegment.

一个简单的 Python 解决方案：安装wordsegment包：pip install wordsegment.

##代码##

string 如何拆分多个连接的单词？

提问by Taptronic

采纳答案by SquareCog

回答by Darius Bacon

回答by Robert Gamble

回答by kamran kausar

回答by Greg Hewgill

回答by mhucka

回答by Zoe Gagnon

回答by Mitch Wheat

回答by adam shamsudeen

回答by Rabash

相关推荐

最近更新

标签

string 如何拆分多个连接的单词？

提问by Taptronic

采纳答案by SquareCog

回答by Darius Bacon

回答by Robert Gamble

回答by kamran kausar

回答by Greg Hewgill

回答by mhucka

回答by Zoe Gagnon

回答by Mitch Wheat

回答by adam shamsudeen

回答by Rabash

相关推荐

pandas “numpy.ndarray”对象不可调用错误

pandas 如何将列转换为熊猫中的一个日期时间列？

Pandas Dataframe ValueError：传递值的形状是 (X, )，索引意味着 (X, Y)

计算 Pandas GroupBy 上的任意百分位数

相关推荐

最近更新

标签