C# 如何将文本拆分成单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16725848/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 01:46:04  来源:igfitidea点击:

How to split text into words?

c#.net

提问by Colonel Panic

How to split text into words?

如何将文本拆分成单词?

Example text:

示例文本:

'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

“哦,你没办法,”猫说:“我们都在这里发疯了。我生气了。你疯了。'

The words in that line are:

那一行的字是:

  1. Oh
  2. you
  3. can't
  4. help
  5. that
  6. said
  7. the
  8. Cat
  9. we're
  10. all
  11. mad
  12. here
  13. I'm
  14. mad
  15. You're
  16. mad
  1. 不能
  2. 帮助
  3. 说过
  4. 全部
  5. 疯狂的
  6. 这里
  7. 我是
  8. 疯狂的
  9. 你是
  10. 疯狂的

采纳答案by Colonel Panic

Split text on whitespace, then trim punctuation.

在空白处拆分文本,然后修剪标点符号。

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));

Agrees exactly with example.

完全同意例子。

回答by Adam Tal

First, Remove all special characeters:

首先,删除所有特殊字符:

var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better

Then split it:

然后拆分它:

var split = fixedInput.Split(' ');

For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):

对于用于删除特殊字符(您可以轻松更改)的更简单的 C# 解决方案,请添加此扩展方法(我添加了对撇号的支持):

public static string RemoveSpecialCharacters(this string str) {
   var sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

Then use it like so:

然后像这样使用它:

var words = input.RemoveSpecialCharacters().Split(' ');

You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)

你会惊讶地发现这个扩展方法非常有效(肯定比正则表达式更有效)所以我建议你使用它;)

Update

更新

I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:

我同意这是一种仅限英语的方法,但要使其与 Unicode 兼容,您所要做的就是替换:

(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')

With:

和:

char.IsLetter(c)

Which supports Unicode, .Net Also offers you char.IsSymboland char.IsLetterOrDigitfor the variety of cases

其中支持 Unicode,.Net 还为您提供char.IsSymbolchar.IsLetterOrDigit针对各种情况

回答by Michael La Voie

Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:

只是在@Adam Fridental 的答案上添加一个非常好的变体,你可以试试这个正则表达式:

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";

var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");

foreach (Match match in matches) {
    var word = match.Value;
}

I believe this is the shortest RegEx that will get all the words

我相信这是最短的正则表达式,可以得到所有单词

\w+[^\s]*\w+|\w

回答by keyboardP

You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Charstatic methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't) but remove the single quotes like in 'Oh.

您可以尝试使用正则表达式删除未被字母(即单引号)包围的撇号,然后使用Char静态方法去除所有其他字符。通过首先调用正则表达式,您可以保留收缩撇号(例如can't),但删除像 in 的单引号'Oh

string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";

Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");

string[] listOfWords = RemoveCharacters(myText);

public string[] RemoveCharacters(string input)
{
    StringBuilder sb = new StringBuilder();
    foreach (char c in input)
    {
        if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
           sb.Append(c);
    }

    return sb.ToString().Split(' ');
}

回答by mason

If you don't want to use a Regex object, you could do something like...

如果您不想使用 Regex 对象,则可以执行以下操作...

string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();

You'll still have to handle the trailing apostrophe at the end of "that,'"

您仍然需要处理“that”结尾处的尾随撇号

回答by toannm

This is one of solution, i dont use any helper class or method.

这是解决方案之一,我不使用任何帮助类或方法。

        public static List<string> ExtractChars(string inputString) {
            var result = new List<string>();
            int startIndex = -1;
            for (int i = 0; i < inputString.Length; i++) {
                var character = inputString[i];
                if ((character >= 'a' && character <= 'z') ||
                    (character >= 'A' && character <= 'Z')) {
                    if (startIndex == -1) {
                        startIndex = i;
                    }
                    if (i == inputString.Length - 1) {
                        result.Add(GetString(inputString, startIndex, i));
                    }
                    continue;
                }
                if (startIndex != -1) {
                    result.Add(GetString(inputString, startIndex, i - 1));
                    startIndex = -1;
                }
            }
            return result;
        }

        public static string GetString(string inputString, int startIndex, int endIndex) {
            string result = "";
            for (int i = startIndex; i <= endIndex; i++) {
                result += inputString[i];
            }
            return result;
        }

回答by Francesco

If you want to use the "for cycle" to check each char and save all punctuationin the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]

如果您想使用“for 循环”来检查每个字符并将所有标点保存在输入字符串中,我已经创建了这个类。GetSplitSentence() 方法返回一个 SentenceSplitResult 列表。在此列表中,保存了所有单词以及所有标点符号和数字。保存的每个标点符号或数字都是列表中的一个项目。sentenceSplitResult.isAWord 用于检查是否是单词。[对不起我的英语不好]

public class SentenceSplitResult
{
    public string word;
    public bool isAWord;
}

public class StringsHelper
{

    private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();

    private readonly string input;

    public StringsHelper(string input)
    {
        this.input = input;
    }

    public List<SentenceSplitResult> GetSplitSentence()
    {
        StringBuilder sb = new StringBuilder();

        try
        {
            if (String.IsNullOrEmpty(input)) {
                Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
                return outputList;                    
            }

            bool isAletter = IsAValidLetter(input[0]);

            // Each char i checked if is a part of a word.
            // If is YES > I can store the char for later
            // IF is NO > I Save the word (if exist) and then save the punctuation
            foreach (var _char in input)
            {
                isAletter = IsAValidLetter(_char);

                if (isAletter == true)
                {
                    sb.Append(_char);
                }
                else
                {
                    SaveWord(sb.ToString());
                    sb.Clear();
                    SaveANotWord(_char);                        
                }
            }

            SaveWord(sb.ToString());

        }
        catch (Exception ex)
        {
            Logger.Log(ex);
        }

        return outputList;

    }

    private static bool IsAValidLetter(char _char)
    {
        if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
        {
            return false;
        }
        return true;
    }

    private void SaveWord(string word)
    {
        if (String.IsNullOrEmpty(word) == false)
        {
            outputList.Add(new SentenceSplitResult()
            {
                isAWord = true,
                word = word
            });                
        }
    }

    private void SaveANotWord(char _char)
    {
        outputList.Add(new SentenceSplitResult()
        {
            isAWord = false,
            word = _char.ToString()
        });
    }