python 从文本中解析含义

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1140908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:34:09  来源:igfitidea点击:

Parsing Meaning from Text

pythonparsingnlplexical-analysis

提问by Tom

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

我意识到这是一个广泛的主题,但我正在寻找有关从文本中解析含义的优秀入门书,最好是在 Python 中。作为我想要做的一个例子,如果一个用户写了一篇博文,比如:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

“曼尼·拉米雷斯今天在对阵休斯顿太空人队的比赛中重返道奇队”,

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

从句子中提取名词的轻量级/简单方法是什么?首先,我想我会将它限制为专有名词,但我不想仅限于此(我不想依赖一个简单的正则表达式,假设任何标题上限都是专有名词)。

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other questionon the topic and I'm digging through those resources now.

让这个问题更糟的是,我没有要求我应该做的事情是什么?我是否需要现有单词的语料库才能开始使用?我需要知道哪些词法分析才能使这项工作发挥作用?我确实遇到了有关该主题的另一个问题,现在我正在挖掘这些资源。

采纳答案by Bluu

Use the NLTK, in particular chapter 7 on Information Extraction.

使用NLTK,特别是关于信息提取的第 7 章。

You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

你说你想提取意义,并且有语义分析的模块,但我认为 IE 就是你所需要的——老实说,这是 NLP 计算机现在唯一可以处理的领域之一。

See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

请参阅第 7.5 和 7.6 节关于命名实体识别(将 Manny Ramerez 作为个人、道奇队作为体育组织、休斯顿太空人队作为另一个体育组织或任何适合您的领域的组织)和关系提取的子主题。安装 NLTK 后,您可以插入一个 NER 分块器。从他们的例子中,提取一个地缘实体(GPE)和一个人:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent) 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

请注意,您仍需要了解标记化和标记,如前几章所述,以便为这些 IE 任务以正确的格式获取文本。

回答by RichieHindle

You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.

您需要查看Natural Language Toolkit,它正是针对此类事情的。

This section of the manual looks very relevant: Categorizing and Tagging Words- here's an extract:

手册的这一部分看起来非常相关:Categorizing and Tagging Words- 这是一个摘录:

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

Here we see that andis CC, a coordinating conjunction; nowand completelyare RB, or adverbs; foris IN, a preposition; somethingis NN, a noun; and differentis JJ, an adjective.

在这里我们看到是 CC,协调连词;现在完全是 RB,或副词;因为是 IN,介词;东西是NN,一个名词;和不同的是JJ,一个形容词。

回答by Stephan202

Natural Language Processing (NLP) is the name for parsing, well, natural language. Many algorithms and heuristics exist, and it's an active field of research. Whatever algorithm you will code, it will need to be trained on a corpus. Just like a human: we learn a language by reading text written by other people (and/or by listening to sentences uttered by other people).

自然语言处理 (NLP) 是解析自然语言的名称。存在许多算法和启发式方法,这是一个活跃的研究领域。无论您要编码什么算法,都需要在语料库上进行训练。就像人类一样:我们通过阅读其他人写的文本(和/或通过听其他人说的句子)来学习语言。

In practical terms, have a look at the Natural Language Toolkit. For a theoretical underpinning of whatever you are going to code, you may want to check out Foundations of Statistical Natural Language Processingby Chris Manning and Hinrich Schütze.

实际上,请查看Natural Language Toolkit。对于您将要编码的任何内容的理论基础,您可能需要查看Chris Manning 和 Hinrich Schütze 合着的统计自然语言处理基础

alt text
(source: stanford.edu)

替代文字
(来源:stanford.edu

回答by zakovyrya

Here is the book I stumbled upon recently: Natural Language Processing with Python

这是我最近偶然发现的书:Natural Language Processing with Python

回答by Jay Kominek

What you want is called NP (noun phrase) chunking, or extraction.

您想要的称为 NP(名词短语)分块或提取。

Some links here

一些链接在这里

As pointed out, this is very problem domain specific stuff. The more you can narrow it down, the more effective it will be. And you're going to have to train your program on your specific domain.

正如所指出的,这是非常有问题的领域特定的东西。你越能缩小范围,它就会越有效。而且你将不得不在你的特定领域训练你的程序。

回答by Paul Sonier

This is a really really complicated topic. Generally, this sort of stuff falls under the rubric of Natural Language Processing, and tends to be tricky at best. The difficulty of this sort of stuff is precisely why there still is no completely automated system for handling customer service and the like.

这是一个非常复杂的话题。一般来说,这类东西属于自然语言处理的范畴,充其量也往往很棘手。这类东西的难点正是为什么还没有完全自动化的系统来处理客户服务等。

Generally, the approach to this stuff REALLY depends on precisely what your problem domain is. If you're able to winnow down the problem domain, you can gain some very serious benefits; to use your example, if you're able to determine that your problem domain is baseball, then that gives you a really strong head start. Even then, it's a LOT of work to get anything particularly useful going.

一般来说,处理这些东西的方法真的取决于你的问题域是什么。如果您能够筛选问题域,您可以获得一些非常重要的好处;使用您的示例,如果您能够确定您的问题域是棒球,那么这将为您提供一个非常强大的开端。即便如此,要使任何特别有用的东西运行起来也需要做很多工作。

For what it's worth, yes, an existing corpus of words is going to be useful. More importantly, determining the functional complexity expected of the system is going to be critical; do you need to parse simple sentences, or is there a need for parsing complex behavior? Can you constrain the inputs to a relatively simple set?

就其价值而言,是的,现有的语料库将是有用的。更重要的是,确定系统预期的功能复杂性将是至关重要的;你需要解析简单的句子,还是需要解析复杂的行为?您能否将输入限制为相对简单的集合?

回答by Jesse Walters

Regular expressions can help in some scenario. Here is a detailed example: What's the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.

正则表达式在某些情况下会有所帮助。这是一个详细的示例:CNET 论坛上提及最多的扫描仪是什么,它使用正则表达式在 CNET 论坛帖子中查找所有提到的扫描仪。

In the post, a regular expression as such was used:

在帖子中,使用了这样的正则表达式:

(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))

in order to match either of the following:

为了匹配以下任一项:

  • two words, then model number (including all-in-one), then “scanner”
  • “scanner”, then one or two words, then model number (including all-in-one)
  • 两个字,然后是型号(包括一体机),然后是“扫描仪”
  • “扫描仪”,然后是一两个词,然后是型号(包括一体机)

As a result, the text extracted from the post was like,

结果,从帖子中提取的文本就像,

  1. discontinued HP C9900A photo scanner
  2. scanning his old x-rays
  3. new Epson V700 scanner
  4. HP ScanJet 4850 scanner
  5. Epson Perfection 3170 scanner
  1. 停产的 HP C9900A 照片扫描仪
  2. 扫描他的旧 X 光片
  3. 新的爱普生 V700 扫描仪
  4. HP ScanJet 4850 扫描仪
  5. 爱普生 Perfection 3170 扫描仪

This regular expression solution worked in a way.

这个正则表达式解决方案在某种程度上起作用了。