用于文本处理的 Python 或 Java(文本挖掘、信息检索、自然语言处理)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6030291/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python or Java for text processing (text mining, information retrieval, natural language processing)
提问by kga
I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on.
我即将开始一个新项目,我将在其中执行大量文本处理任务,例如搜索、分类/分类、聚类等。
There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents.
将有大量文件需要处理;可能有数百万个文件。在初始处理之后,它还必须能够每天更新多个新文档。
Can I use Python to do this, or is Python too slow? Is it best to use Java?
我可以使用 Python 来做到这一点,还是 Python 太慢?最好使用Java吗?
If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's speed. I have used Python for some small scale text processing tasks with only a couple of thousand documents, but I am not sure how well it scales up.
如果可能的话,我更喜欢 Python,因为这是我最近一直在使用的。另外,我会更快地完成编码部分。但这一切都取决于 Python 的速度。我已经将 Python 用于一些只有几千个文档的小规模文本处理任务,但我不确定它的扩展情况。
回答by Chris
Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATEseem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).
两者都很好。Java 在文本处理方面投入了大量精力。斯坦福的文本处理系统OpenNLP、UIMA和GATE似乎是大玩家(我知道我错过了一些)。玩几分钟后,您可以在大型语料库上运行 StanfordNLP 模块。但是,它有很大的内存要求(我使用它时需要 3 GB 左右)。
NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.
NLTK、Gensim、Pattern和许多其他 Python 模块非常擅长文本处理。它们的内存使用和性能非常合理。
Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...
Python 可以扩展是因为文本处理是一个非常容易扩展的问题。在解析/标记/分块/提取文档时,您可以非常轻松地使用多处理。一旦您将文本放入任何类型的特征向量中,您就可以使用 numpy 数组,我们都知道 numpy 有多棒……
I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.
我学习了 NLTK,Python 极大地帮助了我减少了开发时间,所以我认为你应该先试一试。他们也有一个非常有用的邮件列表,我建议你加入。
If you have custom scripts, you might want to check out how well they perform with PyPy.
如果您有自定义脚本,您可能想看看它们在PyPy 上的表现如何。
回答by StackExchange saddens dancek
It's very difficult to answer questions like this without trying. So why don't you
不尝试就很难回答这样的问题。那你为什么不
- Figure out what would be a difficult operation
- Implement that (and I mean the simplest, quickest hack that you can make work)
- Run it with a lotof data, and see how long it takes
- Figure out if it's too slow
- 弄清楚什么是困难的手术
- 实现它(我的意思是你可以做的最简单、最快的 hack)
- 用大量数据运行它,看看需要多长时间
- 弄清楚它是否太慢
I've done this in the past and it's really theway to see if something performs well enough for something.
我过去曾经这样做过,这确实是一种查看某事是否对某事表现得足够好的方法。
回答by Jakob Bowyer
Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized. There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.
随便写吧,编程人最大的缺陷就是过早优化。处理一个项目,把它写出来并让它工作。然后返回并修复错误并确保其优化。会有很多人在谈论 x 与 y 的速度,y 比 x 好,但归根结底它只是一种语言。它不是一种语言是什么,而是它是如何做到的。
回答by Denis Tulskiy
it's not language you have to evaluate, but frameworks and app servers for clustering, data storage/retrieval etc available for the language.
它不是您必须评估的语言,而是该语言可用的用于集群、数据存储/检索等的框架和应用程序服务器。
you can use jython and use all the java enterprise technologies for high load system and do text parsing with python.
您可以使用 jython 并使用所有用于高负载系统的 java 企业技术并使用 python 进行文本解析。