大规模机器学习——Python 还是 Java?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9720894/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 06:41:55  来源:igfitidea点击:

Large scale machine learning - Python or Java?

javapythonmachine-learningnltkmahout

提问by jeffreyveon

I am currently embarking on a project that will involve crawling and processing huge amounts of data (hundreds of gigs), and also mining them for extracting structured data, named entity recognition, deduplication, classification etc.

我目前正在着手一个项目,该项目将涉及抓取和处理大量数据(数百个演出),并挖掘它们以提取结构化数据、命名实体识别、重复数据删除、分类等。

I'm familiar with ML tools from both Java and the Python world: Lingpipe, Mahout, NLTK, etc. However, when it comes down to picking a platform for such a large scale problem - I lack sufficient experience to decide between Java or Python.

我熟悉来自 Java 和 Python 世界的 ML 工具:Lingpipe、Mahout、NLTK 等。但是,当涉及到为如此大规模的问题选择平台时 - 我缺乏足够的经验来决定 Java 还是 Python .

I know this sounds like a vague question, and but I am looking for general advice on picking either Java or Python. The JVM offers better performance(?) over Python, but are libraries like Lingpipe etc. match up with the Python ecosystem? If I went this Python, how easy would it be scaling it and managing it across multiple machines etc.

我知道这听起来像是一个模糊的问题,但我正在寻找关于选择 Java 或 Python 的一般建议。JVM 比 Python 提供更好的性能(?),但是像 Lingpipe 等库是否与 Python 生态系统相匹配?如果我使用这个 Python,在多台机器上扩展和管理它会有多容易。

Which one should I go with and why?

我应该和哪一个一起去,为什么?

采纳答案by Yavar

As Apache is going strong producing excellent stuff like Lucene/Solr/Nutch for Search, Mahout for Big Data Machine Learning, Hadoop for Map Reduce, OpenNLP for NLP, lot of NoSQL stuff. The best part is the big "I" which stands for integration and these products can be integrated with each other well as of course in most situations they (these products) complement each other.

随着 Apache 的强劲发展,它产生了用于搜索的 Lucene/Solr/Nutch、用于大数据机器学习的 Mahout、用于 Map Reduce 的 Hadoop、用于 NLP 的 OpenNLP,以及大量 NoSQL 的东西。最好的部分是代表集成的大“I”,这些产品可以相互集成,当然在大多数情况下它们(这些产品)相互补充。

Python is great too however if you consider above from ASF then I will go with Java like Sean Owen. Python will always be available for the above but mostly like Add on's and not the actual stuff. For example you can do Hadoop using Python by using Streaming etc.

Python 也很棒,但是如果您从 ASF 考虑以上内容,那么我会像 Sean Owen 一样使用 Java。Python 将始终可用于上述内容,但主要类似于 Add on 而非实际内容。例如,您可以通过使用 Streaming 等使用 Python 进行 Hadoop。

I partially switched from C++ to Java in order to utilize some of the very popular Apache products like Lucene, Solr & OpenNLP and also other popular open source NoSQL Java products like Neo4j & OrientDB.

我从 C++ 部分切换到 Java,以便利用一些非常流行的 Apache 产品,如 Lucene、Solr 和 OpenNLP,以及其他流行的开源 NoSQL Java 产品,如 Neo4j 和 OrientDB。

回答by Sean Owen

I think one big thing Java has going for it is Hadoop. If you really mean large scale, you'll want to be able to use something like that. Generally speaking Java has the performance advantage, and more libraries available. So: Java.

我认为 Java 为它做的一件大事是 Hadoop。如果你真的是说大规模,你会希望能够使用类似的东西。一般来说,Java 具有性能优势,可用的库更多。所以:爪哇。

回答by subiet

If you are looking at NoSQL databases fit for ML task, then Neo4J is one of the more production ready (relatively) and capable of handling BigData, it is native to JAVA but comes along with a beautiful REST API out of the box and hence can be integrated with the platform of your choice. JAVA will give you an performance edge here.

如果您正在寻找适合 ML 任务的 NoSQL 数据库,那么 Neo4J 是更适合生产(相对)并且能够处理大数据的数据库之一,它是 JAVA 原生的,但带有开箱即用的漂亮 REST API,因此可以与您选择的平台集成。JAVA 将在这里为您提供性能优势。