python 最佳数据挖掘数据库

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2577967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:58:32  来源:igfitidea点击:

Best DataMining Database

pythondatabasenosqldata-mining

提问by Eric

I am an occasional Python programer who only have worked so far with MYSQL or SQLITE databases. I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

我是一个临时的 Python 程序员,到目前为止只使用过 MYSQL 或 SQLITE 数据库。我是一家小公司的计算机人员,我已经开始了一个新项目,我认为是时候尝试新的数据库了。

Sales department makes a CSV dump every week and I need to make a small scripting application that allow people form other departments mixing the information, mostly linking the records. I have all this solved, my problem is the speed, I am using just plain text files for all this and unsurprisingly it is very slow.

销售部门每周都会进行 CSV 转储,我需要制作一个小型脚本应用程序,允许其他部门的人员混合信息,主要是链接记录。我已经解决了所有这些问题,我的问题是速度,我只使用纯文本文件来处理所有这些问题,不出所料,它非常慢。

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

本来想用mysql的,后来每个桌面都需要安装mysql,sqlite比较容易,但是很慢。我不需要完整的关系数据库,只需要在合适的时间处理大量数据的某种方式。

Update: I think I was not being very detailed about my database usage thus explaining my problem badly. I am working reading all the data ~900 Megas or more from a csv into a Python dictionary then working with it. My problem is storing and mostly reading the data quickly.

更新:我想我没有很详细地了解我的数据库使用情况,因此很糟糕地解释了我的问题。我正在将大约 900 Megas 或更多的所有数据从 csv 读取到 Python 字典中,然后使用它。我的问题是快速存储和读取数据。

Many thanks!

非常感谢!

回答by Alfred

Quick Summary

快速总结

  • You need enough memory(RAM) to solve your problem efficiently. I think you should upgrade memory?? When reading the excellent High ScalabilityBlog you will notice that for big sites to solve there problem efficiently they store the complete problem set in memory.
  • You do need a central database solution. I don't think hand doing this with python dictionary's only will get the job done.
  • How to solve "your problem" depends on your "query's". What I would try to do first is put your data in elastic-search(see below) and query the database(see how it performs). I think this is the easiest way to tackle your problem. But as you can read below there are a lot of ways to tackle your problem.
  • 您需要足够的内存(RAM)来有效地解决您的问题。我觉得你应该升级内存??在阅读优秀的高可扩展性博客时,您会注意到,对于大型站点来说,为了有效地解决那里的问题,他们将完整的问题集存储在内存中。
  • 您确实需要一个中央数据库解决方案。我不认为只用 python 字典手工做这件事就可以完成工作。
  • 如何解决“您的问题”取决于您的“查询”。我首先会尝试做的是将您的数据放入弹性搜索(见下文)并查询数据库(看看它是如何执行的)。我认为这是解决您的问题的最简单方法。但是正如您在下面阅读的那样,有很多方法可以解决您的问题。

We know:

我们知道:

  • You used python as your program language.
  • Your database is ~900MB (I think that's pretty large, but absolute manageable).
  • You have loaded all the data in a python dictionary. Here I am assume the problem lays. Python tries to store the dictionary(also python dictionary's aren't the most memory friendly) in your memory, but you don't have enough memory(How much memory do you have????). When that happens you are going to have a lot of Virtual Memory. When you attempt to read the dictionary you are constantly swapping data from you disc into memory. This swapping causes "Trashing". I am assuming that your computer does not have enough Ram. If true then I would first upgrade your memory with at least 2 Gigabytes extra RAM.When your problem set is able to fit in memory solving the problem is going to be a lot faster. I opened my computer architecture book where it(The memory hierarchy) says that main memory access time is about 40-80ns while disc memory access time is 5 ms. That is a BIG difference.
  • 你使用 python 作为你的程序语言。
  • 您的数据库约为 900MB(我认为这相当大,但绝对可以管理)。
  • 您已经加载了 Python 字典中的所有数据。在这里,我假设问题在于。Python 尝试将字典(python 字典也不是最适合内存的)存储在您的内存中,但是您没有足够的内存(您有多少内存??????)。发生这种情况时,您将拥有大量虚拟内存。当您尝试阅读字典时,您会不断地将光盘中的数据交换到内存中。这种交换会导致“ Trashing”。我假设你的电脑没有足够的内存。如果为真,那么我将首先使用至少 2 GB 的额外 RAM 升级您的内存。当您的问题集能够适应内存时,解决问题的速度会快得多。我打开了我的计算机体系结构书,其中(内存层次结构)说主内存访问时间约为 40-80 纳秒,而磁盘内存访问时间为 5 毫秒。这是一个很大的不同。

Missing information

丢失的信息

  • Do you have a central server. You should use/have a server.
  • What kind of architecture does your server have? Linux/Unix/Windows/Mac OSX? In my opinion your server should have linux/Unix/Mac OSX architecture.
  • How much memory does your server have?
  • Could you specify your data set(CSV) a little better.
  • What kind of data mining are you doing? Do you need full-text-search capabilities? I am not assuming you are doing any complicated (SQL) query's. Performing that task with only python dictionary's will be a complicated problem. Could you formalize the query's that you would like to perform? For example:
    • "get all users who work for departement x"
    • "get all sales from user x"
  • 你有中央服务器吗?您应该使用/拥有服务器。
  • 你的服务器有什么样的架构?Linux/Unix/Windows/Mac OSX?在我看来,您的服务器应该具有 linux/Unix/Mac OSX 架构。
  • 你的服务器有多少内存?
  • 你能不能更好地指定你的数据集(CSV)。
  • 你在做什么类型的数据挖掘?您需要全文搜索功能吗?我不假设您正在执行任何复杂的 (SQL) 查询。仅使用 python 字典执行该任务将是一个复杂的问题。你能把你想要执行的查询形式化吗?例如:
    • "get all users who work for departement x"
    • "get all sales from user x"

Database needed

需要数据库

I am the computer person for everything in a small company and I have been started a new project where I think it is about time to try new databases.

我是一家小公司的计算机人员,我已经开始了一个新项目,我认为是时候尝试新的数据库了。

You are sure right that you need a database to solve your problem. Doing that yourself only using python dictionary's is difficult. Especially when your problem set can't fit in memory.

您肯定需要一个数据库来解决您的问题。仅使用python字典自己做这件事很困难。尤其是当您的问题集无法放入内存时。

MySQL

MySQL

I thought about using mysql, but then I need installing mysql in every desktop, sqlite is easier, but it is very slow. I do not need a full relational database, just some way of play with big amounts of data in a decent time.

本来想用mysql的,后来每个桌面都需要安装mysql,sqlite比较容易,但是很慢。我不需要完整的关系数据库,只需要在合适的时间处理大量数据的某种方式。

A centralized(Client-server architecture) database is exactly what you need to solve your problem. Let all the users access the database from 1 PC which you manage. You can use MySQL to solve your problem.

集中式(客户端-服务器架构)数据库正是您解决问题所需要的。让所有用户从您管理的 1 台 PC 访问数据库。您可以使用 MySQL 来解决您的问题

Tokyo Tyrant

东京暴君

You could also use Tokyo Tyrantto store all your data. Tokyo Tyrant is pretty fast and it does not have to be stored in RAM. It handles getting data a more efficient(instead of using python dictionary's). However if your problem can completely fit in Memory I think you should have look at Redis(below).

您还可以使用Tokyo Tyrant来存储您的所有数据。Tokyo Tyrant 速度非常快,而且不必存储在 RAM 中。它处理获取数据的效率更高(而不是使用 python 字典)。但是,如果您的问题完全适合 Memory,我认为您应该看看 Redis(如下)。

Redis:

Redis:

You could for example use Redis(quick start in 5 minutes)(Redis is extremely fast) to store all sales in memory. Redis is extremely powerful and can do this kind of queries insanely fast. The only problem with Redis is that it has to fit completely in RAM, but I believe he is working on that(nightly build already supports it). Also like I already said previously solving your problem set completely from memory is how big sites solve there problem in a timely manner.

例如,您可以使用Redis(5 分钟内快速启动)(Redis 非常快)将所有销售额存储在内存中。Redis 非常强大,可以非常快地执行此类查询。Redis 唯一的问题是它必须完全适合RAM,但我相信他正在努力(每晚构建已经支持它)。就像我之前已经说过的那样,完全从记忆中解决您的问题是大型网站如何及时解决问题。

Document stores

文件存储

This articletries to evaluate kv-stores with document stores like couchdb/riak/mongodb. These stores are better capable of searching(a little slower then KV stores), but aren't good at full-text-search.

文章试图与文档商店像CouchDB的/了Riak / MongoDB的评估KV-店。这些商店的搜索能力更好(比 KV 商店慢一点),但不擅长全文搜索。

Full-text-search

全文搜索

If you want to do full-text-search queries you could like at:

如果您想进行全文搜索查询,您可能会喜欢:

  • elasticsearch(videos): When I saw the video demonstration of elasticsearch it looked pretty cool. You could try put(post simple json) your data in elasticsearch and see how fast it is. I am following elastissearch on github and the author is commiting a lot of new code to it.
  • solr(tutorial): A lot of big companies are using solr(github, digg) to power there search. They got a big boost going from MySQL full-text search to solr.
  • elasticsearch(videos):当我看到 elasticsearch 的视频演示时,它看起来很酷。您可以尝试将(发布简单的 json)数据放入 elasticsearch 中,看看它有多快。我正在 github 上关注 elasticsearch,作者正在向其提交大量新代码。
  • solr(tutorial):很多大公司都在使用 solr(github, digg) 来支持搜索。从 MySQL 全文搜索到 solr,他们得到了很大的推动。

回答by High Performance Mark

You probably do need a full relational DBMS, if not right now, very soon. If you start now while your problems and data are simple and straightforward then when they become complex and difficult you will have plenty of experience with at least one DBMS to help you. You probably don't need MySQL on all desktops, you might install it on a server for example and feed data out over your network, but you perhaps need to provide more information about your requirements, toolset and equipment to get better suggestions.

您可能确实需要一个完整的关系 DBMS,即使不是现在,也很快。如果您现在开始时您的问题和数据简单明了,那么当它们变得复杂和困难时,您将拥有至少一个 DBMS 来帮助您的丰富经验。您可能不需要在所有桌面上安装 MySQL,例如,您可以将它安装在服务器上并通过网络提供数据,但您可能需要提供有关您的要求、工具集和设备的更多信息以获得更好的建议。

And, while the other DBMSes have their strengths and weaknesses too, there's nothing wrong with MySQL for large and complex databases. I don't know enough about SQLite to comment knowledgeably about it.

而且,虽然其他 DBMS 也有其优点和缺点,但 MySQL 对于大型复杂数据库没有任何问题。我对 SQLite 的了解不够,无法对其进行专业评论。

EDIT: @Eric from your comments to my answer and the other answers I form even more strongly the view that it is time you moved to a database. I'm not surprised that trying to do database operations on a 900MB Python dictionary is slow. I think you have to first convince yourself, then your management, that you have reached the limits of what your current toolset can cope with, and that future developments are threatened unless you rethink matters.

编辑:@Eric 从你的评论到我的答案和其他答案,我更加强烈地认为是时候转移到数据库了。尝试在 900MB 的 Python 字典上执行数据库操作很慢,我并不感到惊讶。我认为你必须首先说服自己,然后说服你的管理层,你已经达到了当前工具集可以处理的极限,除非你重新考虑问题,否则未来的发展会受到威胁。

If your network really can't support a server-based database than (a) you really need to make your network robust, reliable and performant enough for such a purpose, but (b) if that is not an option, or not an early option, you should be thinking along the lines of a central database server passing out digests/extracts/reports to other users, rather than simultaneous, full RDBMS working in a client-server configuration.

如果您的网络确实无法支持基于服务器的数据库,那么 (a) 您确实需要使您的网络足够健壮、可靠和高性能,以实现这一目的,但是 (b) 如果这不是一种选择,或者不是早期的选项,您应该按照中央数据库服务器向其他用户传递摘要/提取/报告的思路,而不是在客户端-服务器配置中同时工作的完整 RDBMS。

The problems you are currently experiencing are problems of not having the right tools for the job. They are only going to get worse. I wish I could suggest a magic way in which this is not the case, but I can't and I don't think anyone else will.

您目前遇到的问题是没有合适的工具来完成工作的问题。他们只会变得更糟。我希望我能提出一种神奇的方法,但我不能,而且我认为其他人也不会。

回答by High Performance Mark

Here is a performance benchmark of different database suits -> Database Speed Comparison

这是不同数据库套件的性能基准 -> 数据库速度比较

I'm not sure how objective the above comparison is though, seeing as it's hosted on sqlite.org. Sqlite only seems to be a bit slower when dropping tables, otherwise you shouldn't have any problems using it.Both sqlite and mysql seem to have their own strengths and weaknesses, in some tests the one is faster then the other, in other tests, the reverse is true.

我不确定上面的比较有多客观,因为它托管在sqlite.org 上Sqlite 在删除表时似乎只是慢了一点,否则使用它应该不会有任何问题。sqlite 和 mysql 似乎都有自己的优点和缺点,在某些测试中,一个比另一个更快,在其他测试中,情况正好相反。

If you've been experiencing lower then expected performance, perhaps it is not sqlite that is the causing this, have you done any profiling or otherwise to make sure nothing else is causing your program to misbehave?

如果您遇到的性能低于预期,也许不是 sqlite 是导致这种情况的原因,您是否进行了任何分析或以其他方式确保没有其他原因导致您的程序行为不端?

EDIT: Updated with a link to a slightly more recent speed comparison.

编辑:更新了一个稍微更新的速度比较的链接。

回答by Simon Callan

Have you done any bench marking to confirm that it is the text files that are slowing you down? If you haven't, there's a good chance that tweaking some other part of the code will speed things up so that it's fast enough.

您是否进行了任何基准测试以确认是文本文件使您变慢?如果你还没有,那么调整代码的其他部分很有可能会加快速度,使其足够快。

回答by Eric

It has been a couple of months since I posted this question and I wanted to let you all know how I solved this problem. I am using Berkeley DB with the module bsddb instead loading all the data in a Python dictionary. I am not fully happy, but my users are. My next step is trying to get a shared server with redis, but unless users starts complaining about speed, I doubt I will get it. Many thanks everybody who helped here, and I hope this question and answers are useful to somebody else.

我发布这个问题已经几个月了,我想让大家知道我是如何解决这个问题的。我将 Berkeley DB 与模块 bsddb 一起使用,而不是将所有数据加载到 Python 字典中。我并不完全满意,但我的用户很满意。我的下一步是尝试使用 redis 获取共享服务器,但除非用户开始抱怨速度,否则我怀疑我是否会得到它。非常感谢在这里提供帮助的每个人,我希望这个问题和答案对其他人有用。

回答by Cerin

It sounds like each department has their own feudal database, and this implies a lot of unnecessary redundancy and inefficiency.

听起来每个部门都有自己的封建数据库,这意味着很多不必要的冗余和低效。

Instead of transferring hundreds of megabytes to everyone across your network, why not keep your data in MySQL and have the departments upload theirdata to the database, where it can be normalized and accessible by everyone?

相反,在网络中传输百兆给大家的,为什么不能让你的数据在MySQL和有部门上传他们的数据到数据库中,在那里它可以被归一化和所有人开放?

As your organization grows, having completely different departmental databases that are unaware of each other, and contain potentially redundant or conflicting data, is going to become very painful.

随着组织的发展,拥有完全不同的部门数据库,这些数据库彼此不知道,并且包含潜在的冗余或冲突数据,这将变得非常痛苦。

回答by eric.christensen

Does the machine this process runs on have sufficient memory and bandwidth to handle this efficiently? Putting MySQL on a slow machine and recoding the tool to use MySQL rather than text files could potentially be far more costly than simply adding memory or upgrading the machine.

运行这个进程的机器是否有足够的内存和带宽来有效地处理这个问题?将 MySQL 放在慢速机器上并重新编码工具以使用 MySQL 而不是文本文件可能比简单地增加内存或升级机器的成本要高得多。

回答by asdfsadf

Take a look at mongodb.

看看mongodb。

回答by Khelben

If you have that problem with a CSV file, maybe you can just pickle the dictionary and generate a pickle "binary" file with pickle.HIGHEST_PROTOCOLoption. It can be faster to read and you get a smaller file. You can load the CSV file once and then generate the pickled file, allowing faster load in next accesses.

如果您遇到 CSV 文件的问题,也许您可​​以只腌制字典并生成带有pickle.HIGHEST_PROTOCOL选项的腌制“二进制”文件。读取速度会更快,而且文件更小。您可以加载一次 CSV 文件,然后生成腌制文件,以便在下次访问时更快地加载。

Anyway, with 900 Mb of information, you're going to deal with some time loading it in memory. Another approach is not loading it on one step on memory, but load only the information when needed, maybe making different files by date, or any other category (company, type, etc..)

无论如何,对于 900 Mb 的信息,您将需要花费一些时间将其加载到内存中。另一种方法不是在内存中一步加载它,而是仅在需要时加载信息,可能按日期或任何其他类别(公司、类型等)制作不同的文件。