database 73亿行行情数据如何存储(优化读取)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9815234/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to store 7.3 billion rows of market data (optimized to be read)?
提问by Victor
I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billionrows.
自 1998 年以来,我有一个包含 1000 只股票的 1 分钟数据的数据集,(2012-1998)*(365*24*60)*1000 = 7.3 Billion总行数。
Most (99.9%) of the time I will perform only readrequests.
大多数(99.9%)的时间我只会执行读取请求。
What is the best way to store this data in a db?
将此数据存储在数据库中的最佳方法是什么?
- 1 big table with 7.3B rows?
- 1000 tables (one for each stock symbol) with 7.3M rows each?
- any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
- 1 个 7.3B 行的大表?
- 1000 个表(每个股票代码一个),每个表有 730 万行?
- 有什么推荐的数据库引擎吗?(我打算使用 Amazon RDS 的 MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
我不习惯处理这么大的数据集,所以这是我学习的绝佳机会。我将非常感谢您的帮助和建议。
Edit:
编辑:
This is a sample row:
这是一个示例行:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
第一列是股票代码,第二列是日期,第三列是分钟,其余是开-高-低-收盘价、成交量和3个整数列。
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
大多数查询将类似于“给我 2012 年 4 月 12 日 12:15 和 2012 年 4 月 13 日 12:52 之间 AAPL 的价格”
About the hardware: I plan to use Amazon RDS so I'm flexible on that
关于硬件:我计划使用 Amazon RDS,所以我在这方面很灵活
采纳答案by Charlie Martin
Tell us about the queries, and your hardware environment.
告诉我们有关查询和您的硬件环境的信息。
I would be very very tempted to go NoSQL, using Hadoopor something similar, as long as you can take advantage of parallelism.
只要您可以利用并行性,我会非常非常想使用NoSQL,使用Hadoop或类似的东西。
Update
更新
Okay, why?
好吧,为什么?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scaleof the problem makes me think about moving away from a Big Old Database because
首先,请注意我询问了查询。你不能——我们当然不能——在不知道工作量是什么样的情况下回答这些问题。(我会共同顺带对这个即将出现的文章,但今天我不能链接。)但是,规模的问题让我想从大的旧数据库,因为走开
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
- you can better exploit parallelism on relatively small commodity hardware.
- you can also better implement high reliability and redundancy
- many of those problems lend themselves naturally to the MapReduce paradigm.
我对类似系统的经验表明,访问要么是大顺序的(计算某种时间序列分析),要么是非常灵活的数据挖掘(OLAP)。顺序数据可以更好更快地顺序处理;OLAP 意味着计算大量的索引,这将花费大量时间或大量空间。
但是,如果您正在对 OLAP 世界中的许多数据进行有效的大规模运行,那么面向列的方法可能是最好的。
如果您想进行随机查询,尤其是进行交叉比较,Hadoop 系统可能会有效。为什么?因为
- 您可以在相对较小的商品硬件上更好地利用并行性。
- 您还可以更好地实现高可靠性和冗余
- 许多这些问题很自然地适用于 MapReduce 范式。
But the fact is, until we know about your workload, it's impossible to say anything definitive.
但事实是,在我们了解您的工作量之前,无法给出任何明确的结论。
回答by Andrew Tomazos
So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
因此,数据库适用于具有不断变化的大型复杂模式的情况。您只有一个“表格”,其中包含一手简单的数字字段。我会这样做:
Prepare a C/C++ struct to hold the record format:
准备一个 C/C++ 结构来保存记录格式:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
然后计算 sizeof(StockPrice[N]) 其中 N 是记录数。(在 64 位系统上)它应该只有几百演出,并且适合 50 美元的硬盘驱动器。
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
然后将文件截断为该大小并将 mmap(在 linux 上,或在 Windows 上使用 CreateFileMapping)将其截断到内存中:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
将 mmaped 指针投射到 StockPrice*,并传递填充数组的数据。关闭 mmap,现在您将在一个文件中的一个大二进制数组中保存您的数据,该文件可以稍后再次进行 mmap。
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
您现在可以从任何程序中再次以只读方式对它进行 mmap,并且您的数据将随时可用:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
所以现在你可以把它当作一个内存中的结构数组来对待。您可以根据“查询”的内容创建各种索引数据结构。内核将透明地处理与磁盘之间的数据交换,因此它会非常快。
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.
如果您希望有特定的访问模式(例如连续日期),最好按该顺序对数组进行排序,以便它按顺序访问磁盘。
回答by Dan Dascalescu
I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only readrequests.
我有一个包含 1000 只股票的 1 分钟数据的数据集 [...] 大部分时间(99.9%)我只会执行读取请求。
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
存储一次并多次读取基于时间的数值数据是一个称为“时间序列”的用例。其他常见的时间序列有物联网中的传感器数据、服务器监控统计、应用事件等。
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
这个问题是在 2012 年提出的,从那时起,几个数据库引擎一直在开发专门用于管理时间序列的功能。我使用InfluxDB取得了很好的结果,它是开源的,用 Go 编写,并获得了 MIT 许可。
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
InfluxDB 经过专门优化,可以存储和查询时间序列数据。比 Cassandra 更重要,它经常被吹捧为非常适合存储时间序列:
Optimizing for time series involved certain tradeoffs. For example:
优化时间序列涉及某些权衡。例如:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
对现有数据的更新很少发生,有争议的更新永远不会发生。时间序列数据主要是从未更新的新数据。
优点:限制对更新的访问可以提高查询和写入性能
缺点:更新功能受到严重限制
在开源基准测试中,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
InfluxDB 在所有三个测试中都优于 MongoDB,写入吞吐量提高了 27 倍,同时使用的磁盘空间减少了 84 倍,并且在查询速度方面提供了相对相同的性能。
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
查询也很简单。如果您的行看起来像<symbol, timestamp, open, high, low, close, volume>,使用 InfluxDB 您可以存储它,然后轻松查询。比如说,对于最后 10 分钟的数据:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.
没有 ID,没有键,也没有要进行的连接。你可以做很多有趣的聚合。你不必表格垂直分区为PostgreSQL的,或扭曲你的模式成秒阵列与MongoDB的。此外,InfluxDB 压缩得非常好,而PostgreSQL 将无法对您拥有的数据类型执行任何压缩。
回答by Jon Skeet
Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data reallyeasily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
好的,所以这与其他答案有些不同,但是......对我来说,如果您在文件系统(每个文件可能有一个股票)中拥有固定记录大小的数据,您可以获取数据真的很容易:给定对特定股票和时间范围的查询,您可以寻找正确的位置,获取您需要的所有数据(您将确切知道有多少字节),将数据转换为您需要的格式(这可以根据您的存储格式非常快)并且您离开了。
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
我对 Amazon 存储一无所知,但是如果您没有直接访问文件之类的功能,那么您基本上可以拥有 Blob - 您需要平衡大 Blob(更少的记录,但可能读取的数据比您需要的要多)时间)与小斑点(更多的记录提供更多的开销和可能更多的请求来获取它们,但每次返回的无用数据更少)。
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have twoservers for any particular stock, which would be helpful).
接下来,您添加缓存 - 例如,我建议为不同的服务器提供不同的库存来处理 - 您几乎可以从内存中提供服务。如果您可以在足够多的服务器上负担得起足够的内存,请绕过“按需加载”部分,只需在启动时加载所有文件。这将简化事情,代价是启动速度变慢(这显然会影响故障转移,除非您能够负担得起任何特定库存的两台服务器,这会很有帮助)。
Note that you don't need to storethe stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long(or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
请注意,您不需要为每条记录存储股票代码、日期或分钟 - 因为它们隐含在您正在加载的文件和文件中的位置中。您还应该考虑每个值需要什么样的精度,以及如何有效地存储它 - 您在问题中给出了 6SF,您可以将其存储为 20 位。可能将三个 20 位整数存储在 64 位存储中:将其作为 a 读取long(或任何您的 64 位整数值)并使用掩码/移位将其恢复为三个整数。当然,您需要知道要使用的比例 - 如果您不能使其保持不变,您可以将其编码为备用的 4 位。
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
您还没有说明其他三个整数列是什么样的,但是如果您也可以为这三个列使用 64 位,那么您可以将整个记录存储在 16 个字节中。整个数据库只有~110GB,这真的不是很多......
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is veryreasonable.
编辑:另一件需要考虑的事情是,股票可能不会在周末 - 或者确实是一夜之间发生变化。如果股票市场每天只开放 8 小时,每周 5 天,那么您每周只需要 40 个值而不是 168 个。那时您的文件中可能只有大约 28GB 的数据......听起来比您最初想象的要小得多。在内存中拥有这么多数据是非常合理的。
EDIT: I think I've missed out the explanation of whythis approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once(as the filename) and leaving the date/time entirely implicit in the positionof the data, you're removing a whole bunch of work. It's a bit like the difference between a String[]and a Map<Integer, String>- knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.
编辑:我想我错过了为什么这种方法很适合这里的解释:你的大部分数据都有一个非常可预测的方面 - 股票行情、日期和时间。通过表达一次代码(作为文件名)并将日期/时间完全隐含在数据的位置中,您将删除一大堆工作。这有点像 aString[]和 a之间的区别Map<Integer, String>——知道你的数组索引总是从 0 开始,并以 1 的增量递增,直到数组的长度允许快速访问和更有效的存储。
回答by sarnold
回答by Konstantin Tarkus
Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
这是在 Microsoft SQL Server 2012 数据库之上创建市场数据服务器的尝试,该数据库应该有利于 OLAP 分析,这是一个免费的开源项目:
回答by Richard
First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
首先,一年中没有 365 个交易日,假期 52 个周末 (104) = 比如说 250 x 市场实际开盘时间,就像有人说的那样,使用符号作为主键不是一个好主意由于符号更改,请使用带有符号 (char) 的 k_equity_id(数字),因为符号可以是这样的 A 或 GAC-DB-B.TO ,然后在您的价格信息数据表中,您有,因此您的估计值为 7.3十亿被大大高估了,因为 14 年来每个符号只有大约 170 万行。
k_equity_id k_date k_minute
k_equity_id k_date k_minute
and for the EOD table (that will be viewed 1000x over the other data)
和 EOD 表(将比其他数据查看 1000 倍)
k_equity_id k_date
k_equity_id k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.
其次,不要将您的 OHLC 按分钟数据存储在与 EOD 表(一天结束)相同的 DB 表中,因为任何想要查看一年内 pnf 或折线图的人对 by 的兴趣为零分钟信息。
回答by aliasmrchips
Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
让我建议您看看apache solr,我认为它非常适合您的特定问题。基本上,您首先会索引您的数据(每一行都是一个“文档”)。Solr 针对搜索进行了优化,并且本机支持日期范围查询。您的名义查询,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
会翻译成类似的东西:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).
假设“股票”是股票名称,“日期”是从索引输入数据的“日期”和“分钟”列创建的“日期字段”。Solr 非常灵活,我真的说不出太多好话。因此,例如,如果您需要维护原始数据中的字段,您可能会找到一种方法来动态创建“DateField”作为查询(或过滤器)的一部分。
回答by Wolf5370
I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
我认为任何主要的 RDBMS 都会处理这个问题。在原子级别,具有正确分区的单表似乎是合理的(如果固定,则根据您的数据使用情况进行分区 - 这可能是符号或日期)。
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
您还可以研究构建聚合表,以便在原子级别上更快地访问。例如,如果您的数据是在天,但您经常在周甚至月级别获取数据,那么这可以在聚合表中预先计算。在某些数据库中,这可以通过缓存视图来完成(不同数据库解决方案的各种名称 - 但基本上它是原子数据的视图,但是一旦运行,该视图就会被缓存/强化到一个固定的临时表中 - 查询子序列匹配查询. 这可以不时删除以释放内存/磁盘空间)。
I guess we could help you more with some idea as to the data usage.
我想我们可以帮助您更多地了解数据使用情况。
回答by Stephan Eggermont
You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
您应该将慢速解决方案与简单优化的内存模型进行比较。未压缩它适合 256 GB 内存服务器。快照适合 32 K,您只需在日期时间和库存上对其进行位置索引。然后您可以制作专门的快照,因为打开一个通常等于关闭前一个。
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.
[编辑] 为什么您认为完全使用数据库(rdbms 或 nosql)是有意义的?此数据不会更改,并且适合内存。这不是 dbms 可以增加价值的用例。


