huge inserts, no updates, moderate reads with billions of rows datastore - datastore

Our system need 10 million inserts per day(structural data), storage upto 3 months(which adds up to 300 million records, after which we can purge older records), no updates are required, and it should support simple queries(like queries on some particular columns sorted by date). Which data storage solution is efficient for this case?. We are thinking RDBMS will be slow for billions of records.

MySql might work, if you can optimize the way you insert.
http://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html
Of course it depends if your 1M inserts are evenly spread or you have picks.
If you have picks of very height TPS I would recommend aerospike.
MongoDb can also work, however you will need to use sharding.

Related

Partitioned tables in BigQuery

I was wondering what the usage of using a partitioned table in BigQuery is. It seems most of the queries seem to take about the same time to finish regardless of size (ignoring extremes, I'm generalizing), is this mainly a matter of using it to reduce costs on the bytes processed, or what is the main use case of partitioning tables in BQ?
https://cloud.google.com/bigquery/docs/creating-column-partitions
There are multiple benefits, mainly costs.
by writing a query to read only eg: 7 days of partitions instead of 7 years you have lower costs
partitions you don't touch for older than 90 days are at lower costs
you can clearly reload a day's data much more easier than having to work around
you are still recommended to use YEARly tables eg mytable_2018, but you are no longer required to have daily tables eg: mytable_20180101, this further leads to have simpler queries, also no longer a problem to read more than 1000 tables (which is a hard limit).
when you modify schema, you need to modify a few tables, you no longer need to script alters on thousands of table
this also means it's lover bytes processed and in the cloud platform can be better optimized and needs fewer resources
by reorganizing data into partitioned tables the query times will benefit in the future. As customers will move data, the cloud engineering team will optimize the service for better usage.
you see clear cost wise benefits if your existing data is at least a couple of terabytes.

Amazon redshift large table VACUUM REINDEX issue

My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys.
One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.
When i track the vacuum progress it says the following:
SELECT * FROM svv_vacuum_progress;
table_name | status | time_remaining_estimate
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
my_table_name | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383) | 0m 0s
I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.
These days, it is recommended that you should not use Interleaved Sorting.
The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.
I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.
The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.
Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.
If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.
You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything:
https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/
The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!
Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

CloudSearch performance with frequent updates of small batches

I have a use case where I need to upload small document batches (typical 1 to 10 documents of 1KB each) to CloudSearch. Every 2 or 3 seconds a new batch is uploaded. The CloudSearch docs for bulk uploads say:
Make sure your batches are as close to the 5 MB limit as possible. Uploading a larger amount of smaller batches slows down the upload and indexing process.
It's ok if there is a 30 seconds delay before the documents show up in search results. Will my implementation work well as my document count is increasing, let's say to 500.000 docs?
Indexing time should be well under your 30 second SLA even with 500k docs, regardless of how or whether you batch your submissions.
I say this based on my own testing with an index of 300k docs and 38 index fields on an m1.small instance type, where it takes less than 3 seconds for a document to be searchable. There are a lot of variables that could affect your own situation, such as how many index fields you have, your instance size, etc, but I think my setup reflects the unfavorable conditions (m1.small instance with complex indexing schema) and is still an order of magnitude faster than your SLA. It's anecdotal evidence of course, but you should be fine.

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.