Aws RedShift sampling - amazon-web-services

For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.

How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.

Related

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

How does Amazon Redshift reconstruct a row from columnar storage?

Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?
If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.
AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Google Cloud Datastore - Scalability of indexing a short list of enums

Is it an issue in Datastore to index a property that can only have 4-5 possible values? Would this lead to tablet hotspots?
I am thinking of a property with an enum of string values like "done", "working", "complete". The reason for indexing such a property would be so you can create a composite index that let's you query on all entities that are "done" for example.
Yes, it would be an issue if/when you have high rates of queries using these composite indexes you mentioned, listed in Indexes:
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to
hotspots that impact Cloud Datastore latency for applications with
high read and write rates. For further guidance on dealing with
monotonic properties, see High read/write rates for a narrow key
range below.
You would also have a tablet hotspot problem if/when you hit high rates of datastore writes for entities with the same property value (for example 100s of entities becoming done per second) - another facet of the same problem. It's this case mentioned in High read/write rates to a narrow key range:
You will also see this problem if you create new entities at a high rate with a monotonically increasing indexed property like a
timestamp, because these properties are the keys for rows in the index
tables in Bigtable.
TLDR: It scales so long as entity keys are scattered.
DR:
Lets first consider the index entries being written.
We have something like:
SomeKind\E1 -> FullEntityKey1
SomeKind\E2 -> FullEntityKey2
SomeKind\E2 -> FullEntityKey3
SomeKind\E3 -> FullEntityKey4
We note that each individual index entry points to some entity.
As far as the load sharding is concerned the value being sharded is like the following:
SomeKind\E1\FullEntityKey1
SomeKind\E2\FullEntityKey2
SomeKind\E2\FullEntityKey3
SomeKind\E3\FullEntityKey4
Now lets imagine we were using randomly allocated ids for the entity keys (range [0,2] to be simple) -- we assume even distribution of writes across the random entity ids.
SomeKind\E1\0\RestOfKey1
SomeKind\E2\0\RestOfKey2
SomeKind\E2\1\RestOfKey3
SomeKind\E3\2\RestOfKey4
And then we can note that there are clear split points for the load to shard across -- that is every of the [0,2] possible random ids is a shard and the system can scale indefinitely so long as the writes are evenly distributed across the enties in SomeKind written (make the random id longer for more split points/scaling)
So the is index enum value scaling/hotspotting is highly associated with the entity keys being indexed, which are generally constructed in ways that are shardable which means that the associated index entries also are.
This is not to say that it is impossible to create situations in which hotspots may occur (for example, if the entity keys had a monotonically increasing value (like a timestamp)), or by targeting a small section of keys for a very high write rate -- but that shouldn't happen by default with typical traffic patterns and entity keys.

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.