I am using Amazon Redshift to store a relationship table connected to a huge tables of logs.
The schema should look like:
CREATE TABLE public.my_table (
id INT IDENTITY(1,1),
identifier INTEGER NOT NULL encode lzo DISTKEY,
foreign_id VARCHAR(36) NOT NULL encode runlength
)
SORTKEY(foreign_id);
My question is: Can I apply encoding to the column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartition and the indexation?
Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values ?
Yes, you can apply compression without fear of impacting the DISTKEY. Amazon Redshift will use the uncompressed values.
In fact, blocks are immediately decompressed when they are read from disk, so all operations are carried-out on uncompressed data.
Just remember the golden rules:
Use DISTKEY on the column that is often used in a JOIN
Use SORTKEY on columns that are often used in WHERE
Always compress data (less disk reads means faster access) — and the automatic compression normally finds the best encoding method
After many days, I also manage to get an AWS staff response on this subject:
1) Can you apply encoding to column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartitions and the indexation?
You can apply encoding to the Distribution Key column which is also a Sort Key. However this is against our best practice recommendations as we do not advise applying encoding to a sort key column. Based on your question as you mention that the DIST KEY by extension could (is) also a SORT key then this would not be recommended. If the distribution key is not part of the sort key then you may encode it.
2) Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values?
The DISTKEY and SORTKEY algorithms are applied to the raw values. Compression is only at storage level which means that during query execution it is one the last steps when writing and one of the first steps before reading data. Looking at the example you have given where you are using run length encoding to encode the SORT KEY, we state specifically in our guide that "We do not recommend applying runlength encoding on any column that is designated as a sort key." This is due to the fact that range restricted scans might perform poorly if the sort key columns are compressed more highly. A rule of thump we recommend that the sort key should not be compressed as this can result in sort key skew. If you have time please have a look at our of our Redshift Deep Dive videos where we discuss compression in detail and actually mention this rule of thumb.
Related
For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.
Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?
If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.
AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.
I was wondering if anyone is able to provide some insight into the chances of collisions when using FARM_FINGERPRINT in BigQuery to generate INT64 hashes to be used as Surrogate Keys on tables?
Going with a normal UUID increases storage of the key columns x4. I was thinking FARM_FINGERPRINT(GENERATE_UUID()) might provide an INT64 alternative. I know collisions are always a concern but reading the SMHasher output for FarmHash it looks like it could be an option as it is not showing any collision issues at present.
Other than the size I have users concerned about join performance on STRING vs INT64 surrogate keys in BigQuery. I cannot find anything official that speaks to it to calm the fears. Hence why considering this method to generate an INT64 hash.
I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?
I would like to use a UUID as a primary key in Cloud Spanner. What is the best way to read and write UUIDs? Is there a UUID type, or client library support?
The simplest solution is just to store it as a STRING in the standard RFC 4122 format. E.g.:
"d1a0ce61-b9dd-4169-96a8-d0d7789b61d9"
This will take 37 bytes to store (36 bytes plus a length byte). If you really want to save every possible byte, you could store your UUID as two INT64's. However, you would need to write your own libraries for serializing/deserializing the values, and they wouldn't appear very pretty in your SQL queries. In most cases, the extra ~21 bytes of savings per row is probably not worth it.
Note that some UUID generation algorithms generate the UUID sequentially based on a timestamp. If the UUID values generated by a machine are monotonically increasing, then this can lead to hot-spotting in Cloud Spanner (this is analogous to the anti-pattern of using timestamps as the beginning of a primary key), so it is best to avoid these variants (e.g. UUID version 1 is not recommended).
This Stackoverflow answer provides more details about the various UUID versions. (TL;DR: use Version 4 with Cloud Spanner since a psuedo-ranndom number is used in the generation)
As per Cloud Spanner documentation:
There are several ways to store the UUID as the primary key:
In a STRING(36) column.
In a pair of INT64 columns.
In a BYTES(16) column.