I was wondering if anyone is able to provide some insight into the chances of collisions when using FARM_FINGERPRINT in BigQuery to generate INT64 hashes to be used as Surrogate Keys on tables?
Going with a normal UUID increases storage of the key columns x4. I was thinking FARM_FINGERPRINT(GENERATE_UUID()) might provide an INT64 alternative. I know collisions are always a concern but reading the SMHasher output for FarmHash it looks like it could be an option as it is not showing any collision issues at present.
Other than the size I have users concerned about join performance on STRING vs INT64 surrogate keys in BigQuery. I cannot find anything official that speaks to it to calm the fears. Hence why considering this method to generate an INT64 hash.
I am using Amazon Redshift to store a relationship table connected to a huge tables of logs.
The schema should look like:
CREATE TABLE public.my_table (
id INT IDENTITY(1,1),
identifier INTEGER NOT NULL encode lzo DISTKEY,
foreign_id VARCHAR(36) NOT NULL encode runlength
)
SORTKEY(foreign_id);
My question is: Can I apply encoding to the column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartition and the indexation?
Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values ?
Yes, you can apply compression without fear of impacting the DISTKEY. Amazon Redshift will use the uncompressed values.
In fact, blocks are immediately decompressed when they are read from disk, so all operations are carried-out on uncompressed data.
Just remember the golden rules:
Use DISTKEY on the column that is often used in a JOIN
Use SORTKEY on columns that are often used in WHERE
Always compress data (less disk reads means faster access) — and the automatic compression normally finds the best encoding method
After many days, I also manage to get an AWS staff response on this subject:
1) Can you apply encoding to column used as DISTKEY (and by extensions SORTKEYs) without breaking the logic behind the repartitions and the indexation?
You can apply encoding to the Distribution Key column which is also a Sort Key. However this is against our best practice recommendations as we do not advise applying encoding to a sort key column. Based on your question as you mention that the DIST KEY by extension could (is) also a SORT key then this would not be recommended. If the distribution key is not part of the sort key then you may encode it.
2) Does it take into account the raw values without encoding to apply the DISTKEY and SORTKEY or rather the compressed values?
The DISTKEY and SORTKEY algorithms are applied to the raw values. Compression is only at storage level which means that during query execution it is one the last steps when writing and one of the first steps before reading data. Looking at the example you have given where you are using run length encoding to encode the SORT KEY, we state specifically in our guide that "We do not recommend applying runlength encoding on any column that is designated as a sort key." This is due to the fact that range restricted scans might perform poorly if the sort key columns are compressed more highly. A rule of thump we recommend that the sort key should not be compressed as this can result in sort key skew. If you have time please have a look at our of our Redshift Deep Dive videos where we discuss compression in detail and actually mention this rule of thumb.
I am referring the Read-Only transaction, with option min_read_timestamp. As per the document, that executes all reads at a timestamp >= min_read_timestamp.
Assume I have a table with many million rows, but from a specific timestamp, only few rows have been written. How about the performance of this read (assuming only few rows have been written since the timestamp)? Does the spanner internally maintain any index based on the write timestamp to improve the performance of such reads or I still have to rely on my own indexes to speed up the read.
As per the documentation, ReadOnlyTransaction with min_read_timestamp will pick a timestamp >= min_read_timestamp and will return all data which was updated at this timestamp or before.
So, using it in a query to get a specific rows updated since min_read_timestamp is not the correct usage.
I would recommend using a query to select all rows where timestamp_column >= specific_timestamp and creating a composite index.
Cloud Spanner doesn't create an index to optimize this query type. You may want to create an index of your own, though I would recommend this read:
https://cloud.google.com/spanner/docs/schema-design#creating_indexes
I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.
I have recently started looking into Google Charts API for possible use within the product I'm working on. When constructing the URL for a given chart, the data points can be specified in three different formats, unencoded, using simple encoding and using extended encoding (http://code.google.com/apis/chart/formats.html). However, there seems to be no way around the fact that the highest value possible to specify for a data point is using extended encoding and is in that case 4095 (endoded as "..").
Am I missing something here or is this limit for real?
When using the Google Chart API, you will usually need to scale your data yourself so that it fits within the 0-4095 range required by the API.
For example, if you have data values from 0 to 1,000,000 then you could divide all your data by 245 so that it fits within the available range (1000000 / 245 = 4081).
Per data scaling, this may also help you:
http://code.google.com/apis/chart/formats.html#data_scaling
Note the chds parameter option.
You may also wish to consider leveraging a wrapper API that abstracts away some of these ugly details. They are listed here:
http://groups.google.com/group/google-chart-api/web/useful-links-to-api-libraries
I wrote charts4j which has functionality to help you deal with data scaling.