Are Redshift system tables immutable and well ordered? - amazon-web-services

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.
According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.
My question in two parts - hoping for references or code-as-proof - is
are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?
Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?
I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.
I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.
Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.

I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.
Here is why I think that. First off, I ran this query on a cluster with running queries
select count(*) from stl_query where endtime is null
This returns 0. My hunch is that you'll probably see the same thing on your side.
To be double sure, I also ran this query:
select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query
This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.
That said, I would rewrite the query to insert into your history table as such:
insert into admin.query_history (
select * from stl_query
where query not in (select query from admin.query_history)
)
That way, you'll always insert any records you don't have in the history table.

Related

Schedule the creation a partitioned table overwriting an existing table in BigQuery GCP

Yesterday I scheduled daily the overwriting of a table. The new table will be partitioned as well as the overwritten one... It did not run at the corresponding time, nor gave an error... It just did not started.
My feeling is that it has to be with the partitioning option. To mention that the casting of the field date_formatted that will be used as partition field works fine.
As far as I know, when scheduling a query you can't use the create or replace table T partitioned by column C as select...
You starts from the select... clause, as shows in the image, and I don't know if the problem comes from here.
PS: I had no troubles scheduling the appending to a partitioned by day table with this same procedure.
the destination table is in the same dataset.
if the very same query is scheduled to deliver the results in a table with the same name, but in a different dataset (located in the same project), it works.
by the way, the input table and the output table never were the same.

Patterns for replicating data to BigQuery

I'm asking for the best practice/industrial standard on these types of jobs, this is what I've been doing:
The end goal is to have a replication of the data in BigQuery
Get the data from an API (incrementally using the previous watermark, using a field like updated_at)
Batch load into native BigQuery table (the main table)
Run an Update-ish query, like this
select * (except _rn)
from (select *, row_number() over (partition by <id> order by updated_at desc) as _rn)
where _rn = 1
Essentially, only get the rows which are the most up-to-date. I'm opting for a table instead of a view to facilliate downtream usages.
This methods works for small table, but when the volume increases, it will face some issues:
Whole table will be recreated, whether partitioned or not
If partitioned, I could easily ran into quota limits
I've also looked for other methods, including loading into a staging table and then perform merge operation between them.
So I'm asking for advice on what your preferred methods/patterns/architecture are to achieve the end goals.
Thanks

Using merge in Power Query while keeping native query

I'm trying to reduce my dataset of 1.000.000 records to only the subset I need (+/- 500) by creating an Inner Join to a different table. Unfortunataly it seems that Power Query drops the "native query" and loads the entire dataset before reducing it by merging it with a related table. I have no access to the database unfortunately, otherwise I would have written the SQL myself. Is there a way to make merge work with a native SQL query?
Thanks
I would first check that your "related table" query can run as a native query - right-click on it's last step and check if View Native Query is enabled.
If that's the case, then it may be due to the Join Kind in the Merge Queries step. I've noticed that against SQL Server data sources, Join Kinds other than the default Left Outer Join tend to kill the Native Query option.

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.