No Of Queries Hitting Redshift tables in a Time Frame - amazon-web-services

How can I know the number of queries hitting a table in a particular time frame and what are those queries
Is it possible to get those stats for live tables hitting a redshift table?

This will give you the number of queries hitting a redshift table in a certain time frame:
SELECT
count(*)
FROM stl_wlm_query w
LEFT JOIN stl_query q
ON q.query = w.query
AND q.userid = w.userid
join pg_user u on u.usesysid = w.userid
-- Adjust your time frame accordignly
WHERE w.queue_start_time >= '2022-04-04 10:00:00.000000'
AND w.queue_start_time <= '2022-04-05 22:00:00.000000'
AND w.userid > 1
-- Set the table name here:
AND querytxt like '%my_main_table%';
If you need the actual queries text hitting the table in a certain timeframe, plus the queue and execution time and the user (remove if not needed):
SELECT
u.usename,
q.querytxt,
w.queue_start_time,
w.total_queue_time / 1000000 AS queue_seconds,
w.total_exec_time / 1000000 exec_seconds
FROM stl_wlm_query w
LEFT JOIN stl_query q
ON q.query = w.query
AND q.userid = w.userid
join pg_user u on u.usesysid = w.userid
-- Adjust your time frame accordignly
WHERE w.queue_start_time >= '2022-04-04 10:00:00.000000'
AND w.queue_start_time <= '2022-04-05 22:00:00.000000'
AND w.userid > 1
-- Set the table name here:
AND querytxt like '%my_main_table%'
ORDER BY w.queue_start_time;

If by "hitting a table you mean scan then they system table stl_scan lists all the accesses to a table and lists the query number that causes this scan. By writing a query to aggregate the information in stl_scan you can look at it by time interval and/or originating query. If this isn't what you mean you will need to clarify.
I don't understand what is meant by 'stats for live tables hitting a redshift table?'. What is meant by a table hitting a table?

Related

Athena query return empty result because of timing issues

I'm trying to create and query the Athena table based on data located in S3, and it seems that there are some timing issues.
How can I know when all the partitions have been loaded to the table?
The following code returns an empty result -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
But when I add some delay, it works greate -
athena_client.start_query_execution(QueryString=app_query_create_table,
ResultConfiguration={'OutputLocation': output_location})
athena_client.start_query_execution(QueryString="MSCK REPAIR TABLE `{athena_db}`.`{athena_db_partition}`"
.format(athena_db=athena_db, athena_db_partition=athena_db_partition),
ResultConfiguration={'OutputLocation': output_location})
time.sleep(3)
result = query.format(athena_db_partition=athena_db_partition, delta=delta, dt=dt)
The following is the query for creating the table -
query_create_table = '''
CREATE EXTERNAL TABLE `{athena_db}`.`{athena_db_partition}` (
`time` string,
`user_advertiser_id` string,
`predictions` float
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://{bucket}/path/'
'''
app_query_create_table = query_create_table.format(bucket=bucket,
athena_db=athena_db,
athena_db_partition=athena_db_partition)
I would love to get some help.
The start_query_execution call only starts the query, it does not wait for it to complete. You must run get_query_execution periodically until the status of the execution is successful (or failed).
Not related to your problem per se, but if you create a table with CREATE TABLE … AS there is no need to add partitions with MSCK REPAIR TABLE … afterwards, there will be no new partitions after the table has just been created that way – because it will be created with all the partitions produced by the query.
Also, in general, avoid using MSCK REPAIR TABLE, it is slow and inefficient. There are many better ways to add partitions to a table, see https://athena.guide/articles/five-ways-to-add-partitions/

BigQuery MERGE statement billing more bytes than editor shows

I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.
The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched and IdParent in unnest(parentList) then
insert (<all the fields>)
values (<all the fields)
;
So I:
Pull the IdParent list from the staging table to know which partitions to prune
limit the partitions of the target table in the join predicate
also limit the partitions of the target table in the match/not matched conditions
The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:
The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.
Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.
That's very interesting. What you did seems totally correct.
It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.
try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched then
insert (<all the fields>)
values (<all the fields)
;
It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.

UPDATE with JOIN not SELECTing expected value

I'm trying to UPDATE a temporary transaction table with values from a holdings table. The fields I want to get are from the holding with the lowest date that is higher than the transaction date.
When I use below SELECT statement, the right values are shown:
SELECT h.*
FROM transaction_tmp tt
JOIN holdings h
ON tt.isin = h.isin
AND tt.portfolio = h.portfolio
WHERE h.start_date > tt.tr_date
ORDER BY h.start_date
LIMIT 1
However, when I use below UPDATE statement, incorrect values are selected/updated in transaction_tmp:
UPDATE transaction_tmp tt
JOIN holdings h
ON tt.isin = h.isin
AND tt.portfolio = h.portfolio
SET
tt.next_id = h.id,
tt.next_start_date = h.start_date
WHERE h.start_date > tt.tr_date
ORDER BY h.start_date
LIMIT 1
I'm thinking the WHERE statement is not working appropriately, but unfortunately I cannot figure out how to fix it.
Appreciate any help here!
-Joost
should work using a subquery
UPDATE transaction_tmp tt
JOIN (
SELECT h.*
FROM transaction_tmp tt
JOIN holdings h
ON tt.isin = h.isin
AND tt.portfolio = h.portfolio
WHERE h.start_date > tt.tr_date
ORDER BY h.start_date
LIMIT 1
) tx on ON tt.isin = tx.isin
AND tt.portfolio = tx.portfolio
SET
tt.next_id = tx.id,
tt.next_start_date = tx.start_date
I'm surprised your syntax works. The MySQL documentation is pretty clear that LIMIT and ORDER BY are only allowed when there is a single table reference:
UPDATE [LOW_PRIORITY] [IGNORE] table_reference
SET assignment_list
[WHERE where_condition]
[ORDER BY ...]
[LIMIT row_count]
They are not allowed for the multiple table version of UPDATE:
UPDATE [LOW_PRIORITY] [IGNORE] table_references
SET assignment_list
[WHERE where_condition]
. . .
For the multiple-table syntax, UPDATE updates rows in each table named in table_references that satisfy the conditions. Each matching row is updated once, even if it matches the conditions multiple times. For multiple-table syntax, ORDER BY and LIMIT cannot be used.
I get an error if I try such syntax.

How to setup an AWS Athena query with multiple regex replacements?

I have been trying to make an AWS Athena query and got enough work done to get my data. However, my data needs to identify some patterns and change it in an uniform way in order to group those "similars". So I'm trying to make a regex_replacement, but how can i do multiple replacements to a same column in the same column?
Here's my query:
with q as (SELECT r.key,
r.otherid,
r.complexString,
minute(date_trunc('minute', from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York')) AS minute,
hour(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS hour,
day(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS day
FROM requests0918 t
JOIN requests0918 t1 ON t.id = t1.id
WHERE t1.msg = 'response_written' AND t1.code = '200'
and t.otherid is not null
and t.key is not null
and t.path is not null
limit 10)
Select q.key, q.otherid, REGEXP_REPLACE(q.complexString, '\/accounts\/[0-9]+\/balances', '/accounts/.../balances' ) as path, q.minute, q.hour, q.day from q
So I'm successfully changing this strings to that ones, but I need to set more patterns and to replace under the same column name. So I'm looking on how to do it. I could add more layers of with q as {Query} to add more rules, but that sounds pretty wrong.

Special character to query from latest timestamp sharded table in BigQuery

From
https://cloud.google.com/bigquery/docs/partitioned-tables:
you can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD
This enables me to do:
SELECT count(*) FROM `xxx.xxx.xxx_*`
and query across all the shards. Is there a special notation that queries only the latest shard? For example say I had:
xxx_20180726
xxx_20180801
could I do something along the lines of
SELECT count(*) FROM `xxx.xxx.xxx_{{ latest }}`
to query xxx_20180801?
SINGLE QUERY INSPIRED BY Mikhail Berlyant:
SELECT count(*) as c FROM `XXX.PREFIX_*` WHERE _TABLE_SUFFIX IN ( SELECT
SUBSTR(MAX(table_id), LENGTH('PREFIX_') + 2)
FROM
`XXX.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'PREFIX_%')
If you do care about cost (meaning how many tables will be scaned by your query) - the only way to do so is to do in two steps like below
First query
#standardSQL
SELECT SUBSTR(MAX(table_id), LENGTH('PREFIX') + 1)
FROM `xxx.xxx.__TABLES_SUMMARY__`
WHERE table_id LIKE 'PREFIX%'
Second Query
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '<result of first query>'
so, if result of first query is 20180801 so, second query will obviously look like below
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '20180801'
If you don't care about cost but rather need just result - you can easily combine above two queries into one - but - again - remember - even though result will be out of last table - cost will be as you query all table that match xxx.xxx.PREFIX_*
Forgot to mention (even though it should be obvious): of course when you have only COUNT(1) in your SELECT - the cost will be 0(zero) for both options - but in reality - most likely you will have something more valuable than just count(1)
I know this is a kind of an old thread but I was surprised why no one offers an answer using Variables.
"Héctor Neri" already mentioned this in the comments but I thought might be better to have an actual answer with a sample code posted.
#standardSQL
DECLARE SHARD_DATE STRING;
SET SHARD_DATE=(
SELECT MAX(REPLACE(table_name,'{TABLE}_',''))
FROM `{PRJ}.{DATASET}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{TABLE}_20%'
);
SELECT * FROM `{PRJ}.{DATASET}.{TABLE}_*`
WHERE _TABLE_SUFFIX = SHARD_DATE
Make sure to replace {PRJ}, {DATASET}, and {TABLE} values with your table location.
If you run this on BigQuery Web UI, you will see this message:
WARNING: Could not compute bytes processed estimate for script.
But you can see that variable properly reduce the table scan to the latest partition and does not cause any extra cost after running the script.