How to find average time to load data from S3 into Redshift - amazon-web-services

I have more than 8 schemas and 200+ tables and data is loaded by CSV files in different schema.
I want to to know the SQL script for how to find average time to load the data from S3 into Redshift for all 200 tables.

You can examine the STL System Tables for Logging to discover how long queries took to run.
You'd probably need to parse the Query text to discover which tables were loaded, but you could use the historical load times to calculate a typical load time for each table.
Some particularly useful tables are:
STL_QUERY_METRICS: Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for queries that have completed running in user-defined query queues (service classes).
STL_QUERY: Returns execution information about a database query.
STL_LOAD_COMMITS: This table records the progress of each data file as it is loaded into a database table.

Run this query to find out how fast your COPY queries are working.
select q.starttime, s.query, substring(q.querytxt,1,120) as querytxt,
s.n_files, size_mb, s.time_seconds,
s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s
from (select query, count(*) as n_files,
sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
from stl_s3client where http_method = 'GET' and query > 0
and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >= dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;
Once you find out how many mb/s you're pushing through from S3 you can roughly determine how long it will take each file based on the size.

There's a smart way to do it. You ought to have an ETL script that migrates data from S3 to Redshift.
Assuming that you have a shell script, just capture the time stamp before the ETL logic starts for that table (let's call that start), capture another timestamp after the ETL logic ends for that table (let's call that end) and take the difference towards the end of the script:
#!bin/sh
.
.
.
start=$(date +%s) #capture start time
#ETL Logic
[find the right csv on S3]
[check for duplicates, whether the file has already been loaded etc]
[run your ETL logic, logging to make sure that file has been processes on s3]
[copy that table to Redshift, log again to make sure that table has been copied]
[error logging, trigger emails, SMS, slack alerts etc]
[ ... ]
end=$(date +%s) #Capture end time
duration=$((end-start)) #Difference (time taken by the script to execute)
echo "duration is $duration"
PS: The duration will be in seconds and you can maintain a log file, entry to a DB table etc. The timestamp will be in epoc and you can use functions (depending upon where you're logging) like:
sec_to_time($duration) --for MySQL
SELECT (TIMESTAMP 'epoch' + 1511680982 * INTERVAL '1 Second ')AS mytimestamp -- for Amazon Redshift (and then take the difference of the two instances in epoch).

Related

AWS Athena partitioning - projected and otherwise - make queries run forever

I'm trying to introduce partitioning into an Athena table. By way of background, what I have is a kinesis data stream receiving events that are JSON data, a delivery stream that converts the JSON from text to orc via a glue table we'll call glue-table and writes them to s3://bucket-name/path/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/.
With this setup, and no partitioning properties or fields set up in the glue table at all, there is automatically an external table in Athena named glue-table and with ~50 sample events run through kinesis, I have data in the appropriate path in S3 (aws s3 ls s3://bucket-name/path/year=2022/month=08/day=31/hour=19 for example lists results). I can execute SELECT * FROM glue-table in Athena and get ~50 results in ~900ms.
If I execute MSCK REPAIR TABLE glue_table the result is Partitions not in metastore: glue_table:year=2022/month=08/day=31/hour=19 which implies Athena is able to see that there ought to be partitions based on the templates it finds in the S3 path.
If I then do this to create a partitioned table using projections:
CREATE EXTERNAL TABLE paritioned_table (
action string,
createdat string
)
PARTITIONED BY (
year integer,
month integer,
day integer,
hour integer
)
STORED AS ORC
LOCATION 's3://bucket-name/path/'
TBLPROPERTIES (
'classification'='orc',
'projection.year.type'='integer',
'projection.year.range'='2022,9999',
'projection.month.type'='integer',
'projection.month.range'='01,12',
'projection.day.type'='integer',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.range'='00,24',
'projection.enabled'='true',
'storage.location.template'='s3://bucket-name/path/year=${year}/month=${month}/day=${day}/hour=${hour}/');
and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
If I go modify the glue table so it has the same projection settings from TBLPROPERTIES in the above example set as Table Properties, and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
Interestingly, if I drop and recreate partitioned_table so that it's location is s3://bucket-does-not-exist then SELECT action, year FROM partitioned_table WHERE year=2022; also runs for 3+ minutes and then returns no results.
I'm not able to see what I'm doing wrong here after referring to multiple pages of AWS documentation and a few official AWS videos describing this process. What's wrong here?

How Redshift Spectrum scans data?

Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.

Determine the table creation date in AWS Athena using information_schema catalog?

Does anyone know if its possible to retrieve the creation date of a table in AWS Athena using SQL on the information_schema catalog? I know I can use show properties on an individual table basis but I want to get the data for 1000's of tables.
This can tell you when the data in a given table was populated.
1) list the s3 objects in the table
Some SQL to execute via Athena. I like knowing record counts as well, but really the "$PATH" thing is the important part.
select count(*) as record_cnt, "$PATH" as path
from myschema.mytable
group by "$PATH"
order by "$PATH"
record_cnt path
10000 s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
..etc...
2) get the timestamp for the s3 objects
$ aws s3 ls s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
2022-01-01 12:01:23 4456780 000000_00000-abcdef
$
You'll need to loop through the S3 objects & consolidate the timestamps.
I see +/- a few seconds on the backing s3 objects when I create tables, so you may want to consolidate that to the nearest 5 minutes... or hour.
shrug Whatever fits the tempo of your table creation.
I've also seen tables where data is appended over time,
so maybe the closest you could to an actual table create
timestamp would be the oldest s3 object's timestamp.
Ps. I haven't dug into the Glue table defn stuff enough to
know if one can get useful metadata out of that.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Hive query on s3 partition is too slow

I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog