When you load data to your Amazon Redshift tables, you can check the load status using the table STV_LOAD_STATE.
I would like to know if there's a way to achieve the same, but with the unload operation. In other words, I'd like to know if there's a way to find out the current stage of an unload process.
Unlike loading data into Redshift, Unloading actually has to run a select statement. Therefore it can't tell us a status like it does when it's loading.
e.g if the select statement has to join multiple tables and scan a lot of tables to generate the output then it might take long even though the actual unload part might not be the long part.
So I usually check the query execution steps in AWS console to have a rough idea about where the unload is.
I also check the S3 folder that I am unloading to see if the files start coming in yet. They usually come in batches so it can give you an idea as well.
2021, and we have a solution
STL_UNLOAD_LOG
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_UNLOAD_LOG.html
Related
Because of companies policies, we have a lot of information that we need as input inserted into a BigQuery table that we need to SELECT from.
My problem is that doing a select directly into this table and trying to run a process (a virtual machine, etc) is prone to errors and reworking. If my process stops, I need to run the query again and reprocess everything.
Is there a way to export data from Big Query to a Kinesis-like stream (I'm more familiar with AWS)?
DataFlow + PubSub seems to be the way to go for this kind of issue.
Thank you jamiet!
For one of my classes, I have to analyze a "big data" dataset. I found the following dataset on the AWS Registry of Open Data that seems interesting:
https://registry.opendata.aws/openaq/
How exactly can I create a connection and load this dataset into Databricks? I've tried the following:
df = spark.read.format("text").load("s3://openaq-fetches/")
However, I receive the following error:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
Also, it seems that this dataset has multiple folders. How do I access a particular folder in Databricks, and if possible, can I focus on a particular time range? Let's say, from 2016 to 2020?
Ultimately, I would like to perform various SQL queries in order to analyze the dataset and perhaps create some visualizations as well. Thank you in advance.
If you browse the bucket, you'll see that there are multiple datasets there, in different formats, that will require different access methods. So you need to point to the specific folder (and maybe its subfolder to load data). Like, to load daily dataset you need to use CSV format:
df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "false").load("s3://openaq-fetches/daily/")
To load only subset of the data you can use path filters, for example. See Spark documentation on loading data.
P.S. the inferSchema isn't very optimal from performance standpoint, so it's better to explicitly provide schema when reading.
I am running a BigQuery script to generate a table. The script assumes the existence of another table, performs some transformations, and places the transformed data into an output table. However, I want the script to terminate its execution (and possibly post a message) if the input table does not comply with some conditions. What is the best way of terminating a BigQuery script using a condition?
To achieve this without any external app that call the BigQuery API and perform the requirements checks (which is a nicer way, easier to maintain and evolve), is to create a schedule query. In this case, it's well designed for recurring request. If not, code this in your preferred language.
So, with BigQuery scheduled queries, you can perform your query, define the destination table and define a notification channel
Set the PubSub topic that you want. However, this message isn't custom. You will have the status and the reason of the latest execution. Then you will need to dig into to understand exactly what happened during the query and perform complex code to read the log and find the root cause.
If your check wants to know the status OK/KO, this solution is suitable, if not, prefer your own code, you will have a better granularity on the error management.
I have a fairly complex Bigquery query and it seems to cost more than I expect. It has 97 intermediate stages... are those charged?
You can get the value of how much data will be scanned (and therefore charged) by your query using the --dry-run switch from the CLI or by looking at the right end of the UI section where you run and set up your query.
BigQuery pricing model is per byte read. For my understanding, at the moment, if you reference a table in multiple CTE you will get charged one. But this might depend on how the query is written.
The best practice is always using the --dry-run feature which is very accurate.
Under 3 nodes using redshift we plan on doing 50-100 inserts every 10 seconds. Within that 10 second window we also will try to do the equivalent of a redshift upsert as documented here https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html on about 50 to 100 rows as well.
I'm basically unaware if a 10 second window is realistic or a 10 minute window... etc is good for this kind of load. Should this be a daily batch? Should I try to re-architect to get rid of upserts?
My question is essentially can redshift handle this load? I feel the upsert is happening too many times. We are using structured streaming in spark to handle all of this. If yes what type of nodes should we be using? Has anyone who has done this have a ballpark estimate? If no, what are alternative architectures?
Essentially what we're trying to do is load entity data to be joined with the events in redshift. But we want the analytics to be as near real time as possible so we want load as fast as we can.
There's probably no exact answer for this, so any explanation that can get help me perform estimations on requirements based on load will be helpful.
I do not think you will achieve the performance you seek.
Running large numbers of INSERT statements is not an optimal way to load data into Amazon Redshift.
The best way is via running COPY from data stored in Amazon S3. This loads data in parallel across all nodes.
Unless you have a very real need to get data immediately into Redshift, it would be better to batch the data in S3 over a period of time (the larger the batch, the better), then load via COPY. This will also work well with the Staging Table approach to performing UPSERTS.
The best way to discover whether Redshift will handle a particular load is to try it! Spin up another cluster and try the various methods, measuring the performance each time.
I would recommend using Kinesis Firehose to insert data to Redshift. It will optimize for time / load and insert accordingly.
We tried inserting manually in batches, does not seems to be the cleaner way of handling it when an optimized cloud service exist for the same.
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
It does collect them in batches, compress and load them to Redshift.
Upsert Process:
If you want an upsert this is how I would have them done in a scalable way,
DynamoDB Table (Update) --> DynamoDB Streams --> Lambda --> Firehose --> Redshift
Have a scheduled job to cleanup any duplicate records based on created_timestamp.
Hope it helps.