Does BigQuery Copy datasets feature is using slots? - google-cloud-platform

BigQuery Dataset Copy (see more here: https://cloud.google.com/bigquery/docs/copying-datasets) per documentation is mentioning that is using BigQuery Data Transfer Service. But what right now is unclear to me is question if BigQuery Data Transfer Service is using slots or not? I tried to find any information about in documentation but failed. They are used for ingestion jobs, so I was curious if this is also a case for Data Transfer Service.

As per this video from Google Cloud Tech, PIPELINE type jobs (extract, load and copy jobs) do take up slots if they are assigned reservations. The dataset copy documentation states that each table in the intended dataset gets its own copy job and so, it seems that copy jobs do take up slots.
Quoting from the documentation.
Copying a dataset requires one copy job for each table in the dataset.
If no slots are assigned to PIPELINE type jobs, they use the shared pool of free slots by default.

Related

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

Truncate load in Redshift daily

Would like some suggestions on loading data to Redshift.
Currently we have an EMR cluster where RAW data is ingested regularly. We have a transformation job which runs daily and creates final modeled object. However, we are following truncate and load strategy in EMR . Due to business reasons there is no way to figure out which data has changed.
We are planning to store of this modeled object in Redshift.
Now my question is If we follow the same truncate and load strategy in
RedShift also, will that work?
I was able to find only articles which say use copy if you want to perform bulk copy, and then use insert command for small updates. But nothing on can and should we be using RedShift where the data is getting overwritten daily.

Move dataset to another region bigquery

I want to move a dataset to another region but there are some pubsub subscriptions with dataflow templates loading the tables within the dataset. How can I do this without interrupting the dataflow jobs? Or interrupt them as little as possible.
Is it possible to do this in these steps?:
Create a temporary dataset with a temporary name in a new region
Copy original dataset to the temporary dataset
Delete old original dataset
Create a new dataset with the original name in the new region
Copy temporary dataset the new dataset with the original dataset
Im open for suggestions :D
You can use the copy dataset feature in preview for now. One interesting feature is the cross region copy available with the feature.
You can perform the same process but easier!
About your Dataflow pipeline, I think it won't work. Indeed, the location is an important information when you write to BigQuery. Have a try, but I'm pretty sure that you have to update it.

Export Data from BigQuery to Google Cloud SQL using Create Job From SQL tab in DataFlow

I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.

ETL Possible Between S3 and Redshift with Kinesis Firehose?

My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.
Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.