I am trying to copy a table in the US region to the EU region. I tried Google cloud data transfer service but it copies at the dataset level.
My dataset has lots of huge tables and I just need to copy one of them daily.
Question 1: Does google cloud data transfer service works for a table, not a dataset?
Question 2: What is the easiest solution to do the cross-region transfer for a table?
This is my dataset copy job using google cloud data transfer service: At the dataset level, it works fine but does not allow me to add the table name.
That is happening because the source type is chosen: Dataset Copy which means copies the whole dataset. There is no other option in the source type drop-down list for table transfer though.
Related
I'm did ETL for our data and did simple aggregations on it in Athena. Our plan is to use our BI tool to access those tables from Athena. It works for now, but I'm worried that these tables are static i.e. they only reflect the data since I last created the Athena table. When called, are Athena tables automatically ran again? If not, how do I make them be automatically updated when called by our BI tool?
My only solution thus far to overwrite the tables we have is by running two different queries: one query to drop the table, and another to re-create the table. Since it's two different queries, I'm not sure if you can run it all at the same time (at least in Athena, you can't run them all in one go).
Amazon Athena is a query engine, not a database.
When a query is sent to Amazon Athena, it looks at the location stored in the table's DDL. Athena then goes to the Amazon S3 location specified and scans the files for the requested data.
Therefore, every Athena query always reflects the data shown in the underlying Amazon S3 objects:
Want to add data to a table? Then store an additional object in that location.
Want to delete data from a table? Then delete the underlying object that contains that data.
There is no need to "drop a table, then re-create the table". The table will always reflect the current data stored in Amazon S3. In fact, the table doesn't actually exist -- rather, it is simply a definition of what the table should contain and where to find the data in S3.
The best use-case for Athena is querying large quantities of rarely-accessed data stored in Amazon S3. If the data is often accessed and updated, then a traditional database or data warehouse (eg Amazon Redshift) would be more appropriate.
Pointing a Business Intelligence tool to Athena is quite acceptable, but you need to have proper processes in place for updating the underlying data in Amazon S3.
I would also recommend storing the data in Snappy-compressed Parquet files, which will make Athena queries faster and lower cost (because it is charged based upon the amount of data read from disk).
Can Google BigQuery data transfer service allow me to transfer specific app data automatically?
For example, I have 10 apps in my Google play console, I only want to transfer to BQ within only 3 apps. Is it possible to make this work or any approach?
Also, I just read the price of doc, The monthly charge is $25 per unique Package Name in the Installs_country table.
I don't quite understand how to calculate my cost with that example.
Thank you.
For your requirement, you can download the reports in Cloud Storage of a specific app by selecting the app from Google Play Store for which you want to get the data and then send it to BigQuery using BigQuery Data Transfer Service. The cost calculation of Google Play, it is calculated as $25 per month per unique package and stored in the Installs_country table in BigQuery.
For selecting the specific app, follow the steps given below :
Go to the Play Console.
Click on Download Reports and select the type of report you want.
Under "Select an application," type and select the app for which you want to get the data.
Select the year and month for which you want to download the report.
If you are storing data in a Cloud Storage bucket then that will incur cost and the pricing for data transfer from one storage bucket to another storage bucket can be checked in this link and since you are storing and querying in BigQuery that will also be chargeable.For BigQuery pricing details you can check this documentation. You can use the Billing Calculator to calculate your costs.
I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.
I have some datasets in my own bigquery account (organization A) that needs to be transferred to another bigquery account in organization B. How to do it?
I am aware the data transfer service and RESTapi but that seem to be transfer data across project and region level within the organization.
Thanks!
You can use BigQuery Copy Datasets feature (beta feature at the moment) to copy datasets across projects/organization and across regions (not all regions supported). Cross organization copy works as long as you don't have VPC service controls set. You can use COPY DATASET or TRANSFERS on BQ Web UI or use CLI. Using Transfers allows running the copy on a recurring schedule.
Usage: bq mk --transfer_config --project_id=[PROJECT_ID] --data_source=[DATA_SOURCE] --target_dataset=[DATASET] --display_name=[NAME] --params='[PARAMETERS]'
Use —params for the specifying source dataset and other options.
I have created the Big Query out of the data I have in my Cloud storage bucket.
In my use case, I am sending data periodically to the same bucket which is backend of my Big Query(while creating the Big query table I used the same bucket name).
Is it possible to get the updated data into Big Query, as I am pushing new data each time into the same bucket on some interval basis.
Just to mention - I am making native Big query from my dedicated storage bucket mentioned above.
Your help will be much appreciated. thanks in advance.
You can create an external (federated) table on Google Cloud Storage Bucket. In this case, whenever you query this table you will get the latest data.
If you just need to append data to a table (let's call it target table) based on data from the bucket - I can imagine following this process:
Create a federated table on the GCS bucket
Setup a simple cron job that runs a bq command which is just doing select * from [federated_table] and appends results into the target table (you may have a more complicated query that will check duplication of data in the target table and only appends new data).
Alternative option:
Setup a trigger on your bucket that activates cloud function and in a cloud function you just load the newly added data to the target table.