GCP moving Bigtable data to BigQuery - google-cloud-platform

In GCP I would like to know if it possible to transfer/move data from Bigtable to BigQuery. i.e lets say I want all data greater than 1 year should be moved from Bigtable to BigQuery. Is this doable?
Can someone please help me on this

You can query the BigTable data from BigQuery thanks to the external table configuration.
Because you are able to query the data from BigQuery, you can perform an INSERT SELECT in a BigQuery table.
EDIT 1
You can't do it automatically. You must perform custom code to copy only the old data and then delete the old ones.
You must have a timestamp field for that. To copy the data from BigTable to BigQuery, you can use external table. But you can't delete the data from BigQuery external connection.
To purge BigTable data, you can use garbage collection feature.

Related

Data Warehouse and connection to Power Bi on AWS Structure

I work for a startup where I have created several dashboards in Power BI using tables that are stored in an AWS RDS that I connect to using MySQL. To create additional columns for visualizations, I created views of the tables in MySQL and used DAX to add some extra columns for the Power BI visualizations.
However, my team now wants to use the AWS structure and build a data lake to store the raw data and a data warehouse to store the transformed data. After researching, I believe I should create the additional columns in either Athena or Redshift. My question is, which option is best for our needs?
I think the solution is to connect to the RDS using AWS Glue to perform the necessary transformations and deposit the transformed data in either Athena or Redshift. Then, we can connect to the chosen platform using Power BI. Please let me know if I am misunderstanding anything.
To give an approximate number of the number of records I'm handling, the fact tables have about 10 thousand new records every month
Thank you in advance!

how to export oracle DB table with complex CLOB data into bigquery through batch upload?

We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.

GCS to BigQuery table using stored Proc

I want to create a stored Proc which can read data from GCS bucket and store into the table in bigquery. I was able to do it using python bu conecting to gcs and creating bigquery client.
credentials = service_account.Credentials.from_service_account_file(path_to_key)
bq_client = bigquery.Client(credentials=credentials, project=project_id)
Can we achieve the same using stored procedure?
Have you looked into using an external table? You can query directly off of Google Cloud Storage without needing to load anything. Just define the schema of the expected data along with the GCS URI's and once the data is in GCS it is accessible via SQL in BigQuery. Otherwise, no. There is no LOAD statement that you can execute via BigQuery SQL. See the docs here for all of the ways to load data into a table. You could have the external table and maybe create a stored procedure that does an INSERT into another table using the data from the external table you created. That is if you are really hell bent on having a stored procedure to "load" data to a normal BigQuery table. Otherwise the external tables are an excellent option to obviate the need to even load the data in the first place.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

Bigquery, save clusters of clustered table to cloud storage

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.
What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:
gs://my_bucket/{client_id}/{attribute_id}/object.avro
I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.
But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.
Is there a way to accomplish this already without requesting a new feature to be added?
Thanks!