Migrate data from Hive to BigQuery - google-cloud-platform

I need to migrate 70TB data (2400 tables) from on-premises Hive to BigQuery. Initial plan is to load ORC files from Hive to Cloud Storage and then to BigQuery tables.
What is a better way achieving this through automation or any other GCP service.

I would suggest you to leverage data pipelines for the stated purpose.
Here’s some reference on how to use it - https://cloud.google.com/architecture/dw2bq/dw-bq-data-pipelines#what-is-a-data-pipeline
Also, you can explore different ways to transfer your on prem data to bigquery here - https://cloud.google.com/architecture/dw2bq/dw-bq-migration-overview
And please note that in Big query ORC is not supported. So you have to convert your ORC data into one of these 3 formats - Avro, JSON, CSV.

Related

Export data from Google Cloud SQL, and append to BigQuery

We have production databases (postgresql and mysql) on Cloud SQL.
How could I export the data from the production databases, and then append to BigQuery datasets?
I DO NOT want to sync or replicate the data into BigQuery because we purge (after backing up) the production databases on regular basis.
The only method I could think of is:
Export to CSV and then drop into Google Cloud Storage
Python scrip to append into BigQuery.
Are there any other more optimal ways?
BigQuery supports external data sources, specifically federated queries which allow you to read data directly from a Cloud SQL instance.
You can use this feature to select from all the relevant tables in your Postgres/MySQL instances and copy them into BigQuery without any extra ETL process. You can append the data to your existing tables, create a new table every time, or use some other organization that works for you.
BigQuery also supports scheduled queries so you can automate this.
The actual SQL will depend on your data sources but it's not much more than...
INSERT INTO `your_bq_table`
SELECT *
FROM `external.postgres123.tablename`

AWS Athena - UPDATE table rows using SQL

I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?
No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).
Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.

Amazon S3 to Amazon Athena to Tableau

I am working on a project to get data from an Amazon S3 bucket into Tableau.
The data needs to reorganised and combined from multiple .CSV files. Is Amazon Athena capable of connecting from the S3 to Tableau directly and is it relatively easy/cheap? Or should I instead look at another software package to achieve this?
I am looking to visualise the data and provide a forecast based on observed trend (may need to incorporate functions to generate data to fit linear regression).
It appears that Tableau can query data from Amazon Athena.
See: Connect to your S3 data with the Amazon Athena connector in Tableau 10.3 | Tableau Software
Amazon Athena can query multiple CSV files in a given path (directory) and run SQL against the data. So, it sounds like this is a feasible solution for you.
Yes, you can integrate Athena with Tableau to query your data in S3. There are plenty resource online that describe how to do that, e.g. link 1, link 2, link 3. But obviously, tables that define meta information of your data have to be defined before hand.
Amazon Athena pricing is based on on the amount of data scanned by each query, i.e. 5$ per 1TB of data scanned. So it all comes down how much data you have and how it is structured, i.e. partitioning, bucketing file format etc. Here is a nice blog post that covers these aspects.
While you prototype a dashboard there is one thing to keep in mind. By deafult, each time you would change list of parameters, filters etc, Tableau would automatically send a request to AWS Athena to execute your query. Luckily, you can disable auto querying of the data source and do it manually.

When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data

As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.

Confusions related to Redshift about dataset (Structured, Unstructured, Semi-structured) and format to be used

Can anyone explain me clearly about what kind of data Redshift can handle(like structured, unstructured , or in any formats)?
How to copy Cloudfront logs into Amazon Redshift even the log is in unstructured data without going to Amazon EMR?
**How to find Database size which is created in Amazon Redshift?
Please someone explain me clearly about all the three questions which i have mentioned it above...It will be better if you explain me with some example or sample code or any source it will be very helpful for my project
Amazon Redshift provides a standard SQL interface (based on PostgreSQL). Therefore, it is best suited for structured data that is stored in Tables, Rows and Columns.
It is also possible to store JSON records within a field and access them via JSON functions.
To load data into Amazon Redshift, it needs to be in a delimited file format, such as comma delimited, tab delimited, fixed-length fields or JSON format. Any data that is not in a suitable format will need to be pre-processed and converted to a suitable format. This could be done with tools such as Amazon Athena (Presto) or Amazon EMR (Hadoop).
Amazon CloudFront logs are in Tab-Delimited format and can be loaded directly into Amazon Redshift. For an example, see: Analyzing S3 and CloudFront Access Logs with AWS Redshift
Information about disk space consumed by tables can be obtained via the SVV_DISKUSAGE system view.