We are trying to replace alteryx with GCP dataflow for ETL job. Requirement is to get data from PostgreSQL table, join, add some formulas and group by to add missing columns and then write back to PostgreSQL table again for Qlik to consume and generate Viz.
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. That would be really helpful. Thank you
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. Thank you
Related
I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?
Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.
Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!
I wish to perform sentimental analysis using Google Natural Language API.
I found a documentation that perform sentiment analysis directly on a file located in Cloud Storage, https://cloud.google.com/natural-language/docs/analyzing-sentiment#language-sentiment-string-python.
However, my data that i am working on is instead located in Big Query. I am wondering how do I call the data directly from Big Query table to do the Sentimental Analysis?
An example of the Big Query Table schema:
I wish to do NLP on the tweet columns of the table.
I tried to search for documentation on it but seems to not find anything.
I would appreciate any help or references. Thank You.
You can take a look at BigQuery Remote Functions which provide a direct integration with Cloud Functions and Cloud Run. The columns returned from BigQuery SQL can be passed to the Remote Functions and a custom code can be executed as per the requirements. Please do note that Remote Functions are still in preview and might not be suitable for production systems.
This should be fairly straightforward to do with Dataflow - you could write a pipeline that reads from BigQuery followed by a DoFn that uses Google's NLP Libraries, and then writes the results to BigQuery.
Some wrappers are already provided for you in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/ml/gcp/naturallanguageml.py
I wanted to know like which concepts/topics I need to learn in order to work for a BigQuery DWH project? Along with Big Query, what other programming languages I need to get acquainted or expertise with(like python)? I am currently working as data enginner with ssis, informatica, power bi skills with strong sql. Please give your valuable suggestions.
Thanks,
Ven.
BigQuery has an SQL interface so if you don't already know SQL, learn it.
See the query reference.
Also, you can interact with BigQuery using Bash, with the bq CLI provided as a Google Cloud component in the gcloud CLI, or with Python, Go, Java, node.js... (choose your favorite).
Actually if you are not planning a long term project, or become an expert of BigQuery, the more complex concepts are not needed. In case you want to know more about it I link a pretty interesting blog.
To sum up:
Learn SQL
Take into account that BigQuery is optimized for reading and performing analysis, it is not a common database (do not exceed with writes)
Most common languages has a bigquery client, so you don't need to learn any new language.
I am trying to build a data warehouse using RedShift in AWS. I want preprocess salesforce data my moving it to RDS or S3(use them as stage) before finally moving it to RedShift. I am trying to find out what are different ways on how I can replicate salesforce data in S3/RDS for this purpose. I have seen many third party tools which are able to do this.But I am looking for something which can be built in-house. I would like to use this data for dimensional modeling.
Thanks for your help!
I am stuck at one point. We have 700 SAS scripts in which we have some ETL logic is written, we have to migrate it into Informatica. I understood the logic and its quite easy to implement it but my client is saying that don't do it manually, find out some way so that we can automate it. By automation he means that implement some way by which ETL code will be read automatically and then mapping will also be designed automatically. Now my question is, Is that possible? Can we achieve this if yes then please share some insight if not then is there any other alternate way?