I am stuck at one point. We have 700 SAS scripts in which we have some ETL logic is written, we have to migrate it into Informatica. I understood the logic and its quite easy to implement it but my client is saying that don't do it manually, find out some way so that we can automate it. By automation he means that implement some way by which ETL code will be read automatically and then mapping will also be designed automatically. Now my question is, Is that possible? Can we achieve this if yes then please share some insight if not then is there any other alternate way?
Related
We are trying to replace alteryx with GCP dataflow for ETL job. Requirement is to get data from PostgreSQL table, join, add some formulas and group by to add missing columns and then write back to PostgreSQL table again for Qlik to consume and generate Viz.
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. That would be really helpful. Thank you
I am new to Java. Can anyone help me with any sample code to refer to for similar use case. Thank you
I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?
Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.
Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!
I have one generic question, actually, I am hunting for a solution to a problem,
Currently, we are generating the reports directly from the oracle database, now from the performance perspective, we want to migrate data from oracle to any specific AWS service which could perform better. We will pass data from that AWS service to our reporting software.
Could you please help which service would be idle for this?
Thanks,
Vishwajeet
To answer well, additional info is needed:
How much data is needed to generate a report?
Are there any transformed/computed values needed?
What is good performance? 1 second? 30 seconds?
What is the current query time on Oracle and what kind of query? Joins, aggregations etc.
I wanted to know like which concepts/topics I need to learn in order to work for a BigQuery DWH project? Along with Big Query, what other programming languages I need to get acquainted or expertise with(like python)? I am currently working as data enginner with ssis, informatica, power bi skills with strong sql. Please give your valuable suggestions.
Thanks,
Ven.
BigQuery has an SQL interface so if you don't already know SQL, learn it.
See the query reference.
Also, you can interact with BigQuery using Bash, with the bq CLI provided as a Google Cloud component in the gcloud CLI, or with Python, Go, Java, node.js... (choose your favorite).
Actually if you are not planning a long term project, or become an expert of BigQuery, the more complex concepts are not needed. In case you want to know more about it I link a pretty interesting blog.
To sum up:
Learn SQL
Take into account that BigQuery is optimized for reading and performing analysis, it is not a common database (do not exceed with writes)
Most common languages has a bigquery client, so you don't need to learn any new language.
I am a newbie in ETL. I have a requirement to migrate IBM DataStage to Informatica.
a. Please specify best tools to do this above scenario.
b. Before doing migration, please help me in mentioning the steps required before doing migration.
c. Drawbacks for this scenario.
There is no single solution work for your problem as migrating from one ETL to other is a nightmare. You have to consider lot of points. Let me list few:
Understand current ETL process
Collect list of source and target system and types like flat file, RDBMS etc
Read the datastage mapping and prepare mapping specification - (transformation logic, lookup logic, join conditions, expressions)
Understand the datastage architecture - number of nodes, high availability, fail over etc
List of DB objects used like Stored procedures, functions etc
Once you understand these you can think about how you can implement this using Infa.