Informatica performance issues with snowflake target - informatica

I'm facing an performance issue with informatica hotfix 1 when loading huge data around 250 Millions into snowflake target from oracle sources, we have two versions one load into teradata which has no problem of performance (3-4 hours) and the second which load snowflake for this one the Throughput(row/sec) start correctly around 15K but it decrease to 2K and take so much time to load the data (more than 23 hours).
the mapping is a simple one I have source(oracle)--> Source qualifier (simple sql) --> expresion (without transformation) --> target (snowflake).
I don't have any warning message in the logs (there was "waiting for buffer increase" which I already fixed).
have you faced the same issue or any idea about please?

Related

DataFlow Teradata Jdbc connection proporty string

I have nearly 3310000 millons data on teradata. When I want to transfer into the BigQuery via DataFloq, it takes nearly 20 minutes and that time is very high for the whole process.
To decrease the this process time , I have already tried to adjust prefetch and fetchsize, but I got an error as Invalid connection parameter name. What can I do for this situation. Thanks in advance,
Best Regards

Timeout value in powerbi service

I have a dataset for 3 days I can not update it on Power BI service, knowing that on DESKTOP it is normally updated in 30 minutes. the dataset is powered from a SQL server database, via a data gateway. the data gateway is well updated. the incremental update is activated and the dataset retrieves only the data of the last 3 days to each update.
Here is the generated error message :
Data source error: The XML for Analysis request timed out before it was completed. Timeout value: 17998 sec.
Cluster URI: WABI-WEST-EUROPE-B-PRIMARY-redirect.analysis.windows.net
Activity ID: 680ec1d7-edea-4d7c-b87e-859ad2dee192
Application ID: fcde3b0f-874b-9321-6ee4-e506b0782dac
Time: 2020-12-24 19:03:30Z
What is the solution to this problem please.
Thank you
What license are you using?
Without premium capacity, the max dataset size can be 1GB. Maybe your dataset size has crossed this mark? If you are using the shared capacity, then you can check the workspace utilized storage size by clicking on ellipses at top right corner. Then click on storage to see how much is utilized for that workspace.
Also note that in shared capacity there is a 10GB uncompressed dataset limit at the gateway (but this should not be an issue as you have only 3 day data in refesh)
Also check whether your power query query is foldable (on the final step you should be able to see the 'show native query' option). If not then incremental refresh does not work and results in querying the entire data.
Also, note that 17998 sec means 5 hours. What is your internet speed?

Use case for dataflow (small SQL queries)

We're using Cloud Function to transform our datas in BigQuery :
- all datas are in BigQuery
- to transform data, we only use SQL queries in BigQuery
- each query runs once a day
- our biggest SQL query runs for about 2 to 3 minutes, but most queries runs for less than 30 seconds
- we have about 50 queries executed once a day, and this number is increasing
We tried at first to do the same thing (SQL queries in BigQuery) with Dataflow, but :
- it took about 10 to 15 minutes just to start dataflow
- it is more complicated to code than our cloud functions
- at that time, Dataflow SQL was not implemented
Every time we talk with someone using GCP (users, trainers or auditers), they recommend using Dataflow.
So did we miss something "magic" with Dataflow, in our use case? Is there a way to make it start in seconds and not in minutes?
Also, if we use streaming in Dataflow, how are costs calculated? I understand that in batch we pay for what we use, but what if we use streaming? Is it counted as a full-time running service?
Thanks for your help
For the first part, BigQuery VS Dataflow, I discussed this with Google weeks ago and their advice is clear:
When you can express your transformation in SQL, and you can reach your data with BigQuery (external table), it's always quicker and cheaper with BigQuery. Even if the request is complex.
For all the other use cases, Dataflow is the most recommended.
For realtime (with true need of realtime, with metrics figured out on the fly with windowing)
When you need to reach external API (ML, external service,...)
When you need to sink into something else than BigQuery (Firestore, BigTable, Cloud SQL,...) or read from a source not reachable by BigQuery.
And yes, Dataflow start in 3 minutes and stop in again 3 minutes. It's long... and you pay for this useless time.
For batch, like for streaming, you simply pay for the number (and the size) of the Compute Engine used for your pipeline. Dataflow scale automatically in the boundaries that you provide. Streaming pipeline don't scale to 0. If you haven't message in your PubSub, you still have at least 1 VM up and you pay for it.

AWS Athena - Query over large external table generated from Glue crawler?

I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena

Improve the response time of AWS Dynamodb

We have been using Couchbase for about two years but we finally decided to switch to Amazon DynamoDB service for many reasons.
Now I started the migration of data to dynamodb. First everything was alright and going as expected, but after some time the response time from dynamo is getting higher and higher and the migration process is getting slower by time.
I tried to change my strategies but with no luck.
What can I do to increase the response time?
Basically I am scanning an SQL table getting 100 items per query then asking Couchbase to retrieve the data I want about these 100 items. At first I was getting high response times (as shown in the image bellow).
The following info might help:
I am running the migration code on an ec2 micro server running Ubuntu 14.04 with node v 4.4.1.
After looking at the graphs I started measuring the time for each dynamodb request (so I don't know what is the average was at first), the average response time is 800 ms for about 150,000 requests (get & put only, no batch commands or queries)
I am storing the items in two tables one with integer hash key and the other is with integer hash and sort keys
The second table is a huge one (having about 4.4 millions of items and counting)
Thanks to #Vor for mentioning the "Hot partition keys" I have made further readings and in reference to the Guidelines for Working with Tables I have followed what they are recommending and also I distributed my requests into batch requests made the difference.
Thank you all for your help.