Which one is better performance wise in Informatica Powercenter? Use sorter transformation or add number of sorted ports on source qualifier? - informatica

I have a mapping in Informatica Powercenter which combines data from two sources. One source has around 22 million rows of data while the other have >389 million rows of data. Will it be better performance-wise if I add Sorter transformation or is it better to add number of sorted ports in the Source Qualifier?
Also, what factors that makes one way better than the other(in case of sorter transformation vs adding number of sorted ports in SQ)?

If both tables are from same DB, without a doubt - sort in the SQ using number of sorted ports.
Informatica sorter brings whole data into infa server and then sort it. So, sorting 300M resultant data is going to take lot of time + resource.
Now, joining 389 M and 22M table in source and sort the result in source itself will take less time and resource. Informatica doesnt have to bring any data into its server.
Now, if they are from different data bases, then, sorting them in source qualifier will give perf boost while joining. You have to join them using joiner to get whole data set. And i think data order will be same if your sort key is same as join key and you do not have to sort again using sorter. Issue is joining both will take time.

Related

Aws RedShift sampling

For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

Joining Solution using Co-Group by SideInput Apache Beam

I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Best way to store mappings in a database

Suppose I have an employees table(with around a million employees) and a tasks table(with a few hundred tasks).
Now, I have a mechanism to predict how probable(percentage) an employee is to complete the task -- let's say I have four such mechanisms, and each of the mechanism outputs it's own probability.
Putting it all together, I now have n1(employees) times n2(tasks) times n3(mechanisms) results to store.
I was wondering what would be the best way to store these results.
I have a few options and thoughts:
Maintain a column(JSONField) in either of employees or tasks tables -- Concern: Have to update the whole column data if one of the values changes
Maintaining a third table predictions with foreign keys to employee and task with a column to store the predicted_probability -- Concern: Will have to store n1 * n2 * n3 records, I'm worried about scalability and performance
Thanks for any help.
PS: I'm using Django with postgres
The predictions table is the correct way to go. Depending on how you access the data, the size of the table won't matter. e.g. I would expect that reading the prediction for a single employee has a pretty constant performance. Large tables tend to be a problem only when you need to process all (or a large fraction) of the rows. If you hit a performance problem once you test this, you could e.g. partition that table by task or by task and mechanism (depending on how your queries are structured)
-Credits to #a_horse_with_no_name