SSIS Process Design Issue: Workaround for Variable Sharing in Concurrent Executions? - concurrency

Let met start by saying that I am in the preliminary stages of a large ETL project - ~80ish data sources, with 100s of millions of rows being transformed and loaded to a (finicky) ERP system. As such, I'm open to large process design solutions, along with potential solutions to this issue of variable sharing in concurrent Sequence Container execution.
In my example, I am logging three row counting variables to an SSMS ErrorLogging table. My intention is to use these three variables to help narrow down the 'stage' of the ETL process where data loss is/could be occurring. I am currently building this out in a very small test case. Below are the three variables and an image of the SSIS test case package.
varRowsSource: Control Flow task that counts of rows from the source data
varRowsTransformation: Data Flow task that counts the number of rows after data flow transformations
varRowsDestination: Control Flow task that counts the number of rows that wrote to the destination table -> select count(*).
SSIS TestCase package
In the image above, you'll see the range of Sequence Containers that are being used. These containers run concurrently.
I've highlighted the Control Flow tasks that overwrite each others RowsSource and RowsDestination variables entries, as well as circled the Data Flows that add rows to the variable.
It can be seen how, with only one data source, this requires 3 variables per container. We expect 30-50 data sources per package. I'm hoping to avoid having to create hundreds of variables.
SSMS ErrorLog table
Above is a snippet of the ErrorLog table in SSMS, should that help visualize how the variables are adding to, or overwriting each other. There are 4417 rows in the source data for Sequence Container CustomerInfoAddress and 4429 rows in the source data for Sequence Container CustomerTaxInfo.
In essence, I'm looking for a workaround for the issue of variable sharing in SSIS during concurrent processes and I would like to avoid creating hundreds of variables. I've consulted as many resources as I've been able to find about this issue, (including the below) without luck. I'm hoping someone here may have some insight into a process design workaround or any other type of solution.
social.msdn.microsoft.com/Forums/sqlserver/en-US/bed0d082-6c30-47a5-aaa7-0e132927bfe2/row-count-concurrency?forum=sqlintegrationservices

Related

Redshift Query Performance to reduce CPU utilisation

I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.

Performant way to handle arrays in Athena/Quicksight

I currently have a large set of json data that I'd like to import into Amazon Athena for visualization in Amazon Quicksight. In each json, there are two fields: one is a comma separated string of ids (orderlist), and the other field is an array of strings(locations). Because Quicksight doesn't support array searching, I'm currently resorting to creating a view where I generate crossjoins across the two string arrays:
select id,
try_CAST(orderid AS bigint) orderid_targeting,
location
from advertising_json
CROSS JOIN UNNEST(split(orderlist, ',')) as x(orderid)
CROSS JOIN UNNEST(locations) t (location)
With two cross joins, this can explode out the data to 20x-30x the original size.
If I were working on individual queries on Athena, I could use Presto array functions to search through the arrays. Is there a better way to make these fields accessible for filtering on Quicksight?
You have two options: keep doing what you're doing or implement an ETL workflow where you periodically materialise the view, for example using CTAS. The latter has the added benefit that you can produce Parquet files, which could help speed up your queries.
On the other hand it's not as simple as it sounds. If you're in luck you can use INSERT INTO to transform partitions from your current table into an optimised table after a point in time when they will not change – but in my experience most of the time your most recent data gets updated during some window of time, but you still want to be able to query it during that window. In that situation the ETL process becomes much more complicated since you need to remove data from the optimised table to avoid ending up with duplicate data. It's not hard, it's just a lot of code and juggling S3 and Glue Data Catalog operations so that you never have tables that have duplicate data nor too little data.
Unless you feel like your current setup with the view is too slow, don't go implementing something big and complicated. Remember that you pay for bytes scanned in Athena, not the amount of time Athena spends crunching your query. You get quite a lot of compute power running your queries and in my experience there's rarely any point in micro-optimisation of queries, the gains you make are orders of magnitude lower than minimising the amount of data you process, either through clever partitioning or moving to columnar file formats. Most of the time the gains from small optimisations are not measurable because the error bars caused by Athena's query queue and waiting for S3 operations. You may get your query to run 50ms faster, but sometimes it gets queued for 500ms, and spends another 2000ms doing list operations on S3 so how can you tell?
If you decide to go down the materialisation route, first do it once using CTAS and run your QuickSight visualisation against the results. Don't implement the whole ETL workflow before you've checked that you get something that is significantly more performant.
If all you are worried about is that it's less performant to apply filters after the unnesting of your arrays than using array functions, write the two versions of the query and benchmark them against each other. I suspect array functions are going to be slightly faster – but for the same reasons I mentioned above, the gains may drown in the error bars caused by Athena's queuing and other operations.
Make sure to benchmark at different points during the day, and be especially conscious of the fact that top-of-the-hour behaviour in Athena is extremely different from other times (run queries at 10:00 and then at 10:10 – your total execution times will be very different because everyone's cron jobs run at the top of the hour).

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

What can be inferred from two datasets using k-means or k-nn

I am wondering what you could infer using data mining from two large datasets that have similar properties. Say you have two datasets containing detailed information about schools in a country and each dataset belongs to a school stage for a particular year. What sorts of things can you do with these datasets using data mining?
I know how to use and apply the algorithms in pandas but I am having problems with getting the motivation behind the k-means especially.
I know you use the k-means to put the unlabeled data into clusters based on number of factors from the dataset and based on the property values of each data element, they are being placed in one of the clusters created. But then what do you do with these clusters? How can you use them for analysing the data? I read that it can even be used for cleaning the data or relating two datasets to each other, but I'm just having hard time to imagine how would you go about to do these things.
Any help is well appreciated. Thanks..
You could do many things with those datasets including:
See which students from a lower stage are more likely to be in which group (successful, unsuccessful etc.) when they reach a higher stage based on some factors
See what factors are affecting the success of the students at different stages (assuming the datasets contain this information)
You can do many different comparisons based on different factors
..and many more. The issue is that it is not really possible to say what can be inferred from your datasets without seeing what information they contain. My suggestion is you should look carefully in two datasets and see if they have some columns that are common and pick the ones that are interest you the most.