Unit testing data flow in a ssis package - unit-testing

Is there a way to unit test data flow in a ssis package .
Ex: testing the sort - verify that the sort is porperly done.

There is an unit testing framework for SSIS - see SSISUnit.
This is worth looking at but it may not solve your problem. It is possible to unit test individual components at the control flow level using this framework, but it is not possible to isolate and individual Data Flow Transformations - you can only test the whole Data Flow component.
One approach you could take is to redesign your package and break down your DataFlow component into multiple DataFlow components that can be individually tested. However that will impact the performance of your package, because you will have to persist the data somewhere in between each data flow task.
You can also adopt this approach by using NUnit or a similar framework, using the SSIS api to load a package and execute an individual task.

SSISTester can tap data flow between two components and save the data into a file. Output can be accessed in a unit test. For more information look at ssistester.bytesoftwo.com.An example how to use SSISTester to achieve this is given bellow:
[UnitTest("DEMO", "CopyCustomers.dtsx", DisableLogging=true)]
[DataTap(#"\[CopyCustomers]\[DFT Convert customer names]\[RCNT Count customers]", #"\[CopyCustomers]\[DFT Convert customer names]\[DER Convert names to upper string]")]
[DataTap(#"\[CopyCustomers]\[DFT Convert customer names]\[DER Convert names to upper string]", #"\[CopyCustomers]\[DFT Convert customer names]\[FFD Customers converted]")]
public class CopyCustomersFileAll : BaseUnitTest
{
...
protected override void Verify(VerificationContext context)
{
ReadOnlyCollection<DataTap> dataTaps = context.DataTaps;
DataTap dataTap = dataTaps[0];
foreach (DataTapSnapshot snapshot in dataTap.Snapshots)
{
string data = snapshot.LoadData();
}
DataTap dataTap1 = dataTaps[1];
foreach (DataTapSnapshot snapshot in dataTap1.Snapshots)
{
string data = snapshot.LoadData();
}
}
}

Short answer - not easily. Longer answer: yes, but you'll need lots of external tools to do it. One potential test would be to take a small sample of the data set, run it through your sort, and dump to an excel file. Take the same data set, copy it to an excel spreadsheet, and manually sort it. Run a binary diff tool on the result of the dump from SSIS and your hand-sorted example. If everything checks out, it's right.
OTOH, unit testing the Sort in SSIS shouldn't be necessary, unless what you're really testing is the sort criteria selection. The sort should have been tested by MS before it was shipped.

I would automate the testing by having a known good file for appropriate inputs which is compared binarily with an external program.

I like to use data viewers when I need to see the data moving from component to component.

Related

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

Compare values with a static and global table in C++

I am working on statistical analyses in my field, and using c++. I am implementing several tests, and some of them need to compare the calculated value with a table, say a distribution table for example, like this one.
I want my different functions in my different classes to be able to access a specific value, to evaluate the significance of my result, for example something like this:
float F = fisherTest(serie1, serie2);
auto tableValue = findValue(serie1.size(), serie2.size());
if(tableValue < F) {
cout << "Not significant";
return -1;
}
This is just an example, as this test actually makes no sense. But I just want to be able to read values from a predefined table.
Do you have an idea of how I can achieve this? Can I store this in a "resource file"?
I hope my question is clear! Thank you.
You can have some data files and pass a configuration during startup (e.g. commandline) to the application so it can find the files and read them. The data structure can then be fed to the test.
It is possible to get the predefined data from several sources:
Hard coded tables in your program.
One or more functions that can compute the data on demand.
Files on your local disk.
Data stored in a database server.
You and your team need to decide which makes the most sense for your application.

Repeating pattern matching in Eclipse regexp

I'm trying to turn XML files in to SQL statements within Eclipse to save on the manual work of converting large files.
A line such as:
<TABLE_NAME COL_1="value1" COL_2="value2"/>
should be converted to:
insert into TABLE_NAME (COL_1, COL_2) values ("value1", "value2");
So far, I have managed to match and capture the table name and the first column/value pair with:
<(\w+)( \w+)=(".+?").*/>
The .* near the end is just there to test the first part of the pattern and should be removed once it is complete.
The following replace pattern yields the following result:
insert into $1 ($2) values ($3);
insert into TABLE_NAME ( COL_1) values ("value1");
The problem I'm having is that the number of columns is different for different tables, so I would like a generic pattern that would match n column/value pairs and repeatedly use the captured groups in the replace pattern. I haven't managed to understand how to do this yet although \G seems to be a good candidate.
Ideally, this would be solved in a single regexp statement, though I also wouldn't be against multiple statements that have to be run in sequence (though I really don't want to have to force the developer to execute once for every column/value pair).
Anyone have any ideas?
I did not in the end solve this with regular expressions as originally envisaged. Taking alternative solutions in mind, it was solved in the end by re-factoring test code to make it re-usable and then co-opting dbunit to use tests to write data to the different stages of the development databases.
I know that this is a misuse of tests, but it does make the whole thing much simpler since the user gets a green/red feedback as to whether the data is inserted or not and the same test data can be inserted for manual acceptance tests as is used for the automated component/integration tests (which used dbunit with an in-memory database). As long as these fake test classes are not added to the test suite, then they are only executed by hand.
The resulting fake tests ended up being extremely simple and lightweight, for instance:
public class MyComponentTestDataInserter extends DataInserter {
#Override
protected String getDeleteSql() {
return "./input/component/delete.sql";
}
#Override
protected String getInsertXml() {
return "./input/component/TestData.xml";
}
#Test
public void insertIntoDevStage() {
insertData(Environment.DEV);
}
#Test
public void insertIntoTestStage() {
insertData(Environment.TEST);
}
}
This also had the additional advantage that a developer can insert the data in to a single environment by simply executing a single test method via the context menu in the IDE. At the same time, by running the whole test class, the exact same data can be deployed to all environments simultaneously.
Also, a failure at any point during data cleanup or insertion causes the whole transaction to be rolled back, preventing an inconsistent state.