How could I run data transformation in parallel in Kettle?

How could I run data transformation in parallel in Kettle? - kettle

I have a Kettle job to load many CSV files and transform them into some other formatted CSV files. How could I make Kettle process these files in parallel. I tried to modify the copies parameter from the default 1 to the number of processors minus 1 according to this doc. But it gives me error saying: "Transformation is killing the other steps!"
How should fix this?
Sample config is as follows:
<step>
<name>WriteSomeData</name>
<type>TextFileOutput</type>
<description/>
<distribute>Y</distribute>
<custom_distribution/>
<copies>50</copies>
...

Related

DataSink step to return each response with all the children

I learnt very recently how to use data-driven testing in Ready API and loop calls based on the data. My goal is to run the steps in loop and at the end create an auto-export facility with DataSink so that the results get auto exported.
Now when I try go to DataSink, as I understood I need to create column headers as below
to store the corresponding child values
It would work well, if the soap response for each of the siteId has the same XML structure. But in my case each of the 2000+ response that I get has different number of children within
<retrun> </return>
For e.g. please take a look at the response 1 and response 2. Both these responses have different number of children.
Response 1
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns2:getSiteInfoResponse xmlns:ns2="http://billing.xyz.cc/">
<return>
<address1>A</address1>
<city>B</city>
<closeDate>2018-10-15T00:00:00-05:00</closeDate>
<contact1/>
<contact2>TBD</contact2>
<country>X1</country>
<customerNbr>288</customerNbr>
<emailAddr1/>
<emailAddr2/>
<fax1>0</fax1>
<fax2>0</fax2>
<gps>C</gps>
<grouping2>Leased</grouping2>
<grouping4>D</grouping4>
<jobTitle1/>
<jobTitle2/>
<phone1>0</phone1>
<phone2>0</phone2>
<siteId>862578</siteId>
<siteName>D</siteName>
<squareFoot>0.0</squareFoot>
<state>E</state>
<weatherStation>D</weatherStation>
<zip4>4</zip4>
<zip5>F</zip5>
</return>
</ns2:getSiteInfoResponse>
</soap:Body>
</soap:Envelope>
Response 2
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns2:getSiteInfoResponse xmlns:ns2="http://billing.xyz.cc/">
<return>
<address1>1202</address1>
<city>QA</city>
<contact1/>
<contact2>BL</contact2>
<country>A</country>
<customerNbr>288</customerNbr>
<emailAddr1/>
<emailAddr2/>
<fax1>0</fax1>
<fax2>0</fax2>
<gps>LTE</gps>
<grouping1>1345</grouping1>
<grouping2>Leased</grouping2>
<grouping3>ZX</grouping3>
<grouping4>AA</grouping4>
<grouping5>2000</grouping5>
<jobTitle1/>
<jobTitle2/>
<phone1>0</phone1>
<phone2>0</phone2>
<services>
<accountNbr>11099942</accountNbr>
<liveDt>2013-07-01T00:00:00-05:00</liveDt>
<service>2</service>
<serviceType>gas</serviceType>
<vendorAddr1/>
<vendorAddr2>M</vendorAddr2>
<vendorCity>N</vendorCity>
<vendorName>O</vendorName>
<vendorNbr>P</vendorNbr>
<vendorPhone>Q</vendorPhone>
<vendorState>R</vendorState>
<vendorZip>S</vendorZip>
</services>
<services>
<accountNbr>13064944</accountNbr>
<liveDt>2018-05-20T00:00:00-05:00</liveDt>
<service>2</service>
<serviceType>gas</serviceType>
<vendorAddr1/>
<vendorAddr2>A</vendorAddr2>
<vendorCity>B</vendorCity>
<vendorName>C</vendorName>
<vendorNbr>677</vendorNbr>
<vendorPhone>D</vendorPhone>
<vendorState>E</vendorState>
<vendorZip>F</vendorZip>
</services>
<siteId>101567</siteId>
<siteName>X</siteName>
<squareFoot>4226.0</squareFoot>
<state>Y</state>
<weatherStation>Z</weatherStation>
<zip4>0</zip4>
<zip5>L</zip5>
</return>
</ns2:getSiteInfoResponse>
</soap:Body>
</soap:Envelope>
Now, I need to further create a table using the whole response to be utilized in business intelligence. If I have to create matching headers in DataSink I need to go through each and every responses to ensure that I have created a corresponding property in datasink. It is not humanly possible without compromising the accuracy.
Is there any way to program Ready API to store individual XML response by each looping call in a file specified by me (2000+ XML responses) or to store all the values by children of the response node without needing me to specify all the header names in the DataSink window. If it happens, i will be fine in both cases to utilize a BI tool to create a corresponding table from there.
Thank you in advance.

As you point out, the differing number of children makes the linear data sink problematic.
That said, you can still use datasink to dump out all values in one go. In the datasink, create a single header and use 'get data' to select the root node of your response.
This will obviously generate a massive file, so you have two choices here. Either dump everything into a single file, or you could create a new file per response
If you're wondering about naming of lots of little files, you can generate a filename on the fly for the data sink to use. To do this, create a groovy script inside the loop. In this script, make it return a path and the file name. You could use some timestamp value, e.g. c:\temp\myResults\2020120218150102.txt, which is year, month, data, hour, min seconds and ms. Then, in the Data sink step where you browse for the file name, use get data to 'grab' the result of the groovy script.

#Chris Adams thanks for your awesome idea. Even though I could not completely put this into practice. But because of your idea (Get Data) I took a different route and I got what I wanted.
So this is what I did. Instead of using DataSink I used create file. The idea is whenever I schedule this task the Ready API can run the whole thing in loop and throw the result in a static folder
with file name
containing
site Id obtained from Get Data Raw request agr3
${getSiteInfo#RawRequest#declare namespace bil='http://billing.xyz.cc/'; //bil:getSiteInfo[1]/arg3[1]}.xml
and
file content
with whole response from root node Response
${getSiteInfo#Response#declare namespace soap='http://schemas.xmlsoap.org/soap/envelope/'; //soap:Envelope[1]}
The end result is this
However, I am still interested in this and I could get this part to work.
That said, you can still use datasink to dump out all values in one go. In the datasink, create a single header and use 'get data' to select the root node of your response.

XSLT - filter out elements that are not x-referenced

I have developed a (semi-)identity transformation from which I need to filter out elements that are unused.
The source XML contains 2001 "zones". No more, no less.
It also contains any number of devices, which are placed in these zones.
One specific example source XML contains 8800 of these devices.
More than one device can be placed in the same zone.
Zone 0 is a "null zone", meaning that a device placed in this zone is currently unassigned.
This means that the number of real zones is 2000.
Simplified source XML:
<configuration>
<zones>
<zone id="0">
...
<zone id="2000"/>
</zones>
<devices>
<device addr="1">
<zone>1</zone>
</device>
...
<device addr="8800">
<zone>1</zone>
</device>
</devices>
</configuration>
The problem we have is that out of the 2000 usable zones, most often only roughly 200 of these contain one or more devices.
I need to whittle out unused zones. There are reasons for this which serve only to detract from the question at hand, so if you don't mind I will not elaborate here.
I currently have this problem solved, like so:
<xsl:for-each select="zones/zone[#id > 0]">
<xsl:when test="/configuration/devices/device[zone=current()/#id]">
<xsl:call-template name="Zone"/>
</xsl:when>
</xsl:for-each>
And this works.
But on some of the larger projects the transformation takes absolute ages.
That is because in pseudo code this translates to:
for each <zone> in <zones>
find any <device> in <devices> with reference to <zone>
if found
apply zone template
endif
endfor
With 2000 zones to iterate over - and each iteration triggering up to 8800 searches for a qualifying device - you can imagine this taking a very long time.
And to compound problems, libxslt provides no API for progress reporting. This means that for a long time our application will appear frozen while it imports and converts the customer XML.
I do have the option to write every zone unconditionally, and upon the application bootstrapping from our (output) XML, remove or ignore any zones that have no devices placed in them.
And it may turn out that this may be the only option I have.
The downside to this is that my output XML then contains a lot of zones that are not referenced.
That makes it a bit difficult to consolidate what we have in our configuration and what parts of it the application is actually using.
My question to you is:
Have I got other options that ensure that the output XML only contains used zones?
I am not averse to performing a follow-up XSLT conversion.
I was for instance thinking that it may be possible(?) to write an attribute used="false" to each <Zone> element in my output.
Then as I go over the devices, I find the relevant zone in my output XML (providing it is assigned / zone is non-zero) and change this to used="true".
Then follow up with a quick second transformation to remove all zones which have used="false".
But, can I reference my own output elements during an XSLT transformation and change its contents?

You said you have a kind of identity transformation so I would use that as the starting point, plus a key:
<xsl:key name="zone-ref" match="device" use="zone"/>
and an empty template
<xsl:template match="zones/zone[not(key('zone-ref', #id))]"/>
that prevents unreferences zones from being copied.
Or, if there are other conditions, then e.g.
<xsl:template match="zones/zone[#id > 0 and not(key('zone-ref', #id))]"/>

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))

Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.

Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Hadoop mapreduce using 2 mapper and 1 reducer using c++

Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?

The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.

XSLT Transformation requires counts

Hello I am transforming a csv file using XSLT files to pull training records out by employees.
What I know need to do is also pull a footer on the bottom of the CSV file with the total record count and somehow count each record that is transformed so I can do a compare of these in the system I am importing these into.
This is what the source file looks like -
TrainingRecord,,SP Training,,geoff.culbertson,,Trained,,IT
TrainingRecord,,SP Training,,jim.schultz,,Trained,,IT
RecordCount|2
So I need to transform the Record Count at the end of the file and do a count for each record in this example it would be 2 and transform that so I can do a compare.

It seems very strange to use XSLT with a CSV as the input, but to answer your question, but based on the XSLT you've provided us, I believe you could obtain the row count with:
<xsl:value-of select="count(//Line)" />

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How could I run data transformation in parallel in Kettle? - kettle

Related

DataSink step to return each response with all the children

XSLT - filter out elements that are not x-referenced

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

Hadoop mapreduce using 2 mapper and 1 reducer using c++

XSLT Transformation requires counts

Categories

Resources