What is the difference between a Job and a Transformation? - kettle

When creating new objects in spoon there's two possibilities: Job and Transfromation. They've got a different set of possible components (although with some level of overlap) and the XML that is generated looks very similar. What's the difference between these two?

This is what I had most problems to understand when starting with Pentaho as well.
A job has one start place, and executes one step at a time, with one flow through the steps.
A transformation has many possible start places and all steps execute in parallel. If a step has a step before it, it will take the data in there, and use it.
In my use I usually schedule jobs, to run transformations in process to get and transform data.
This is a normal question so it's in the FAQ.

Turns out the answer is in the FAQ. From the Pentaho FAQ:
Q: In Spoon I can make jobs and transformations, what's the difference between the two?
A: Transformations are about moving and transforming rows from source to target. Jobs are more about high level flow control: executing transformations, sending mails on failure, transferring files via FTP, ...
Another key difference is that all the steps in a transformation execute in parallel, but the steps in a job execute in order.

Related

How to Limit Concurrency in fp-ts

Our team is starting to learn fp-ts, and we are starting with some basic async examples (mostly pulled from here). Running a set of Tasks in sequence is great, and it looks like array.sequence(task)(tasks)
The question is, what is the idiomatic way to restrict concurrency when executing parallel Tasks in fp-ts? For example, Promise.map (in bluebird) allows you to set a concurrency limit like {concurrency: 4}.
One solution might be to split the array into chunks, and then iterate the chunks, using sequence and flatMap. However, that would mean every Task in each chunk would have to complete before moving on to the next chunk - one long running task could hold up the whole operation.
There must be some abstraction that we're missing - we are all pretty new to FP, so hopefully someone here with more exp can help out.
I was able to find a resolution with the helpful folks over at the ts-fp git repo. Looks like wrapping p-map is the way to go
https://github.com/gcanti/fp-ts/issues/574#issuecomment-424658481

Apache beam / PubSub time delay before processing files

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?
You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

Process flow gets stuck on table creations

I'm trying to understand the Enterprise Guide process flow. As I understand it, the process flow is supposed to make it easy to run related steps in the order they need to be run to make a dependent action able to run and be up to date somewhere later in the flow.
Given that understanding, I'm getting stuck trying to make the process flow work in cases where the temporary data is purged. I'm warned when closing Enterprise Guide that the project has references to temporary data which must be the tables I created. That should be fine, the data is on the SAS server and I wrote code to import that data into SAS.
I would expect that the data can be regenerated when I try run an analysis that depends on that data again later, but instead I'm getting an error indicating that the input data does not exist. If I then run the code to import the data and/or join tables in each necessary place, the process flow seems to work as expected.
See the flow that I'm working with below:
I'm sure I must be missing something. Imagine I want to rerun the rightmost linear regression. Is there a way to make the process flow import the data without doing so manually for each individual table creation the first time round?
The general answer to your question is probably that you can't really do what you're wanting directly, but you can do it indirectly.
A process flow (of which you can have many per project, don't forget) is a single set of programs/tasks/etc. that you intend to run as a group. Typically, you will run whole process flows at once, rather than just individual pieces. If you have a point that you want to pause, look at things, then continue, then you have a few choices.
One is to have a process flow that goes to that point, then a second process flow that starts from that point. You can even take your 'import data' steps out of the process flow entirely, make an 'import data' process flow, always run that first, then run the other process flows individually as you need them. In fact, if you use the AUTOEXEC process flow, you could have the import data steps run whenever you open the project, and imported data ready and waiting for you.
A second is to use the UI and control+click or drag a box to select on the process flow to select a group of programs to run; select the first five, say, then run them, then select 'run branch from program...' option to run from that point on. You could also make separate 'branches' and run just the one branch at a time, making each branch dependent on the input streams.
A third option would be to have different starting points for different analysis tasks, and have the import data bit be after that starting point. It could be common to the starting points, and use macro variables and conditional execution to go different directions. For example, you could have a macro variable set in the first program that says which analysis program you're running, then the conditional from the last import step (which are in sequence, not in parallel like you have them) send you off to whatever analysis task the macro variable says. You could also have macro variables that indicate whether an import has been run once already in the current session that then would tell you not to rerun it via conditional steps.
Unfortunately, though, there's no direct way to run something and say 'run this and all of its dependencies', though.

SAS Enterprise Guide - process dependence and parallel execution

I am working on a project in SAS EG (7.1) which involves process dependence and parallel execution, as depicted below:
I have the following questions:
Is there a way to retrieve or set relations (i.e. process_C --> program_D) between the processes programmatically? The maintenance is becoming problematic with complex projects. Ideally, I would like to be able to re-create the links between processes from external table.
I start the whole process with the option “Run branch from <>” process. Let’s assume that we have only 2 processors available. Is there a way to set the order of execution between process_A, B, C? The critical path of the whole flow is “begin -> process_C -> process_D -> end” hence we would like it to start with process_C in order to ensure minimum execution time.
Thank you in advance.
For 1, I think the answer is "no", if you mean a well defined SAS programmatic method. At least for the relatively limited information and example you provide above, anyway. More might be possible with metadata server - not my area of expertise.
You can do some of this at least using scripting via Powershell or VBScript. EG's API is fairly wide open and not all that hard to use. I won't suggest how as my understanding of this is limited also, but it seems like it should be possible to do what you suggest, though probably not easy.
For your second point:
First off, EG typically runs "top to bottom" if it has no other information on how to process a particular choice. So put c->d above a/b to get it processed first.
Second, you could use conditional processing perhaps. There should be a macro variable that tells you how many cpus you have (&SYSNCPU on my machine, hopefully same on other versions). You could use that value to conditionally link to A then B as opposed to A+B simultaneously. I'm not sure how easy this would be to do in a flexible fashion, though.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.