Pentaho Data Integration "Variable scope type" in Set Variables - kettle

I have a job running in PDI that is transferring data from different sources to different targets an back for a specific System. This job has a lot of child jobs. Let's call that Job MasterJob1.
We have the same System running for another purpose. Therefore, I want to copy that job in PDI. Here I just have to change a few settings. Let's call that MasterJob2.
To make different variables available for the entire job (also in parent jobs, child jobs and so on of the masterjob), we are using "Set Variables". Here, we have a lot of different variables. Let's say, one variable is called TestVar. At the moment, the "Variable Scope type" of these Variables in MasterJob1 is always set on "Valid in Java Virtual Machine".
According to the PDI Documentation http://wiki.pentaho.com/display/EAI/Set+Variables, this means, the variables are available everywhere in the Virtual Machine. For my understanding this means, if I copy the job and let the "Variable Scope type" like it is, the Variable TestVar can be written by MasterJob1 but can also be overwritten by MasterJob2.
I definitively want to avoid that MasterJob1 can overwrite Variables of MasterJob2 and vice versa. However, the Variables that are set in MasterJob1 must be everywhere available in MasterJob1 and the Variables in MasterJob2 must be everywhere available in MasterJob2. Therefore I continued reading the documentation. It's says that there exists the "Variable Scope Type" "Valid in the root Job". Is my assumption right, that this is the Variable Scope Type that I need to use?
Unfortunately I do not have that much experience with this and I hope that you can tell me if that is the right way?! Creating a test environment will take a some days for me. Therefore I hope that you can give me an easy "Yes go for it" or the right solution.

Your assumption is correct.
Avoid using Valid in the virtual machine for jobs on the server, although it is handy for debug on your dev PC.
Use Valid in the parent job when a transformation (or job) has to return a value to the caller.
Use Valid in the grand-parent job very rarely, although I remember some special moments where it was useful.
Use Valid in the root job almost all the time.

Related

Does DuplicateHandle() do any interprocess communication (IPC) and if not why target params?

I am finding DuplicateHandle() very confusing. The third and fourth params, hTargetProcessHandle and lpTargetHandle seem to imply that this API function does some form of interprocess communication, but what I have been reading online seems to imply (without saying directly) that in fact this function cannot communicate with anything outside of the address space of its own process and that if you really do want to say copy the local process handle to another process you have to do that yourself manually.
So can someone please please take pity on me and tell me definitively whether or not this function does any IPC itself? Also if it doesn't do any IPC then what is the point of those two parameters? How can there be a 'target' if no data is sent and the output of this function is not visible to other processes?
At first I thought I could call GetCurrentProcess() and then use DuplicateHandle() to copy the local process handle to another process, but then I started to realize that it probably isn't that easy.
The third parameter hTargetProcessHandle is documented as
A handle to the process that is to receive the duplicated handle.
That means that the handle (which is just a numeric value underneath) will become usable within the target process. However, how you get this handle into the target process and in what context it is to be used there is out of the scope of that function. Also note that "is to receive" points in the future and it refers to the result of the call, so it must be after the call has finished.
As an analogy, you want to allow a friend in your house. For that, you are creating a second key to your door. Just that doesn't mean that your friend can now unlock your door, because you first have to give it to them, but it's a first step.

Are global variables constantly updated

I know global variables are bad, however I have a checksettings function which is run every tick. http://pastebin.com/54yp4vuW The paste bin contains some of the check setting function. Before I added the GetPrivateProfileIntA everything worked fine. Now when I run it, it lags like hell. I can only assume this is because it is constantly loading the files. So my question is, are global variables constantly updated. (ie if I put this in global var will it stop the lag)
Thanks :)
Assuming I'm interpreting your question correctly, then no, global variables are not constantly updated unless you explicitly do so in code. So yes, putting those calls in global variables will get rid of the lag.
You haven't provided any details about the design but globals are visible across the entire application and get updated when they are written into.
Multiple processes/threads reading that global variable would then read the same updated value.
But synchronizing reads/writes requires the use of synchronization mechanisms such as mutexes, condition variables etc etc.
In your case you need to decide when to call GetPrivateProfileIntA() for all those settings.
Are all those settings constantly updated or only a fraction of those? Identify the ones which need to be monitored periodically and only load those.
And if a setting is stateful meaning all objects of the class refer to a single copy of the setting then I would use static class variables instead of plain global variables.
Alternately you could make a JIT call to GetPrivateProfileIntA() where needed and not bother about storing the setting in a global variable.

Consistently using the value of "now" throughout the transaction

I'm looking for guidelines to using a consistent value of the current date and time throughout a transaction.
By transaction I loosely mean an application service method, such methods usually execute a single SQL transaction, at least in my applications.
Ambient Context
One approach described in answers to this question is to put the current date in an ambient context, e.g. DateTimeProvider, and use that instead of DateTime.UtcNow everywhere.
However the purpose of this approach is only to make the design unit-testable, whereas I also want to prevent errors caused by unnecessary multiple querying into DateTime.UtcNow, an example of which is this:
// In an entity constructor:
this.CreatedAt = DateTime.UtcNow;
this.ModifiedAt = DateTime.UtcNow;
This code creates an entity with slightly differing created and modified dates, whereas one expects these properties to be equal right after the entity was created.
Also, an ambient context is difficult to implement correctly in a web application, so I've come up with an alternative approach:
Method Injection + DeterministicTimeProvider
The DeterministicTimeProvider class is registered as an "instance per lifetime scope" AKA "instance per HTTP request in a web app" dependency.
It is constructor-injected to an application service and passed into constructors and methods of entities.
The IDateTimeProvider.UtcNow method is used instead of the usual DateTime.UtcNow / DateTimeOffset.UtcNow everywhere to get the current date and time.
Here is the implementation:
/// <summary>
/// Provides the current date and time.
/// The provided value is fixed when it is requested for the first time.
/// </summary>
public class DeterministicTimeProvider: IDateTimeProvider
{
private readonly Lazy<DateTimeOffset> _lazyUtcNow =
new Lazy<DateTimeOffset>(() => DateTimeOffset.UtcNow);
/// <summary>
/// Gets the current date and time in the UTC time zone.
/// </summary>
public DateTimeOffset UtcNow => _lazyUtcNow.Value;
}
Is this a good approach? What are the disadvantages? Are there better alternatives?
Sorry for the logical fallacy of appeal to authority here, but this is rather interesting:
John Carmack once said:
There are four principle inputs to a game: keystrokes, mouse moves, network packets, and time. (If you don't consider time an input value, think about it until you do -- it is an important concept)"
Source: John Carmack's .plan posts from 1998 (scribd)
(I have always found this quote highly amusing, because the suggestion that if something does not seem right to you, you should think of it really hard until it seems right, is something that only a major geek would say.)
So, here is an idea: consider time as an input. It is probably not included in the xml that makes up the web service request, (you wouldn't want it to anyway,) but in the handler where you convert the xml to an actual request object, obtain the current time and make it part of your request object.
So, as the request object is being passed around your system during the course of processing the transaction, the time to be considered as "the current time" can always be found within the request. So, it is not "the current time" anymore, it is the request time. (The fact that it will be one and the same, or very close to one and the same, is completely irrelevant.)
This way, testing also becomes even easier: you don't have to mock the time provider interface, the time is always in the input parameters.
Also, this way, other fun things become possible, for example servicing requests to be applied retroactively, at a moment in time which is completely unrelated to the actual current moment in time. Think of the possibilities. (Picture of bob squarepants-with-a-rainbow goes here.)
Hmmm.. this feels like a better question for CodeReview.SE than for StackOverflow, but sure - I'll bite.
Is this a good approach?
If used correctly, in the scenario you described, this approach is reasonable. It achieves the two stated goals:
Making your code more testable. This is a common pattern I call "Mock the Clock", and is found in many well-designed apps.
Locking the time to a single value. This is less common, but your code does achieve that goal.
What are the disadvantages?
Since you are creating another new object for each request, it will create a mild amount of additional memory usage and additional work for the garbage collector. This is somewhat of a moot point since this is usually how it goes for all objects with per-request lifetime, including the controllers.
There is a tiny fraction of time being added before you take the reading from the clock, caused by the additional work being done in loading the object and from doing lazy loading. It's negligible though - probably on the order of a few milliseconds.
Since the value is locked down, there's always the risk that you (or another developer who uses your code) might introduce a subtle bug by forgetting that the value won't change until the next request. You might consider a different naming convention. For example, instead of "now", call it "requestRecievedTime" or something like that.
Similar to the previous item, there's also the risk that your provider might be loaded with the wrong lifecycle. You might use it in a new project and forget to set the instancing, loading it up as a singleton. Then the values are locked down for all requests. There's not much you can do to enforce this, so be sure to comment it well. The <summary> tag is a good place.
You may find you need the current time in a scenario where constructor injection isn't possible - such as a static method. You'll either have to refactor to use instance methods, or will have to pass either the time or the time-provider as a parameter into the static method.
Are there better alternatives?
Yes, see Mike's answer.
You might also consider Noda Time, which has a similar concept built in, via the IClock interface, and the SystemClock and FakeClock implementations. However, both of those implementations are designed to be singletons. They help with testing, but they don't achieve your second goal of locking the time down to a single value per request. You could always write an implementation that does that though.
Code looks reasonable.
Drawback - most likely lifetime of the object will be controlled by DI container and hence user of the provider can't be sure that it always be configured correctly (per-invocation and not any longer lifetime like app/singleton).
If you have type representing "transaction" it may be better to put "Started" time there instead.
This isn't something that can be answered with a realtime clock and a query, or by testing. The developer may have figured out some obscure way of reaching the underlying library call...
So don't do that. Dependency injection also won't save you here; the issue is that you want a standard pattern for time at the start of the 'session.'
In my view, the fundamental problem is that you are expressing an idea, and looking for a mechanism for that. The right mechanism is to name it, and say what you mean in the name, and then set it only once. readonly is a good way to handle setting this only once in the constructor, and lets the compiler and runtime enforce what you mean which is that it is set only once.
// In an entity constructor:
this.CreatedAt = DateTime.UtcNow;

Can SYSTEM_INFO::dwActiveProcessorMask change while my process is running?

I'm curious about something. Can dwActiveProcessorMask member of the SYSTEM_INFO struct change after my service starts running (on Windows)? If not, I'd cache it when it is initializing.
It is reasonable to assume that it could change. See, for example, this description of dealing with dynamic partitioning and how to code and test for correctness.
Of course not, dwActiveProcessorMask is set during the hardware-detection phase on boot, it may only change once the hardware is changed. If you read the value during your application initialization-phase you will always be good.

Is using clojure's stm as a global state considered a good practice?

in most of my clojure programs... and alot other clojure programs I see, there is some sort of global variable in an atom:
(def *program-state*
(atom {:name "Program"
:var1 1
:var2 "Another value"}))
And this state would be referred to occasionally in the code.
(defn program-name []
(:name #*program-state*))
Reading this article http://misko.hevery.com/2008/07/24/how-to-write-3v1l-untestable-code/ made me rethink global state but somehow, even though I completely agree with the article, I think its okay to use hash-maps in atoms because its providing a common interface for manipulating global state data (analogous to using different databases to store your data).
I would like some other thoughts on this matter.
This kind of thing can be OK, but it is also often a design smell so I would approach with caution.
Things to think about:
Consistency - can one part of the code change the program name? if so then the program-name function will behave inconsistently from the perspective of other threads. Not good!
Testability - is this easy to test? can one part of the test suite that is changing the program name safely run concurrently with another test that is reading the name?
Multiple instances - will you ever have two different parts of the application expecting to use a different program-name at the same time? If so, this is a strong hint that your mutable state should not be global.
Alternatives to consider:
Using a ref instead of an atom, you can at least ensure consistency of mutable state within transactions
Using binding you can limit mutability to a per-thread basis. This solves most of the concurrency issues and can be helpful when your global variables are being used like a set of thread-local configuration parameters.
Using immutable global state wherever you can. Does it really need to be mutable?
I think having a single global state that is occasionally updated in commutative ways is fine. When you start having two global states that need to be updated and threads start using them for communication, then I start to worry.
maintaining a count of current global users is fine:
Any thread can inc or dec this at any time without hurting another
If it changes out from under your thread nothing explodes.
maintaining the log directory is questionable:
When it changes will all threads stop writing to the old one?
if two threads change it will they converge.
Using this as a message queue is even more dubious:
I think it is fine to have such a global state (and in many cases it is required) but I would be careful about that the core logic of my application have functions that take the state as a parameter and return the updated state rather than directly accessing the global state. Basically I would prefer to have a controlled access to the global state from few set of function and everything else in my program should use these set of methods to access the state as that would allow me to abstract away the state implementation i.e initially I could start with an in memory atom, then may be move to some persistent storage.