Sequence generator or Variable Port - informatica

For sequence generation we can use Sequence Generator and Informatica Variable port. I am just curious about which one is better to use on aspects like performance and etc. Any reference on this from Informatica text will be great.

If you go only by performance, I suspect Sequence Generator will perform better.
However, which approach you should take largely depends on the scenario. For example, if you need to generate a sequence starting from a 0 (or any fixed number) every time the mapping runs, then Sequence Generator will do fine. However, say, you need to generate sequence numbers after the max number present in the target table; in this case, you might do a lookup on target table to get the max value and go on incrementing it in a variable port.
An example where you cannot use a variable port, is when you need to generate unique sequence numbers in multiple mappings. But, you can use a reusable sequence generator for this purpose.
Again, you would go for a variable port, when you need more flexibility in generating sequence numbers, like, to generate a new sequence number, only when a unique value appears in some source column.

as far as I think if you want just continuous sequence you have to use variable port , it will give you better performance and sequence generator creates lots of problems while migration while migration,
if you want certain kind of loop or some special requirement then one should go for sequence generator.

Apart from performance please keep code readability and maintenance in mind. Sequence Generator is clearly visible just at first glance. Variable ports are hidden and visible only while editing the appropriate expression.

Please consider these points:
1--> Sequence generator is a separate transformation to generate the sequence numbers. Also it has a limit of 2,147,483,647 value after that you would need to take another sequence generator to continue otherwise the mapping fails.
2--> I haven't heard generating sequence numbers using expression transformation has an end value.
3--> If you don't want to have more transformations in the mapping(to avoid mapping complexity) you may generate the numbers using expression transformation itself.
4--> If you consider ease of use, may go with sequence generator since you don't need to do define anything manually as inbuilt it's all set.
For complete guidance please refer to
https://chase4chance.blogspot.com/2022/09/sequence-generator-vs-variable-port-in.html

Related

How can I create pseudo-random numbers depending on multiple integer parameters?

My overall goal is to create a round-based game. So, I need some seemingly randomly generated numbers, example in fights - but not only, there are some different occasions. Anyway, I want those numbers to be reliably the same if certain parameters are the same.
Those parameters will be an integer seed - it will take the value of a generalRandomSeed, changing every round; but I need more parameters, like IDs of attacker and defender. I would be very convenient to call this function with the parameters (maybe all combined in a vector) like getRandom(generalRandomSeed,id1,id2).
So, in the end I am hoping for a function that takes one or more ints as parameters (ideally, a vector), returning one single integer: int getRandom(std::vector<int> parameters);
I canot quite figure out how I could solve that problem; if it was only about one parameter, I might just create a new mt19937 every time with my seed generalRandomSeed.
To explain Maarten's issues with std::seed_seq (and to make sure I understand things correctly!) I'll attempt an answer.
The main issue I'd have with potentially suggesting using it (and it sounds like Maarten has the same one) is that the algorithm used by std::seed_seq isn't defined by the standard. Hence it's possible for other standard libraries (e.g. if you use a different compiler or even a different version of the same compiler) to change the implementation and you'd get different values back from the same inputs. Depending on your use case this lack of stability may, or may not, matter. That's a domain specific issue you'd have to decide on as you haven't specified it.
The cryptographic approach suggested would would be to use something like a HKDF, where you use a cryptographic hash (like SHA256) to extract the entropy (i.e. "randomness") from your input values in a deterministic manner (e.g. taking care of endianness) and then use another cryptographic primitive to "stretch" this entropy out to produce your random output. If you're not familiar with the cryptographic world these things can be awkward as there's a lot of terminology.
As a minor point, I'd suggest against using the MT19937 PRNG as it's relatively expensive to seed and it sounds like you'd be doing this a lot. There are other algorithms that are much cheaper, I personally like Sebastiano's xoshiro256 family or you could use Melissa's PCG family. Melissa O'Neill's site is a very useful resource if you're new to PRNGs.
That said, if you're going the HKDF route you may as well just use the "expansion" step as a PRNG as it will directly produce uniform values. These can be transformed into bounded/uniform values easily: Melissa has a good review for ints here, or Vigna describes a conventional transform for binary IEEE-754 floats at the bottom of here (so is valid for float and double on most common CPUs).
update: The MT19937 would seem to be difficult to predict given its enormous state space, but in fact it's been shown almost trivial. For example, searching for "predicting mt19937" leads to https://github.com/kmyk/mersenne-twister-predictor which will give you the state of the RNG from 624 consecutive 32bit integer draws. Using a CSPRNG would protect you from this, and is what using the output of a HKDF would give you. PCG makes this more difficult than the Mersenne Twister, but given that it's optimised for speed can't expend too much work doing this.
Apparently, it was indeed a seed_seq that I was looking for.

How to derive pseudo-random values of a preset probability from a seed?

I have a bunch of classes that I'd like to instantiate using seeds. Such a class would only have one constructor; taking in a single argument, the seed. A very simple pseudo example could be:
class Person {
int age;
Person(uint32 seed){
age = deriveAgeFromSeed(seed);
}
}
If I instantiate a Person with a random given seed, e.g. 123456789, it should evaluate to a Person with a specific age, e.g 30. The same seed will always generate the same person (same age).
To achieve this specific example, I could use a regular random-number-generator and use my seed as its seed to generate a random number between e.g. 0-100 for age.
However, I may not want it to be linearly random. Maybe I'd want a 50% chance that the age is in the range of 30-40. I guess I could chain a bunch of "random" numbers operations with my logic, e.g. generating a number from 0 to 1 which would indicate which age-range should be used, and then generate a new number to decide what specific age within this given range. But this would be a very ugly chain of hard code, and very hard to make adjustments to later.
I'd rather want a way to bundle an "option probability set" with the application. For instance an XML file that would specify the probability for all variable. The file could be loaded into memory at launch to prevent having to read the same file every time a person is instantiated. Unrealistic example to give an idea of what I mean:
"Person":
"age":
0-30: 25%,
30-40: 50%,
40-100: 25%
The application would use this information to automagically set an age based on the seed, with these given "probability parameters".
Having such a file would drastically decrease the workload for future adjustments of parameters, and would even let me change parameters without having to rebuild the application. But is it viable?
In addition to this, there may be cases where a second parameter could be dependent on the first. An example could be enum Occupation, where certain occupations are more common for certain ages (e.g. 'fast food employee' being more common at younger ages and CEO more common the older they get).
This type of logic seems rather common for certain types of video games, e.g. RTS such as "Civilization", where the game seed would be used to create the map, place its resources and the player spawn locations. It appears to also be used for procedural sandbox games such as Minecraft, where there's a certain probability for biomes. The latter is probably more noise-based, but still, noise would only give an output between 0 and 1, and they somehow derive a certain biome probability from it.
(I will code in C++, but language doesn't matter for the question)
So
What is the optimal/best practice procedure to derive many values at preset probabilities from a single seed?
Can the probabilities be imported from an external file?
Can probabilities be dependent on each other?
It seems you are thinking about a prng with a state, that you can initialize and use exactly once to generate the age value. However, for this purpose, hash would be also good: it also produces a pseudo-random number with uniform distribution on the given output space. If there is any need to generate a sequence of random numbers from a single seed, then a random number engine would be more useful.
XML, JSON and other formats can be read in using 3rd party libraries in C++. However, if the difficulty level is at storing age ranges and probability, for portability and dependency management it can be better to implement the parser yourself.
If I am correct, the age distribution is occupation-dependent. Once you derive the correct distribution for the occupation (either explicitly defined in a settings file, like in point 2 or by calculating it using a formula), you can get the right age-distribution from a single pseudo random number generated by the hash function.

C++ hash function, how is the original haser i.e. hash<int xkey> implemented

I am new to the hashing in general and also to the STL world and saw the new std::unrdered_set and the SGI :hash_set,both of which uses the hasher hash. I understand to get a good load factor , you might need to write your own hashfunction, and I have been able to write one.
However, I am trying to go deep into , how the original default has_functions are written.
My question is :
1) How is the original default HashFcn written ; more concretely how is the hash generated?
Is it based on some pseudo random number. Can anyone point me to some header file (I am a bit lost with the documentation), where I can look up ; how the hasher hash is implemented.
2)How does it guarantee that each time , you will be able to get the same key?
Please, let me know if I can make my questions clearer any way?
In the version of gcc that I happen to have installed here, the required hash functions are in /usr/lib/gcc/i686-pc-cygwin/4.7.3/include/c++/bits/functional_hash.h
The hashers for integer types are defined using the macro _Cxx_hashtable_define_trivial_hash. As you might expect from the name, this just casts the input value to size_t.
This is how gcc does it. If you're using gcc then you should have a similarly-named file somewhere. If you're using a different compiler then the source will be somewhere else. It is not required that every implementation uses a trivial hash for integer types, but I suspect that it is very common.
It's not based on a random number generator, and hopefully it's now pretty obvious to you how this function guarantees to return the same key for the same input every time! The reason for using a trivial hash is that it's as fast as it gets. If it gives a bad distribution for your data (because your values tend to collide modulo the number of buckets) then you can either use a different, slower hash function or a different number of buckets (std::unordered_set doesn't let you specify the exact number of buckets, but it does let you set a minimum). Since library implementers don't know anything about your data, I think they will tend not to introduce slower hash functions as the default.
A hash function must be deterministic -- i.e., the same input must always produce the same result.
Generally speaking, you want the hash function to produce all outputs with about equal probability for arbitrary inputs (but while desirable, this is no mandatory -- and for any given hash function, there will always be an arbitrary number of inputs that produce identical outputs).
Generally speaking, you want the hashing function to be fast, and to depend (to at least some degree) on the entirety of the input.
A fairly frequently seen pattern is: start with some semi-random input. Combine one byte of input with the current value. Do something that will move the bits around (multiplication, rotation, etc.) Repeat for all bytes of the input.

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

random_shuffle algorithm - are identical results produced without random generator function?

If a random generator function is not supplied to the random_shuffle algorithm in the standard library, will successive runs of the program produce the same random sequence if supplied with the same data?
For example, if
std::random_shuffle(filenames.begin(), filenames.end());
is performed on the same list of filenames from a directory in successive runs of the program, is the random sequence produced the same as that in the prior run?
If you use the same random generator, with the same seed, and the same starting
sequence, the results will be the same. A computer is, after all,
deterministic in its behavior (modulo threading issues and a few other
odds and ends).
If you do not specify a generator, the default generator is
implementation defined. Most implementations, I think, use
std::rand() (which can cause problems, particularly when the number of
elements in the sequence is larger than RAND_MAX). I would recommend
getting a generator with known quality, and using it.
If you don't correctly seed the generator which is being used (another
reason to not use the default, since how you seed it will depend on the
implementation), then you'll get what you get. In the case of
std::rand(), the default always uses the same seed. How you seed
depends on the generator used. What you use to seed it should be vary
from one run to the other; for many applications, time(NULL) is
sufficient; on a Unix platform, I'd recommend reading however many bytes
it takes from /dev/random. Otherwise, hashing other information (IP
address of the machine, process id, etc.) can also improve things---it
means that two users starting the program at exactly the same second
will still get different sequences. (But this is really only relevant
if you're working in a networked environment.)
25.2.11 just says that the elements are shuffled with uniform distribution. It makes no guarantees as to which RNG is used behind the scenes (unless you pass one in) so you can't rely on any such behavior.
In order to guarantee the same shuffle outcome you'll need to provide your own RNG that provides those guarantees, but I suspect even then if you update your standard library the random_shuffle algorithm itself could change effects.
You may produce an identical result every run of the program. You can add a custom random number generator (which can be seeded from an external source) as an additional argument to std::random_shuffle if this is a problem. The function would be the third argument. Some people recommend call srand(unsigned(time(NULL))); before random_shuffle, but the results are often times implementation defined (and unreliable).