Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have problem with me where I have to generate unique number throughout the system.Say an application 'X' generates value 'A' by using some inputs and this value 'A' will be used by some other application as input to generate some other value 'B'.
'A' and 'B' both values will be saved later in KDB. Purpose of doing this is to identify which value of 'A' triggered generation value of 'B'. 'A' are getting generated at very high speed so I am looking for algorithm which is fast and doesn't hamper the performance of application 'X'.
What you want is a UUID. See https://en.m.wikipedia.org/wiki/Universally_unique_identifier. They are generally based on things like MAC addresses, timestamps, hashes, and randomness. Their theoretical intent is to be globally unique. Depending on the platform there are often built in functions for generating them. I can expand on this more when I'm not on my phone if necessary, but start there.
You have likely run into them from time to time, https://www.uuidgenerator.net can give you some examples.
That said if you're inserting them in a database another strategy to investigate is using the databases auto assigned primary key ID numbers. Not always possible since you have to store them first to get an ID assigned, but philosophically sounds correct for your application.
You could also roll your own although there are many caveats. E.g. The time stamp of application startup concatenated with some internal counter. Just be aware of collision risks, even unlikely, e.g. two applications starting at the same time, or an incorrect system clock. I wouldn't consider this approach for serious usage given the presence of other more reliable strategies.
No matter what you use, I do recommend ultimately also using it as the primary key in your database. It will simplify things for you overall, and also having two unique ids in a database (e.g. UUID plus auto-generated primary key) denormalizes your database a bit (https://en.m.wikipedia.org/wiki/Database_normalization).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 days ago.
Improve this question
I've got a service that somewhere in its internals does a validation on whether something is "allowed" or "not allowed" (to keep it simple), which is based on a regular expression match. In pseudo-code:
func isAllowed(s string) {
return regex.match(pattern, s)
}
Now, I know that regex is slow, and even though Golang has a slightly dumbed-down flavor of regex to meet its performance SLAs, it's still not going to be the same as an exact string comparison. And I also know that my function is going to be called quite often with repeated values. So, I thought of making a cache:
var cache = make(map[string]bool)
func isAllowed(s string) {
if result, found := cache[s]; found {
return result
}
allowed := regex.match(pattern, s) // ignore syntax here; I'm simplifying this as pseudo-code
cache[s] = allowed
return allowed
}
So now I can avoid the regex operation if the string is already in my cache. But...there are potentially going to be a lot, like thousands or 10,000s of values in this cache. So just to look up values in the cache I might have to do 10,000 string comparisons, rather than a single regex operation.
So, I guess my question is, how much faster is a string comparison than a Go regex match? Is caching going help or hurt my efficiency?
This technique is called memoization.
A [hash]map lookup is O(1) [constant] time. The regular expressions in Go's regexp package are guaranteed to run in O(N) (linear) time, where N is the length of the input (see https://pkg.go.dev/regexp#pkg-overview, and https://swtch.com/~rsc/regexp/regexp1.html for details).
So that means you are trading time for space: TANSTAAFL
As to how much faster a map lookup might be over the regular expression might be, the only way to find out would be to run some benchmarks on something using something representative of your actual input.
Some questions you might want to consider:
Is the time spent in this authorization function actually significant from a performance perspective?
How often will you get a cache hit versus a cache miss?
If this is a long-running service/daemon, is the cache going to grow without limit and ultimately crash your service/daemon?
Might you want to use a more sophisticated cache where cache entries will expire or get evicted to keep growth within limits?
And finally,
If you're having to parse bits out of a string for authorization purposes, perhaps a better performance improvement might be to rethink your approach and maintain your authorization rules/flags as some sort of datatype (a structure or bit map) with associated functions for performing authorization tests.
For the record, I ran benchmarks and found that caching improved my performance by two orders of magnitude. So, I think I will pay the price in memory, and pocket the performance gain. :-)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I'm trying to qualify the randomness of some shuffled list. Given a list of distinct integers, I want to shuffle it with different random generators or methods and evaluate the quality of the resulting shuffling.
For now, I tried to do some kind of dice experiment. Given a list of input_size, I select some bucket in the list to be the "observed" one and then I shuffle the initial list num_runs * input_size (always starting with a fresh copy). I then look the frequencies of the elements that fell in the observed bucket. I then report the result on some plot. You can observe the results bellow for three different methods (line plots of the frequencies, I tried histogramms but it would look bad).
The dice experiment over three methods
Reporting plots only is not formal enough, I would like to report some numbers. What are the best ways to do it (or used in academic publications).
Thanks in advance.
Quantifying randomness isn't trivial.
You generate a huge amount of random numbers and then test if they have the properties exhibited from true random numbers. There are tons of properties you can test for, e.g., a bits should occure with a 50% probability.
There are a randomness test suites that combine a bunch of these tests and try to find statistical flaws in the pseudo random numbers.
PractRand is to my knowledge currently the most sophisticated randomness test suite.
I'd suggest you write a program that uses your method to repeatedly shuffle an array of e.g. [0..255] and write the raw bytes to stdout. (so the output is a pseudo random bit stream) Then you can pipe that into PractRand and it will quit once it finds statistical flaws. ./a.out | ./PractRand stdin.
TestU01's "Big Crush" is also a pretty good test suite, but it takes a very long time to run, and from my experiance PractRand finds more statistical flaws.
I suggest not to use the Diehard or the newer Dieharder test suites, because they aren't as powerful and have false positives, even when using CSPRNGs or true random number generators.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have a list of unsigned shorts that act as local IDs for a database. I was wondering what is the most memory-efficient way to store allowed IDs. For the lifetime of my project, the allowed ID list will be dynamic, so it may have more true or more false allowed IDs as time goes on, with a range of none allowed or all allowed.
What would be the best method to store these? I've considered the following:
List of allowed IDs
Bool vector/array of true/false for allowed IDs
Byte array that can be iterated through, similar to 2
Let me know which of these would be best or if another, better method, exists.
Thanks
EDIT: If possible, can a vector have a value put at say, index 1234, without all 1233 previous values, or would this suit a map or similar type more?
I'm looking at using an Arduino with 2k total ram and using external storage to assist with managing a large block of data, but I'm exploring what my options are
"Best" is opinion-based, unless you are aiming for memory efficiency at the expense of all other considerations. Is that really what you want?
First of all, I hope we're talking <vector> here, not <list> -- because a std::list< short > would be quite wasteful already.
What is the possible value range of those ID's? Do they use the full range of 0..USHRT_MAX, or is there e.g. a high bit you could use to indicate allowed ones?
If that doesn't work, or you are willing to sacrifice a bit of space (no pun intended) for a somewhat cleaner implementation, go for a vector partitioned into allowed ones first, disallowed second. To check whether a given ID is allowed, find it in the vector and compare its position against the cut-off iterator (which you got from the partitioning). That would be the most memory-efficient standard container solution, and quite close to a memory-optimum solution either way. You would need to re-shuffle and update the cut-off iterator whenever the "allowedness" of an entry changes, though.
One suitable data structure to solve your problem is a trie (string tree) that holds your allowed or disallowed IDs.
Your can refer to the ID binary representation as the string. Trie is a compact way to store the IDs (memory wise) and the runtime access to it is bound by the longest ID length (which in your case is constant 16)
I'm not familiar with a standard library c++ implementation, but if efficiency is crucial you can find an implementation or implementat yourself.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm very new to checksums, and fairly new to programming. I have a fairly simple C++ program (measuring psi) that I'm transferring to an Arduino board. Would crc16 be OK or should I go with crc32 or would that be overkill?
A possible way to check that an executable has been correctly transmited to your Arduino board could be to use a simple checksum like md5, or even something even simpler, like some crude hash function computing a 16 bit hash. See e.g. this answer for inspiration.
Checksums come up in the context of unreliable communication channels. A communication channel is an abstraction; bits go in one end and come out on the other end. An unreliable channel simply means that the bits which come out aren't the same as those which went in.
Now obviously the most extreme unreliable channel has random bits come out. That's unusable, so we focus on the channels where the input and output are correlated.
Still, we have many different corruption models. One common model is each bit is P% likely to fail. Another common model considers that bit errors typically come in bursts. Each bit then has a P% chance to start a bust of errors of length N, in which each bit is 50% likely to be wrong. Quite a few more models exist, depending on your problem - more advanced models also consider the chance of bits going completely missing.
The correct checksum has a very, very high likelihood of detecting the type of error predicted by the model, but might not work well for other types of errors.
For instance, I think with the Internet IP layer, the most common error is an entire IP packet going missing. That's why TCP uses sequence numbers to detect this particular error.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Fact: my code is in c++
Question:
I have a list of unsigned long long, which I use as reppresentation for some objects and the position in the list is the ID for the object.
Each unsigned long long is a sort of lis that mark if an object has the component j, so let say that I have
OBJ=1 | (1<<3)
that means that the object has the components 1 and 3.
I want to be fast when I insert, erease and when I retrieve objects from the list.
The retrieve usually is performed by components so I will look for objects with the same components.
I was using std::vector but as soon I started thinking about performance it seems to me not to be the best choice, since each time I have to delete an object from the list it will relocate all the object ( erease can be really frequent ) plus as soon as the underlying array is full vector will create a new bigger array and will copy all the elements on it.
I was thinking to have an "efficientArray" wich is a simple array but each time an element is removed I will "mark" the position as free an I will store that position inside a list of available positions, and anytime I have to add a new element I will store it in the first available position.
In this way all the elements will be stored in contiguos area, as using vector, but I avoid the "erease" problem.
In this way I'm not avoiding the "resize" problem (maybe I can't) plus the objects with the same components will not be closer (maybe).
Are there other ideas/structures wich I can use in order to have better performance?
Am I wrong when I say that I want "similar" object to be closer?
Thanks!
EDIT
Sorry maybe the title and the question was not write in a good way. I know vector is efficient and I don't want to write a better vector. Since I'm learning I would like to understand if vector IN THIS CASE is good or bad and why, if I'm wrong and if what I was thinking is bad and why, if there are better solutions and data structures (tree? map?), if yes why. I asked even if it is convinient to keep "similar" objects closer and if that MAYBE can influence things like branch prediction or something else (no answer about that) or if it is just nonsence. I just want to learn, even "wrong" answer can be useful for me and for others to learn something, but seems it was a bad idea like I asked *"my compiler works even if I write char ** which is wrong"* and I didn't understand why.
I recommend using either std::set or std::map. You want to know if an item exists in a container and both std::set and std::map have good performance for searches and "lookups".
You could use std::bitset and assign each object an ID that is a power of 2.
In any case, you need to profile. Profile your code without changes. Profile your code using different data structures. Choose the data structure with the best performance.
Some timing for different structures can be read here.
The problem with lists are that your always hunting after the link, where each link potentially is a cache miss (and maybe a TLB miss in addition).
The vector on the other hand will enjoy few cache misses and the hardware prefetcher will work optimally for this data structure.
If the data was much larger the results are not so clearcut.