What can be the best algorithm to generate a unique id in C++?
The length ID should be a 32 bit unsigned integer.
Getting a unique 32-bit ID is intuitively simple: the next one. Works 4 billion times. Unique for 136 years if you need one a second. The devil is in the detail: what was the previous one? You need a reliable way to persist the last used value and an atomic way to update it.
How hard that will be depends on the scope of the ID. If it is one thread in one process then you only need a file. If it is multiple threads in one process then you need a file and a mutex. If is multiple processes on one machine then you need a file and a named mutex. If it is multiple processes on multiple machines then you need to assign a authoritative ID provider, a single server that all machines talk to. A database engine is a common provider like that, they have this built-in as a feature, an auto-increment column.
The expense of getting the ID goes progressively up as the scope widens. When it becomes impractical, scope is Internet or provider too slow or unavailable then you need to give up on a 32-bit value. Switch to a random value. One that's random enough to make the likelihood that the machine is struck by a meteor is at least a million times more likely than repeating the same ID. A goo-ID. It is only 4 times as large.
Here's the simplest ID I can think of.
MyObject obj;
uint32_t id = reinterpret_cast<uint32_t>(&obj);
At any given time, this ID will be unique across the application. No other object will be located at the same address. Of course, if you restart the application, the object may be assigned a new ID. And once the object's lifetime ends, another object may be assigned the same ID.
And objects in different memory spaces (say, on different computers) may be assigned identical IDs.
And last but not least, if the pointer size is larger than 32 bits, the mapping will not be unique.
But since we know nothing about what kind of ID you want, and how unique it should be, this seems as good an answer as any.
You can see this. (Complete answer, I think, is on Stack Overflow.)
Some note for unique id in C++ in Linux in this site. And you can use uuid in Linux, see this man page and sample for this.
If you use windows and need windows APIs, see this MSDN page.
This Wikipedia page is also useful: http://en.wikipedia.org/wiki/Universally_Unique_Identifier.
DWORD uid = ::GetTickCount();
::Sleep(100);
If you can afford to use Boost, then there is a UUID library that should do the trick. It's very straightforward to use - check the documentation and this answer.
There is little context, but if you are looking for a unique ID for objects within your application you can always use a singleton approach similar to
class IDGenerator {
public:
static IDGenerator * instance ();
uint32_t next () { return _id++; }
private:
IDGenerator () : _id(0) {}
static IDGenerator * only_copy;
uint32_t _id;
}
IDGenerator *
IDGenerator::instance () {
if (!only_copy) {
only_copy = new IDGenerator();
}
return only_copy;
}
And now you can get a unique ID at any time by doing:
IDGenerator::instance()->next ()
Related
I need a unique identifier to distinguish entities, but in reality these entities don't have a lot, and the uid can be repeated when the entity is destroyed. Entities are created in a distributed system and can be created multiple times at the same time.
Currently using a popular UUID library, but UUID is a 128-bit number. According to the design of my system, an int type is more than enough. If uid can be recycled, 8-byte should ok. So I think there is a lot of optimization space.
For example:
bool isEqual(const char *uid1, const char *uid2) {
return strcmp(uid1, uid2) == 0;
}
If I can make uid an integer instead of a string, then I don't need to use the string comparison function.
bool isEqual(int uid1, int uid2) {
return uid1 == uid2;
}
But I don't know now that there are mature libraries that meet my needs.
So I want to ask you:
How feasible if I implement it myself?
What difficulties will I encounter?
What should I pay attention to?
Is there a library that already implements similar functions?
Worth it?
BTW, I can use C/C++/lua.
If you want a custom dedicated uid generation on a fully controlled distributed system, you have 3 possibilities:
A central system generates simply serial values and the other systems ask it for each new uid. Simple and fully deterministic, but the generator is a Single Point Of Failure
Each (logical) system receives an id and combines it with a local serial number. For example if the number of systems is beyond 32000 you could use 16 bits for the system id and 48 bits for the serial. Fully deterministic but requires an admin to give each system its id
Random. High quality random number generators that comply with crypto requirements should give you pseudo uid with a low probability of collision. But it is only probabilistic so a collision is still possible.
Point to pay attention to:
race conditions. If more than one process can be client for a generator, you must ensure that the uid generation is correctly synchronized
uid recycling. If the whole system must be designed to live long enough to exhaust a serial generator, you will have to keep somewhere the list of still existing entities and their uid
for the probabilistic solution, the risk of collision is proportional to the maximum number of simultaneous entities. You should carefully evaluates that probability and evaluates whether the risk can be accepted.
Are such solutions already implemented?
Yes, in database systems that allow automatic id generation.
Worth it?
Only you can say...
We have a tiny, secure, unique string ID generator for Python, which allows you to reduce ID length (but increase collisions probability), you can pass the length as an argument. To use in python env :
pip install nanoid
from nanoid import generate
generate(size=10) => u'p1yS9T21Bf'
To check how the ID's are generated an their collision probablity for a given length visit https://zelark.github.io/nano-id-cc/
Refer: https://pypi.org/project/nanoid/
I have a question regarding the performance effects when taking into consideration of two possible methods of 'getting' data from a given struct. It is assumed that the 'name' variable is relative to what the value of 'id' is.
Assuming I have a struct and enum as follows,
enum GenericId { NONE, ONE, TWO };
struct GenericTypeDefinition {
GenericId id;
const char name[8];
...
};
Let's say I wanted to get the name of this struct. Quite easy, I could just refer to the instance of the GenericTypeDefinition struct and refer (or point) to the name member. Simple enough.
Now here is where my performance question becomes relevant. Say I need I create hundreds of these instances, all of which will be locked to a certain number of names and a unique id per. These instances will be referred to as a collection of possible 'GenericTypeDefinition's throughout the program. Keep in mind, the value of 'name' is relative to the value of 'id'. My question is, would I be able to save some memory if I implemented a function like follows (and removed the name variable from the struct),
struct GenericTypeDefinition { // 'name' is now removed.
GenericId id;
...
};
const char* Definition_ToString(GenericEnum e) {
switch (e) {
case NONE: return "Nothing";
case ZERO: return "Zero is not nothing.";
...
}
I assume it would because I am freeing up the need to store the string (8 bytes in length) in each struct that I create.
If you would like any clarification please ask, as I have not been able to find much on this.
If I understand what you're asking, you are putting redundant data into your struct. Essentially, you are able to get the name of the struct from the id in the struct. But, you could also store the name directly in the struct.
So, you are right -- not storing the name will save memory, because you won't store the name with every item. The cost is a bit of time. You will need to call a function to give you the name from the id each time you need it. You will have to weigh these tradeoffs to determine which is more important.
The devil is in details. The answer depends on many things. For example, how often such a structure is allocated, how often it is used and how often char name[8]; is used.
If you remove name from the structure, several scenario may happen:
if you have many objects of this type and a good allocator, you will save space.
if you use those objects extensively during some calculus and you use name only from time to time, you will save time thanks to better cache performance.
if you use name extensively for some computation and your function Definition_ToString is just a little bit more complex than the one in your example, you will loose on performance.
However, in my estimation, optimizations like this can speed up program by some small factor only. It may help in cases when you count in microseconds. If your program is desperately slow, look for asymptotically better algorithm.
In most cases compiler will do this job for you. It usually stores all the const string literals in the RO section of the executable.Depending on the optimization level it may even do away with the memory taken up by the char array in the struct. So your executable size will grow,but it won't effect the run time memory.
However since the name is tied to the ID,logically it makes sense to implement the second version,so in future if you want to add a new id,you don't need to do any redundant work.
In your first case, the task of initializing the structs with the proper ID and NAME means that the program will, at the very beginning, copy the literals, this is, the strings (because I assume you initialize the structs with the strings within the code) to another space in RAM memory, to which the char[ ] will point.
Instead, the second case means that the value is read from the program itself (the literals are hard coded in a table somewhere in the deep assembler code), and will return a pointer to it (correct me if the pointer is not to somewhere in the program but the returning const char* is stored as a variable), therefore you do save some memory.
My personal comment is (which you may see it beyond the question's scope), that even though the second alternative may save you some memory, implies that the IDs and NAMEs are hard coded, therefore leaving out any possibility of expansion during runtime (i.e. you want to add more IDs that are received via a console...).
I am writing a database and I wish to assign every item of a specific type a unique ID (for internal data management purposes). However, the database is expected to run for a long (theoretically infinite) time and with a high turnover of entries (as in with entries being deleted and added on a regular basis).
If we model our unique ID as a unsigned int, and assume that there will always be less than 2^32 - 1 (we cannot use 0 as a unique ID) entries in the database, we could do something like the following:
void GenerateUniqueID( Object* pObj )
{
static unsigned int iCurrUID = 1;
pObj->SetUniqueID( iCurrUID++ );
}
However, this is fine until entries start getting removed and other ones added in their place, there may still be less than 2^32-1 entries, but we may overflow the iCurrUID and end up assigning "unique" IDs which already are being used.
One idea I had was to use a std::bitset<std::numeric_limits<unsigned int>::max-1> and then traversing that to find the first free unique ID, but this would have a high memory consumption and will take linear complexity to find a free unique ID, so I'm looking for a better method if one exists?
Thanks in advance!
I'm aware that changing the datatype to a 64-bit integer, instead of a 32-bit integer would resolve my problem; however, because I am working in the Win32 environment, and working with lists (with DWORD_PTR being 32-bits), I am looking for an alternative solution. Moreover, the data is sent over a network and I was trying to reduce bandwidth consumption by using a smaller size unique ID.
With an uint64_t (64bit), it would take you well, well over 100 years, even if you insert somewhere close to 100k entries per second.
Over 100 years, you should insert somewhere around 315,360,000,000,000 records (not taking into account leap years and leap seconds, etc). This number will fit into 49 bits.
How long to you anticipate that application to run?
Over 100 years?
This is the common thing database administrators do when they have an autoincrement field that apprpaches the 32bit limit. They change the value to the native 64bit type (or 128bit) for their DB system.
The real question is how many entries can you have until you are
guaranteed that the first one is deleted. And how often you
create new entries. An unsigned long long is guaranteed to
have a maximum value of at least 2^64, about 1.8x10^19. Even at
one creation per microsecond, this will last for a couple of
thousand centuries. Realistically, you're not going to be able
to create entries that fast (since disk speed won't allow it),
and your program isn't going to run for hundreds of centuries
(because the hardware won't last that long). If the unique id's
are for something disk based, you're safe using unsigned long
long for the id.
Otherwise, of course, generate as many bits as you think you
might need. If you're really paranoid, it's trivial to use
a 256 bit unsigned integer, or even longer. At some point,
you'll be fine even if every atom in the universe creates a new
entry every picosecond, until the end of the universe. (But
realistically... unsigned long long should suffice.)
For simple objects, it's usually easy to have a "state" attribute that's a string and storeable in a database. For example, imagine a User class. It may be in the states of inactive, unverified, and active. This could be tracked with two boolean values – "active" and "verified" – but it could also use a simple state machine to transition from inactive to unverified to active while storing the current state in that "state" attribute. Very common, right?
However, now imagine a class that has several more boolean attributes and, more importantly, could have lots of combinations of those. For example, a Thing that may be broken, missing, deactivated, outdated, etc. Now, tracking state in a single "state" attribute becomes more difficult. This, I guess, is a Nondeterministic Finite Automaton or State Machine. I don't really want to store states like "inactive_broken" and "active_missing_outdated", etc.
The best I've come up with is to have both the "state" attribute and store some sort of superstate – "available" vs "unavailable", in this case – and each of the booleans. That way I could have a guard-like method when transitioning.
Has anyone else run into this problem and come up with a good solution to tracking states?
Have you considered serializing the "state" to a bit mask and storing it in an integer column in a database? Let's say an entity can be active or inactive, available or unavailable, or working or broken in any combination.
You could store each state as a bit; either on or off. This way a value of 111 would be active, available, and working, while a value of 000 would be inactive, unavailable, and broken.
You could then query for specific combinations using the appropriate bit mask or deserialize the entity to a class with boolean values for each state you are wanting to track. It would also be relatively cheap to add states to an object and would not break already serialized objects.
Same as the answer above but more practical than theory:
Identify the possible number of Boolean attributes. The state of all these attributes can be represented by 1=true or 0=false
Take a appropriate sized numeric datatype. unsigned short=16, unsigned int=32, unsigned long=64, if you have an even bigger type take an array of numeric: for instance for 128 attributes take
unsigned long[] attr= new long[2]; // two long side by side
each bit can be accessed with following code
bool GetBitAt(long attr, int position){
return (attr & (1<<(position-1)) >0;
}
long SetBitAt(long attr, int position, bool value){
return attr|=1<<(position-1);
}
Now have each bit position represent an attribute. E.g: bit 5 means Is Available?
bool IsAvailable(long attr){
return GetBitAt(attr, 5);
}
benefits:
Saves space e.g. 64 attributes will only take 8 bytes.
Easy saving and reading you simply have to read a short, int or long which is just a simple variable
Comparing a set of attributes is easy as you will simple compare a short, int or long 's numeric value with the other. e.g. if(Obj.attributes == Obj2.attributes){ }
I think you are describing an example of Orthogonal Regions. From that link, "Orthogonal regions address the frequent problem of a combinatorial increase in the number of states when the behavior of a system is fragmented into independent, concurrently active parts."
One way you might implement this is via object composition. For example, your super object contains several sub-objects. The sub-objects each maintain their associated state independently from one another. The super object's state is the combination of all its sub-object states.
Search for "orthogonal states", "orthogonal regions", or "orthogonal components" for more ideas.
I had a need of a quick unique ID in one of my classes to differenciate one process from another. I decided to use the address of the instance to do so. I ended up with something like this (quintptr is a Qt defined type of integer to store addresses with the correct size, according to the platform):
Foo::Foo()
: _id(reinterpret_cast<quintptr>(this))
{
...
}
The idea is to compare the output of two different processes of the same exe. On Vista (my dev machine) there's no problem. But on XP, the value of _id is the same (!) in the two processes.
Can anyone explain why is that? and if it's a good idea to use pointers like that (I thought so, I'm not so sure anymore)?
Thanks.
Every process gets its own address space. On XP, they're all the same. Therefore it's very common to see what you saw: two objects that have the same address, but in two different address spaces.
It turns out that this contributes to security risks. Attackers were able to guess where vulnerable objects would be in memory, and exploit those. Vista randomizes address spaces (ASLR) which means that two processes are far more likely to put the same object at different addresses.
For your case, using pointers like that is not a smart idea. Just use the process ID
The reason is each process has its own address space and if two processes do the same they just use the same virtual addresses - maybe even heap allocations will be done at same virtual addresses.
You could call GetCurrentProcessId() once and store the result somewhere so that further retrieval is very fast. The process id persists and is unique for the lifetime of the process.
Each process gets its own address space. Unless something like ASLR kicks in, the memory layouts of two processes stemming from the same executable are likely to be very similar, if not identical.
So your idea is not a good one. Using the process ID sounds like a saner approach here, but keep in mind that those can be recycled too.