Persisting the state of an object that has many boolean attributes

Persisting the state of an object that has many boolean attributes - state

For simple objects, it's usually easy to have a "state" attribute that's a string and storeable in a database. For example, imagine a User class. It may be in the states of inactive, unverified, and active. This could be tracked with two boolean values – "active" and "verified" – but it could also use a simple state machine to transition from inactive to unverified to active while storing the current state in that "state" attribute. Very common, right?
However, now imagine a class that has several more boolean attributes and, more importantly, could have lots of combinations of those. For example, a Thing that may be broken, missing, deactivated, outdated, etc. Now, tracking state in a single "state" attribute becomes more difficult. This, I guess, is a Nondeterministic Finite Automaton or State Machine. I don't really want to store states like "inactive_broken" and "active_missing_outdated", etc.
The best I've come up with is to have both the "state" attribute and store some sort of superstate – "available" vs "unavailable", in this case – and each of the booleans. That way I could have a guard-like method when transitioning.
Has anyone else run into this problem and come up with a good solution to tracking states?

Have you considered serializing the "state" to a bit mask and storing it in an integer column in a database? Let's say an entity can be active or inactive, available or unavailable, or working or broken in any combination.
You could store each state as a bit; either on or off. This way a value of 111 would be active, available, and working, while a value of 000 would be inactive, unavailable, and broken.
You could then query for specific combinations using the appropriate bit mask or deserialize the entity to a class with boolean values for each state you are wanting to track. It would also be relatively cheap to add states to an object and would not break already serialized objects.

Same as the answer above but more practical than theory:
Identify the possible number of Boolean attributes. The state of all these attributes can be represented by 1=true or 0=false
Take a appropriate sized numeric datatype. unsigned short=16, unsigned int=32, unsigned long=64, if you have an even bigger type take an array of numeric: for instance for 128 attributes take
unsigned long[] attr= new long[2]; // two long side by side
each bit can be accessed with following code
bool GetBitAt(long attr, int position){
return (attr & (1<<(position-1)) >0;
}
long SetBitAt(long attr, int position, bool value){
return attr|=1<<(position-1);
}
Now have each bit position represent an attribute. E.g: bit 5 means Is Available?
bool IsAvailable(long attr){
return GetBitAt(attr, 5);
}
benefits:
Saves space e.g. 64 attributes will only take 8 bytes.
Easy saving and reading you simply have to read a short, int or long which is just a simple variable
Comparing a set of attributes is easy as you will simple compare a short, int or long 's numeric value with the other. e.g. if(Obj.attributes == Obj2.attributes){ }

I think you are describing an example of Orthogonal Regions. From that link, "Orthogonal regions address the frequent problem of a combinatorial increase in the number of states when the behavior of a system is fragmented into independent, concurrently active parts."
One way you might implement this is via object composition. For example, your super object contains several sub-objects. The sub-objects each maintain their associated state independently from one another. The super object's state is the combination of all its sub-object states.
Search for "orthogonal states", "orthogonal regions", or "orthogonal components" for more ideas.

Related

Possible to generate 8-byte Unique ID for an entity

I need a unique identifier to distinguish entities, but in reality these entities don't have a lot, and the uid can be repeated when the entity is destroyed. Entities are created in a distributed system and can be created multiple times at the same time.
Currently using a popular UUID library, but UUID is a 128-bit number. According to the design of my system, an int type is more than enough. If uid can be recycled, 8-byte should ok. So I think there is a lot of optimization space.
For example:
bool isEqual(const char *uid1, const char *uid2) {
return strcmp(uid1, uid2) == 0;
}
If I can make uid an integer instead of a string, then I don't need to use the string comparison function.
bool isEqual(int uid1, int uid2) {
return uid1 == uid2;
}
But I don't know now that there are mature libraries that meet my needs.
So I want to ask you:
How feasible if I implement it myself?
What difficulties will I encounter?
What should I pay attention to?
Is there a library that already implements similar functions?
Worth it?
BTW, I can use C/C++/lua.

If you want a custom dedicated uid generation on a fully controlled distributed system, you have 3 possibilities:
A central system generates simply serial values and the other systems ask it for each new uid. Simple and fully deterministic, but the generator is a Single Point Of Failure
Each (logical) system receives an id and combines it with a local serial number. For example if the number of systems is beyond 32000 you could use 16 bits for the system id and 48 bits for the serial. Fully deterministic but requires an admin to give each system its id
Random. High quality random number generators that comply with crypto requirements should give you pseudo uid with a low probability of collision. But it is only probabilistic so a collision is still possible.
Point to pay attention to:
race conditions. If more than one process can be client for a generator, you must ensure that the uid generation is correctly synchronized
uid recycling. If the whole system must be designed to live long enough to exhaust a serial generator, you will have to keep somewhere the list of still existing entities and their uid
for the probabilistic solution, the risk of collision is proportional to the maximum number of simultaneous entities. You should carefully evaluates that probability and evaluates whether the risk can be accepted.
Are such solutions already implemented?
Yes, in database systems that allow automatic id generation.
Worth it?
Only you can say...

We have a tiny, secure, unique string ID generator for Python, which allows you to reduce ID length (but increase collisions probability), you can pass the length as an argument. To use in python env :
pip install nanoid
from nanoid import generate
generate(size=10) => u'p1yS9T21Bf'
To check how the ID's are generated an their collision probablity for a given length visit https://zelark.github.io/nano-id-cc/
Refer: https://pypi.org/project/nanoid/

State machine representation

I want to implement the GUI as a state machine. I think there are some benefits and some drawbacks of doing this, but this is not the topic of this questions.
After some reading about this I found several ways of modeling a state machine in C++ and I stuck on 2, but I don't know what method may fit better for GUI modeling.
Represent the State Machine as a list of states with following methods:
OnEvent(...);
OnEnterState(...);
OnExitState(...);
From StateMachine::OnEvent(...) I forward the event to CurrentState::OnEvent(...) and here the decision to make a transition or not is made. On transition I call CurrentState::OnExitState(...), NewState::OnEnterState() and CurrentState = NewState;
With this approach the state will be tightly coupled with actions, but State might get complicated when from one state I can go to multiple states and I have to take different actions for different transitions.
Represent the state machine as list of transitions with following properties:
InitialState
FinalState
OnEvent(...)
DoTransition(...)
From StateMachine::OnEvent(...) I forward the event to all transitions where InitialState has same value as CurrentState in the state machine. If the transition condition is met the loop is stopped, DoTransition method is called and CurrentState set to Transition::FinalState.
With this approach Transition will be very simple, but the number of transition count might get very high. Also it will become harder to track what actions will be done when one state receives an event.
What approach do you think is better for GUI modeling. Do you know other representations that may be better for my problem?

Here is a third option:
Represent the state machine as a transition matrix
Matrix column index represents a state
Matrix row index represents a symbol (see below)
Matrix cell represents the state machihe should transit to. This could be both new state or the same state
Every state has OnEvent method which returns a symbol
From StateMachine::OnEvent(...) events are forwarded to State::OnEvent which returns a symbol - a result of execution. StateMachine then based on current state and returned symbol decides whether
Transition to different state must be made, or
Current state is preserved
Optionally, if transition is made, OnExitState and OnEnterState is called for a corresponsing states
Example matrix for 3 states and 3 symbols
0 1 2
1 2 0
2 0 1
In this example if if machine is in any od the states (0,1,2) and State::OnEvent returns symbol 0 (first row in the matrix) - it stays in the same state
Second row says, that if current state is 0 and returned symbol is 1 transition is made to state 1. For state 1 -> state 2 and for state 2 -> state 0.
Similary third row says that for symbol 2, state 0-> state 2, state 1 -> state 0, state 2 -> state 1
The point of this being:
Number of symbols will likely be much lower than that of states.
States are not aware of each other
All transition are controlled from one point, so the moment you want to handle symbol DB_ERROR differently to NETWORK_ERROR you just change the transition table and don't touch states implementation.

I don't know if this is the kind of answer you are expecting, but I use to deal with such state machines in a straightforward way.
Use a state variable of an enumerated type (the possible states). In every event handler of the GUI, test the state value, for instance using a switch statement. Do whatever processing there needs to be accordingly and set the next value of the state.
Lightweight and flexible. Keeping the code regular makes it readable and "formal".

I'd personally prefer the first method you said. I find the second one to be quite counter-intuitive and overly complicated. Having one class for each state is simple and easy, if then you set the correct event handlers in OnEnterState and remove them in OnExitState your code will be clean and everything will be self contained in the corresponding state, allowing for an easy read.
You will also avoid having huge switch statements to select the right event handler or procedure to call as everything a state does is perfectly visible inside the state itself thus making the state machine code short and simple.
Last but not least, this way of coding is an exact translation from the state machine draw to whatever language you'll use.

I prefer a really simple approach for this kind of code.
An enumeration of states.
Each event handler checks the current state before deciding what action to take. Actions are just composite blocks inside a switch statement or if chain, and set the next state.
When actions become more than a couple lines long or need to be reused, refactor as calls to separate helper methods.
This way there's no extra state machine management metadata structures and no code to manage that metadata. Just your business data and transition logic. And actions can directly inspect and modify all member variables, including the current state.
The downside is that you can't add additional data members localized to one single state. Which is not a real problem unless you have a really large number of states.
I find it also leads to more robust design if you always configure all UI attributes on entry to each state, instead of making assumptions about the previous setting and creating state-exit behaviors to restore invariants before state transitions. This applies regardless of what scheme you use for implementing transitions.

You can also consider modelling the desired behaviour using a Petri net. This would be preferable if you want to implement a more complex behaviour, since it allows you to determine exactly all possible scenarios and prevent deadlocks.
This library might be useful to implement a state machine to control your GUI: PTN Engine

Struct vs int64 value

Assume that this is for a 32-bit application. I have two ID values which together uniquely identify a particular object. Each ID is a plain 32-bit int, so in order to identify any particular object I need to keep both ID values together.
The first solution that comes to mind is to store them is as two separate values, and pass them around in a struct:
struct {
int id1;
int id2;
};
However, it would be nice if I could pass them around as a single value rather than a struct pair, as I did back when there was only a single 32-bit ID value.
So, another idea is to store them as the upper and lower halves of a uint64_t.
My question is, is there any real difference between the two methods? Either way, the same number of bytes are being passed around, but I don't know if there is any special overhead for int64, since the CPU isn't natively handling 64-bit values.

The best handling depends on how the entity is handled. If it is usually/mostly comparisons of the two sub-fields as a unit, then int64 has validity. If it is often inspecting the sub-fields independently, then the struct is more natural.

There is no reason not to use the struct.

If you can't make up your mind, use a union.
Microsoft for example defines a LARGE_INTEGER union.

How to handle changing data structures on program version update?

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.
Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?
Let's take an address book. From program version x to x+1, what if:
a field is deleted (seems simple enough) or added (ok if all can use some new default)?
a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
maybe I combine surname/forename, or split name into two fields?
These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.
Obviously I need some hard coded logic to take care of each change.
What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?
What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?
I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.
This is nothing new, so I just wondered how others do it....

I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:
Define your goals - there are two:
new versions should be able to read what old versions write
old versions should be able to read what new versions write (harder)
Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!
If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.
Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.
Keep a set of test data from all versions around.
Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).
e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).
add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".
Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.
(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.
(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be
empty (not stored)
a primitive type (string, integer, double - the less you use the easier it gets)
an array of primitive types
and array of records
I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve
Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.
The following are "change" scenarios you can expect:
a flag (yes/no) is extended to an enum ("yes/no/auto")
a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
a setting is added, overriding (or worse, extending) an existing setting.
For implementing case 2, you also need to consider:
no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
an enum may contain unknown values, other changes of valid range
phew. that was a lot. But it's not as complicated as it seems.

There's a huge concept that the relational database people use.
It's called breaking the architecture into "Logical" and "Physical" layers.
Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.
You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.
You don't need to reinvent SQL to accomplish this.
If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.
If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.
Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.

Some simple guidelines if you're talking about a structure use as in a C API:
have a structure size field at the start of the struct - this way code using the struct can always ensure they're dealing only with valid data (for example, many of the structures the Windows API uses start with a cbCount field so these APIs can handle calls made by code compiled against old SDKs or even newer SDKs that had added fields
Never remove a field. If you don't need to use it anymore, that's one thing, but to keep things sane for dealing with code that uses an older version of the structure, don't remove the field.
it may be wise to include a version number field, but often the count field can be used for that purpose.
Here's an example - I have a bootloader that looks for a structure at a fixed offset in a program image for information about that image that may have been flashed into the device.
The loader has been revised, and it supports additional items in the struct for some enhancements. However, an older program image might be flashed, and that older image uses the old struct format. Since the rules above were followed from the start, the newer loader is fully able to deal with that. That's the easy part.
And if the struct is revised further and a new image uses the new struct format on a device with an older loader, that loader will be able to deal with it, too - it just won't do anything with the enhancements. But since no fields have been (or will be) removed, the older loader will be able to do whatever it was designed to do and do it with the newer image that has a configuration structure with newer information.
If you're talking about an actual database that has metadata about the fields, etc., then these guidelines don't really apply.

What you're looking for is forward-compatible data structures. There are several ways to do this. Here is the low-level approach.
struct address_book
{
unsigned int length; // total length of this struct in bytes
char items[0];
}
where 'items' is a variable length array of a structure that describes its own size and type
struct item
{
unsigned int size; // how long data[] is
unsigned int id; // first name, phone number, picture, ...
unsigned int type; // string, integer, jpeg, ...
char data[0];
}
In your code, you iterate through these items (address_book->length will tell you when you've hit the end) with some intelligent casting. If you hit an item whose ID you don't know or whose type you don't know how to handle, you just skip it by jumping over that data (from item->size) and continue on to the next one. That way, if someone invents a new data field in the next version or deletes one, your code is able to handle it. Your code should be able to handle conversions that make sense (if employee ID went from integer to string, it should probably handle it as a string), but you'll find that those cases are pretty rare and can often be handled with common code.

I have handled this in the past, in systems with very limited resources, by doing the translation on the PC as a part of the s/w upgrade process. Can you extract the old values, translate to the new values and then update the in-place db?
For a simplified embedded db I usually don't reference any structs directly, but do put a very light weight API around any parameters. This does allow for you to change the physical structure below the API without impacting the higher level application.

Lately I'm using bencoded data. It's the format that bittorrent uses. Simple, you can easily inspect it visually, so it's easier to debug than binary data and is tightly packed. I borrowed some code from the high quality C++ libtorrent. For your problem it's so simple as checking that the field exist when you read them back. And, for a gzip compressed file it's so simple as doing:
ogzstream os(meta_path_new.c_str(), ios_base::out | ios_base::trunc);
Bencode map(Bencode::TYPE_MAP);
map.insert_key("url", url.get());
map.insert_key("http", http_code);
os << map;
os.close();
To read it back:
igzstream is(metaf, ios_base::in | ios_base::binary);
is.exceptions(ios::eofbit | ios::failbit | ios::badbit);
try {
torrent::Bencode b;
is >> b;
if( b.has_key("url") )
d->url = b["url"].as_string();
} catch(...) {
}
I have used Sun's XDR format in the past, but I prefer this now. Also it's much easier to read with other languages such as perl, python, etc.

Embed a version number in the struct or, do as Win32 does and use a size parameter.
if the passed struct is not the latest version then fix up the struct.
About 10 years ago I wrote a similar system to the above for a computer game save game system. I actually stored the class data in a seperate class description file and if i spotted a version number mismatch then I coul run through the class description file, locate the class and then upgrade the binary class based on the description. This, obviously required default values to be filled in on new class member entries. It worked really well and it could be used to auto generate .h and .cpp files as well.

I agree with S.Lott in that the best solution is to separate the physical and logical layers of what you are trying to do. You are essentially combining your interface and your implementation into one object/struct, and in doing so you are missing out on some of the power of abstraction.
However if you must use a single struct for this, there are a few things you can do to help make things easier.
1) Some sort of version number field is practically required. If your structure is changing, you will need an easy way to look at it and know how to interpret it. Along these same lines, it is sometimes useful to have the total length of the struct stored in a structure field somewhere.
2) If you want to retain backwards compatibility, you will want to remember that code will internally reference structure fields as offsets from the structure's base address (from the "front" of the structure). If you want to avoid breaking old code, make sure to add all new fields to the back of the structure and leave all existing fields intact (even if you don't use them). That way, old code will be able to access the structure (but will be oblivious to the extra data at the end) and new code will have access to all of the data.
3) Since your structure may be changing sizes, don't rely on sizeof(struct myStruct) to always return accurate results. If you follow #2 above, then you can see that you must assume that a structure may grow larger in the future. Calls to sizeof() are calculated once (at compile time). Using a "structure length" field allows you to make sure that when you (for example) memcpy the struct you are copying the entire structure, including any extra fields at the end that you aren't aware of.
4) Never delete or shrink fields; if you don't need them, leave them blank. Don't change the size of an existing field; if you need more space, create a new field as a "long version" of the old field. This can lead to data duplication problems, so make sure to give your structure a lot of thought and try to plan fields so that they will be large enough to accommodate growth.
5) Don't store strings in the struct unless you know that it is safe to limit them to some fixed length. Instead, store only a pointer or array index and create a string storage object to hold the variable-length string data. This also helps protect against a string buffer overflow overwriting the rest of your structure's data.
Several embedded projects I have worked on have used this method to modify structures without breaking backwards/forwards compatibility. It works, but it is far from the most efficient method. Before long, you end up wasting space with obsolete/abandoned structure fields, duplicate data, data that is stored piecemeal (first word here, second word over there), etc etc. If you are forced to work within an existing framework then this might work for you. However, abstracting away your physical data representation using an interface will be much more powerful/flexible and less frustrating (if you have the design freedom to use such a technique).

You may want to take a look at how Boost Serialization library deals with that issue.

Algorithm for generating a unique ID in C++?

What can be the best algorithm to generate a unique id in C++?
The length ID should be a 32 bit unsigned integer.

Getting a unique 32-bit ID is intuitively simple: the next one. Works 4 billion times. Unique for 136 years if you need one a second. The devil is in the detail: what was the previous one? You need a reliable way to persist the last used value and an atomic way to update it.
How hard that will be depends on the scope of the ID. If it is one thread in one process then you only need a file. If it is multiple threads in one process then you need a file and a mutex. If is multiple processes on one machine then you need a file and a named mutex. If it is multiple processes on multiple machines then you need to assign a authoritative ID provider, a single server that all machines talk to. A database engine is a common provider like that, they have this built-in as a feature, an auto-increment column.
The expense of getting the ID goes progressively up as the scope widens. When it becomes impractical, scope is Internet or provider too slow or unavailable then you need to give up on a 32-bit value. Switch to a random value. One that's random enough to make the likelihood that the machine is struck by a meteor is at least a million times more likely than repeating the same ID. A goo-ID. It is only 4 times as large.

Here's the simplest ID I can think of.
MyObject obj;
uint32_t id = reinterpret_cast<uint32_t>(&obj);
At any given time, this ID will be unique across the application. No other object will be located at the same address. Of course, if you restart the application, the object may be assigned a new ID. And once the object's lifetime ends, another object may be assigned the same ID.
And objects in different memory spaces (say, on different computers) may be assigned identical IDs.
And last but not least, if the pointer size is larger than 32 bits, the mapping will not be unique.
But since we know nothing about what kind of ID you want, and how unique it should be, this seems as good an answer as any.

You can see this. (Complete answer, I think, is on Stack Overflow.)
Some note for unique id in C++ in Linux in this site. And you can use uuid in Linux, see this man page and sample for this.
If you use windows and need windows APIs, see this MSDN page.
This Wikipedia page is also useful: http://en.wikipedia.org/wiki/Universally_Unique_Identifier.

DWORD uid = ::GetTickCount();
::Sleep(100);

If you can afford to use Boost, then there is a UUID library that should do the trick. It's very straightforward to use - check the documentation and this answer.

There is little context, but if you are looking for a unique ID for objects within your application you can always use a singleton approach similar to
class IDGenerator {
public:
static IDGenerator * instance ();
uint32_t next () { return _id++; }
private:
IDGenerator () : _id(0) {}
static IDGenerator * only_copy;
uint32_t _id;
}
IDGenerator *
IDGenerator::instance () {
if (!only_copy) {
only_copy = new IDGenerator();
}
return only_copy;
}
And now you can get a unique ID at any time by doing:
IDGenerator::instance()->next ()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js