Reusing Qt's QString COW / ref counting in a string registry - c++

I work on a project that is supposed to have large object count (in the range of millions), and even though object names are not mandatory, they are supported for the user convenience. It would be quite a big overhead to have an empty string member or even a string pointer for every object considering that named objects will be few and far in between. Also it just so happens that object names are very frequently reused.
So my solution was to create a name registry, basically a QHash<Object*, QString*> which tracks every object that has a name assigned to it, and another string registry, which is a QHash<QString, uint> which tracks the usage count for every string. When objects are named, they are registered in the name registry, when the name is changed or the object is deleted they are unregistered, and the name registry itself manages the string registry, creates the strings, tracks the usage count and if necessary removes entries that are no longer used.
I feel like the second registry may be redundant, since Qt already employs reference counting for its COW scheme, so I wonder how can I employ that already existing functionality to eliminate the string registry and manage the strings usage count and lifetime more elegantly?

user3735658, for some reason I tend to believe that hash table does not carry the original key string in it, only the hash. So maybe your concern of redundant QString object is not valid (?). It is a bit of question, though... Theoretically, there should not be actual string there. So, you can probably set the key to just anything not valid in the context of your app e.g. "UnnamedObj666" in case of object not tied to the string but still be able to find it via the hash-table /this is where it gets tricky, maybe not, because of inability to resolve collisions by matching with original/.
And, I maybe not replying exactly as you asked but it may work as well, how about
QHash<QString, QSharedDataPointer<YourType1>> collection1;
QHash<QString, QSharedDataPointer<YourType2>> collection2;
QHash<QString, QSharedDataPointer<YourType3>> collection3;
or maybe just one
QHash<QString, QSharedDataPointer<YourBasicType>> collection;
Using QSharedDataPointer here appears to be the solution as long as you derive YourType from QSharedData to carry the reference counter immediately with the object. This way we have the proper tracking system for all the "floating" references used pretty much anywhere in the program. Of course you create the instance just once and then provide a const ref to QSharedDataPointer to consumer of your object.
There is one problem with the solution above though, when the last named object destructed we still have the entry in the hash table but we can reuse it if so.

Related

How can I prevent users deleting the pointer of a container member?

I have an API that returns a constant list of objects which represent non-copyable operating system resources. I want the list to be an absolute source of truth as to the state of the resources. If an application took a copy and manipulated the copy then the corresponding state change of the resource would not be reflected in the object stored in the list, so I want to prevent this, and also prevent removing resources from the list.
I have deleted various constructors and assignment operators and am so far happy that the user can only reference list members, and the list is itself a reference and I have satisfied myself that the list itself cannot be copied. I am also happy that pop_back() and similar calls fail because the list is returned as const.
However, this code compiles and potentially breaks the list.
const std::vector<MyClass> &list = MyClass::GetList();
const MyClass *test = &(list[0]);
delete test;
I know that you would have to be a muppet to do something like this, but in my 25 year career I have seen many such muppets earning good salaries (I suppose) as professional software engineers.
I'm pretty sure that this would cause a crash of some sort, or some other undefined and potentially application breaking behaviour that will, hopefully, be caught before the code goes into production. Hopefully. Hahahaha.
How can I make this code generate a compile error or otherwise smack said muppets over the back of the head with their own stupidity?
Thanks to HolyBlackCat for the spark of inspiration.
The answer is...extend the pimpl idiom...which I was already using.
In the implementation of MyClass have a static map of the available resources. Make it so MyClass only contains the key of its entry in said map. The implementation of the functions in the MyClass interface all look up the actual instance of the resource in the map and then call the equivalent API on that instance.
The list no longer needs to be protected at all. The lookup ID is private and read only. Callers can do whatever they wish with the list and the list members without damaging the underlying data at all.

Using a map with key-string with redundant information?

I have a class myClass. A myClass object has a (human readable) name, and some more information.
class myClass
{
std::string name;
int attribute;
int anotherAttribute;
}
They are stored inside a STL container, a vector, for example.
std::vector<myObject> myList;
When the client wants to access an element, it does this by name. That means, I have to iterate over the whole vector to find the correct object (the container contains about a few hundred objects).
So, I'm thinking of moving to std::map as container, instead of vector. As far as my understanding is, a map should be the container of choice when accessing elements by name, instead of an index.
However, then the name of the object is stored twice, once as the map key, and in the object itself. The memory overhead should be no problem, but I wonder if this is good practise. There may be the problem that the names run out of sync (for some mysterious reason). I even thought about dropping the name member of myClass.
To make it short: what container should I use, and why?
You should choose your container based on the way your store and access data in it. So in your usecase you should definitely use a unordered_map
You should keep the std::string name attribute in your class if and only if you use it from inside the class.
So the question is will you at any point have to get the name from the object, rather than the object from its name.
This occurs when you get the object from the container and store it elsewhere, then use it later on.
Given the nature of name (which is most likely not going to change) you don't really have to worry about the name attribute not being "in sync" with the key you use in the unordered_map
If memory isn't a primary concern I'd advise you to keep it.

Which containers to store objects for access via different identifiers?

I have to access my objects (multiple instances from one class) via several different identifiers and don't know which is the best way to store the mapping from identifier to object.
I act as a kind of "connector" between two worlds and each has its own identifiers I have to use / support.
If possible I'd like to prevent using pointers.
The first idea was to put the objects in a List/Vector and then create a map for each type of identifier. Soon I had to realize that the std-containers doesn't support storing references.
The next idea was to keep the objects inside the vector and just put the index in the map. The problem here is that I didn't find an index_of for vector and storing the index inside the object only works as long as nobody uses insert or erase.
The only identifer I have when creating the objects is a string and for performance I don't want to use this string as identifer for a map.
Is this a problem solved best with pointers or does anybody have an idea how to deal with it?
Thanks
Using pointers seems reasonable. Here's a suggested API that you could implement:
class WidgetDatabase {
public:
// Returns true if widget was inserted.
// If there is a Widget in *this with the same name and/or id,
// widget is not inserted.
bool Insert(const std::string& name, int id, const Widget& widget);
// Caller does NOT own returned pointer (do not delete it!).
// null is returned if there is no such Widget.
const Widget* GetByName(const string& name) const;
const Widget* GetById(int id) const;
private:
std::set<Widget> widgets_;
std::map<std::string, Widget*> widgets_by_name_;
std::map<int, Widget*> widgets_by_id_;
};
I think this should be pretty straightforward to implement. You just need to make sure to maintain the following invariant:
w is in widgets_ iff a pointer to it is in widgets_by_*
I think the main pitfall that you'll encounter is making sure is that name and id are not already in widgets_by_* when Insert is called.
It should be easy to make this thread safe; just throw in a mutex member variable, and some local lock_guards. Optionally, use shared_lock_guard in the Get* methods to avoid contention; this will be especially helpful if your use-case involves more reading than writing.
Have you considered an in-memory SQLite database? SQL gives you many ways of accessing the same data. For example, your schema might look like this:
CREATE TABLE Widgets {
-- Different ways of referring to the same thing.
name STRING,
id INTEGER,
-- Non-identifying characteristics.
mass_kg FLOAT,
length_m FLOAT,
cost_cents INTEGER,
hue INTEGER;
}
Then you can query using different identifiers:
SELECT mass_kg from Widgets where name = $name
or
SELECT mass_kg from Widgets where id = $id
Of course, SQL allows you to do much more than this. This will allow you to easily extend your library's functionality in the future.
Another advantage is that SQL is declarative, which usually makes it more concise and readable.
In recent versions, SQLite supports concurrent access to the same database. The concurrency model has gotten stronger over time, so you'll have to make sure you understand the model that is offered by the version that you're using. The latest version of the docs can be found on sqlite's website.

Using a shared_ptr<string> into an unordered_set<string>

I'm trying to cut down on string copying (which has been measured to be a performance bottleneck in my application) by putting the strings into an unordered_set<string> and then passing around shared_ptr<string>'s. It's hard to know when all references to the string in the set have been removed, so I hope that the shared_ptr can help me. This is the untested code that illustrates how I hope to be able to write it:
unordered_set<string> string_pool;
:
shared_ptr<string> a = &(*string_pool.emplace("foo").first); // .first is an iterator
:
shared_ptr<string> b = &(*string_pool.emplace("foo").first);
In the above, only one instance of the string "foo" should be in string_pool; both a and b should point to it; and at such time that both a and b are destructed, "foo" should be erased from the string_pool.
The doc on emplace() suggests, but doesn't make clear to me, that pointer a can survive a rehashing caused by the allocation of pointer b. It also appears to guarantee that the second emplacement of "foo" will not cause any reallocation, because it is recognized as already present in the set.
Am I on the right track here? I need to keep the string_pool from growing endlessly, but there's no single point at which I can simply clear() it, nor is there any clear "owner" of the strings therein.
UPDATE 1
The history of this problem: this is a "traffic cop" app that reads from servers, parcels out data to other servers, receives their answers, parcels those out to others, receives, and finally assembles and returns a summary answer. It includes an application protocol stack that receives TCP messages, parses them into string scalar values, which the application then assembles into other TCP messages, sends, receives, etc. I originally wrote it using strings, vectors<string>s, and string references, and valgrind reported a "high number" of string constructors (even compiled with -O3), and high CPU usage that was focused in library routines related to strings. I was asked to investigate ways to reduce string copying, and designed a "memref" class (char* and length pointing into an input buffer) that could be copied around in lieu of the string itself. Circumstances then arose requiring the input buffer to be reused while memrefs into it still needed to be valid, so I paid to copy each buffer substring into an internment area (an unordered_set<string>), and have the memref point there instead. Then I discovered it was difficult and inconvenient to find a spot in the process when the internment area could be cleared all at once (to prevent its growing without bound), and I began trying to redesign the internment area so that when all memrefs to an interned string were gone, the string would be removed from the pool. Hence the shared_ptr.
As I mentioned in my comment to #Peter R, I was even less comfortable with move semantics and containers and references than I am now, and it's quite possible I didn't code my simple, string-based solution to use all that C++11 can offer. By now I seem to have been traveling in a great circle.
The unordered_set owns the strings. When it goes out of scope your strings will be freed.
My first impression is that your approach does not sound like it will result in a positive experience with respect to maintainability or testability. Certainly this
shared_ptr<string> a = &(*string_pool.emplace("foo").first);
is wrong. You already have an owner for the string in your unordered_set. Trying to put another ownership layer on it with the shared_ptr is not going to work. You could have an unordered_set<shared_ptr<string>> but even that I would not recommend.
Without understanding the rest of your code base it's hard to recommend a 'solution' here. The combination of move semantics and passing const string& should handle most requirements at a low level. If there are still performance issues they may then be architectural. Certainly using only shared_ptr<string> may solve your life-time issues if there is no natural owner of the string, and they are cheap to copy, just don't use the unordered_set<string> in that case.
You've gone a bit wayward. shared_ptrs conceptually form a set of shared owners of an object... the first shared_ptr should be created with make_shared, then the other copies are created automatically (with "value" semantics) when that value is copied. What you're attempting to do is flawed in that:
the string_pool itself stores strings that don't partake in the shared ownership, nor is there any way in which the string_pool is notified or updated when the shared_ptr's reference count hits 0
the share_ptrs have no relationship to each other (you're giving both of them raw pointers rather than copying one to make the other)
For your usage, you need to decide whether you'll pro-actively erase the string from the string_pool at some point in time, otherwise you may want to put a weak_ptr in the string_pool and check whether the shared string actually still exists before using it. You can google weak_ptr if you're not already familiar with the concept.
Separately, it's worth checking whether your current observation that string copying is a performance problem is due to inefficient coding. For example:
are your string variables passed around by reference where possible, e.g.:const std::string& function parameters whenever you won't change them
do you use static const strings rather than continual run-time recreation from string literals/character arrays?
are you compiling with a sensible level of optimisation (e.g. -O2, /O2)
are there places where a keeping a reference to a string, and offsets within the string would massively improve performance and reduce memory usage (the referenced string must be kept around as long as it's used even indirectly) - it is very common to implement a "string_ref" or similar class for this is medium- and larger-sized C++ projects

Cache design: flyweight of mutable entity objects based on an immutable key

A lot of different screens in my app refer to the same entity/business objects over and over again.
Currently, each screen refers to their own copy of each object.
Also, entity objects may themselves expose access to other entity objects, again new copies of objects are created.
I'm trying to find a caching solution.
I'm looking for something similar to boost::flyweight.
However, based on immutable key/mutable value and reference counted.
boost::flyweight<key_value<long, SomeObject>, tag<SomeObject> > object;
The above is almost perfect.
I'm looking for a similar container that will give mutable access to SomeObject
Edit:
I like the flyweight's syntax and semantics. However, flyweight only allows const SomeObject& access, no chance to modify the object.
Edit2: Code has to compile on MSVC++6
Any ideas?
As long as you are happy affecting intrinsic state, then from the internals in boost/flyweight/key_value.hpp it looks like you can get away with a const_cast. If you have your own key extractor you should ensure it doesn't vary with the operations that making x mutable will expose it to.
flyweight<key_value<long, SomeObject> > kvfw(2);
SomeObject &x = const_cast<SomeObject &>(static_cast<const SomeObject&>(kvfw));
I think if you make flyweights mutable, then they cannot be legally called flyweights. Imagine a situation where glyphs are represented as flyweights. What will happen if one function changes the codepoint of the glyph that represents the letter 'A'? Another function which render the glyphs on screen, will try to draw 'A' and the user might end up seeing 'B' or something else! I think you need immutable keys referring to mutable objects. Then, think of using a hash table coupled with some reference counting mechanism.