Designing a string class in C++ - c++

I need to design (and code) a "customized" string class in C++. I am seeking any documentation and pointers on design issues or potential pitfalls I should be aware of.
Links are very welcome, as are the identification of problems (if any) with current string libs (Qstring, std::string, and the others).

Despite the critics, I think this is a valid question.
The std::string is not a panacea. It looks like someone took the class from a pure-OO and dumped it in C++, which is probably the case.
Advice 1: Prefer non-member non-friend methods
Now that this is said, in this hour of internationalization, I would certainly advise you to design a class that would support Unicode. And I do say Unicode, not UTF-8 or UTF-16. It's ill-fitting (I think) to devise a class that would contain the data in a given encoding. You can provide methods to then output the information in various formats.
Advice 2: Support Unicode
Then, there is a number of points on the memory allocation schemes:
Small String Optimization: the class contains pre-allocated space for a few characters (a dozen or two), and thus avoid heap allocation for those
Copy On Write: the various strings share a buffer so that copy is cheap, when one string needs to modify its content, it copies the buffer if it's not the sole owner --> the issue is that multithreading introduces overhead here and it's been showed that for a general purpose technic this overhead could dwarf the actual copying cost
Immutability: "new" languages such as Java, C# or Python use immutable strings. Think of it as a pool of strings, all strings containing "Fooo" will point to the same buffer. Note that these languages support garbage collection, which rather helps here.
I would personally pick the "Small String Optimization" here (though it's not exclusive with the other two), simply because it's simple to implement and should actually benefit you (heap allocation cost, locality of reference issues).
The other two technics are somewhat complex in the face of multi-threading, and such are likely error-prone and unlikely to yield any real benefit unless carefully crafted.
And that brings my last advice:
Advice 3: Don't implement internal locking in an attempt of MultiThreading support
It will slow down the class when used in SingleThreaded context and will not yield as much benefit as you'd think when used in a MultiThreaded one.
Finally, you could perhaps find something suiting your tastes (or get some pointers) by browsing existing code. I don't promise to exhibit "smooth" interfaces though:
ICU UnicodeString: Unicode support, at least
std::string: over 100 member methods (counting the various overloads)
llvm StringRef: note how many algorithms are implemented as member methods :'(

Effective STL by Scott Meyers has some interesting discussion about possible std::string implementation techniques, though it covers rather advanced issues such as copy-on-write and reference counting.

Depending on what the "customization" is (e.g. a custom allocator), you may be able to do it via a template parameter of the std::basic_string class.

Herb Sutter gives a sample of a custom string class in the GotW #29. You could use it for the start.

From a general-purpose point of view a "new" string class ideally combined the good points of std::string, CString, QString and others. A few points in random order:
MFC CString supports using it in printf-like functions due to a very specific implementation. If you need or want this feature I recommend buying the book "MFC Internals" by George Sheperd. Although the book is from 1996(!) it's description of how CString is implemented should be worth it. http://www.amazon.com/MFC-Internals-Microsoft-Foundation-Architecture/dp/0201407213/ref=sr_1_1?ie=UTF8&s=books&qid=1283176951&sr=8-1
Check that your string class plays nicely with all interfaces you'll use it with (iostreams, Windows API, printf*, etc.)
Don't aim for full unicode support (as in: collation, grapheme clusters, ...) as that will mean your class will never be done, but consider making it a wchar_t class with conversion options.
Consider making the ctor/function that creates your string objects from char* always take the specific encoding of the character arrays. (Can be helpful in mixed UTF-8 / other character sets environments.)
Look at the full CString interface and at the full std:string interface and decide what you are going to need and what you can skip.
Look at QString to see what the other two miss.
Do not provide implicit conversion to neither char/wchar_t*
Consider adding convenient conversion functions to/from numeric types.
Don't write a string class without a full set of detailed Unit Tests!

The world doesn't need another string class. Is this homework? If not, use std::string.

The problem with std::string is.. that you can't change it. Sometimes you need the basics of a std::string, but disagree with the implementation of your c++ library.
As an example, thread-safe reference counting employed means lots of locking (or at least locked operations). Also, if most of your strings are short (because you know this will be the case), you might want a string class that is optimized for that use-case.
So even if you like the std::string API, or at least have learned to live with it, there is room for 'competing implementations' that are more or less workalikes.
PowerDNS would love to have one, as we currently pass many dns host names around, and a large majority of them would fit in a, say, 25 bytes fixed buffer, which would relieve a lot of new/delete pressure.

Related

Exporting C++ from dll - Domain and collections

There are several questions already on stack overflow regarding classes and functions across dll boundries.
Alot of them reference this article https://www.codeproject.com/Articles/28969/HowTo-Export-C-classes-from-a-DLL
I want to drill down a bit and ask some more specific questions.
The summary of most of those questions is
1) Exporting templates and the STL across dll boundries is bad in general
2) You cannot allocated on one side of a dll boundry and release in another.
3) Your options are
a) Export only pure C interfaces
b) Export abstract classes that are free to use templated and the STL in their internal implementation.
Ok. Well, I've never ever done this in 10 years of coding C++ and it has never nipped me in the butt. I've followed what the article calls "The naive approach." However, I resent being called naive a little bit :), because I've always had 100% of the source in every solution and always built 100% of it with the same compiler and settings, and have been mindful of such.
Recently, I've collected evidence that I have a problem with an object being allocated in on one side of a dll boundry and released in another. So, I may find myself having to change my ways. Perhaps, this is a result from now using third party libraries from Nuget, rather than compiling them from source.
So, I wonder, how can I employ these options when it comes to domain objects (Or in my case, objects with no functionality)
If I can never use std::string, then I assume everything exported has to use c-style strings and the amount of copying MBs of text just increased 500X in my solution, if I want to use std:string internally in my dlls.
Worse, I am pondering the question: How do you then have domain objects that contain a collection?
Consider a class in my "naive dll"
class CustomerList
{
public:
// where this is not trivial
void AddCustomer(const Customer & customer);
private:
std::vector<Customer> m_customers;
};
and a function or method in my dll
SubmitCustomers(const CustomerList & customers);
How do I make this safe?
Do I need to go reinvent vector in pure C without allocations?
Do I have to resort to arrays with sizes? yuck? and then do yet more copying to STL containers internally?

framework/library for property-tree-like data structure with generic get/set-implementation?

I'm looking for a data structure which behaves similar to boost::property_tree but (optionally) leaves the get/set implementation for each value item to the developer.
You should be able to do something like this:
std::function<int(void)> f_foo = ...;
my_property_tree tree;
tree.register<int>("some.path.to.key", f_foo);
auto v1 = tree.get<int>("some.path.to.key"); // <-- calls f_foo
auto v2 = tree.get<int>("some.other.path"); // <-- some fallback or throws exception
I guess you could abuse property_tree for this but I haven't looked into the implementation yet and I would have a bad feeling about this unless I knew that this is an intended use case.
Writing a class that handles requests like val = tree.get("some.path.to.key") by calling a provided function doesn't look too hard in the first place but I can imagine a lot of special cases which would make this quite a bulky library.
Some extra features might be:
subtree-handling: not only handle terminal keys but forward certain subtrees to separate implementations. E.g.
tree.register("some.path.config", some_handler);
// calls some_handler.get<int>("network.hostname")
v = tree.get<int>("some.path.config.network.hostname");
search among values / keys
automatic type casting (like in boost::property_tree)
"path overloading", e.g. defaulting to a property_tree-implementation for paths without registered callback.
Is there a library that comes close to what I'm looking for? Has anyone made experiences with using boost::property_tree for this purpose? (E.g. by subclassing or putting special objects into the tree like described here)
After years of coding my own container classes I ended up just adopting QVariantMap. This way it pretty much behaves (and is as flexible as) python. Just one interface. Not for performance code though.
If you care to know, I really caved in for Qt as my de facto STL because:
Industry standard - used even in avionics and satellite software
It has been around for decades with little interface change (think about long term support)
It has excellent performance, awesome documentation and enormous user base.
Extensive feature set, way beyond the STL
Would an std::map do the job you are interested in?
Have you tried this approach?
I don't quite understand what you are trying to do. So please provide a domain example.
Cheers.
I have some home-cooked code that lets you register custom callbacks for each type in GitHub. It is quite basic and still missing most of the features you would like to have. I'm working on the second version, though. I'm finishing a helper structure that will do most of the job of making callbacks. Tell me if you're interested. Also, you could implement some of those features yourself, as the code to register callbacks is already done. It shouldn't be so difficult.
Using only provided data structures:
First, getters and setters are not native features to c++ you need to call the method one way or another. To make such behaviour occur you can overload assignment operator. I assume you also want to store POD data in your data structure as well.
So without knowing the type of the data you're "get"ting, the only option I can think of is to use boost::variant. But still, you have some overloading to do, and you need at least one assignment.
You can check out the documentation. It's pretty straight-forward and easy to understand.
http://www.boost.org/doc/libs/1_61_0/doc/html/variant/tutorial.html
Making your own data structures:
Alternatively, as Dani mentioned, you can come up with your own implementation and keep a register of overloaded methods and so on.
Best

Datastructure Storage in Filesystem

I am trying to write a persistent datastructure in C++ , however I feel that I should be able to make it binary compatible with various other implementations of my datastructure readers, and hence, my current idea is to declare datastructure in the native memory without any abstraction.
For example, I would specify a linear block of memory as a datastructure (using new keyword) and then describe what the first byte means, what the second byte means and so on. I know I can do this using struct but then, the datastructure would be bound to one language and other languages will have to then use this structure. Also, the implementation might then change from compiler to compiler. I would instead like it as a memory standard.
Is what I am trying to do somewhat sensible? Or I am trying to over-simplify things and should really proceed with a struct data structure? Now onto the C++ part, if you believe that I should be using a struct data structure, then what are the disadvantages of using a full-fledged class?
(I am using a class anyway to wrap around the memory structure and provide functions to it since the datastructure is anyway persistent.)
EDIT
As justin as suggested, I do not need any such advanced interface wrapper around the memory structure, so my last point about class wrapper is not stated properly. What I mean is I would like to have a class interface for the memory representation, it does not necessarily have to be a wrapper.
Several file formats I have read/worked with do exactly that -- define a memory standard or layout, then typically back it up with a demonstration in C-like pseudo-structure. Sometimes they will provide struct or class representations, and some are completely abstracted by a library. Of course, these formats go on to document all fields, their sizes, the endianness of the data and so on.
I figure endian related issues, padding, complexity (e.g. introduced by variations in the data structures) and proper versioning are the biggest sources of errors. Another issue I find is the use of data structures of yesteryear and inconsistency of data structures used to represent similar functionalities -- You may receive a spec, and realize it contains several different string representations -- all of which are archaic, and somebody has to go on to support all of these (bidirectionally).
Proceeding that route:
You should not commit to a binary representation (or compilable program) if you don't want to support it (and attempts of long-lived formats fail/stumble along the way, as platforms and toolsets change). Just commit to a formal memory standard at first, then build on top of that with tests and example input files to verify the representation is properly serialized and deserialized correctly. A very basic test suite will help ensure your model is portable on all the systems you need, and can point out potential pitfalls or platform specific considerations you may not have been aware of.
If you really want to provide a compilable representation, I'd stick with a very compliant struct representation -- clients can take that (in memory) representation and turn it into any C++ abstraction/representation they like. That is to say, a serialized representation should probably not reflect that of a representation in memory, apart from trivially simple representations and the intermediate storage of such a representation (flattened and packed structs).
One of the important parts is that you should have tests which confirm your in memory object graph which you create with these structs are forward and backwards serializable and de-serializable, and support proper versioning -- so it often takes a bit of work to make a complex serialized representation compatible. So you see this approach just introduces one abstraction layer on top of another. In this regard, you may want give C++ abstraction the ability to create itself from the packed in memory representation, and to ensure that that representation can also correctly populate the packed structure without data loss.
Beyond that, is there any need to have a more advanced interface? If there is, then you may want to provide that information.
So yes, the memory standard is the part that you must get correct and stable, and to which all implementations should refer to and test against -- regardless of platform/architecture differences. IOW, you're on the right track ;)
In C++ there's no practical difference between struct and class (besides the default accessibility being public in struct). Traditionally, struct is used when a type only has (public) member variables and no member functions but this is only a convention, not a rule enforced by the compiler.
I'd certainly use a struct/class to describe the data. If someone wants to write a reader of your data structure, they can either import your header file or implement the data structure in their language of choice - in most programming languages this should be pretty simple.
I recommend you start your structure something like this:
typedef struct
{
int Version; // struct layout version
int ByteSize; // byte size of structure for validation
...
} MYDATA;
This way when your data structure is being passed around, your code can verify that the allocated structure size matches with how many bytes you'd expect for a given version of your structure. You could then easily introduce new versions of your structure by simply updating the version field and checking for the new size.
When you save your data to disk, make sure that you write it out field-by-field, rather than through a single write (using a pointer and sizeof() to ensure that other languages won't have to deal with potential padding that your C++ compiler may decide to put in. It's possible to manually lay out fields in the structure so that there's no padding but you have to be very, very careful while doing that and it's easy to make mistakes.

Extending libraries in C++

Is it possible to extend a class from a C++ library without the source code? Would having the header be enough to allow you to use inheritance? I am just learning C++ and am getting into the theory. I would test this but I don't know how.
Short answer
YES, definitively you can.
Long answer:
WARNING: THe following text may hurt children an sensitive OOP integralists. If you feel or retain to be one of such, stay away from this answer: mine your and everyone alse life will be more easier
Let me reveal a secret: STL code is just nothing more than regular C++ code that comes with headers and libraries, exactly like your code can -and most likely- do.
STL authors are just programmer LIKE YOU. They are no special at all respect to the compiler. Thay don't have any superpower towards it. They sits on their toilet exacly like you do on yours, to do exactly what you do. Don't over-mistify them.
STL code follows the exact same rules of your own written code: what is overridden will be called instead of the base: always if it is virtual, and only according to the static type of its referring pointer if it is not virtual, like every other piece of C++ code. No more no less.
The important thing is not to subvert design issues respecting the STL name convention and semantics, so that every further usage of your code will not confuse people expectation, including yourself, reading your code after 10 years, not remembering anymore certain decisions.
For example, overriding std::exception::what() must return an explanatory persistent C string, (like STL documentation say) and not add unexpected other fuzzy actions.
Also, overriding streams or streaming operators shold be done cosidering the entire design (do you really need to override the stream or just the streambuffer or just add a specific facet to the locale it imbued?): In other words, study not just "the class" but the design of all its "world" to properly understand how it works with what is around.
Last, but not least, one of the most controversial aspect are containers and everything not having virtual destructors.
My opinion is that the noise about the "classic OOP rule: Dont' derive what has no virtual destructor" is over-inflated: simply don't expect a cow to became an horse just because you place a saddle on it.
If you need (really really need) a class that manage a sequence of character with the exact same interface of std::string that is able to convert implicitly into an std::string and that has something more, you have two ways:
do what the good good girls do, embed std:string and rewrite all its 112 (yes: they are more than 100) methods with function that do nothing more than calling them and be sure you come still virgin to the marriage with another good good boy programmer's code, or ...
After discover that this takes about 30 years and you are risking to become 40 y.o. virgin no good good boy programmer is anymore interested in, be more practical, sacrifice your virginity and derive std::string. The only thing you will loose is your possibility to marry an integralist. And you can even discover it not necessarily a problem: you're are even staying away from the risk to be killed by him!
The only thing you have to take care is that, being std::string not polymorphic your derivation will mot make it as such, so don't expect and std::string* or std::string& referring yourstring to call your methods, including the destructor, that is no special respect every other method; it just follow the exact same rules.
But ... hey, if you embed and write a implicit conversion operator you will get exactly that result, no more no less!
The rule is easy: don't make yourself your destructor virtual and don't pretend "OOP substitution principle" to work with something that is not designed for OOP and everything will go right.
With all the OOP integralist requemscant in pacem their eternal sleep, your code will work, while they are still rewriting the 100+ std::string method just to embed it.
Yes, the declaration of the class is enough to derive from it.
The rest of the code will be picked up when you link against the library.
Yes you can extend classes in standard C++ library. Header file is enough for that.
Some examples:
extending std::exception class to create custom exception
extending streams library to create custom streams in your application
But one thing you should be aware is don't extend classes which does not have a virtual destructor. Examples are std::vector, std::string
Edit : I just found another SO question on this topic Extending the C++ Standard Library by inheritance?
Just having an header file is enough for inheriting from that class.
C++ programs are built in two stages:
Compilation
Compiler looks for definition of types and checks your program for language correctness.This generates object files.
Linking
The compiled object files are linked together to form a executable.
So as long as you have the header file(needed for compilation) and the library(needed for linking) You can derive from a class.
But note that one has to be careful whether that class is indeed meant for inheritance.
For example: If you have a class with non virtual destructor then that class is not meant for inheritance. Just like all standard library container classes.
So in short, Just having a interface of class is enough for derivation but the implementation and design semantics of the class do play an important role.

Design methods for multiple serialization targets/formats (not versions)

Whether as members, whether perhaps static, separate namespaces, via friend-s, via overloads even, or any other C++ language feature...
When facing the problem of supporting multiple/varying formats, maybe protocols or any other kind of targets for your types, what was the most flexible and maintainable approach?
Were there any conventions or clear cut winners?
A brief note why a particular approach helped would be great.
Thanks.
[ ProtoBufs like suggestions should not cut it for an upvote, no matter how flexible that particular impl might be :) ]
Reading through the already posted responses, I can only agree with a middle-tier approach.
Basically, in your original problem you have 2 distinct hierarchies:
n classes
m protocols
The naive use of a Visitor pattern (as much as I like it) will only lead to n*m methods... which is really gross and a gateway towards maintenance nightmare. I suppose you already noted it otherwise you would not ask!
The "obvious" target approach is to go for a n+m solution, where the 2 hierarchies are clearly separated. This of course introduces a middle-tier.
The idea is thus ObjectA -> MiddleTier -> Protocol1.
Basically, that's what Protocol Buffers does, though their problematic is different (from one language to another via a protocol).
It may be quite difficult to work out the middle-tier:
Performance issues: a "translation" phase add some overhead, and here you go from 1 to 2, this can be mitigated though, but you will have to work on it.
Compatibility issues: some protocols do not support recursion for example (xml or json do, edifact does not), so you may have to settle for a least-common approach or to work out ways of emulating such behaviors.
Personally, I would go for "reimplementing" the JSON language (which is extremely simple) into a C++ hierarchy:
int
strings
lists
dictionaries
Applying the Composite pattern to combine them.
Of course, that is the first step only. Now you have a framework, but you don't have your messages.
You should be able to specify a message in term of primitives (and really think about versionning right now, it's too late once you need another version). Note that the two approaches are valid:
In-code specification: your message is composed of primitives / other messages
Using a code generation script: this seems overkill there, but... for the sake of completion I thought I would mention it as I don't know how many messages you really need :)
On to the implementation:
Herb Sutter and Andrei Alexandrescu said in their C++ Coding Standards
Prefer non-member non-friend methods
This applies really well to the MiddleTier -> Protocol step > creates a Protocol1 class and then you can have:
Protocol1 myProtocol;
myProtocol << myMiddleTierMessage;
The use of operator<< for this kind of operation is well-known and very common. Furthermore, it gives you a very flexible approach: not all messages are required to implement all protocols.
The drawback is that it won't work for a dynamic choice of the output protocol. In this case, you might want to use a more flexible approach. After having tried various solutions, I settled for using a Strategy pattern with compile-time registration.
The idea is that I use a Singleton which holds a number of Functor objects. Each object is registered (in this case) for a particular Message - Protocol combination. This works pretty well in this situation.
Finally, for the BOM -> MiddleTier step, I would say that a particular instance of a Message should know how to build itself and should require the necessary objects as part of its constructor.
That of course only works if your messages are quite simple and may only be built from few combination of objects. If not, you might want a relatively empty constructor and various setters, but the first approach is usually sufficient.
Putting it all together.
// 1 - Your BOM
class Foo {};
class Bar {};
// 2 - Message class: GetUp
class GetUp
{
typedef enum {} State;
State m_state;
};
// 3 - Protocl class: SuperProt
class SuperProt: public Protocol
{
};
// 4 - GetUp to SuperProt serializer
class GetUp2SuperProt: public Serializer
{
};
// 5 - Let's use it
Foo foo;
Bar bar;
SuperProt sp;
GetUp getUp = GetUp(foo,bar);
MyMessage2ProtBase.serialize(sp, getUp); // use GetUp2SuperProt inside
If you need many output formats for many classes, I would try to make it a n + m problem instead of an n * m problem. The first way I come to think of is to have the classes reductible to some kind of dictionary, and then have a method to serlalize those dictionarys to each output formats.
Assuming you have full access to the classes that must be serialized. You need to add some form of reflection to the classes (probably including an abstract factory). There are two ways to do this: 1) a common base class or 2) a "traits" struct. Then you can write your encoders/decoders in relation to the base class/traits struct.
Alternatively, you could require that the class provide a function to export itself to a container of boost::any and provide a constructor that takes a boost::any container as its only parameter. It should be simple to write a serialization function to many different formats if your source data is stored in a map of boost::any objects.
That's two ways I might approach this. It would depend highly on the similarity of the classes to be serialized and on the diversity of target formats which of the above methods I would choose.
I used OpenH323 (famous enough for VoIP developers) library for long enough term to build number of application related to VoIP starting from low density answering machine and up to 32xE1 border controller. Of course it had major rework so I knew almost anything about this library that days.
Inside this library was tool (ASNparser) which converted ASN.1 definitions into container classes. Also there was framework which allowed serialization / de-serialization of these containers using higher layer abstractions. Note they are auto-generated. They supported several encoding protocols (BER,PER,XER) for ASN.1 with very complex ASN sntax and good-enough performance.
What was nice?
Auto-generated container classes which were suitable enough for clear logic implementation.
I managed to rework whole container layer under ASN objects hierarchy without almost any modification for upper layers.
It was relatively easy to do refactoring (performance) for debug features of that ASN classes (I understand, authors didn't intended to expect 20xE1 calls signalling to be logged online).
What was not suitable?
Non-STL library with lazy copy under this. Refactored by speed but I'd like to have STL compatibility there (at least that time).
You can find Wiki page of all the project here. You should focus only on PTlib component, ASN parser sources, ASN classes hierarchy / encoding / decoding policies hierarchy.
By the way,look around "Bridge" design pattern, it might be useful.
Feel free to comment questions if something seen to be strange / not enough / not that you requested actuall.