How can I do string interning in C or C++? - c++

Is there something like intern() method in C or C++ like there is in Java ? If there isn't, how can I carry out string interning in C or C++?

boost::flyweight< std::string > seems to be exactly what you're looking for.

Is there something like intern() method in C like we have in Java ?
Not in the standard C library.
If there isn't, how to carry out string interning in C?
With great difficulty, I fear. The first problem is that "string" is not a well-defined thing in C. Instead you have char *, which might point at a zero-terminated string, or might just denote a character position. Then you've got the problem that some strings are embedded in other things ... or are stored on the stack. Both of which make interning impossible and/or meaningless. Then, there is the problem that C string literals are not guaranteed to be interned ... in the way that Java guarantees it. Finally, there is the problem that interning is a storage leak waiting to happen ... if the language is not garbage collected.
Having said that, the way to (attempt to) implement interning in C would be to create a hash table to hold the interned strings. You'd need to make it a precondition that you cannot intern a string unless it is either a literal or a string allocated in its own heap node. To address the storage leak issue, you'd need a per-string reference count to detect when an interned string can be discarded.

What would string interning mean in a language which has value
semantics? Interning is a mechanism to force object identity for
references to strings with value identity. It's relevant in languages
which use reference semantics and use object identity as the default
comparison function. C++ uses value semantics by default, and types
like std::string don't have identity, so interning makes no sense.
Some implementations (e.g. g++) may use a form of reference semantics
for the string data, behind the scenes. Such an implementation could
offer some sort of interning of that data, as an extension. (G++
doesn't, as far as I know, but does automatically "intern" empty
strings.)
Most other implementations don't even use reference semantics
internally. How would you intern an implementation using the small
string optimization (like MS)? Where the data is literally in the class
in some cases, and there is no dynamically allocated memory.

Related

What is string_view?

string_view was a proposed feature within the C++ Library Fundamentals TS(N3921) added to C++17
As far as i understand it is a type that represent some kind of string "concept" that is a view of any type of container that could store something viewable as a string.
Is this right ?
Should the canonical
const std::string& parameter type become string_view ?
Is there another important point about string_view to take into consideration ?
The purpose of any and all kinds of "string reference" and "array reference" proposals is to avoid copying data which is already owned somewhere else and of which only a non-mutating view is required. The string_view in question is one such proposal; there were earlier ones called string_ref and array_ref, too.
The idea is always to store a pair of pointer-to-first-element and size of some existing data array or string.
Such a view-handle class could be passed around cheaply by value and would offer cheap substringing operations (which can be implemented as simple pointer increments and size adjustments).
Many uses of strings don't require actual owning of the strings, and the string in question will often already be owned by someone else. So there is a genuine potential for increasing the efficiency by avoiding unneeded copies (think of all the allocations and exceptions you can save).
The original C strings were suffering from the problem that the null terminator was part of the string APIs, and so you couldn't easily create substrings without mutating the underlying string (a la strtok). In C++, this is easily solved by storing the length separately and wrapping the pointer and the size into one class.
The one major obstacle and divergence from the C++ standard library philosophy that I can think of is that such "referential view" classes have completely different ownership semantics from the rest of the standard library. Basically, everything else in the standard library is unconditionally safe and correct (if it compiles, it's correct). With reference classes like this, that's no longer true. The correctness of your program depends on the ambient code that uses these classes. So that's harder to check and to teach.
(Educating myself in 2021)
From Microsoft's <string_view>:
The string_view family of template specializations provides an efficient way to pass a read-only, exception-safe, non-owning handle to the character data of any string-like objects with the first element of the sequence at position zero. (...)
From Microsoft's C++ Team Blog std::string_view: The Duct Tape of String Types from August 21st, 2018 (retrieved 2021 Apr 01):
string_view solves the “every platform and library has its own string type” problem for parameters. It can bind to any sequence of characters, so you can just write your function as accepting a string view:
void f(wstring_view); // string_view that uses wchar_t's
and call it without caring what stringlike type the calling code is using (and > for (char*, length) argument pairs just add {} around them)
(...)
(...)
Today, the most common “lowest common denominator” used to pass string data around is the null-terminated string (or as the standard calls it, the Null-Terminated Character Type Sequence). This has been with us since long before C++, and provides clean “flat C” interoperability. However, char* and its support library are associated with exploitable code, because length information is an in-band property of the data and susceptible to tampering. Moreover, the null used to delimit the length prohibits embedded nulls and causes one of the most common string operations, asking for the length, to be linear in the length of the string.
(...)
Each programming domain makes up their own new string type, lifetime semantics, and interface, but a lot of text processing code out there doesn’t care about that. Allocating entire copies of the data to process just to make differing string types happy is suboptimal for performance and reliability.

C++: What is recommended way to distinguish between string which never had the value set and empty string?

We have a Java messaging API which we are translating to C++. The messages typically have simple data types, like string, int, double, etc. When a message is constructed, we initialize all the member variables to a default value which the API recognizes as a "null" value (i.e. never set to any value), e.g. Integer.MAX_VALUE for int types. Any fields which are considered null are not serialized and sent.
In Java, strings automatically initialize to null so it's easy to differentiate between a string field which is null versus a string which is empty string (which is a legal value to send in the message).
I'm not sure of the best way to handle this in C++, since the strings automatically initialize to an empty string, and empty string is a legal value to send over the API. We could default strings to some control character (which would not be a legal value in our API), but I'm wondering if there is a more conventional or better way to do this.
We're all new here to C++, so we may have overlooked some obvious approach.
The recommended way is to make is that the object doesn't exist until it has a valid value. If a message wit a null string isn't valid, why allow it?
You can't avoid it in Java, because a string can always be null.
But C++ gives you the tool to create a class which is guaranteed to always hold a string.
And it sounds like that's what you want.
For what you're asking for, the best approach is really to build into the class the invariant that objects of this class always have a string set. Instead of setting all the objects to some default value in the constructor, define the constructor to take the actual parameters and set the members to valid values.
However, if you want to specify an "optional" value, there are a couple of broad approaches:
either use a pointer (preferably a smart pointer). A pointer to a string can be null, or it can point to a valid string (which, again, may or may not be empty)
alternatively, use something like boost::optional from the Boost libraries. This is a clever little utility template which lets you define, well, optional values (the object may contain a string, or it may be null)
or you could simply add a bool flag (something like has_string, which, when not set, indicates that no string has been set, and the string value should be disregarded).
Personally, I'd prefer the last two approaches, but all three are fairly commonly used, and will work just fine. But the best approach is the one in which you design the class so that the compiler can guarantee that it'll always be valid. If you don't want messages with a null string, let the compiler ensure that messages will never have a null string.
To replicate Java "things can have values, or lack values", probably the most general way is to store boost::optional<T>, or in the next version of the standard, std::optional<T>.
You do have to throw in some * and -> if you want to read their values, and be careful about optional<bool> because its default conversion to bool is "am I initialized or not?", not the bool that is stored. But operator= does pretty much what you want it to when writing to it, it is just reading from it that can do unexpected things in a bool context.
To tell if an optional<T> is initialized, just evaluate it in a bool context like you might a pointer. To extract its value after you have confirmed it is initialized, use the unary * operator.
boost is a relatively high quality library with a high rate of code migrating from it to the C++ standard in 5 to 10 years. It does contain some scary parts (like phoenix!), and in general you should make sure that whatever component you are using isn't already in the C++ standard library (having migrated there). boost::optional in particular is part of their header-only libraries, which are easier to use (as you don't have to build boost to use them).

When working with a C API, use character arrays or use strings and convert as needed?

I'm working with a C API and a lot of the functions take arguments that are character arrays. I've heard that using char arrays is now frowned upon. But on the other hand, using c_str() to convert the string to a char array over-and-over seems wasteful.
Are there any reasons to do it one way vs the other?
The c_str() call is quite likely to be inlined—it's very small in terms of the required code. I would use std::string if that's the only thing holding you back.
Of course, if you're very worried, this standard advice applies:
Profile it
Read the assembly
Also be aware that this is a micro-optimization; you're quite likely to be wasting development time worrying about something completely different than from what you should be worrying about.
It depends on what you're doing, and what the interface functions are
doing. At one extreme, I don't think anyone would recommend converting
a string literal to an std::string, just so you can call c_str on
it. At the other: any code building up strings dynamically should use
std::string. Functions like strcpy and strcat are invitations to
buffer overflow. Between two, it depends. I'd say that the criteria
should be ease and safety: anytime something is easier or safer to do
using std::string, use std::string. As long as what you're doing
doesn't require dynamic allocation of char[], and things like
operator+ on strings wouldn't be used, you can use char[].
As mentioned, c_str() is going to be inlined. But what hasn't been mentioned but what I think is one of the most important aspects of your question is that std::string follows the principles of RAII. When using std::string, you won't need to remember to free the string or need to worry about exception safety. Just make sure each instance of std::string is not destructed until the C code is done with the string. That could especially be an issue if the std::string is a temporary made by the compiler.
If your C function writes back a string, you could use a vector<char> and set the size to your desired buffer size. That way you'll still follow C++ RAII principles.
Most implementations of std::string probably store the actual string as a C string anyway, so the c_str function is just an inline function that returns a pointer. So generally, I would say the proper way to go is with std::string.
Of course, if the string is intended to be modified by the function you call, then you can't use the std::string approach. Instead, you'll have to make a copy to your own buffer before calling the function, in which case using arrays may be the way to go.
There is a very simple reason to use string: it works.
Working with C-strings is a pain:
manual memory allocation and deallocation is error-prone
the interface itself is error-prone (lack of null character termination, off-by-one errors and buffer overflows are common)
the operations are inefficient (strlen, strcpy and strcat) because the length need be recomputed at each time
I really see no good reason to ever work with C-strings.
It's such a pain that many platforms have provided their own specific operations and that a number of "better strings" have been proposed (oh, the joy of having multiple standards).

Does the imlementation of std::string, I use, implement ref-counting or not?

I develop for iOS and use XCode 3.2.5, GCC 4.2.
UPD
This code works:
string s = "aaaa";
string s1 = s;
assert(s.data() == s1.data());
Does it mean ref-counting is used? Or '==' is overloaded for const char* somehow to compare contents, not addresses?
UPD
Okay, it does.
There are different ways of finding out, the first of which is plainly looking at the code. std::string is a typedef to an instantiation of the basic_string template, and being a template, all the code is available to you in the headers. Note that reading standard library headers can be both enlightening and hard. And yet, you don't even need to understand the code, you might get some good hints from a cursory look (as by the fact that basic_string contains a member _M_p with a _M_refcount sub member)
If you don't want to read the code, you can approach the problem from a practical point of view and measure the effects that a copy-on-write implementation would have. You can, for example create a long string [*], then copy it to a different string and compare the addresses of the data() that stores the actual contents.
[*] The reason for the long string is to avoid getting confused with some other implementations, as small object implementation that could be used by the compiler and by which a string could contain a small buffer to avoid dynamic memory allocations for very small uses.
An easy way to find out would be to copy-construct or assign a string, and compare the results of their data() method - if their data area is at the same location in memory, they must be using some form of reference counting.
One obvious answer is: it's unspecified. As far as I know, it's
not only unspecified in the standard, but in every
implementation. But for what it's worth, g++ uses a reference
counted implementation, at least through the latest version I've
looked at (4.4.2).
ref counting is really useful, copy on write...etc. a lot of code relies on it to be efficient. it's probably a bad idea to abandon it. better to have a function that explicitly obtains a copy of a string the way MS does it (lock buffer, etc.) if you're going to tinker with internals in an unsafe manner.

immutable strings vs std::string

I've recent been reading about immutable strings Why can't strings be mutable in Java and .NET? and Why .NET String is immutable? as well some stuff about why D chose immutable strings. There seem to be many advantages.
trivially thread safe
more secure
more memory efficient in most use cases.
cheap substrings (tokenizing and slicing)
Not to mention most new languages have immutable strings, D2.0, Java, C#, Python, etc.
Would C++ benefit from immutable strings?
Is it possible to implement an immutable string class in c++ (or c++0x) that would have all of these advantages?
update:
There are two attempts at immutable strings const_string and fix_str. Neither have been updated in half a decade. Are they even used? Why didn't const_string ever make it into boost?
I found most people in this thread do not really understand what immutable_string is. It is not only about the constness. The really power of immutable_string is the performance (even in single thread program) and the memory usage.
Imagine that, if all strings are immutable, and all string are implemented like
class string {
char* _head ;
size_t _len ;
} ;
How can we implement a sub-str operation? We don't need to copy any char. All we have to do is assign the _head and the _len. Then the sub-string shares the same memory segment with the source string.
Of course we can not really implement a immutable_string only with the two data members. The real implementation might need a reference-counted(or fly-weighted) memory block. Like this
class immutable_string {
boost::fly_weight<std::string> _s ;
char* _head ;
size_t _len ;
} ;
Both the memory and the performance would be better than the traditional string in most cases, especially when you know what you are doing.
Of course C++ can benefit from immutable string, and it is nice to have one. I have checked the boost::const_string and the fix_str mentioned by Cubbi. Those should be what I am talking about.
As an opinion:
Yes, I'd quite like an immutable string library for C++.
No, I would not like std::string to be immutable.
Is it really worth doing (as a standard library feature)? I would say not. The use of const gives you locally immutable strings, and the basic nature of systems programming languages means that you really do need mutable strings.
My conclusion is that C++ does not require the immutable pattern because it has const semantics.
In Java, if you have a Person class and you return the String name of the person with the getName() method, your only protection is the immutable pattern. If it would not be there you would have to clone() your strings all night and day (as you have to do with data members that are not typical value-objects, but still needs to be protected).
In C++ you have const std::string& getName() const. So you can write SomeFunction(person.getName()) where it is like void SomeFunction(const std::string& subject).
No copy happened
If anyone wants to copy he is free to do so
Technique applies to all data types, not just strings
You're certainly not the only person who though that. In fact, there is const_string library by Maxim Yegorushkin, which seems to have been written with inclusion into boost in mind. And here's a little newer library, fix_str by Roland Pibinger. I'm not sure how tricky would full string interning at run-time be, but most of the advantages are achievable when necessary.
I don't think there's a definitive answer here. It's subjective—if not because personal taste then at least because of the type of code one most often deals with. (Still, a valuable question.)
Immutable strings are great when memory is cheap—this wasn't true when C++ was developed, and it isn't the case on all platforms targeted by C++. (OTOH on more limited platforms C seems much more common than C++, so that argument is weak.)
You can create an immutable string class in C++, and you can make it largely compatible with std::string—but you will still lose when comparing to a built-in string class with dedicated optimizations and language features.
std::string is the best standard string we get, so I wouldn't like to see any messing with it. I use it very rarely, though; std::string has too many drawbacks from my point of view.
const std::string
There you go. A string literal is also immutable, unless you want to get into undefined behavior.
Edit: Of course that's only half the story. A const string variable isn't useful because you can't make it reference a new string. A reference to a const string would do it, except that C++ won't allow you to reassign a reference as in other languages like Python. The closest thing would be a smart pointer to a dynamically allocated string.
Immutable strings are great if, whenever it's necessary to create a new a string, the memory manager will always be able to determine determine the whereabouts of every string reference. On most platforms, language support for such ability could be provided at relatively modest cost, but on platforms without such language support built in it's much harder.
If, for example, one wanted to design a Pascal implementation on x86 that supported immutable strings, it would be necessary for the string allocator to be able to walk the stack to find all string references; the only execution-time cost of that would be requiring a consistent function-call approach [e.g. not using tail calls, and having every non-leaf function maintain a frame pointer]. Each memory area allocated with new would need to have a bit to indicate whether it contained any strings and those that do contain strings would need to have an index to a memory-layout descriptor, but those costs would be pretty slight.
If a GC wasn't table to walk the stack, then it would be necessary to have code use handles rather than pointers, and have code create string handles when local variables come into scope, and destroy the handles when they go out of scope. Much greater overhead.
Qt also uses immutable strings with copy-on-write.
There is some debate about how much performance it really buys you with decent compilers.
constant strings make little sense with value semantics, and sharing isn't one of C++'s greatest strengths...
Strings are mutable in Ruby.
$ irb
>> foo="hello"
=> "hello"
>> bar=foo
=> "hello"
>> foo << "world"
=> "helloworld"
>> print bar
helloworld=> nil
trivially thread safe
I would tend to forget safety arguments. If you want to be thread-safe, lock it, or don't touch it. C++ is not a convenient language, have your own conventions.
more secure
No. As soon as you have pointer arithmetics and unprotected access to the address space, forget about being secure. Safer against innocently bad coding, yes.
more memory efficient in most use cases.
Unless you implement CPU-intensive mechanisms, I don't see how.
cheap substrings (tokenizing and slicing)
That would be one very good point. Could be done by referring to a string with backreferences, where modifications to a string would cause a copy. Tokenizing and slicing become free, mutations become expensive.
C++ strings are thread safe, all immutable objects are guaranteed to be thread safe but Java's StringBuffer is mutable like C++ string is and the both of them are thread safe. Why worry about speed, define your method or function parameters with the const keyword to tell the compiler the string will be immutable in that scope. Also if string object is immutable on demand, waiting when you absolutely need to use the string, in other words, when you append other strings to the main string, you have a list of strings until you actually need the whole string then they are joined together at that point.
immutable and mutable object operate at the same speed to my knowledge , except their methods which is a matter of pro and cons. constant primitives and variable primitives move at different speeds because at the machine level, variables are assigned to a register or a memory space which require a few binary operations, while constants are labels that don't require any of those and are thus faster (or less work is done). works only for primitives and not for object.