String storage optimization - c++

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>
in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>
this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety

If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
hello world
hello winter
hell
[2]
/
h-e-l-l-o-' '-w-o-r-l-d-[0]
\
i-n-t-e-r-[1]
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)
libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html
SGI documentation: http://www.sgi.com/tech/stl/Rope.html
But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.

For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!

First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.
This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.
The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:
Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.
For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.
This is just an example, what the key is and what the value is depend on how you are going to access these strings.
Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

Properties of a string class in C++

I am working through C++ program design by Cohoon and Davidson. This is what it says about string class attributes (3rd Edition, Page 123):
Characters that comprise the string
The number of characters in the string
My question is: If we know the characters in the string, does not it implies we already know about number of characters in the string? What is the need to explicitly specify the second attribute?
You are right but length is required in many places like counting, or knowing the length/end of malloc memory so it is better to store length as additional property to make your program run fast.
Consider what will happen if the program needs to count the chars all the way just to tell you how many are there in it. Moreover when this feature is accessed frequently.
So it simply saves time storing length too.
So all actual implementations of string classes do store length of the string.
If we know the characters in the string, does not it implies we already know about number of characters in the string?
Well in C we know the number of elements because we can count up to the NULL terminal. But think how expensive it is to get the length of a string? It takes walking the entire string. For such a common operation, why wouldn't we want this to be a constant-time operation?

C++ reinterpret_cast, making unique number

Recently, i'm using a code to make unique int number for my classes.
I used reinterpret_cast<int>(my_unique_name) where my_unique_name is a char [] variable with unique value. Something like below:
const char my_unique_name[] = "test1234";
int generate_unique_id_from_string(const char *str)
{
return reinterpret_cast<int>(str);
}
My question is, is the generated int really unique for all entry strings ?
No, it is not. You are casting the address of the string, not its contents.
To create a numeric value based on string input, use a hash function. However, this doesn't create a truly unique number, because of the so-called pigeonhole principle.
It would depend. Here's my interpretation of your question:
You're trying to assign a different number to each string. And identical strings from different sources will have different IDs.
Case 1:
If str happens to be a reusable buffer that you use to read in those strings from wherever. Then they'll all have the same base address. So no it will not be unique.
Case 2:
str happens to be a heap-allocated string. Furthermore, all the strings that will be ID'ed have overlapping lifetimes. Then yes, the IDs will be unique because they all reside in memory at the same time at different addresses.
EDIT:
If you want to generate a unique ID, but you want identical strings to have the same ID, then look at Greg's answer for a hash function.
It may not be always unique,
id1 = generate_unique_id_from_string("test12344444444444444");
id2 = generate_unique_id_from_string("Test12344444444444444");
Also, I think this will depend on the endianness of the platform.

Write a program to count how many times each distinct word appears in its input

This is a question(3-3) in accelerated C++.
I am new to C++. I have thought about this for a long time, however, I can't figure it out.
Will anyone resolve this problem for me?
Please explain it in detail, you know I am not very good at programming. Tell me the meaning of the variables you use.
The best data structure for this is something like a std::map<std::string,unsigned>, but you don't encounter maps until chapter 7.
Here are some hints based on the contents of chapter 3:
You can put strings in a vector, so you can have std::vector<std::string>
Strings can be compared, so std::sort works with std::vector<std::string>, and you can check if two strings are the same with s1==s2 just like for integers.
You saw in chapter 1 that std::cin >> s reads a word from std::cin into s if s is a std::string.
To provide maximal learning experience, I will not provide pastable code. That's an exercise. You have to do it yourself to learn as much as you can.
This is the perfect scenario for employing a kind of map that creates its value type upon accessing a non-existing key. Fortunately, C++ has such a map in its standard library: std::map<key_type,value_type> is exactly what you need.
So here's the jigsaw pieces:
you can read word by word from a stream into a string by using operator >>
you can store what you find in a map of words (strings) to occurrences (unsigned number type)
when you access an entry in the map through a non-existing key, the map will helpfully create a new default-constructed value under that key for you; if the value happens to be a number, default-construction will set it to 0 (zero)
Have fun put this together!
Here's my hint. std::map will be your friend.
Here is an algorthm you could use, try coding something and put you results here. People can then help you get further.
Scan down the string collecting each letter until you get to a word boundary (say space or . or , etc).
Take that word and compare it to the words you've already found, if already found then add one to the count for that word. If it's not then add that word to the list of words found with a count of 1.
Carry on down the string
Well, you need a way of getting individual words from the input stream (perhaps something like an "input stream" method applied to the "standard input stream") and a way of storing those strings and counts in some sort of "collection".
My natural homework cynicism and general apathy towards life prevent me from adding more detail at the moment :-)
The meaning of any variables I use is fairly self-evident since I tend to use things like objectsRemaining or hasBeenOpened.

Most Efficient way to 'look up' Keywords

Alright so I am writing a function as part of a lexical analyzer that 'looks up' or searches for a match with a keyword. My lexer catches all the obvious tokens such as single and multi character operators (+ - * / > < = == etc) (also comments and whitespace are already taken out) so I call a function after I've collected a stream of only alphanumeric characters (including underscores) into a string, this string then needs to be matched as either a known keyword or an identifier.
So I was wondering how I might go about identifying it? I know I basically need to compare it to some list or array or something of all the built in keywords, and if it matches one return that match to it's corresponding enum value; otherwise, if there is no match, then it must be a function or variable identifier. So how should I look for matches? I read somewhere that something called a Binary Search Tree is an efficient way to do it or by using Hash Tables, problem is I've never used either so I am not sure if it's the right way. Could I possibly use a MySQL database?
If your set of keywords is fixed, a perfect hash can be built for O(1) lookup. Check out gperf or cmph.
A "trie" will surely be the most efficient way.
Whatever implementation of std::map you have will probably be sufficient.
This is for a language, with a specific set of keywords that never change, and there aren't very many of them?
If so, it probably doesn't matter what you use. You will have bigger fish to fry.
However, since the list doesn't change, it would be hard to beat a hard coded search like this:
// search on first letter
switch(s[0]){
case 'a':
// search on 2nd letter, etc.
break;
case 'b':
// search on 2nd letter, etc.
break;
........
case '_':
// search on 2nd letter, etc.
break;
}
For singe character keywords a lookup table would be perfect. For multicharacter (especially if the lengths differs): a hash table. If you need performance, you could even use source code generation to create the hash tables (using a simple hash function that is able or not to ignore case, depending on your syntax).
So I'd implement it with a LUT and a hash table: first you check the first character with the LUT (if it's a simple operator, it would start with a non-alpha-numeric value), and, if not found, check the hash table.