Appending strings in C++ [duplicate] - c++

This question already has answers here:
C++ equivalent of StringBuffer/StringBuilder?
(10 answers)
Closed 9 years ago.
Consider this piece of code:
public String joinWords(String[] words) {
String sentence = "";
for(String w : words) {
sentence = sentence + w;
}
return sentence;
}
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2). Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
While in C++, std::string::append() has complexity of O(n), and I'm not clear about the complexity of stringstream.
In C++, are there methods like those in StringBuffer with the same complexity?

C++ strings are mutable, and pretty much as dynamically sizable as a StringBuffer. Unlike its equivalent in Java, this code wouldn't create a new string each time; it just appends to the current one.
std::string joinWords(std::vector<std::string> const &words) {
std::string result;
for (auto &word : words) {
result += word;
}
return result;
}
This runs in linear time if you reserve the size you'll need beforehand. The question is whether looping over the vector to get sizes would be slower than letting the string auto-resize. That, i couldn't tell you. Time it. :)
If you don't want to use std::string itself for some reason (and you should consider it; it's a perfectly respectable class), C++ also has string streams.
#include <sstream>
...
std::string joinWords(std::vector<std::string> const &words) {
std::ostringstream oss;
for (auto &word : words) {
oss << word;
}
return oss.str();
}
It's probably not any more efficient than using std::string, but it's a bit more flexible in other cases -- you can stringify just about any primitive type with it, as well as any type that has specified an operator <<(ostream&, its_type&) override.

This is somewhat tangential to your Question, but relevant nonetheless. (And too big for a comment!!)
On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2).
In Java, the complexity of s1.concat(s2) or s1 + s2 is O(M1 + M2) where M1 and M2 are the respective String lengths. Turning that into the complexity of a sequence of concatenations is difficult in general. However, if you assume N concatenations of Strings of length M, then the complexity is indeed O(M * N * N) which matches what you said in the Question.
Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).
In the StringBuilder case, the amortized complexity of N calls to sb.append(s) for strings of size M is O(M*N). The key word here is amortized. When you append characters to a StringBuilder, the implementation may need to expand its internal array. But the expansion strategy is to double the array's size. And if you do the math, you will see that each character in the buffer is going to be copied on average one extra time during the entire sequence of append calls. So the complexity of the entire sequence of appends still works out as O(M*N) ... and, as it happens M*N is the final string length.
So your end result is correct, but your statement about the complexity of a single call to append is not correct. (I understand what you mean, but the way you say it is facially incorrect.)
Finally, I'd note that in Java you should use StringBuilder rather than StringBuffer unless you need the buffer to be thread-safe.

As an example of a really simple structure that has O(n) complexity in C++11:
template<typename TChar>
struct StringAppender {
std::vector<std::basic_string<TChar>> buff;
StringAppender& operator+=( std::basic_string<TChar> v ) {
buff.push_back(std::move(v));
return *this;
}
explicit operator std::basic_string<TChar>() {
std::basic_string<TChar> retval;
std::size_t total = 0;
for( auto&& s:buff )
total+=s.size();
retval.reserve(total+1);
for( auto&& s:buff )
retval += std::move(s);
return retval;
}
};
use:
StringAppender<char> append;
append += s1;
append += s2;
std::string s3 = append;
This takes O(n), where n is the number of characters.
Finally, if you know how long all of the strings are, just doing a reserve with enough room makes append or += take a total of O(n) time. But I agree that is awkward.
Use of std::move with the above StringAppender (ie, sa += std::move(s1)) will significantly increase performance for non-short strings (or using it with xvalues etc)
I do not know the complexity of std::ostringstream, but ostringstream is for pretty printing formatted output, or cases where high performance is not important. I mean, they aren't bad, and they may even out perform scripted/interpreted/bytecode languages, but if you are in a rush, you need something else.
As usual, you need to profile, because constant factors are important.
A rvalue-reference-to-this operator+ might also be a good one, but few compilers implement rvalue references to this.

Related

Best way to concatenate and condense a std::vector<std::string>

Disclaimer: This problem is more of a theoretical, rather than a practical interest. I want to find out various different ways of doing this, with speed as icing on the new year cake.
The Problem
I want to be able to store a list of strings, and be able to quickly combine them into 1 if needed.
In short, I want to condense a structure (currently a std::vector<std::string>) that looks like
["Hello, ", "good ", "day ", " to", " you!"]
to
["Hello, good day to you!"]
Is there any idiomatic way to achieve this, ala python's [ ''.join(list_of_strings) ]?
What is the best way to achieve this in C++, in terms of time?
Possible Approaches
The first idea I had is to
loop over the vector,
append each element to the first,
simultaneously delete the element.
We will be concatenating with += and reserve(). I assume that max_size() will not be reached.
Approach 1 (The Greedy Approach)
So called because it ignores conventions and operates in-place.
#if APPROACH == 'G'
// Greedy Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// There are strings left, ...
for(auto itr = my_strings.begin()+1; itr != my_strings.end();)
{
// append, and...
my_strings[0] += *itr;
// delete, until...
itr = my_strings.erase(itr);
}
}
#endif
Now I know, you would say that this is risky and bad. So:
loop over the vector,
append each element to another std::string,
clear the vector and make the string first element of the vector.
Approach 2 (The "Safe" Haven)
So called because it does not modify the container while iterating over it.
#if APPROACH == 'H'
// Safe Haven Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Store the whole vector here
std::string condensed_string;
condensed_string.reserve(total_characters_in_list);
// There are strings left...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// append, until...
condensed_string += *itr;
}
// remove all elements except the first
my_strings.resize(1);
// and set it to condensed_string
my_strings[0] = condensed_string;
}
#endif
Now for the standard algorithms...
Using std::accumulate from <algorithm>
Approach 3 (The Idiom?)
So called simply because it is a one-liner.
#if APPROACH == 'A'
// Accumulate Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// Accumulate all the strings
my_strings[0] = std::accumulate(my_strings.begin(), my_strings.end(), std::string(""));
// And resize
my_strings.resize(1);
}
#endif
Why not try to store it all in a stream?
Using std::stringstream from <sstream>.
Approach 4 (Stream of Strings)
So called due to the analogy of C++'s streams with flow of water.
#if APPROACH == 'S'
// Stringstream Approach
void condense(std::vector< std::string >& my_strings, int) // you can remove the int
{
// Create out stream
std::stringstream buffer(my_strings[0]);
// There are strings left, ...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// add until...
buffer << *itr;
}
// resize and assign
my_strings.resize(1);
my_strings[0] = buffer.str();
}
#endif
However, maybe we can use another container rather than std::vector?
In that case, what else?
(Possible) Approach 5 (The Great Indian "Rope" Trick)
I have heard about the rope data structure, but have no idea if (and how) it can be used here.
Benchmark and Verdict:
Ordered by their time efficiency (currently and surprisingly) is1:
Approaches Vector Size: 40 Vector Size: 1600 Vector Size: 64000
SAFE_HAVEN: 0.1307962699997006 0.12057728999934625 0.14202970000042114
STREAM_OF_STRINGS: 0.12656566000077873 0.12249500000034459 0.14765803999907803
ACCUMULATE_WEALTH: 0.11375975999981165 0.12984520999889354 3.748660090001067
GREEDY_APPROACH: 0.12164988000004087 0.13558526000124402 22.6994204800023
timed with2:
NUM_OF_ITERATIONS = 100
test_cases = [ 'greedy_approach', 'safe_haven' ]
for approach in test_cases:
time_taken = timeit.timeit(
f'system("{approach + ".exe"}")',
'from os import system',
number = NUM_OF_ITERATIONS
)
print(approach + ": ", time_taken / NUM_OF_ITERATIONS)
Can we do better?
Update: I tested it with 4 approaches (so far), as I could manage in my little time. More incoming soon. It would have been better to fold the code, so that more approaches could be added to this post, but it was declined.
1 Note that these readings are only for a rough estimate. There are a lot of things that influence the execution time, and note that there are some inconsistencies here as well.
2 This is the old code, used to test only the first two approaches. The current code is a good deal longer, and more integrated, so I am not sure I should add it here.
Conclusions:
Deleting elements is very costly.
You should just copy the strings somewhere, and resize the vector.
Infact, better reserve enough space too, if copying to another string.
You could also try std::accumulate:
auto s = std::accumulate(my_strings.begin(), my_strings.end(), std::string());
Won't be any faster, but at least it's more compact.
With range-v3 (and soon with C++20 ranges), you might do:
std::vector<std::string> v{"Hello, ", "good ", "day ", " to", " you!"};
std::string s = v | ranges::view::join;
Demo
By default, I would use std::stringstream. Simply construct the steam, stream in all the strings from the vector, and then return the output string. It isn't very efficient but it is clear what it does.
In most cases, one doesn't need fast method when dealing with strings and printing - so the "easy to understand and safe" methods are better. Plus, compilers nowadays are good at optimizing inefficiencies in simple cases.
The most efficient way... it is a hard question. Some applications require efficiency on multiple fronts. In these cases you might need to utilize multithreading.
Personally, I'd construct a second vector to hold a single "condensed" string, construct the condensed string, and then swap vectors when done.
void Condense(std::vector<std::string> &strings)
{
std::vector<std::string> condensed(1); // one default constructed std::string
std::string &constr = &condensed.begin(); // reference to first element of condensed
for (const auto &str : strings)
constr.append(str);
std::swap(strings, condensed); // swap newly constructed vector into original
}
If an exception is thrown for some reason, then the original vector is left unchanged, and cleanup occurs - i.e. this function gives a strong exception guarantee.
Optionally, to reduce resizing of the "condensed" string, after initialising constr in the above, one could do
// optional: compute the length of the condensed string and reserve
std::size_t total_characters_in_list = 0;
for (const auto &str : strings)
total_characters_in_list += str.size();
constr.reserve(total_characters_in_list);
// end optional reservation
As to how efficient this is compared with alternatives, that depends. I'm also not sure it's relevant - if strings keep on being appended to the vector, and needing to be appended, there is a fair chance that the code that obtains the strings from somewhere (and appends them to the vector) will have a greater impact on program performance than the act of condensing them.

Redefine data area for faster access

New to c++. I've searched but probably using wrong terms.
I want to find which slot in an array of many slots a few bytes long literal value is stored. Currently check each slot sequentially.
If I can use an internal function to scan the whole array as if it was one big string, I feel this would be much faster. (Old COBOL programmer).
Any way I can do this please?
I want to find which slot in an array of many slots a few bytes long literal value is stored. Currently check each slot sequentially.
OK, I'm going to take a punt and infer that:
you want to store string literals of any length in some kind of container.
the container must be mutable (i.e. you can add literals at will)
there will not be duplicates in the container.
you want to know whether a string literal as been stored in the container previously, and what "position" it was at so that you can remove it if necessary.
the string literals will be inserted in random lexicographical order and need not be sorted.
The container that springs to mind is the std::unordered_set
#include <unordered_set>
std::unordered_set<std::string> tokens;
int main()
{
tokens.emplace("foo");
tokens.emplace("bar");
auto it = tokens.find("baz");
assert(it == tokens.end()); // not found
it = tokens.find("bar"); // will be found
assert(it != tokens.end());
tokens.erase(it); // remove the token
}
The search time complexity of this container is O(1).
As you already found out by the comments, "scanning as one big string" is not the way to go in C++.
Typical in C++ when using C-style arrays and normally fast enough for linear search is
auto myStr = "result";
auto it = std::find_if(std::begin(arr), std::end(arr),
[myStr](const char* const str) { return std::strcmp(mystr,str) == 0; });
Remember that string compare function stop at the first wrong character.
More C++ style:
std::vector<std::string> vec = { "res1", "res2", "res3" };
std::string myStr = "res2";
auto it = std::find(vec.begin(), vec.end(), myStr);
If you are interested in very fast lookup for a large container, std::unordered_set is the way to go, but the "slot" has lost its meaning then, but maybe in that case std::unordered_map can be used.
std::unordered_set<std::string> s= { "res1", "res2", "res3" };
std::string myStr = "res2";
auto it = s.find(myStr);
All code is written as example, not compiled/tested

Need for Fast map between string and integers

I have a map of string and unsigned, in which I store a word to its frequency of the following form:
map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings
Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.
map does not fit in my RAM and swapping is consuming a lot of time.
Would erasing all words having frequency < 1000 help using erase function of map?
I suggest you use an unordered_map as opposed to a map. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1) as opposed to O(logn) in a map.
As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.
You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.
Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort in linux implements external mergesort.
You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
you can use some third party libraries for this also.
EDIT:
pseudo code
int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);
loop: currString in ListOfString
int key = someHashMethod(currString);
++mapWordFrequency[key];
if(mapWordFrequency[key] > 1000)
doSomeThing();
Update:
As #Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.
Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:
#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp>
inline std::string compress(const std::string& data)
{
std::stringstream decompressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_compressor());
stream.push(decompressed);
std::stringstream compressed {};
boost::iostreams::copy(stream, compressed);
return compressed.str();
}
inline std::string decompress(const std::string& data)
{
std::stringstream compressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_decompressor());
stream.push(compressed);
std::stringstream decompressed;
boost::iostreams::copy(stream, decompressed);
return decompressed.str();
}
In addition to using a std::unordered_map as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.
You need another approach to your problem, your data is too big to be processed all at once.
For example you could split your file into multiple files, let's say the easiest would be to logically splitting them by letters.
100GB/24 letters = 4.17 GB
Now you'll have 24 files of 4.17GB each.
You know that the words in any of the files can not be part of any other file, this will help you, since you won't have to merge the results.
With a 4GB file, it now gets easier to work in ram.
std::map has a problem when you start using a lot of memory, since it fragments a lot. Try std::unordered_map, and if that's still not performing well, you may be able to load in memory the file and sort it. Counting occurrences will be then quite easy.
Assuming you have several duplicates, your map or unordered_map will have a significantly lower memory footprint.
Run your code in a loop, for each file, and append the results in another file.
You should be done quite quickly.
The main problem seems to be the memory footprint, so we are looking for a solution that uses up little memory. A way to save memory is to use sorted vectors instead of a map. Now, vector has a lookup time with ~log(n) comparisons and an average insert time of n/2, which is bad. The upside is you have basically no memory overhead, the memory to be moved is small due to separation of data and you get sequential memory (cache-friendliness) which can easily outperform a map. The required memory would be 2 (wordcount) + 4 (index) + 1 (\0-char) + x (length of word) bytes per word. To achieve that we need to get rid of the std::string, because it is just too big in this case.
You can split your map into a vector<char> that saves the strings one after another separated by \0-characters, a vector<unsigned int> for the index and a vector<short int> for the word count. The code would look something like this (tested):
#include <vector>
#include <algorithm>
#include <cstring>
#include <string>
#include <fstream>
#include <iostream>
std::vector<char> strings;
std::vector<unsigned int> indexes;
std::vector<short int> wordcount;
const int countlimit = 1000;
void insertWord(const std::string &str) {
//find the word
auto stringfinder = [](unsigned int lhs, const std::string &rhs) {
return &strings[lhs] < rhs;
};
auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder);
//increment counter
if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word
wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1);
indexes.insert(index, strings.size());
strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1);
}
else { //known word
auto &count = wordcount[index - begin(indexes)];
if (count < countlimit) //prevent overflow
count++;
}
}
int main() {
std::ifstream f("input.txt");
std::string s;
while (f >> s) { //not a good way to read in words
insertWord(s);
}
for (size_t i = 0; i < indexes.size(); ++i) {
if (wordcount[i] > countlimit) {
std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n';
}
}
}
This approach still saves all words in memory. According to Wolfram Alpha the average word length in the English language is 5.1 characters. This gives you a total memory requirement of (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have a halfway modern computer with 16+GB of RAM you can fit it all into RAM.
If this fails (because you don't have English words and they don't fit in memory), the next approach would be memory mapped files. That way you can make indexes point to the memory mapped file instead of strings, so you can get rid of strings, but the access time would suffer.
If this fails due to low performance you should look into map-reduce which is very easy to apply to this case. It gives you as much performance as you have computers.
#TonyD Can you please give a little example with trie? – Rose Sharma
Here's an example of a trie approach to this problem:
#include <iostream>
#include <string>
#include <limits>
#include <array>
class trie
{
public:
void insert(const std::string& s)
{
node_.insert(s.c_str());
}
friend std::ostream& operator<<(std::ostream& os, const trie& t)
{
return os << t.node_;
}
private:
struct Node
{
Node() : freq_(0) { }
uint16_t freq_;
std::array<Node*, 26> next_letter_{};
void insert(const char* p)
{
if (*p)
{
Node*& p_node = next_letter_[*p - 'a'];
if (!p_node)
p_node = new Node;
p_node->insert(++p);
}
else
if (freq_ < std::numeric_limits<decltype(freq_)>::max()) ++freq_;
}
} node_;
friend std::ostream& operator<<(std::ostream& os, const Node& n)
{
os << '(';
if (n.freq_) os << n.freq_ << ' ';
for (size_t i = 0; i < 26; ++i)
if (n.next_letter_[i])
os << char('a' + i) << *(n.next_letter_[i]);
return os << ')';
}
};
int main()
{
trie my_trie;
my_trie.insert("abc");
my_trie.insert("abcd");
my_trie.insert("abc");
my_trie.insert("bc");
std::cout << my_trie << '\n';
}
Output:
(a(b(c(2 d(1 ))))b(c(1 )))
The output is a compressed/tree-like representation of your word-frequency histogram: abc appears 2 times, abcd 1, bc 1. The parentheses can be thought of as pushing and popping characters from a "stack" to form the current prefix or - when there's a number - word.
Whether it improves much on a map depends on the variations in the input words, but it's worth a try. A more memory efficient implementation might use a vector or set - or even a string of say space-separated- suffixes when there are few elements beneath the current prefix, then switch to the array-of-26-pointers when that's likely to need less memory.

Prepend std::string

What is the most efficient way to prepend std::string? Is it worth writing out an entire function to do so, or would it take only 1 - 2 lines? I'm not seeing anything related to an std::string::push_front.
There actually is a similar function to the non-existing std::string::push_front, see the below example.
Documentation of std::string::insert
#include <iostream>
#include <string>
int
main (int argc, char *argv[])
{
std::string s1 (" world");
std::string s2 ("ello");
s1.insert (0, s2); // insert the contents of s2 at offset 0 in s1
s1.insert (0, 1, 'h'); // insert one (1) 'h' at offset 0 in s1
std::cout << s1 << std::endl;
}
output:
hello world
Since prepending a string with data might require both reallocation and copy/move of existing data you can get some performance benefits by getting rid of the reallocation part by using std::string::reserve (to allocate more memory before hand).
The copy/move of data is sadly quite inevitable, unless you define your own custom made class that acts like std::string that allocates a large buffer and places the first content in the center of this memory buffer.
Then you can both prepend and append data without reallocation and moving data, if the buffer is large enough that is. Copying from source to destination is still, obviously, required though.
If you have a buffer in which you know you will prepend data more often than you append a good alternative is to store the string backwards, and reversing it when needed (if that is more rare).
myString.insert(0, otherString);
Let the Standard Template Library writers worry about efficiency; make use of all their hours of work rather than re-programming the wheel.
This way does both of those.
As long as the STL implementation you are using was thought through you'll have efficient code. If you're using a badly written STL, you have bigger problems anyway :)
If you're using std::string::append, you should realize the following is equivalent:
std::string lhs1 = "hello ";
std::string lhs2 = "hello ";
std::string rhs = "world!";
lhs1.append(rhs);
lhs2 += rhs; // equivalent to above
// Also the same:
// lhs2 = lhs2 + rhs;
Similarly, a "prepend" would be equivalent to the following:
std::string result = "world";
result = "hello " + result;
// If prepend existed, this would be equivalent to
// result.prepend("hello");
You should note that it's rather inefficient to do the above though.
There is an overloaded string operator+ (char lhs, const string& rhs);, so you can just do your_string 'a' + your_string to mimic push_front.
This is not in-place but creates a new string, so don't expect it to be efficient, though. For a (probably) more efficient solution, use resize to gather space, std::copy_backward to shift the entire string back by one and insert the new character at the beginning.
The problem is efficiency: inserting to the beginning of the string is more expensive as it requires both reallocation and shifting of existing characters.
If you are only prepending to the string, the most efficient way is appending, and then either reverse the string, or even better, go through the string in reverse order.
string s;
for (auto c: "foobar") {
s.push_back(c);
}
for (auto it=s.rbegin(); it!=s.rend(); it++) {
// do something
}
If you need a mix of prepending and appending, I'd suggest using a deque, and then construct a string from it.
The double-ended queue supports O(1) insertion and deletion at the beginning and end.
deque<char> dq;
dq.push_front('f');
dq.push_back('o');
dq.push_front('o');
string s {dq.begin(), dq.end()};

Two short questions about std::vector

When a vector is created it has a default allocation size (probably this is not the right term to use, maybe step size?). When the number of elements reaches this size, the vector is resized. Is this size compiler specific? Can I control it? Is this a good idea?
Do repeated calls to vector::size() recount the number of elements (O(n) calculation) or is this value stored somewhere (O(1) lookup). For example, in the code below
// Split given string on whitespace
vector<string> split( const string& s )
{
vector<string> tokens;
string::size_type i, j;
i = 0;
while ( i != s.size() ) {
// ignore leading blanks
while ( isspace(s[i]) && i != s.size() ) {
i++;
}
// found a word, now find its end
j = i;
while ( !isspace(s[j]) && j != s.size() ) {
j++;
}
// if we found a word, add it to the vector
if ( i != j ) {
tokens.push_back( s.substr(i, j-i) );
i = j;
}
}
return tokens;
}
assuming s can be very large, should I call s.size() only once and store the result?
Thanks!
In most cases, you should leave the allocation alone unless you know the number of items ahead of time, so you can reserve the correct amount of space.
At least in every case of which I'm aware, std::vector::size() just returns a stored value, so it has constant complexity. In theory, the C++ standard allows it to do otherwise. There are reasons to allow otherwise for some other containers, primarily std::list, and rather than make a special case for those, they simply recommend constant time for all containers instead of requiring it for any. I can't quite imagine a vector::size that counted elements though -- I'm pretty no such thing has ever existed.
P.S., an easier way to do what your code above does, is something like this:
std::vector<string> split(std::string const &input) {
vector<string> ret;
istringstream buffer(input);
copy(istream_iterator<string>(input),
istream_iterator<string>(),
back_inserter(ret));
return ret;
}
Edit: IMO, The C++ Standard Library, by Nicolai Josuttis is an excellent reference on such things.
The actual size of the capacity increment is implementation-dependent, but it has to be (roughly) exponential to support the container's complexity requirements. As an example, the Visual C++ standard library will allocate exactly the space required for the first few elements (five, if I recall correctly), then increases the size exponentially after that.
The size has to be stored somehow in the vector, otherwise it doesn't know where the end of the sequence is! However, it may not necessarily be stored as an integer. The Visual C++ implementation (again, as an example) stores three pointers:
a pointer to the beginning of the underlying array,
a pointer to the current end of the sequence, and
a pointer to the end of the underlying array.
The size can be computed from (1) and (2); the capacity can be computed from (1) and (3).
Other implementations might store the information differently.
It's library-specific. You might be able to control the incremental allocation, but you might not.
The size is stored, so it is very fast (constant time) to retrieve. How else could it work? C has no way of knowing in general whether a memory location is "real data" or not.
The resizing mechanism is usually fixed. (Most compilers double the size of the vector when it reaches the limit.) The C++ standard specifies no way to control this behaviour.
The size is internally updated whenever you insert/remove elements and when you call size(), it's returned immediately. So yes, it's O(1).
Unrelated to your actual questions, but here's a more "STL" way of doing what you're doing:
vector<string> split(const string& s)
{
istringstream stream(s);
istream_iterator<string> iter(stream), eos;
vector<string> tokens;
copy(iter, eos, back_inserter(tokens));
return tokens;
}
When the number of elements reaches this size, the vector is resized. Is this size compiler specific? Can I control it? Is this a good idea?
In general, this is a library-specific behavior, but you may be able to influence this behavior by specifying a custom allocator, which is non-trivial work.
Do repeated calls to vector::size() recount the number of elements (O(n) calculation) or is this value stored somewhere (O(1) lookup).
Most implementations store the size as a member. It's a single memory read.