Best way to concatenate and condense a std::vector<std::string>

Best way to concatenate and condense a std::vector<std::string> - c++

Disclaimer: This problem is more of a theoretical, rather than a practical interest. I want to find out various different ways of doing this, with speed as icing on the new year cake.
The Problem
I want to be able to store a list of strings, and be able to quickly combine them into 1 if needed.
In short, I want to condense a structure (currently a std::vector<std::string>) that looks like
["Hello, ", "good ", "day ", " to", " you!"]
to
["Hello, good day to you!"]
Is there any idiomatic way to achieve this, ala python's [ ''.join(list_of_strings) ]?
What is the best way to achieve this in C++, in terms of time?
Possible Approaches
The first idea I had is to
loop over the vector,
append each element to the first,
simultaneously delete the element.
We will be concatenating with += and reserve(). I assume that max_size() will not be reached.
Approach 1 (The Greedy Approach)
So called because it ignores conventions and operates in-place.
#if APPROACH == 'G'
// Greedy Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// There are strings left, ...
for(auto itr = my_strings.begin()+1; itr != my_strings.end();)
{
// append, and...
my_strings[0] += *itr;
// delete, until...
itr = my_strings.erase(itr);
}
}
#endif
Now I know, you would say that this is risky and bad. So:
loop over the vector,
append each element to another std::string,
clear the vector and make the string first element of the vector.
Approach 2 (The "Safe" Haven)
So called because it does not modify the container while iterating over it.
#if APPROACH == 'H'
// Safe Haven Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Store the whole vector here
std::string condensed_string;
condensed_string.reserve(total_characters_in_list);
// There are strings left...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// append, until...
condensed_string += *itr;
}
// remove all elements except the first
my_strings.resize(1);
// and set it to condensed_string
my_strings[0] = condensed_string;
}
#endif
Now for the standard algorithms...
Using std::accumulate from <algorithm>
Approach 3 (The Idiom?)
So called simply because it is a one-liner.
#if APPROACH == 'A'
// Accumulate Approach
void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
{
// Reserve the size for all characters, less than max_size()
my_strings[0].reserve(total_characters_in_list);
// Accumulate all the strings
my_strings[0] = std::accumulate(my_strings.begin(), my_strings.end(), std::string(""));
// And resize
my_strings.resize(1);
}
#endif
Why not try to store it all in a stream?
Using std::stringstream from <sstream>.
Approach 4 (Stream of Strings)
So called due to the analogy of C++'s streams with flow of water.
#if APPROACH == 'S'
// Stringstream Approach
void condense(std::vector< std::string >& my_strings, int) // you can remove the int
{
// Create out stream
std::stringstream buffer(my_strings[0]);
// There are strings left, ...
for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
{
// add until...
buffer << *itr;
}
// resize and assign
my_strings.resize(1);
my_strings[0] = buffer.str();
}
#endif
However, maybe we can use another container rather than std::vector?
In that case, what else?
(Possible) Approach 5 (The Great Indian "Rope" Trick)
I have heard about the rope data structure, but have no idea if (and how) it can be used here.
Benchmark and Verdict:
Ordered by their time efficiency (currently and surprisingly) is1:
Approaches Vector Size: 40 Vector Size: 1600 Vector Size: 64000
SAFE_HAVEN: 0.1307962699997006 0.12057728999934625 0.14202970000042114
STREAM_OF_STRINGS: 0.12656566000077873 0.12249500000034459 0.14765803999907803
ACCUMULATE_WEALTH: 0.11375975999981165 0.12984520999889354 3.748660090001067
GREEDY_APPROACH: 0.12164988000004087 0.13558526000124402 22.6994204800023
timed with2:
NUM_OF_ITERATIONS = 100
test_cases = [ 'greedy_approach', 'safe_haven' ]
for approach in test_cases:
time_taken = timeit.timeit(
f'system("{approach + ".exe"}")',
'from os import system',
number = NUM_OF_ITERATIONS
)
print(approach + ": ", time_taken / NUM_OF_ITERATIONS)
Can we do better?
Update: I tested it with 4 approaches (so far), as I could manage in my little time. More incoming soon. It would have been better to fold the code, so that more approaches could be added to this post, but it was declined.
1 Note that these readings are only for a rough estimate. There are a lot of things that influence the execution time, and note that there are some inconsistencies here as well.
2 This is the old code, used to test only the first two approaches. The current code is a good deal longer, and more integrated, so I am not sure I should add it here.
Conclusions:
Deleting elements is very costly.
You should just copy the strings somewhere, and resize the vector.
Infact, better reserve enough space too, if copying to another string.

You could also try std::accumulate:
auto s = std::accumulate(my_strings.begin(), my_strings.end(), std::string());
Won't be any faster, but at least it's more compact.

With range-v3 (and soon with C++20 ranges), you might do:
std::vector<std::string> v{"Hello, ", "good ", "day ", " to", " you!"};
std::string s = v | ranges::view::join;
Demo

By default, I would use std::stringstream. Simply construct the steam, stream in all the strings from the vector, and then return the output string. It isn't very efficient but it is clear what it does.
In most cases, one doesn't need fast method when dealing with strings and printing - so the "easy to understand and safe" methods are better. Plus, compilers nowadays are good at optimizing inefficiencies in simple cases.
The most efficient way... it is a hard question. Some applications require efficiency on multiple fronts. In these cases you might need to utilize multithreading.

Personally, I'd construct a second vector to hold a single "condensed" string, construct the condensed string, and then swap vectors when done.
void Condense(std::vector<std::string> &strings)
{
std::vector<std::string> condensed(1); // one default constructed std::string
std::string &constr = &condensed.begin(); // reference to first element of condensed
for (const auto &str : strings)
constr.append(str);
std::swap(strings, condensed); // swap newly constructed vector into original
}
If an exception is thrown for some reason, then the original vector is left unchanged, and cleanup occurs - i.e. this function gives a strong exception guarantee.
Optionally, to reduce resizing of the "condensed" string, after initialising constr in the above, one could do
// optional: compute the length of the condensed string and reserve
std::size_t total_characters_in_list = 0;
for (const auto &str : strings)
total_characters_in_list += str.size();
constr.reserve(total_characters_in_list);
// end optional reservation
As to how efficient this is compared with alternatives, that depends. I'm also not sure it's relevant - if strings keep on being appended to the vector, and needing to be appended, there is a fair chance that the code that obtains the strings from somewhere (and appends them to the vector) will have a greater impact on program performance than the act of condensing them.

Related

Redefine data area for faster access

New to c++. I've searched but probably using wrong terms.
I want to find which slot in an array of many slots a few bytes long literal value is stored. Currently check each slot sequentially.
If I can use an internal function to scan the whole array as if it was one big string, I feel this would be much faster. (Old COBOL programmer).
Any way I can do this please?

I want to find which slot in an array of many slots a few bytes long literal value is stored. Currently check each slot sequentially.
OK, I'm going to take a punt and infer that:
you want to store string literals of any length in some kind of container.
the container must be mutable (i.e. you can add literals at will)
there will not be duplicates in the container.
you want to know whether a string literal as been stored in the container previously, and what "position" it was at so that you can remove it if necessary.
the string literals will be inserted in random lexicographical order and need not be sorted.
The container that springs to mind is the std::unordered_set
#include <unordered_set>
std::unordered_set<std::string> tokens;
int main()
{
tokens.emplace("foo");
tokens.emplace("bar");
auto it = tokens.find("baz");
assert(it == tokens.end()); // not found
it = tokens.find("bar"); // will be found
assert(it != tokens.end());
tokens.erase(it); // remove the token
}
The search time complexity of this container is O(1).

As you already found out by the comments, "scanning as one big string" is not the way to go in C++.
Typical in C++ when using C-style arrays and normally fast enough for linear search is
auto myStr = "result";
auto it = std::find_if(std::begin(arr), std::end(arr),
[myStr](const char* const str) { return std::strcmp(mystr,str) == 0; });
Remember that string compare function stop at the first wrong character.
More C++ style:
std::vector<std::string> vec = { "res1", "res2", "res3" };
std::string myStr = "res2";
auto it = std::find(vec.begin(), vec.end(), myStr);
If you are interested in very fast lookup for a large container, std::unordered_set is the way to go, but the "slot" has lost its meaning then, but maybe in that case std::unordered_map can be used.
std::unordered_set<std::string> s= { "res1", "res2", "res3" };
std::string myStr = "res2";
auto it = s.find(myStr);
All code is written as example, not compiled/tested

C++ Reverse sequence of elements in a vector

Hello I'm still new to C++ and I am writing a program to reverse the elements in a vector. I don't get any errors running the program but when I run it and I enter the numbers my program prints " Printing ... end of print" then it just closes on its own. I sure it may be a simple mistake.
using namespace std;
vector<int> reverse_a(const vector<int>&veca)
{
vector<int> vecb;
//size_t as the index type
size_t i = veca.size();
while ( i > 0 )
vecb.push_back(veca[--i]);
return vecb;
}
void print(const vector<int> vec)
{
cout << "printing " << endl;
for (size_t i = 0; i < vec.size(); ++i)
cout << vec[i] << ",";
cout << "\n" << "\n end of print.\n";
}
int main(void)
{
vector<int>veca;
vector<int>vecb;
int input;
while(cin >> input)
veca.push_back(input);
reverse_a(veca);
print(vecb);
}

Sort of off topic, but can't be explained in a comment. Aderis's answer is correct and πάντα ῥεῖ brings up an alternative for OP.
As with most intro to programming problems, the standard Library has done all of the work already. There is no need for any function because it already exists, in a somewhat twisted form:
std::copy(veca.rbegin(), veca.rend(), std::back_inserter(vecb));
std::copy does just what it sounds like it does: it copies. You specify where to start, where to stop, and where to put the results.
In this case we want to copy from veca, but we want to copy backwards, so rather than calling begin like we normally would, we call rbegin to get one of those reverse iterator thingys πάντα ῥεῖ was talking about. To define the end, we use rend which, rather than tearing things limb from limb marks the end of the reverse range of veca. Typically this is one before the beginning, veca[-1], if such a thing existed.
std::back_inserter tells std::copy how to place the the data from veca in vecb, at the back.
One could be tempted to skip all of this reverse nonsense and
std::copy(veca.begin(), veca.end(), std::front_inserter(vecb));
but no. For one thing, it would be hilariously slow. Consider veca = {1,2,3,4,5}. You'd insert 1 at the beginning of vecb, then copy it to the second slot to make room for 2. Then move 2 and 1 over one slot each to fit in 3. You'd get the nice reverse ordering, but the shuffling would be be murderous. The second reason you can't do it is because vector does not implement the push_front function required to make this work, again because it would be brutally slow.
Caveat:
This approach is simple, but slow. The back_inserter may force resizing of the vector's internal array, but this can be mitigated by preallocating vecb's storage.

It's just a simple mistake. You are forgetting to set vecb to the result of the reverse_a function in main. Instead of reverse_a(veca);, you should have vecb = reverse_a(veca);. The way you currently have it, vecb never gets set and therefore has a length of zero and nothing prints.

How to find the length of an array of std::strings?

I have a few questions related to portions of my code.
The first has to do with how I find the length of an array of arrays of strings. I'm using the following as a map for a Calculus tool I'm using.
std::string dMap[][10] = {{"x", "1"}, {"log(x)", "1/x"}, {"e^x", "e^x"}};
I'm wondering how to do the equivalent of
int arr[] = {1, 69, 2};
int arrlen = sizeof(arr)/sizeof(int);
with an array of elements of type std::string. Also, is there a better way of storing symbolic representations of (f(x), f'(x)) pairs? I'm trying to not use C++11.
My next question has to do with a procedure I wrote that isn't working. Here it is:
std::string CalculusWizard::composeFunction(const std::string & fx, const char & x, const std::string & gx)
{
/* Return fx compose gx, i.e. return a string that is gx with every instance of the character x replaced
by the equation gx.
E.g. fx="x^2", x="x", gx="sin(x)" ---> composeFunction(fx, x, gx) = "(sin(x))^2"
*/
std::string hx(""); // equation to return
std::string lastString("");
for (std::string::const_iterator it(fx.begin()), offend(fx.end()); it != offend; ++it)
{
if (*it == x)
{
hx += "(" + gx + ")";
lastString.erase(lastString.begin(), lastString.end());
}
else
{
lastString.push_back(*it);
}
}
return hx;
}
First of all, where's the bug in the procedure? It's not working when I test it out.
Second of all, when trying to make a string empty again, is it faster to do
lastString.erase(lastString.begin(), lastString.end());
or
lastString = "";
???
Thank you for your time.

Question 1) Understand that you can't, and really don't need to, calculate the size of a String this way. Just ask it how big it is and it will tell you.
// comparing size, length, capacity and max_size
#include <iostream>
#include <string>
int main ()
{
std::string str ("Test string");
std::cout << "size: " << str.size() << "\n";
std::cout << "length: " << str.length() << "\n";
std::cout << "capacity: " << str.capacity() << "\n";
std::cout << "max_size: " << str.max_size() << "\n";
return 0;
}
http://www.cplusplus.com/reference/string/string/capacity/
As for an array of strings, well go read this:
How to determine the size of an array of strings in C++?
Check out David Rodríguez's answer.
Question 2) The better way might be to define a FunctionPair class depending on what you're doing with them. Vector<FunctionPair> might come in handy.
If FunctionPair doesn't end up with any behavior (functions) associated with it then a struct might be enough: std::pair<std::string, std::string> could also be shoved into a vector.
You don't need a map unless your going to use one function string to look up the other.
http://www.cplusplus.com/reference/map/map/
Question 3) A little better description of what's not working would help. I notice lastString doesn't impact hx at all.
Question 4) "Second of all" Fastest is nothing to worry about at this point. Write what is easiest to look at until all the bugs are gone. "Premature optimization is the root of all evil", Donald Knuth.
Tip: Look into how the replace function might help you do the composition replacements:
http://www.cplusplus.com/reference/string/string/replace/

As the above commenter said, you shouldn't use c-style arrays even if you just want to make things 'easy'.
In reality doing things like that makes things harder.
c-style arrays aren't bounds checked. That means they are a source of bugs due to memory unsafety and can lead to all kinds of issues from segfaulting to corrupting data as you read random data from unrelated blocks of memory or even worse write to them.
#include <iostream>
int main() {
int nums[] = {1, 2, 3};
std::cout << nums[3] << std::endl;
}
.
# ./a.out
4196544
No programmer is perfect, every time you implement something like that there is a percentage chance you will be off by one in your bounds or something. Even if you are some programming god most people have to work on a team with people who aren't. In many cases no one will even notice since not every time will cause anything obvious. Memory can be randomly corrupted without causing anything to crash horribly. Until you make a totally unrelated change that causes the memory to be in a different order.
But when you do notice it will often effect something totally unrelated that you code sometime later. Given the fact that you will likely implement many such arrays in your programming lifetime you will likely make things much worse for yourself, you save yourself 10 minutes for each project but end up spending hours tracking down a bug in one.
If you really don't want C++11 then use std::vector<std::vector<std::string>>. It will use a little more memory so you might loose some performance , but most of the time when people are worried about performance they shouldn't be. Are you are calling this function 10,000 time a second? Even then you could gain more performance from threading the code or preallocating memory. Most of the time people think something has bad performance but in reality the computer is optimizing it away, or the CPU is. Is the performance from the memory allocation going to be worse than trying to find the array size every run?
This is also the case with raw pointers vs std::unique_ptr, std::shared_ptr.
If typing all those names looks like a pain, use a typedef to make it nice.
You can also look at using Boost's Array type, boost::array. Or whip up your own custom class.
That's not to say that you should never use that stuff. But you should only use it when you can justify it. The default should be the 'pure' C++ style code.
Performance (only when you have measured and see that you need it there).
C compatibility (but most of the time you can just wrap that stuff in the std classes anyway).
If you do feel you need it then. Make sure you unittest your code. And look at using the address and memory sanitizers that ship in current versions of gcc and clang. And quarantine the code as much as possible (ie in classe)s.
That all sounds like a lot of work, but once you have learned to do it, it becomes a habit and build it into your build system then it's just part of the development process. As easy as make test. And once you have it in one build system, you cut and paste it into everything else you do forever. You have expanded your programmers toolkit. That's all good habits to form even if you don't do that.
But here's the actual answer to your array size question:
std::string arr[][10] = {
{"xxx", "111"},
{"y", "222"},
{"hello", "goodbye"},
{"I like candy", "mmmm"},
{"Math goes here", "this is math"},
{"More random stuff", "adsfdsfasf"},
};
int size = sizeof(arr) / 10 / sizeof(std::string);
std::cout << size << endl; // Prints 6, as in 6 pairs of strings

Since the semantics is similar as Map ( you are mapping a function to it's differential), I guess most suitable data structure would be std::map, when you can easily get the differential using the function as index.
About the function, you are not appending lastString.
return hx+lastString;

Question 1 is actually quite straightforward:
std::string dMap[][10] = {{"x", "1"}, {"log(x)", "1/x"}, {"e^x", "e^x"}};
size_t tupleCount = sizeof(dMap)/sizeof(dMap[0]);
size_t maxTupleSize = sizeof(dMap[0])/sizeof(dMap[0][0]);
assert(tupleCount == 3);
assert(maxTupleSize == 10);
Note that you won't get the actual count of strings in a tuple this way. You only get the amount of std::strings that can fit into each tuple. Of course, you can search your tuples for the first default constructed std::string it contains. But the entire setup is an invitation for bugs, so you don't want to use it anyways (see below).
Question 2 can also be answered quite clearly. You should be using an std::unordered_map<>. Why?
You usecase is to map strings of one class to another. That is the semantics of either std::map<> or std::unordered_map<>.
From your question I gather that you don't need a notion of a next or previous mapping, your mapping pairs are essentially unrelated. In this case, std::unordered_map<> is simply faster than std::map<> because it uses a hash table internally. No matter how big your std::unordered_map<> gets, looking up its elements takes a constant amount of time. This is not true for std::map<>.

std::mutiset vs std::vector to read and write sorted strings to a file

I've a file say somefile.txt it contains names (single word) in sorted order.
I want to updated this file, after adding new name, in sorted order.
Which of the following will be most preferred way and why ?
Using a std::multiset
std::multiset<std::string> s;
std::copy(std::istream_iterator<std::string>(fin),//fin- object of std::fstream
std::istream_iterator<std::string>(),
std::inserter(s, s.begin()));
s.insert("new_name");
//Write s to the file
OR
Using a std::vector
std::vector<std::string> v;
std::copy(std::istream_iterator<std::string>(fin),
std::istream_iterator<std::string>(),
std::back_inserter(v));
v.push_back("new_name");
std::sort(v.begin(),v.end());
//Write v to the file.

The multiset is slower to insert objects than the vector, but they are held sorted.
The multiset is likely to take up more memory than the vector as it has to hold pointers to an internal tree structure. This may not always be the case as the vector may have some empty space.
I guess if you need the information to grow incrementally but always to be ready for immediate access in order then the multi set wins.
If you collect the data all at once without needing to access it in order, it is probably simpler to push it onto the vector and then sort. So how dynamic is the data to be stored is the real criterion.

std::string new_name = "new_name";
bool inserted = false;
std::string current;
while (std::cin >> current) {
if (!inserted && new_name < current) {
std::cout << new_name << '\n';
inserted = true;
}
std::cout << current << '\n';
}

Both options are basically equivalent.
In a performance-critical scenario, the vector approach will be faster, but your perf is largely going to be constrained by the disk in this case; which container you choose won't matter much.

Vectors are faster from what I could see from this guy's testing (http://fallabs.com/blog/promenade.cgi?id=34). I would suggest that you test it out and see for yourself. Performance is often related to platform and especially, in this case, datasets.
From his testing, he concluded that simple element works best with vector. For complex element (more than 4 strings for instance), multiset is faster.
Also, since vectors are big arrays, if you're adding lots of data, it may be worth looking into using another type of container (linked list for instance or a specialized boost container see Is there a sorted_vector class, which supports insert() etc.?).

Prepend std::string

What is the most efficient way to prepend std::string? Is it worth writing out an entire function to do so, or would it take only 1 - 2 lines? I'm not seeing anything related to an std::string::push_front.

There actually is a similar function to the non-existing std::string::push_front, see the below example.
Documentation of std::string::insert
#include <iostream>
#include <string>
int
main (int argc, char *argv[])
{
std::string s1 (" world");
std::string s2 ("ello");
s1.insert (0, s2); // insert the contents of s2 at offset 0 in s1
s1.insert (0, 1, 'h'); // insert one (1) 'h' at offset 0 in s1
std::cout << s1 << std::endl;
}
output:
hello world
Since prepending a string with data might require both reallocation and copy/move of existing data you can get some performance benefits by getting rid of the reallocation part by using std::string::reserve (to allocate more memory before hand).
The copy/move of data is sadly quite inevitable, unless you define your own custom made class that acts like std::string that allocates a large buffer and places the first content in the center of this memory buffer.
Then you can both prepend and append data without reallocation and moving data, if the buffer is large enough that is. Copying from source to destination is still, obviously, required though.
If you have a buffer in which you know you will prepend data more often than you append a good alternative is to store the string backwards, and reversing it when needed (if that is more rare).

myString.insert(0, otherString);
Let the Standard Template Library writers worry about efficiency; make use of all their hours of work rather than re-programming the wheel.
This way does both of those.
As long as the STL implementation you are using was thought through you'll have efficient code. If you're using a badly written STL, you have bigger problems anyway :)

If you're using std::string::append, you should realize the following is equivalent:
std::string lhs1 = "hello ";
std::string lhs2 = "hello ";
std::string rhs = "world!";
lhs1.append(rhs);
lhs2 += rhs; // equivalent to above
// Also the same:
// lhs2 = lhs2 + rhs;
Similarly, a "prepend" would be equivalent to the following:
std::string result = "world";
result = "hello " + result;
// If prepend existed, this would be equivalent to
// result.prepend("hello");
You should note that it's rather inefficient to do the above though.

There is an overloaded string operator+ (char lhs, const string& rhs);, so you can just do your_string 'a' + your_string to mimic push_front.
This is not in-place but creates a new string, so don't expect it to be efficient, though. For a (probably) more efficient solution, use resize to gather space, std::copy_backward to shift the entire string back by one and insert the new character at the beginning.

The problem is efficiency: inserting to the beginning of the string is more expensive as it requires both reallocation and shifting of existing characters.
If you are only prepending to the string, the most efficient way is appending, and then either reverse the string, or even better, go through the string in reverse order.
string s;
for (auto c: "foobar") {
s.push_back(c);
}
for (auto it=s.rbegin(); it!=s.rend(); it++) {
// do something
}
If you need a mix of prepending and appending, I'd suggest using a deque, and then construct a string from it.
The double-ended queue supports O(1) insertion and deletion at the beginning and end.
deque<char> dq;
dq.push_front('f');
dq.push_back('o');
dq.push_front('o');
string s {dq.begin(), dq.end()};

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Best way to concatenate and condense a std::vector<std::string> - c++

You could also try std::accumulate: auto s = std::accumulate(my_strings.begin(), my_strings.end(), std::string()); Won't be any faster, but at least it's more compact.

With range-v3 (and soon with C++20 ranges), you might do: std::vector<std::string> v{"Hello, ", "good ", "day ", " to", " you!"}; std::string s = v | ranges::view::join; Demo

Related

Redefine data area for faster access

C++ Reverse sequence of elements in a vector

How to find the length of an array of std::strings?

std::mutiset vs std::vector to read and write sorted strings to a file

Prepend std::string

Categories

Resources