String.length woes

String.length woes - c++

Edit: Solutions must compile against Microsoft Visual Studio 2012.
I want to use a known string length to declare another string of the same length.
The reasoning is the second string will act as a container for operation done to the first string which must be non volatile with regards to it.
e.g.
const string messy "a bunch of letters";
string dostuff(string sentence) {
string organised NNN????? // Idk, just needs the same size.
for ( x = 0; x < NNN?; x++) {
organised[x] = sentence[x]++; // Doesn't matter what this does.
}
}
In both cases above, the declaration and the exit condition, the NNN? stands for the length of 'messy'.
How do I discover the length at compile time?

std::string has two constructors which could fit your purposes.
The first, a copy constructor:
string organised(sentence);
The second, a constructor which takes a character and a count. You could initialize a string with a temporary character.
string organised(sentence.length(), '_');
Alternatively, you can:
Use an empty string and append (+=) text to it as you go along, or
Use a std::stringstream for the same purpose.
the stringstream will likely be more efficient.
Overall, I would prefer the copy constructor if the length is known.

std::string isn't a compile time type (it can't be a constexpr), so you can't use it directly to determine the length at compile time.
You could initialize a constexpr char[] and then use sizeof on that:
constexpr char messychar[] = "a bunch of letters";
// - 1 to avoid including NUL terminator which std::string doesn't care about
constexpr size_t messylen = sizeof(messychar) / sizeof(messychar[0]) - 1;
const string messy(messychar);
and use that, but frankly, that's pretty ugly; the length would be compile time, but organized would need to use the count and char constructor that would still be performed on each call, allocating and initializing only to have the contents replaced in the loop.
While it's not compile time, you'd avoid that initialization cost by just using reserve and += to build the new string, which with the #define could be done in an ugly but likely efficient way as:
constexpr char messychar[] = "a bunch of letters";
constexpr size_t messylen = sizeof(messychar) / sizeof(messychar[0]) - 1;
// messy itself may not be needed, but if it is, it's initialized optimally
// by using the compile time calculated length, so there is no need to scan for
// NUL terminators, and it can reserve the necessary space in the initial alloc
const string messy(messychar, messylen);
string dostuff(string sentence) {
string organised;
organized.reserve(messylen);
for (size_t x = 0; x < messylen; x++) {
organised += sentence[x]++; // Doesn't matter what this does.
}
}
This avoids setting organised's values more than once, allocating more than once (well, possibly twice if initial construction performs it) per call, and only performs a single read/write pass of sentence, no full read followed by read/write or the like. It also makes the loop constraint a compile time value, so the compiler has the opportunity to unroll the loop (though there is no guarantee of this, and even if it happens, it may not be helpful).
Also note: In your example, you mutate sentence, but it's accepted by value, so you're mutating the local copy, not the caller copy. If mutation of the caller value is required, accept it by reference, and if mutation is not required, accept by const reference to avoid a copy on every call (I understand the example code was filler, just mentioning this).

Related

Why StringCopyFromLiteral is faster than StringCopyFromString?

The Quick C++ Benchmarks example:
static void StringCopyFromLiteral(benchmark::State& state) {
// Code inside this loop is measured repeatedly
for (auto _ : state) {
std::string from_literal("hello");
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(from_literal);
}
}
// Register the function as a benchmark
BENCHMARK(StringCopyFromLiteral);
static void StringCopyFromString(benchmark::State& state) {
// Code before the loop is not measured
std::string x = "hello";
for (auto _ : state) {
std::string from_string(x);
}
}
// Register the function as a benchmark
BENCHMARK(StringCopyFromString);
http://quick-bench.com/IcZllt_14hTeMaB_sBZ0CQ8x2Ro
What if I understand assembly...
More results:
http://quick-bench.com/39fLTvRdpR5zdapKSj2ZzE3asCI

The answer is simple. In the case where you construct an std::string from a small string literal, the compiler optimizes this case by directly populating the contents of the string object using constants in assembly. This avoids expensive looping as well as tests to see whether small string optimization (SSO) can be applied. In this case it knows SSO can be applied so the code the compiler generates simply involves writing the string directly into the SSO buffer.
Note this assembly code in the StringCreation case:
// Populate SSO buffer (each set of 4 characters is backwards since
// x86 is little-endian)
19.63% movb $0x6f,0x4(%r15) // "o"
19.35% movl $0x6c6c6568,(%r15) // "lleh"
// Set size
20.26% movq $0x5,0x10(%rsp) // size = 5
// Probably set heap pointer. 0 (nullptr) = use SSO buffer
20.07% movb $0x0,0x1d(%rsp)
You're looking at the constant values right there. That's not very much code, and no loop is required. In fact, the std::string constructor doesn't even have to be invoked! The compiler is just putting stuff in memory in the same places where the std::string constructor would.
If the compiler cannot apply this optimization, the results are quite different -- in particular, if we "hide" the fact that the source is a string literal by first copying the literal into a char array, the results flip:
char x[] = "hello";
for (auto _ : state) {
std::string created_string(x);
benchmark::DoNotOptimize(created_string);
}
Now the "from-char-pointer" case takes twice as long! Why?
I suspect that this is because the "copy from char pointer" case cannot simply check to see how long the string is by looking at a value. It needs to know whether small string optimization can be performed. There's a few ways it could go about this:
Measure the length of the string first, make an allocation (if needed), then copy the source to the destination. In the case where SSO does apply (it almost certainly does here) I'd expect this to take twice as long since it has to walk the source twice -- once to measure, once to copy.
Copy from the source character-by-character, appending to the new string. This requires testing on each append operation whether the string is now too long for SSO and needs to be copied into a heap-allocated char array. If the string is currently in a heap-allocated array, it needs to instead test if the allocation needs to be resized. This would also take quite a bit longer since there is at least one test for each character in the source string.
Copy from the source in chunks to lower the number of tests that need to be performed and to avoid walking the source twice. This would be faster than the character-by-character approach both because the number of tests would be lower and, because the source is not being walked twice, the CPU memory cache is going to be more effective. This would only show significant speed improvements for long strings, which we don't have here. For short strings it would work about the same as the first approach (measure, then copy).
Contrast this to the case when it's copying from another string object: it can simply look at the size() of the other string and immediately know whether it can perform SSO, and if it can't perform SSO then it also knows exactly how much memory to allocate for the new string.

How to create a function that removes all of a selected character in a C-string?

I want to make a function that removes all the characters of ch in a c-string.
But I keep getting an access violation error.
Unhandled exception at 0x000f17ba in testassignments.exe: 0xC0000005: Access violation writing location 0x000f787e.
void removeAll(char* &s, const char ch)
{
int len=strlen(s);
int i,j;
for(i = 0; i < len; i++)
{
if(s[i] == ch)
{
for(j = i; j < len; j++)
{
s[j] = s[j + 1];
}
len--;
i--;
}
}
return;
}
I expected the c-string to not contain the character "ch", but instead, I get an access violation error.
In the debug I got the error on the line:
s[j] = s[j + 1];
I tried to modify the function but I keep getting this error.
Edit--
Sample inputs:
s="abmas$sachus#settes";
ch='e' Output->abmas$sachus#settes, becomes abmas$sachus#stts
ch='t' Output-> abmas$sachus#stts, becomes abmas$sachus#ss.
Instead of producing those outputs, I get the access violation error.
Edit 2:
If its any help, I am using Microsoft Visual C++ 2010 Express.

Apart from the inefficiency of your function shifting the entire remainder of the string whenever encountering a single character to remove, there's actually not much wrong with it.
In the comments, people have assumed that you are reading off the end of the string with s[j+1], but that is untrue. They are forgetting that s[len] is completely valid because that is the string's null-terminator character.
So I'm using my crystal ball now, and I believe that the error is because you're actually running this on a string literal.
// This is NOT okay!
char* str = "abmas$sachus#settes";
removeAll(str, 'e');
This code above is (sort of) not legal. The string literal "abmas$sachus#settes" should not be stored as a non-const char*. But for backward compatibility with C where this is allowed (provided you don't attempt to modify the string) this is generally issued as a compiler warning instead of an error.
However, you are really not allowed to modify the string. And your program is crashing the moment you try.
If you were to use the correct approach with a char array (which you can modify), then you have a different problem:
// This will result in a compiler error
char str[] = "abmas$sachus#settes";
removeAll(str, 'e');
Results in
error: invalid initialization of non-const reference of type ‘char*&’ from an rvalue of type ‘char*’
So why is that? Well, your function takes a char*& type that forces the caller to use pointers. It's making a contract that states "I can modify your pointer if I want to", even if it never does.
There are two ways you can fix that error:
The TERRIBLE PLEASE DON'T DO THIS way:
// This compiles and works but it's not cool!
char str[] = "abmas$sachus#settes";
char *pstr = str;
removeAll(pstr, 'e');
The reason I say this is bad is because it sets a dangerous precedent. If the function actually did modify the pointer in a future "optimization", then you might break some code without realizing it.
Imagine that you want to output the string with characters removed later, but the first character was removed and you function decided to modify the pointer to start at the second character instead. Now if you output str, you'll get a different result from using pstr.
And this example is only assuming that you're storing the string in an array. Imagine if you actually allocated a pointer like this:
char *str = new char[strlen("abmas$sachus#settes") + 1];
strcpy(str, "abmas$sachus#settes");
removeAll(str, 'e');
Then if removeAll changes the pointer, you're going to have a BAD time when you later clean up this memory with:
delete[] str; //<-- BOOM!!!
The I ACKNOWLEDGE MY FUNCTION DEFINITION IS BROKEN way:
Real simply, your function definition should take a pointer, not a pointer reference:
void removeAll(char* s, const char ch)
This means you can call it on any modifiable block of memory, including an array. And you can be comforted by the fact that the caller's pointer will never be modified.
Now, the following will work:
// This is now 100% legit!
char str[] = "abmas$sachus#settes";
removeAll(str, 'e');
Now that my free crystal-ball reading is complete, and your problem has gone away, let's address the elephant in the room:
Your code is needlessly inefficient!
You do not need to do the first pass over the string (with strlen) to calculate its length
The inner loop effectively gives your algorithm a worst-case time complexity of O(N^2).
The little tricks modifying len and, worse than that, the loop variable i make your code more complex to read.
What if you could avoid all of these undesirable things!? Well, you can!
Think about what you're doing when removing characters. Essentially, the moment you have removed one character, then you need to start shuffling future characters to the left. But you do not need to shuffle one at a time. If, after some more characters you encounter a second character to remove, then you simply shunt future characters further to the left.
What I'm trying to say is that each character only needs to move once at most.
There is already an answer demonstrating this using pointers, but it comes with no explanation and you are also a beginner, so let's use indices because you understand those.
The first thing to do is get rid of strlen. Remember, your string is null-terminated. All strlen does is search through characters until it finds the null byte (otherwise known as 0 or '\0')...
[Note that real implementations of strlen are super smart (i.e. much more efficient than searching single characters at a time)... but of course, no call to strlen is faster]
All you need is your loop to look for the NULL terminator, like this:
for(i = 0; s[i] != '\0'; i++)
Okay, and now to ditch the inner loop, you just need to know where to stick each new character. How about just keeping a variable new_size in which you are going to count up how long the final string is.
void removeAll(char* s, char ch)
{
int new_size = 0;
for(int i = 0; s[i] != '\0'; i++)
{
if(s[i] != ch)
{
s[new_size] = s[i];
new_size++;
}
}
// You must also null-terminate the string
s[new_size] = '\0';
}
If you look at this for a while, you may notice that it might do pointless "copies". That is, if i == new_size there is no point in copying characters. So, you can add that test if you want. I will say that it's likely to make little performance difference, and potentially reduce performance because of additional branching.
But I'll leave that as an exercise. And if you want to dream about really fast code and just how crazy it gets, then go and look at the source code for strlen in glibc. Prepare to have your mind blown.

You can make the logic simpler and more efficient by writing the function like this:
void removeAll(char * s, const char charToRemove)
{
const char * readPtr = s;
char * writePtr = s;
while (*readPtr) {
if (*readPtr != charToRemove) {
*writePtr++ = *readPtr;
}
readPtr++;
}
*writePtr = '\0';
}

How do I declare a new string the same length of a known const string?

I've been using:
string letters = THESAMELENGTH; // Assign for allocation purposes.
Reason being, if I:
string letters[THESAMELENGTH.length()];
I get a non constant expression complaint.
But if I:
string letters[12];
I'm at risk of needing to change every instance if the guide const string changes size.
But it seems foolish to assign a string when I won't use those entries, I only want my newly assigned string to be the same length as the previously assigned const string, then fill with different values.
How do you recommend I do this gracefully and safely?

You can
string letters(THESAMELENGTH.length(), ' '); // constructs the string with THESAMELENGTH.length() copies of character ' '
BTW: string letters[12]; doesn't mean the same as you expected. It declares a raw array of string containing 12 elements.

I only want my newly assigned string to be the same length as the previously assigned const string, then fill with different values.
Part of the reason the string class/type exists is so you don't have to worry about trying to manage its length. (The problem with arrays of char.)
If you have a const std::string tmp then you can't just assign anything to it after it has already been initialized. E.g.:
const std::string tmp = "A value"; // initialization
tmp = "Another value"; // compile error
How do you recommend I do this gracefully and safely?
If you really want to keep strings to a specific size, regardless of their contents, you could always resize your string variables. For example:
// in some constants.h file
const int MAX_STRING_LENGTH = 16;
// in other files
#include "constants.h"
// ...
std::string word = ... // some unknown string
word.resize(MAX_STRING_LENGTH);
Now your word string will have a length/size of MAX_STRING_LENGTH and anything beyond the end gets truncated.
This example is from C++ Reference
// resizing string
#include <iostream>
#include <string>
int main ()
{
std::string str ("I like to code in C");
std::cout << str << '\n';
unsigned sz = str.size();
str.resize (sz+2,'+');
std::cout << str << '\n';
str.resize (14);
std::cout << str << '\n';
return 0;
}
// program output
I like to code in C
I like to code in C++
I like to code
You can't just ask a string variable for its length at compile-time. By definition, it's impossible to know the value of a variable, or the state of any given program for that matter, while it's not running. This question only makes sense at run-time.
Others have mentioned this, but there seems to be an issue with your understanding of string letters[12];. That gives you an array of string types, i.e. you get space for 12 full strings (e.g. words/sentences/etc), not just letters.
In other words, you could do:
for(size_t i = 0; i < letters.size(); ++i)
letters[i] = "Hello, world!";
So your letters variable should be renamed to something more accurate (e.g. words).
If you really want letters (e.g. the full alphabet on a single string), you could do something like this:
// constants.h
const std::string ALPHABET_LC = "abc...z";
const std::string ALPHABET_UC = "ABC...Z";
const int LETTER_A = 0;
const int LETTER_B = 1;
// ...
// main.cpp, etc.
char a = ALPHABET_LC[LETTER_A];
char B = ALPHABET_UC[LETTER_B];
// ...
It all depends on what you need to do, but this might be a good alternative.
Disclaimer: Note that it's not really my recommendation that you do this. You should let strings manage their own length. For example, if the string value is actually shorter than your limit, you're causing your variable to use more space/memory than needed, and if it's longer, you're still truncating it. Neither side-effect is good, IMHO.

The first thing you need to do is understand the difference between a string length and an array dimension.
std::string letters = "Hello";
creates a single string that contains the characters from "Hello", and has length 5.
In comparison
std::string letters[5];
creates an array of five distinct default-constructed objects of type std::string. It doesn't create a single string of 5 characters. The reason for the non-constant complaint when doing
std::string letters[THESAMELENGTH.length()];
is that construction of arrays in standard C++ is required to use a length known to the compiler, whereas the length of a std::string is determined at run time.
If you have a string, and what to create another string of the same length, you can do something like
std::string another_string(letters.length(), 'A');
which will create a single string containing the required number of letters 'A'.
It is largely pointless to do what you are seeking as a std::string can dynamically change its length anyway, as needed. There is also nothing stopping a std::string from allocating more than it needs (e.g. to make provision for multiple increases in its length).

Modifying the length and contents of the string?

To change the contents of a string in a function such that it reflects in the main function we need to accept the string as reference as indicated below.
Changing contents of a std::string with a function
But in the above code we are changing the size of string also(i.e, more than what it can hold), so why is the program not crashing ?
Program to convert decimal to binary, mind it, the code is not complete and I am just testing the 1st part of the code.
void dectobin(string & bin, int n)
{
int i=0;
while(n!=0)
{
bin[i++]= (n % 2) + '0';
n = n / 2;
}
cout << i << endl;
cout << bin.size() << endl;
cout << bin << endl;
}
int main()
{
string s = "1";
dectobin(s,55);
cout << s << endl;
return 0;
}
O/p: 6 1 1 and the program crashes in codeblocks. While the above code in the link works perfectly fine.
It only outputs the correct result, when i initialize the string in main with 6 characters(i.e, length of the number after it converts from decimal to binary).
http://www.cplusplus.com/reference/string/string/capacity/
Notice that this capacity does not suppose a limit on the length of the string. When this capacity is exhausted and more is needed, it is automatically expanded by the object (reallocating it storage space). The theoretical limit on the length of a string is given by member max_size
If the string resizes itself automatically then why do we need the resize function and then why is my decimal to binary code not working?

Your premise is wrong. You are thinking 1) if I access a string out of bound then my program will crash, 2) my program doesn't crash therefore I can't be accessing a string out of bounds, 3) therefore my apparently out of bounds string accesses must actually resize the string.
1) is incorrect. Accessing a string out of bounds results in undefined behaviour. This is means exactly what it says. Your program might crash but it might not, it's behaviour is undefined.
And it's a fact that accessing a string never changes it's size, that's why we have the resize function (and push_back etc.).
We must get questions like yours several times a week. Undefined behaviour is clearly a concept that newbies find surprising.

Check this link about std::string:
char& operator[] (size_t pos);
const char& operator[] (size_t pos) const;
If pos is not greater than the string length, the function never
throws exceptions (no-throw guarantee). Otherwise, it causes
undefined behavior.
In your while loop you are accessing the bin string with index that is greater than bin.size()

You aren't changing the size of the string anywhere. If the string you pass into the function is of length one and you access it at indices larger than 0, i.e., at bin[1], bin[2], you are not modifying the string but some other memory locations after the string - there might be something else stored there. Corrupting memory in this way does not necessarily directly lead to a crash or an exception. It will once you access those memory locations later on in your program.

Accepting a reference to a string makes it possible to change instances of strings from the calling code inside the called code:
void access(std::string & str) {
// str is the same instance as the function
// is called with.
// without the reference, a copy would be made,
// then there would be two distinct instances
}
// ...
std::string input = "test";
access(input);
// ...
So any function or operator that is called on a reference is effectively called on the referenced instance.
When, similar to your linked question, the code
str = " new contents";
is inside of the body of the access function, then operator= of the input instance is called.
This (copy assignment) operator is discarding the previous contents of the string, and then copying the characters of its argument into newly allocated storage, whose needed length is determined before.
On the other hand, when you have code like
str[1] = 'a';
inside the access function, then this calls operator[] on the input instance. This operator is only providing access to the underlying storage of the string, and not doing any resizing.
So your issues aren't related to the reference, but to misusing the index operator[]:
Calling that operator with an argument that's not less than the strings size/length leads to undefined behaviour.
To fix that, you could resize the string manually before using the index operator.
As a side note: IMO you should try to write your function in a more functional way:
std::string toOct(std::string const &);
That is, instead of modifying the oases string, create a new one.

The bounds of the string are limited by its current content. That is why when you initialise the string with 6 characters you will stay inside bounds for conversion of 55 to binary and program runs without error.
The automatic expansion feature of strings can be utilised using
std::string::operator+=
to append characters at the end of current string. Changed code snippet will look like this:
void dectobin(string & bin, int n){
//...
bin += (n % 2) + '0';
//...
}
Plus you don't need to initialise the original string in main() and your program should now run for arbitrary decimals as well.
int main(){
//...
string s;
dectobin(s,55);
//...
}

C++ faster way to do string addition?

I'm finding standard string addition to be very slow so I'm looking for some tips/hacks that can speed up some code I have.
My code is basically structured as follows:
inline void add_to_string(string data, string &added_data) {
if(added_data.length()<1) added_data = added_data + "{";
added_data = added_data+data;
}
int main()
{
int some_int = 100;
float some_float = 100.0;
string some_string = "test";
string added_data;
added_data.reserve(1000*64);
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
stringstream fragment;
fragment<<some_int <<","<<some_float<<","<<some_string;
add_to_string(fragment.str(),added_data);
}
return;
}
Doing some basic profiling, I'm finding that a ton of time is being used in the for loop. Are there some things I can do that will significantly speed this up? Will it help to use c strings instead of c++ strings?

String addition is not the problem you are facing. std::stringstream is known to be slow due to it's design. On every iteration of your for-loop the stringstream is responsible for at least 2 allocations and 2 deletions. The cost of each of these 4 operations is likely more than that of the string addition.
Profile the following and measure the difference:
std::string stringBuffer;
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
char buffer[128];
sprintf(buffer, "%i,%f,%s",some_int,some_float,some_string.c_str());
stringBuffer = buffer;
add_to_string(stringBuffer ,added_data);
}
Ideally, replace sprintf with _snprintf or the equivalent supported by your compiler.
As a rule of thumb, use stringstream for formatting by default and switch to the faster and less safe functions like sprintf, itoa, etc. whenever performance matters.
Edit: that, and what didierc said: added_data += data;

You can save lots of string operations if you do not call add_to_string in your loop.
I believe this does the same (although I am not a C++ expert and do not know exactly what stringstream does):
stringstream fragment;
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
fragment<<some_int<<","<<some_float<<","<<some_string;
}
// inlined add_to_string call without the if-statement ;)
added_data = "{" + fragment.str();

I see you used the reserve method on added_data, which should help by avoiding multiple reallocations of the string as it grows.
You should also use the += string operator where possible:
added_data += data;
I think that the above should save up some significant time by avoiding unecessary copies back and forth of added_data in a temporary string when doing the catenation.
This += operator is a simpler version of the string::append method, it just copies data directly at the end of added_data. Since you made the reserve, that operation alone should be very fast (almost equivalent to a strcpy).
But why going through all this, when you are already using a stringstream to handle input? Keep it all in there to begin with!
The stringstream class is indeed not very efficient.
You may have a look at the stringstream class for more information on how to use it, if necessary, but your solution of using a string as a buffer seems to avoid that class speed issue.
At any rate, stay away from any attempt at reimplementing the speed critical code in pure C unless you really know what you are doing. Some other SO posts support the idea of doing it,, but I think it's best (read safer) to rely as much as possible on the standard library, which will be enhanced over time, and take care of many corner cases you (or I) wouldn't think of. If your input data format is set in stone, then you might start thinking about taking that road, but otherwise it's premature optimization.

If you start added_data with a "{", you would be able to remove the if from your add_to_string method: the if gets executed exactly once, when the string is empty, so you might as well make it non-empty right away.
In addition, your add_to_string makes a copy of the data; this is not necessary, because it does not get modified. Accepting the data by const reference should speed things up for you.
Finally, changing your added_data from string to sstream should let you append to it in a loop, without the sstream intermediary that gets created, copied, and thrown away on each iteration of the loop.

Please have a look at Twine used in LLVM.
A Twine is a kind of rope, it represents a concatenated string using a
binary-tree, where the string is the preorder of the nodes. Since the
Twine can be efficiently rendered into a buffer when its result is used,
it avoids the cost of generating temporary values for intermediate string
results -- particularly in cases when the Twine result is never
required. By explicitly tracking the type of leaf nodes, we can also avoid
the creation of temporary strings for conversions operations (such as
appending an integer to a string).
It may helpful in solving your problem.

How about this approach?
This is a DevPartner for MSVC 2010 report.

string newstring = stringA & stringB;
i dont think strings are slow, its the conversions that can make it slow
and maybe your compiler that might check variable types for mismatches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js