Is using strings this way inefficient? - c++

I am a newbee to c++ and am running into problems with my teacher using strings in my code. Though it is clear to me that I have to stop doing that in her class, I am curious as to why it is wrong. In this program the five strings I assigned were going to be reused no less than 4 to 5 times, therefore I put the text into strings. I was told to stop doing it as it is inefficient. Why? In c++ are textual strings supposed to be typed out as opposed to being stored into strings, and if so why? Below is some of the program, please tell me why it is bad.
string Bry = "berries";
string Veg = "vegetables";
string Flr = "flowers";
string AllStr;
float Tmp1, Precip;
int Tmp, FlrW, VegW, BryW, x, Selct;
bool Cont = true;
AllStr = Flr + ", " + Bry + ", " + "and " + Veg;

Answering whether using strings is inefficient is really something that very much depends on how you're using them.
First off, I would argue that you should be using C++ strings as a default - only going to raw C strings if you actually measure and find C++ strings to be too slow. The advantages (primarily for security) are just too great - it's all too easy to screw up buffer management with raw C strings. So I would disagree with your teacher that this is overly inefficient.
That said, it's important to understand the performance implications of using C++ strings. Since they are always dynamically allocated, you may end up spending a lot of time copying and reallocating buffers. This is usually not a problem; usually there are other things which take up much more time. However, if you're doing this right in the middle of a loop that's critical to your program's performance, you may need to find another method.
In short, premature optimization is usually a bad idea. Write code that is obviously correct, even if it takes ever-so-slightly longer to run. But be aware of the costs and trade-offs you're making at the same time; that way, if it turns out that C++ strings are actually slowing your program down a lot, you'll know what to change to fix that.

Yes, it's fairly inefficient, for following reasons:
When you construct a std::string object, it has to allocate a storage space for the string content (which may or may not be a separate dynamic memory allocation, depending on whether small-string optimization is in effect) and copy the literal string that is parameter of the constructor. For example, when you say: string Bry = "berries" it allocates a separate memory block (potentially from the dynamic memory), then copies "berries" to that block.
So you potentially have an extra dynamic memory allocation (costing time),
have to perform the copy (costing more time),
and end-up with 2 copies of the same string (costing space).
Using std::string::operator+ produces a new string that is the result of concatenation. So when you write several + operators in a row, you have several temporary concatenation results and a lot of unnecessary copying.
For your example, I recommend:
Using string literals unless you actually need the functionality only available in std::string.
Using std::stringstream to concatenate several strings together.
Normally, code readability is preferred over micro-optimizations of this sort, but luckily you can have both performance and readability in this case.

Your teacher is both right and wrong. S/he's right that building up strings from substrings at runtime is less CPU-efficient than simply providing the fully pre-built string in the code to start with -- but s/he's wrong in thinking that efficiency is necessarily an important factor to worry about in this case.
In a lot of cases, efficiency simply doesn't matter. At all. For example, if your code above is only going to be executed rarely (e.g. no more than once per second), then it's going to be literally impossible to measure any difference between the "most efficient version" and your not-so-efficient version. Given that, it's quite justifiable to decide that other factors (such as code readability and maintainability) are more important than maximizing efficiency.
Of course, if your program is going to be reconstructing these strings thousands or millions of times per second, then making sure your code is maximally efficient, even at the expense of readability/maintainability, is a good tradeoff to make. But I doubt that is the case here.

Your approach is almost perfect - try and declare everything only once. But if it is not used more than once - dont wast you fingers typing it :-) ie a 10 line program
The only change I would suggest is to make the strings const to help the compiler optimize you program.
If you instructor still disagrees - get a new instructor.

it is inefficient. doing that last line right would be 4-5 times faster.
at the very least you should use +=
+= means that you would avoid creating new strings with the + operator.
The instructor knows that when you do a string = string + string C++ creates a new string that is immediately destroyed.

Efficiency is probably not is good argument to not use string in school assignments but yes, if I am a teacher and the topic is not about some very high level applications, I don't want my students using string.
The real reason is string hides the low level memory management. A student coming out of college should have the basic memory management skill. Though nowadays in working environment, programmers don't deal with the memory management in most of the time but there are always situations where you need to understand what's happening under the hood to be able to reason the problem you are encountering.

With the context given, it looks like you should just be able to declare AllString as a const or string literal without all the substrings and addition. Assuming there's more to it, declaring them as literal string objects allocates memory at runtime. (And, not that there is any practical impact here, but you should be aware that stl container objects sometimes allocate a default minimum of space that is larger than the number of things initially in it, as part of its optimizations in anticipation of later modifying operations. I'm not sure if std::string does so on an declare/assign or not.) If you are only ever going to use them as literals, declaring them as a const char* or in a #define is easier on both memory and runtime performance, and you can still use them as r-values in string operations. If you are using them other ways in code you are not showing us, then its up to whether they need to ever be changed or manipulated as to whether they need to be strings or not.
If you are trying to learn coding, inefficiencies that don't matter in practice are still things you should be aware of and avoid if unnecessary. In production code, there are sometimes reasons to do something like this, but if it is not for any good reason, it's just sloppy. She's right to point it out, but what she should be doing is using that as a starting point for a conversation about the various tradeoffs involved - memory, speed, readability, maintainability, etc. If she's a teacher, she should be looking for "teaching moments" like this rather than just an opportunity to scold.

you can use string.append() ;
its better than + or +=

Related

Why doesn't std::string take a null pointer?

I recently passed a null pointer to a std::string constructor and got undefined behavior. I'm certain this is something that thousands or tens of thousands of programmers have done before me, and this same bug has no doubt crashed untold numbers of programs. It comes up a lot when converting from code using char* to code using std::string, and it's the kind of thing that is not catchable at compile time and can easily be missed in run time unit tests.
What I'm confused about is the reason for specifying std::string this way.
Why not just define std::string(NULL)==""?
The efficiency loss would be negligible, I doubt it's even measurable in a real program.
Does anyone know what the possible reason for making std::string(NULL) undefined is?
No good reason as far as I know.
Someone just proposed a change to this a month ago. I encourage you to support it.
std::string is not the best example of well done standardization. The version initially standardized was impossible to implement; the requirements placed on it where not consistent with each other.
At some point that inconsistency was fixed.
In c++11 the rules where changed that prevent COW (copy on write) implementations, which broke the ABI of existing reasonably compliant std::strings. This change may have been the point where the inconsistency was fixed, I do not recall.
Its API is different than the rest of std's container because it didn't come from the same pre-std STL.
Treating this legacy behavior of std::string as some kind of reasoned decision that takes into account performance costs is not realistic. If any such testing was done, it was 20+ years ago on a non-standard compliant std::string (because none could exist, the standard was inconsistent).
It continues to be UB on passing (char const*)0 and nullptr due to inertia, and will continue to do so until someone makes a proposal and demonstrates that the cost is tiny while the benefit is not.
Constructing a std::string from a literal char const[N] is already a low performance solution; you already have the size of the string at compile time and you drop it on the ground and then at runtime walk the buffer to find the '\0' character (unless optimized around; and if so, the null check is equally optimizable). The high performance solution involves knowing the length and telling std::string about it instead of copying from a '\0' terminated buffer.
The sole reason is: Runtime performance.
It would indeed be easy to define that std::string(NULL) results in the empty string. But it would cost an extra check at the construction of every std::string from a const char *, which can add up.
On the balance between absolute maximum performance and convenience, C++ always goes for absolute maximum performance, even if this would mean to compromise the robustness of programs.
The most famous example is to not initialize POD member variables in classes by default: Even though in 99% of all cases programmers want all POD member variables to be initialized, C++ decides not to do so to allow the 1% of all classes to achieve slightly higher runtime performance. This pattern repeats itself all over the place in C++. Performance over everything else.
There is no "the performance impact would be negligible" in C++. :-)
(Note that I personally do not like this behavior in C++ either. I would have made it so that the default behavior is safe, and that the unchecked and uninitialized behavior has to be requested explicitly, for example with an extra keyword. Uninitialized variables are still a major problem in a lot of programs in 2018.)

why is "char" a bad programming practice in struct types?

I see in c++ programming language some recommendations like "don't use char in struct type",
struct Student{
int stdnum, FieldCode, age;
double average, marks, res[NumOfCourses];
char Fname[20], Lname[20], cmp[20];
};
And is better to use:
struct Student{
int stdnum, FieldCode, age;
double average, marks, res[NumOfCourses];
string Fname, Lname, cmp;
};
Any other recommendations would be welcome on the matter
Thank you in advance
Because string handling using C-level raw character arrays is hard to get right, prone to fail, and when it fail it can fail rather horribly.
With a higher-level string datatype, your code becomes easier to write, more likely to be correct. It's also often easier to get shorter code, since the string datatype does a lot of work for you that you otherwise have to do yourself.
The original question was tagged with the c tag, but of course string is a c++ construct so this answer doesn't apply to C. In C, you can choose to use some string library (glib's GString is nice) to gain similar benefits but of course you'll never have overloaded operators like in C++.
#unwind's answer is generally perfectly appropriate, unless it's really a sequence of 20 char's you're after. In that case a std::string might be overkill (performance-wise as well), but a std::array might still be better.
For your student class, you have first name, last name and "cmp" - I've no idea what that's supposed to be. Clearly a student could have a first and/or last name longer than 20 characters, so by hardcoding an array of 20 elements you've already created a system that:
has to bother to check all inputs to make sure no attempt is made to store more than 19 characters (leaving space for a NUL), and
can't reliably print out any formal documents (e.g. graduation certificates) that require students' exact names.
If you don't carefully check all your input handling when modifying arrays, your program could crash, corrupt data and/or malfunction.
With std::string, you can generally just let people type into whatever fields your User Interface has, or pick up data from files, databases or the network, and store whatever you're given, print whatever you're given, add extra characters to it without worrying about crossing that threshold etc.. In extreme situations where you can't trust your input sources (e.g. accepting student data over untrusted network connections) you may still want to do some length and content checks, but you'd very rarely find it necessary to let those proliferate throughout all your code the way array-bounds checking often needs to.
There are some performance implications:
fixed length arrays have to be sized to your "worst supported case" size, so may waste considerable space for average content
strings have some extra data members, and if the textual content is larger than any internal Short String Optimisation buffer, then they may use further dynamic memory allocation (i.e. new[]), and the allocation routines might allocate more memory than you actually asked for, or be unable to effectively reuse delete[]d memory due to fragmentation
if a std::string implementation happens to share a buffer between std::string objects that are copied (just until one is or might be modified), then std::strings could reduce your memory usage when there are multiple copies of the student struct - but this probably only happens transiently and is only likely to help with very long strings
It's hard to conclude much from just reading about all these factors - if you care about potential performance impact in your application, you should always benchmark with your actual data and usage.
Separately, std::string is intuitive to use, supporting operators like +, +=, ==, !=, < etc., so you're more likely to be able to write correct, concise code that's easily understood and maintained.
I dont think it makes difference in C as well.
Check this precious stackoverflow answer explained clearly states strings are more secure than char *.

NULL terminated string and its length

I have a legacy code that receives some proprietary, parses it and creates a bunch of static char arrays (embedded in class representing the message), to represent NULL strings. Afterwards pointers to the string are passed all around and finally serialized to some buffer.
Profiling shows that str*() methods take a lot of time.
Therefore I would like to use memcpy() whether it's possible. To achive it I need a way to associate length with pointer to NULL terminating string. I though about:
Using std::string looks less efficient, since it requires memory allocation and thread synchronization.
I can use std::pair<pointer to string, length>. But in this case I need to maintain length "manually".
What do you think?
use std::string
Profiling shows that str*() methods
take a lot of time
Sure they do ... operating on any array takes a lot of time.
Therefore I would like to use memcpy()
whether it's possible. To achive it I
need a way to associate length with
pointer to NULL terminating string. I
though about:
memcpy is not really any slower than strcpy. In fact if you perform a strlen to identify how much you are going to memcpy then strcpy is almost certainly faster.
Using std::string looks less
efficient, since it requires memory
allocation and thread synchronization
It may look less efficient but there are a lot of better minds than yours or mine that have worked on it
I can use std::pair. But in this case I need to
maintain length "manually".
thats one way to save yourself time on the length calculation. Obviously you need to maintain the length manually. This is how windows BSTRs work, effectively (though the length is stored immediately prior, in memory, to the actual string data). std::string. for example, already does this ...
What do you think?
I think your question is asked terribly. There is no real question asked which makes answering next to impossible. I advise you actually ask specific questions in the future.
Use std::string. It's an advice already given, but let me explain why:
One, it uses a custom memory allocation scheme. Your char* strings are probably malloc'ed. That means they are worst-case aligned, which really isn't needed for a char[]. std::string doesn't suffer from needless alignment. Furthermore, common implementatios of std::string use the "Small String Optimization" which eliminates a heap allocation altogether, and improves locality of reference. The string size will be on the same cache line as the char[] itself.
Two, it keeps the string length, which is indeed a speed optimization. Most str* functions are slower because they don't have this information up front.
A second option would be a rope class, e.g. from SGI. This be more efficient by eliminating some string copies.
Your post doesn't explain where the str*() function calls are coming from; passing around char * certainly doesn't invoke them. Identify the sites that actually do the string manipulation and then try to find out if they're doing so inefficiently. One common pitfall is that strcat first needs to scan the destination string for the terminating 0 character. If you call strcat several times in a row, you can end up with a O(N^2) algorithm, so be careful about this.
Replacing strcpy by memcpy doesn't make any significant difference; strcpy doesn't do an extra pass to find the length of the string, it's simply (conceptually!) a character-by-character copy that stops when it encounters the terminating 0. This is not much more expensive than memcpy, and always cheaper than strlen followed by memcpy.
The way to gain performance on string operations is to avoid copies where possible; don't worry about making the copying faster, instead try to copy less! And this holds for all string (and array) implementations, whether it be char *, std::string, std::vector<char>, or some custom string / array class.
What do I think? I think that you should do what everyone else obsessed with pre-optimization does. You should find the most obscure, unmaintainable, yet intuitively (to you anyway) high-performance way you can and do it that way. Sounds like you're onto something with your pair<char*,len> with malloc/memcpy idea there.
Whatever you do, do NOT use pre-existing, optimized wheels that make maintenence easier. Being maintainable is simply the least important thing imaginable when you're obsessed with intuitively measured performance gains. Further, as you well know, you're quite a bit smarter than those who wrote your compiler and its standard library implementation. So much so that you'd be seriously silly to trust their judgment on anything; you should really consider rewriting the entire thing yourself because it would perform better.
And ... the very LAST thing you'll want to do is use a profiler to test your intuition. That would be too scientific and methodical, and we all know that science is a bunch of bunk that's never gotten us anything; we also know that personal intuition and revelation is never, ever wrong. Why waste the time measuring with an objective tool when you've already intuitively grasped the situation's seemingliness?
Keep in mind that I'm being 100% honest in my opinion here. I don't have a sarcastic bone in my body.

Are there any practical limitations to only using std::string instead of char arrays and std::vector/list instead of arrays in c++?

I use vectors, lists, strings and wstrings obsessively in my code. Are there any catch 22s involved that should make me more interested in using arrays from time to time, chars and wchars instead?
Basically, if working in an environment which supports the standard template library is there any case using the primitive types is actually better?
For 99% of the time and for 99% of Standard Library implementations, you will find that std::vectors will be fast enough, and the convenience and safety you get from using them will more than outweigh any small performance cost.
For those very rare cases when you really need bare-metal code, you can treat a vector like a C-style array:
vector <int> v( 100 );
int * p = &v[0];
p[3] = 42;
The C++ standard guarantees that vectors are allocated contiguously, so this is guaranteed to work.
Regarding strings, the convenience factor becomes almnost overwhelming, and the performance issues tend to go away. If you go beack to C-style strings, you are also going back to the use of functions like strlen(), which are inherently very inefficent themselves.
As for lists, you should think twice, and probably thrice, before using them at all, whether your own implementation or the standard. The vast majority of computing problems are better solved using a vector/array. The reason lists appear so often in the literature is to a large part because they are a convenient data structure for textbook and training course writers to use to explain pointers and dynamic allocation in one go. I speak here as an ex training course writer.
I would stick to STL classes (vectors, strings, etc). They are safer, easier to use, more productive, with less probability to have memory leaks and, AFAIK, they make some additional, run-time checking of boundaries, at least at DEBUG time (Visual C++).
Then, measure the performance. If you identify the bottleneck(s) is on STL classes, then move to C style strings and arrays usage.
From my experience, the chances to have the bottleneck on vector or string usage are very low.
One problem is the overhead when accessing elements. Even with vector and string when you access an element by index you need to first retrieve the buffer address, then add the offset (you don't do it manually, but the compiler emits such code). With raw array you already have the buffer address. This extra indirection can lead to significant overhead in certain cases and is subject to profiling when you want to improve performance.
If you don't need real time responses, stick with your approach. They are safer than chars.
You can occasionally encounter scenarios where you'll get better performance or memory usage from doing some stuff yourself (example, std::string typically has about 24 bytes of overhead, 12 bytes for the pointers in the std::string itself, and a header block on its dynamically allocated piece).
I have worked on projects where converting from std::string to const char* saved noticeable memory (10's of MB). I don't believe these projects are what you would call typical.
Oh, using STL will hurt your compile times, and at some point that may be an issue. When your project results in over a GB of object files being passed to the linker, you might want to consider how much of that is template bloat.
I've worked on several projects where the memory overhead for strings has become problematic.
It's worth considering in advance how your application needs to scale. If you need to be storing an unbounded number of strings, using const char*s into a globally managed string table can save you huge amounts of memory.
But generally, definitely use STL types unless there's a very good reason to do otherwise.
I believe the default memory allocation technique is a buffer for vectors and strings is one that allocates double the amount of memory each time the currently allocated memory gets used up. This can be wasteful. You can provide a custom allocator of course...
The other thing to consider is stack vs. heap. Staticly sized arrays and strings can sit on the stack, or at least the compiler handles the memory management for you. Newer compilers will handle dynamically sized arrays for you too if they provide the relevant C99/C++0x feature. Vectors and strings will always use the heap, and this can introduce performance issues if you have really tight constraints.
As a rule of thumb use whats already there unless it hurts your project with its speed/memory overhead... you'll probably find that for 99% of stuff the STL provided classes save you time and effort with little to no impact on your applications performance. (i.e. "avoid premature optimisation")

C++ string memory management

Last week I wrote a few lines of code in C# to fire up a large text file (300,000 lines) into a Dictionary. It took ten minutes to write and it executed in less than a second.
Now I'm converting that piece of code into C++ (because I need it in an old C++ COM object). I've spent two days on it this far. :-( Although the productivity difference is shocking on its own, it's the performance that I would need some advice on.
It takes seven seconds to load, and even worse: it takes just exactly that much time to free all the CStringWs afterwards. This is not acceptable, and I must find a way to increase the performance.
Are there any chance that I can allocate this many strings without seeing this horrible performace degradation?
My guess right now is that I'll have to stuff all the text into a large array and then let my hash table point to the beginning of each string within this array and drop the CStringW stuff.
But before that, any advice from you C++ experts out there?
EDIT: My answer to myself is given below. I realized that that is the fastest route for me, and also step in what I consider the right direction - towards more managed code.
This sounds very much like the Raymond Chen vs Rico Mariani's C++ vs C# Chinese/English dictionary performance bake off. It took Raymond several iterations to beat C#.
Perhaps there are ideas there that would help.
http://blogs.msdn.com/ricom/archive/2005/05/10/performance-quiz-6-chinese-english-dictionary-reader.aspx
You are stepping into the shoes of Raymond Chen. He did the exact same thing, writing a Chinese dictionary in unmanaged C++. Rico Mariani did too, writing it in C#. Mr. Mariani made one version. Mr. Chen wrote 6 versions, trying to match the perf of Mariani's version. He pretty much rewrote significant chunks of the C/C++ runtime library to get there.
Managed code got a lot more respect after that. The GC allocator is impossible to beat. Check this blog post for the links. This blog post might interest you too, instructive to see how the STL value semantics are part of the problem.
Yikes. get rid of the CStrings...
try a profiler as well.
are you sure you were not just running debug code?
use std::string instead.
EDIT:
I just did a simple test of ctor and dtor comparisons.
CStringW seems to take between 2 and 3 times the time to do a new/delete.
iterated 1000000 times doing new/delete for each type. Nothing else - and a GetTickCount() call before and after each loop. Consistently get twice as long for CStringW.
That doesn't address your entire issue though I suspect.
EDIT:
I also don't think that using string or CStringW is the real the problem - there is something else going on that is causing your issue.
(but for god's sake, use stl anyway!)
You need to profile it. That is a disaster.
If it is a read-only dictionary then the following should work for you.
Use fseek/ftell functionality, to find the size of the text file.
Allocate a chunk of memory of that size + 1 to hold it.
fread the entire text file, into your memory chunk.
Iterate though the chunk.
push_back into a vector<const char *> the starting address of each line.
search for the line terminator using strchr.
when you find it, deposit a NUL, which turns it into a string.
the next character is the start of the next line
until you do not find a line terminator.
Insert a final NUL character.
You can now use the vector, to get the pointer, that will let you
access the corresponding value.
When you are finished with your dictionary, deallocate the memory, let the vector
die when going out of scope.
[EDIT]
This can be a little more complicated on the dos platform, as the line terminator is CRLF.
In that case, use strstr to find it, and increment by 2 to find the start of the next line.
What sort of a container are you storing your strings in? If it's a std::vector of CStringW and if you haven't reserve-ed enough memory beforehand, you're bound to take a hit. A vector typically resizes once it reaches it's limit (which is not very high) and then copies out the entirety to the new memory location which is can give you a big hit. As your vector grows exponentially (i.e. if initial size is 1, next time it allocates 2, 4 next time onwards, the hit becomes less and less frequent).
It also helps to know how long the individual strings are. (At times :)
Thanks all of you for your insightful comments. Upvotes for you! :-)
I must admit I wasn't prepared for this at all - that C# would beat the living crap out of good old C++ in this way. Please don't read that as an offence to C++, but instead what an amazingly good memory manager that sits inside the .NET Framework.
I decided to take a step back and fight this battle in the InterOp arena instead! That is, I'll keep my C# code and let my old C++ code talk to the C# code over a COM interface.
A lot of questions were asked about my code and I'll try to answer some of them:
The compiler was Visual Studio 2008 and no, I wasn't running a debug build.
The file was read with an UTF8 file reader which I downloaded from a Microsoft employee who published it on their site. It returned CStringW's and about 30% of the time was actually spent there just reading the file.
The container I stored the strings in was just a fixed size vector of pointers to CStringW's and it was never resized.
EDIT: I'm convinced that the suggestions I was given would indeed work, and that I probably could beat the C# code if I invested enough time in it. On the other hand, doing so would provide no customer value at all and the only reason to pull through with it would be just to prove that it could be done...
The problem is not in the CString, but rather that you are allocating a lot of small objects - the default memory allocator isn't optimized for this.
Write your own allocator - allocate a big chunk of memory and then just advance a pointer in it when allocating. This what actually the .NET allocator does. When you are ready delete the whole buffer.
I think there was sample of writing custom new/delete operators in (More) Effective C++
Load the string to a single buffer, parse the text to replace line breaks with string terminators ('\0'), and use pointers into that buffer to add to the set.
Alternatively - e.g. if you have to do an ANSI/UNICODE conversion during load - use a chunk allocator, that sacrifices deleting individual elements.
class ChunkAlloc
{
std::vector<BYTE> m_data;
size_t m_fill;
public:
ChunkAlloc(size_t chunkSize) : m_data(size), m_fill(0) {}
void * Alloc(size_t size)
{
if (m_data.size() - m_fill < size)
{
// normally, you'd reserve a new chunk here
return 0;
}
void * result = &(m_data[m_fill]);
m_fill += size;
return m_fill;
}
}
// all allocations from chuunk are freed when chung is destroyed.
Wouldn't hack that together in ten minutes, but 30 minutes and some testing sounds fine :)
When working with string classes, you should always have a look at unnecessary operations, for example, don't use constructors, concatenation and such operations too often, especially avoid them in loops. I suppose there's some character coding reason you use CStringW, so you probably can't use something different, this would be another way to optimize your code.
It's no wonder that CLR's memory management is better than the bunch of old and dirty tricks MFC is based on: it is at least two times younger than MFC itself, and it is pool-based. When I had to work on similar project with string arrays and WinAPI/MFC, I just used std::basic_string instantiated with WinAPI's TCHAR and my own allocator based on Loki::SmallObjAllocator. You can also take a look at boost::pool in this case (if you want it to have an "std feel" or have to use a version of VC++ compiler older than 7.1).