Related
This really is a question just for my own interest I haven't been able to determine through the documentation.
I see on http://www.cplusplus.com/reference/string/string/ that append has complexity:
"Unspecified, but generally up to linear in the new string length."
while push_back() has complexity:
"Unspecified; Generally amortized constant, but up to linear in the new string length."
As a toy example, suppose I wanted to append the characters "foo" to a string. Would
myString.push_back('f');
myString.push_back('o');
myString.push_back('o');
and
myString.append("foo");
amount to exactly the same thing? Or is there any difference? You might figure that append would be more efficient because the compiler would know how much memory is required to extend the string the specified number of characters, while push_back may need to secure memory each call?
In C++03 (for which most of "cplusplus.com"'s documentation is written), the complexities were unspecified because library implementers were allowed to do Copy-On-Write or "rope-style" internal representations for strings. For instance, a COW implementation might require copying the entire string if a character is modified and there is sharing going on.
In C++11, COW and rope implementations are banned. You should expect constant amortized time per character added or linear amortized time in the number of characters added for appending to a string at the end. Implementers may still do relatively crazy things with strings (in comparison to, say std::vector), but most implementations are going to be limited to things like the "small string optimization".
In comparing push_back and append, push_back deprives the underlying implementation of potentially useful length information which it might use to preallocate space. On the other hand, append requires that an implementation walk over the input twice in order to find that length, so the performance gain or loss is going to depend on a number of unknowable factors such as the length of the string before you attempt the append. That said, the difference is probably extremely Extremely EXTREMELY small. Go with append for this -- it is far more readable.
I had the same doubt, so I made a small test to check this (g++ 4.8.5 with C++11 profile on Linux, Intel, 64 bit under VmWare Fusion).
And the result is interesting:
push :19
append :21
++++ :34
Could be possible this is because of the string length (big), but the operator + is very expensive compared with the push_back and the append.
Also it is interesting that when the operator only receives a character (not a string), it behaves very similar to the push_back.
For not to depend on pre-allocated variables, each cycle is defined in a different scope.
Note : the vCounter simply uses gettimeofday to compare the differences.
TimeCounter vCounter;
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest.push_back('a');
vTest.push_back('b');
vTest.push_back('c');
}
vCounter.stop();
cout << "push :" << vCounter.elapsed() << endl;
}
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest.append("abc");
}
vCounter.stop();
cout << "append :" << vCounter.elapsed() << endl;
}
{
string vTest;
vCounter.start();
for (int vIdx=0;vIdx<1000000;vIdx++) {
vTest += 'a';
vTest += 'b';
vTest += 'c';
}
vCounter.stop();
cout << "++++ :" << vCounter.elapsed() << endl;
}
Add one more opinion here.
I personally consider it better to use push_back() when adding characters one by one from another string. For instance:
string FilterAlpha(const string& s) {
string new_s;
for (auto& it: s) {
if (isalpha(it)) new_s.push_back(it);
}
return new_s;
}
If using append()here, I would replace push_back(it) with append(1,it), which is not that readable to me.
Yes, I would also expect append() to perform better for the reasons you gave, and in a situation where you need to append a string, using append() (or operator+=) is certainly preferable (not least also because the code is much more readable).
But what the Standard specifies is the complexity of the operation. And that is generally linear even for append(), because ultimately each character of the string being appended (and possible all characters, if reallocation occurs) needs to be copied (this is true even if memcpy or similar are used).
I'm finding standard string addition to be very slow so I'm looking for some tips/hacks that can speed up some code I have.
My code is basically structured as follows:
inline void add_to_string(string data, string &added_data) {
if(added_data.length()<1) added_data = added_data + "{";
added_data = added_data+data;
}
int main()
{
int some_int = 100;
float some_float = 100.0;
string some_string = "test";
string added_data;
added_data.reserve(1000*64);
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
stringstream fragment;
fragment<<some_int <<","<<some_float<<","<<some_string;
add_to_string(fragment.str(),added_data);
}
return;
}
Doing some basic profiling, I'm finding that a ton of time is being used in the for loop. Are there some things I can do that will significantly speed this up? Will it help to use c strings instead of c++ strings?
String addition is not the problem you are facing. std::stringstream is known to be slow due to it's design. On every iteration of your for-loop the stringstream is responsible for at least 2 allocations and 2 deletions. The cost of each of these 4 operations is likely more than that of the string addition.
Profile the following and measure the difference:
std::string stringBuffer;
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
char buffer[128];
sprintf(buffer, "%i,%f,%s",some_int,some_float,some_string.c_str());
stringBuffer = buffer;
add_to_string(stringBuffer ,added_data);
}
Ideally, replace sprintf with _snprintf or the equivalent supported by your compiler.
As a rule of thumb, use stringstream for formatting by default and switch to the faster and less safe functions like sprintf, itoa, etc. whenever performance matters.
Edit: that, and what didierc said: added_data += data;
You can save lots of string operations if you do not call add_to_string in your loop.
I believe this does the same (although I am not a C++ expert and do not know exactly what stringstream does):
stringstream fragment;
for(int ii=0;ii<1000;ii++)
{
//variables manipulated here
some_int = ii;
some_float += ii;
some_string.assign(ii%20,'A');
//then we concatenate the strings!
fragment<<some_int<<","<<some_float<<","<<some_string;
}
// inlined add_to_string call without the if-statement ;)
added_data = "{" + fragment.str();
I see you used the reserve method on added_data, which should help by avoiding multiple reallocations of the string as it grows.
You should also use the += string operator where possible:
added_data += data;
I think that the above should save up some significant time by avoiding unecessary copies back and forth of added_data in a temporary string when doing the catenation.
This += operator is a simpler version of the string::append method, it just copies data directly at the end of added_data. Since you made the reserve, that operation alone should be very fast (almost equivalent to a strcpy).
But why going through all this, when you are already using a stringstream to handle input? Keep it all in there to begin with!
The stringstream class is indeed not very efficient.
You may have a look at the stringstream class for more information on how to use it, if necessary, but your solution of using a string as a buffer seems to avoid that class speed issue.
At any rate, stay away from any attempt at reimplementing the speed critical code in pure C unless you really know what you are doing. Some other SO posts support the idea of doing it,, but I think it's best (read safer) to rely as much as possible on the standard library, which will be enhanced over time, and take care of many corner cases you (or I) wouldn't think of. If your input data format is set in stone, then you might start thinking about taking that road, but otherwise it's premature optimization.
If you start added_data with a "{", you would be able to remove the if from your add_to_string method: the if gets executed exactly once, when the string is empty, so you might as well make it non-empty right away.
In addition, your add_to_string makes a copy of the data; this is not necessary, because it does not get modified. Accepting the data by const reference should speed things up for you.
Finally, changing your added_data from string to sstream should let you append to it in a loop, without the sstream intermediary that gets created, copied, and thrown away on each iteration of the loop.
Please have a look at Twine used in LLVM.
A Twine is a kind of rope, it represents a concatenated string using a
binary-tree, where the string is the preorder of the nodes. Since the
Twine can be efficiently rendered into a buffer when its result is used,
it avoids the cost of generating temporary values for intermediate string
results -- particularly in cases when the Twine result is never
required. By explicitly tracking the type of leaf nodes, we can also avoid
the creation of temporary strings for conversions operations (such as
appending an integer to a string).
It may helpful in solving your problem.
How about this approach?
This is a DevPartner for MSVC 2010 report.
string newstring = stringA & stringB;
i dont think strings are slow, its the conversions that can make it slow
and maybe your compiler that might check variable types for mismatches.
So I was playing around with some code and wanted to see which method of converting a std::string to upper case was most efficient. I figured that the two would be somewhat similar performance-wise, but I was terribly wrong. Now I'd like to find out why.
The first method of converting the string works as follows: for each character in the string (save the length, iterate from 0 to length), if it's between 'a' and 'z', then shift it so that it's between 'A' and 'Z' instead.
The second method works as follows: for each character in the string (start from 0, keep going till we hit a null terminator), apply the build in toupper() function.
Here's the code:
#include <iostream>
#include <string>
inline std::string ToUpper_Reg(std::string str)
{
for (int pos = 0, sz = str.length(); pos < sz; ++pos)
{
if (str[pos] >= 'a' && str[pos] <= 'z') { str[pos] += ('A' - 'a'); }
}
return str;
}
inline std::string ToUpper_Alt(std::string str)
{
for (int pos = 0; str[pos] != '\0'; ++pos) { str[pos] = toupper(str[pos]); }
return str;
}
int main()
{
std::string test = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~!##$%^&*()_+=-`'{}[]\\|\";:<>,./?";
for (size_t i = 0; i < 100000000; ++i) { ToUpper_Reg(test); /* ToUpper_Alt(test); */ }
return 0;
}
The first method ToUpper_Reg took about 169 seconds per 100 million iterations.
The second method Toupper_Alt took about 379 seconds per 100 million iterations.
What gives?
Edit: I changed the second method so that it iterates the string how the first one does (set the length aside, loop while less than length) and it's a bit faster, but still about twice as slow.
Edit 2: Thanks everybody for your submissions! The data I'll be using it on is guaranteed to be ascii, so I think I'll be sticking with the first method for the time being. I'll keep in mind that toupper is locale specific for when/if I need it.
std::toupper uses the current locale to do case conversions, which involves a function call and other abstractions. So naturally, it will be slower. But it will also work on non-ASCII text.
toupper() does more than just shift characters in the range [a-z]. For one thing it's locale dependent and can handle more than just ASCII.
toupper() takes the locale into account so it can handle (some) international characters and is much more complex than just handling the character range 'a'-'z'.
Well, ToUpper_Reg() doesn't work. For example, it doesn't turn my name into all uppercase characters. That said, ToUpper_Alt() also doesn't work because it toupper() gets passed a negative value on some platforms, i.e. it creates undefined behavior (normally a crash) when using it with my name. This is easily fixed, though, by correctly calling it something like this:
toupper(static_cast<unsigned char>(str[pos]))
That said, the two versions of the code are not equivalent: the version onot using toupper() isn't writing the characters all the time while the latter version is: once everything is converted to uppercase it always takes the same branch after a test and then does nothing. You might want to change ToUpper_Alt() to look like this and retest:
inline std::string ToUpper_Alt(std::string str)
{
for (int pos = 0; str[pos] != '\0'; ++pos) {
if (islower(static_cast<unsigned char>(str[pos])) {
str[pos] = toupper(static_cast<unsigned char>(str[pos]));
}
}
return str;
}
I would guess the difference is the writing: toupper() trades the comparison for an array look-up. The locale is quickly accessed and all toupper() does is get the current pointer and access the location at a given offset. With data in the cache this is probably as fast as the branch.
The second on involves a function call. a function call is an expensive operation in an inner loop. toupper also uses locales to determine how the character should be changed.
The advances of the call is that it is standard and will work regardless of character encoding on the host machine
That said, I would highly recommend use the boost function:
boost::algorithm::to_upper
It is a template so is more than likely to be inlined, however it does involve locales. I would still use it.
http://www.boost.org/doc/libs/1_40_0/doc/html/boost/algorithm/to_upper.html
I guess it's because the second one calls a C standard library function, that on the one hand isn't inlined, so you got the overhead of a function call. But even more important, this function probably does a lot more than just two comparisons, two jumps and two integer additions. It performs additional checks on the character and takes the current locale into account and all that stuff.
std::toupper uses the current locale and the reason why this is slower than the C function is that the current locale is shared and mutable from different threads, so it's necessary to lock the locale object when it's accessed to ensure it's not switched during the call. This happens once per call to toupper and introduces quite a large overhead (obtaining the lock might require a syscall depending on implementation). One workaround if you want to get the performance and respect the locale is to get the locale object first (creating a local copy) and then call the toupper facet on your copy, thus avoiding the need to lock for each toupper call. See the link below for an example.
http://www.cplusplus.com/reference/std/locale/ctype/toupper/
The question has already been answered, but as an aside, replacing the guts of your loop in the first method with:
std::string::value_type &c = str[pos];
if ('a' <= c && c <= 'z') { c += ('A' - 'a'); }
makes it even faster. Maybe my compiler just sucks.
For a specific example, consider atoi(const std::string &). This is very frustrating, since we as programmers would need to use it so much.
More general question is why does not C++ standard library reimplement the standard C libraries with C++ string,C++ vector or other C++ standard element rather than to preserve the old C standard libraries and force us use the old char * interface?
Its time consuming and the code to translate data types between these two interfaces is not easy to be elegant.
Is it for compatible reason,considering there was much more legacy C code than these days and preserving these C standard interfaces would make translation from C code to C++ much easier?
In addition,I have heard many other libraries available for C++ make a lot of enhancement and extensions to STL.So does there libraries support these functions?
PS: Considering much more answers to the first specific question, I edit a lot to clarify the question to outline the questions that I am much more curious to ask.
Another more general question is why do not STL reimplementate all the standard C libraries
Because the old C libraries do the trick. The C++ standard library only re-implements existing functionality if they can do it significantly better than the old version. And for some parts of the C library, the benefit of writing new C++-implementations just isn't big enough to justify the extra standardization work.
As for atoi and the like, there are new versions of these in the C++ standard library, in the std::stringstream class.
To convert from a type T to a type U:
T in;
U out;
std::stringstream sstr(in);
sstr >> out;
As with the rest of the IOStream library, it's not perfect, it's pretty verbose, it's impressively slow and so on, but it works, and usually it is good enough. It can handle integers of all sizes, floating-point values, C and C++ strings and any other object which defines the operator <<.
EDIT:In addition,I have heard many other libraries avaliable for C++ make a lot of enhancement and extensions to STL.So does there libraries support these functions?
Boost has a boost::lexical_cast which wraps std::stringstream. With that function, you can write the above as:
U out = boost::lexical_cast<U>(in);
Even in C, using atoi isn't a good thing to do for converting user input. It doesn't provide error checking at all. Providing a C++ version of it wouldn't be all that useful - considering that it wouldn't throw and do anything, you can just pass .c_str() to it and use it.
Instead you should use strtol in C code, which does do error checking. In C++03, you can use stringstreams to do the same, but their use is error-prone: What exactly do you need to check for? .bad(), .fail(), or .eof()? How do you eat up remaining whitespace? What about formatting flags? Such questions shouldn't bother the average user, that just want to convert his string. boost::lexical_cast does do a good job, but incidentally, C++0x adds utility functions to facilitate fast and safe conversions, through C++ wrappers that can throw if conversion failed:
int stoi(const string& str, size_t *idx = 0, int base = 10);
long stol(const string& str, size_t *idx = 0, int base = 10);
unsigned long stoul(const string& str, size_t *idx = 0, int base = 10);
long long stoll(const string& str, size_t *idx = 0, int base = 10);
unsigned long long stoull(const string& str, size_t *idx = 0, int base = 10);
Effects: the first two functions call strtol(str.c_str(), ptr, base), and the last three functions
call strtoul(str.c_str(), ptr, base), strtoll(str.c_str(), ptr, base), and strtoull(str.c_str(), ptr, base), respectively. Each function returns the converted result, if any. The argument ptr designates a pointer to an object internal to the function that is used to determine what to store at *idx. If the function does not throw an exception and idx != 0, the function stores in *idx the index of the first unconverted element of str.
Returns: the converted result.
Throws: invalid_argument if strtol, strtoul, strtoll, or strtoull reports that no conversion could be performed. Throws out_of_range if the converted value is outside the range of representable values for the return type.
There's no good way to know if atoi fails. It always returns an integer. Is that integer a valid conversion? Or is the 0 or -1 or whatever indicating an error? Yes it could throw an exception, but that would change the original contract, and you'd have to update all your code to catch the exception (which is what the OP is complaining about).
If translation is too time consuming, write your own atoi:
int atoi(const std::string& str)
{
std::istringstream stream(str);
int ret = 0;
stream >> ret;
return ret;
}
I see that solutions are offered that use std::stringstream or std::istringstream.
This might be perfectly OK for single threaded applications but if an application has lots of threads and often calls atoi(const std::string& str) implemented in this way that will result in performance degradation.
Read this discussion for example: http://gcc.gnu.org/ml/gcc-bugs/2009-05/msg00798.html.
And see a backtrace of the constructor of std::istringstream:
#0 0x200000007eb77810:0 in pthread_mutex_unlock+0x10 ()
from /usr/lib/hpux32/libc.so.1
#1 0x200000007ef22590 in std::locale::locale (this=0x7fffeee8)
at gthr-default.h:704
#2 0x200000007ef29500 in std::ios_base::ios_base (this=<not available>)
at /tmp/gcc-4.3.1.tar.gz/gcc-4.3.1/libstdc++-v3/src/ios.cc:83
#3 0x200000007ee8cd70 in std::basic_istringstream<char,std::char_traits<char>,std::allocator<char> >::basic_istringstream (this=0x7fffee4c,
__str=#0x7fffee44, __mode=_S_in) at basic_ios.h:456
#4 0x4000f70:0 in main () at main.cpp:7
So every time you enter atoi() and create a local varibale of type std::stringstream you will lock a global mutex and in a multithreaded application it is likely to result in waiting on this mutex.
So, it's better in a multithreaded application not to use std::stringstream. For example simply call atoi(const char*):
inline int atoi(const std::string& str)
{
return atoi(str.c_str());
}
For your example, you've got two options:
std::string mystring("4");
int myint = atoi(mystring.c_str());
Or something like:
std::string mystring("4");
std::istringstream buffer(mystring);
int myint = 0;
buffer >> myint;
The second option gives you better error management than the first.
You can write a more generic string to number convert as such:
template <class T>
T strToNum(const std::string &inputString,
std::ios_base &(*f)(std::ios_base&) = std::dec)
{
T t;
std::istringstream stringStream(inputString);
if ((stringStream >> f >> t).fail())
{
throw runtime_error("Invalid conversion");
}
return t;
}
// Example usage
unsigned long ulongValue = strToNum<unsigned long>(strValue);
int intValue = strToNum<int>(strValue);
int intValueFromHex = strToNum<int>(strHexValue,std::hex);
unsigned long ulOctValue = strToNum<unsigned long>(strOctVal, std::oct);
For conversions I find simplest to use boost's lexical_cast (except it might be too rigorously checking the validity of the conversions of string to other types).
It surely isn't very fast (it just uses std::stringstream under the hood, but significantly more convenient), but performance is often not needed where you convert values (e.g to create error output messages and such). (If you do lots of these conversions and need extreme performance, chances are you are doing something wrong and shouldn't be performing any conversions at all.)
Because the old C libraries still work with standard C++ types, with a very little bit of adaptation. You can easily change a const char * to a std::string with a constructor, and change back with std::string::c_str(). In your example, with std::string s, just call atoi(s.c_str()) and you're fine. As long as you can switch back and forth easily there's no need to add new functionality.
I'm not coming up with C functions that work on arrays and not container classes, except for things like qsort() and bsearch(), and the STL has better ways to do such things. If you had specific examples, I could consider them.
C++ does need to support the old C libraries for compatibility purposes, but the tendency is to provide new techniques where warranted, and provide interfaces for the old functions when there isn't much of an improvement. For example, the Boost lexical_cast is an improvement over such functions as atoi() and strtol(), much as the standard C++ string is an improvement over the C way of doing things. (Sometimes this is subjective. While C++ streams have considerable advantages over the C I/O functions, there's times when I'd rather drop back to the C way of doing things. Some parts of the C++ standard library are excellent, and some parts, well, aren't.)
There are all sorts of ways to parse a number from a string, atoi can easily be used with a std::string via atoi(std.c_str()) if you really want, but atoi has a bad interface because there is no sure way to determine if an error occurred during parsing.
Here's one slightly more modern C++ way to get an int from a std::string:
std::istringstream tmpstream(str);
if (tmpstream >> intvar)
{
// ... success! ...
}
The tongue in cheek answer is: Because STL is a half-hearted attempt to show how powerful C++ templates could be. Before they got to each corner of the problem space, time was up.
There are two reasons: Creating an API takes time and effort. Create a good API takes a lot of time and a huge effort. Creating a great API takes an insane amount of time and an incredible effort. When the STL was created, OO was still pretty new to the C++ people. They didn't have the ideas how to make fluent and simple API. Today, we think iterators are so 1990 but at the time, people thought "Bloody hell, why would I need that? for (int i=0; i<...) has been good enough for three decades!"
So STL didn't became the great, fluent API. This isn't all C++ fault because you can make good APIs with C++. But it was the first attempt to do that and it shows. Before the API could mature, it was turned into a standard and all the ugly shortcomings were set into stone. And on top of this, there was all this legacy code and all the libraries which already could do everything, so the pressure wasn't really there.
To solve your misery, give up on STL and have a look at the successors: Try boost and maybe Qt. Yeah, Qt is a UI library but it also has a pretty good standard library.
Since C++11, you can use std::stoi. It is like atoi but for std::string.
I have some code that I had to write to replace a function that was literally used thousands of times. The problem with the function was that return a pointer to a static allocated buffer and was ridiculously problematic. I was finally able to prove that intermittent high load errors were caused by the bad practice.
The function I was replacing has a signature of char * paddandtruncate(char *,int), char * paddandtruncate(float,int), or char * paddandtruncat(int,int). Each function returned a pointer to a static allocated buffer which was overwritten on subsequent calls.
I had three constants one the
Code had to be replaceable with no impact on the callers.
Very little time to fix the issue.
Acceptable performance.
I wanted some opinion on the style and possible refactoring ideas.
The system is based upon fixed width fields padded with spaces, and has some architectural issues. These are not addressable since the size of the project is around 1,000,000 lines.
I was at first planning on allowing the data to be changed after creation, but thought that immutable objects offered a more secure solution.
using namespace std;
class SYSTEM_DECLSPEC CoreString
{
private:
friend ostream & operator<<(ostream &os,CoreString &cs);
stringstream m_SS ;
float m_FltData ;
long m_lngData ;
long m_Width ;
string m_strData ;
string m_FormatedData;
bool m_Formated ;
stringstream SS ;
public:
CoreString(const string &InStr,long Width):
m_Formated(false),
m_Width(Width),
m_strData(InStr)
{
long OldFlags = SS.flags();
SS.fill(' ');
SS.width(Width);
SS.flags(ios::left);
SS<<InStr;
m_FormatedData = SS.str();
}
CoreString(long longData , long Width):
m_Formated(false),
m_Width(Width),
m_lngData(longData)
{
long OldFlags = SS.flags();
SS.fill('0');
SS.precision(0);
SS.width(Width);
SS.flags(ios::right);
SS<<longData;
m_FormatedData = SS.str();
}
CoreString(float FltData, long width,long lPerprecision):
m_Formated(false),
m_Width(width),
m_FltData(FltData)
{
long OldFlags = SS.flags();
SS.fill('0');
SS.precision(lPerprecision);
SS.width(width);
SS.flags(ios::right);
SS<<FltData;
m_FormatedData = SS.str();
}
CoreString(const string &InStr):
m_Formated(false),
m_strData(InStr)
{
long OldFlags = SS.flags();
SS.fill(' ');
SS.width(32);
SS.flags(ios::left);
SS<<InStr;
m_FormatedData = SS.str();
}
public:
operator const char *() {return m_FormatedData.c_str();}
operator const string& () const {return m_FormatedData;}
const string& str() const ;
};
const string& CoreString::str() const
{
return m_FormatedData;
}
ostream & operator<<(ostream &os,CoreString &cs)
{
os<< cs.m_Formated;
return os;
}
If you really do mean "no impact on the callers", your choices are very limited. You can't return anything that needs to be freed by the caller.
At the risk of replacing one bad solution with another, the quickest and easiest solution might be this: instead of using a single static buffer, use a pool of them and rotate through them with each call of your function. Make sure the code that chooses a buffer is thread safe.
It sounds like the system is threaded, right? If it was simply a matter of it not being safe to call one of these functions again while you're still using the previous output, it should behave the same way every time.
Most compilers have a way to mark a variable as "thread-local data" so that it has a different address depending on which thread is accessing it. In gcc it's __thread, in VC++ it's __declspec(thread).
If you need to be able to call these functions multiple times from the same thread without overwriting the results, I don't see any complete solution but to force the caller to free the result. You could use a hybrid approach, where each thread has a fixed number of buffers, so that callers could make up to N calls without overwriting previous results, regardless of what other threads are doing.
The code you've posted has a one huge problem - if a caller assigns the return value to a const char *, the compiler will make a silent conversion and destroy your temporary CoreString object. Now your pointer will be invalid.
I don't know how the callers are going to be using this, but allocating buffers using new into a auto_ptr<>s might work. It may satisfy criterion 1 (I can't tell without seeing the using code), and could be a pretty fast fix. The big issue is that it uses dynamic memory a lot, and that will slow things down. There's things you can do, using placement new and the like, but that may not be quick to code.
If you can't use dynamic storage, you're limited to non-dynamic storage, and there really isn't much you can do without using a rotating pool of buffers or thread-local buffers or something like that.
The "intermittent high-load errors" are caused by race conditions where one thread tramples on the static buffer before another thread has finished using it, right?
So switch to using an output buffer per thread, using whatever thread-local storage mechanism your platform provides (Windows, I'm thinking).
There's no synchronisation contention, no interference between threads, and based on what you've said about the current implementation rotating buffers, almost certainly the calling code doesn't need to change at all. It can't be relying on the same buffer being used every time, if the current implementation uses multiple buffers.
I probably wouldn't design the API this way from scratch, but it implements your current API without changing it in a significant way, or affecting performance.