Missing std::string functionality - c++

Anyone knows why the C++ standard library’ std::string class (or more generally the std::basic_string class template) lacks ordinary character string functionality such as uppercasing, substring replacement and trimming, etc., compared to e.g., QString class from Qt, or Python strings?

I'm going to take a stab at this...
The template std::basic_string was originally written and included in the STL, which represents the "abstract" parts of the Standard Library as we know it (containers, iterators, algorithms, allocators, etc.). This also included std::string.
Notice how there is absolutely no encoding, internationalization or locale dependent functionality in the STL. It wasn't a design goal.
Now my take on the happenings of previous generations: When C++ was standardized, there was need for a comprehensive Standard Library. The STL was a very good fit for this, and was taken over almost verbatim. Only later was stuff like <iostream> and <locale> added. The clunky and very incoherent interface differences between streams and strings only prove this "let's throw it all together" attitude.
As with many std facilities, interoperation between the components wasn't optimized. On top of that, the simplicity of a small C++ function wrapping existing C functionality (like toupper) have been used as a reason not to include this in the Standard Library.
By a next revision of the Standard (and thus the Library it comprises), backward compatibility prevented any useful and necessary changes (injecting locale into std::string functionality) from being added.
Note that this conjecture does not at all explain why for example a std::trim taking a string and locale object wasn't added. It does kind of attempt to explain the background process involved.
Now that all has been said, I wholly agree the C++ Standard Library is clunky and incomplete in its general usefulness.
UPDATE: I've been informed that my timeline is reversed: the Standard Library (and iostream) existed before the STL was added. The point above is still valid though: STL was copy-pasted, with little to no integration (simple example: the missing until recently std::basic_istream<T>::open(const std::basic_string<T>&), which will be deprecated in the next iteration due to std::filesystem stuffs).

Can't answer about all missing features in the most general sense, but…
The two features mentioned, trimming and uppercasing, are locale-dependent. They aren't only functions of characters, but also the encoding and language being used.
std::string doesn't really handle that. Although in practice, everyone uses Unicode with whitespace as defined by ASCII, that's not general enough for the kind of standardization process that defines C++.
Such operations are obtained by streams (for example, read out of a std::stringstream to strip excessive space) and locale objects (for example, accessed through std::tolower).

Poor functionality? Is considered to be one of the bloated components in the Standard Library. You have an entire set of algorithms that operate on std::strings, all the standard algorithms. Don't restrict yourself to member functions, there is much more in an interface than that...

Related

Have I used STL? Or not?

I'm confused about some of the documentation at cplusplus.com. Have I used the standard template library with either of these two lines of code?
for (std::string::const_iterator it = str.begin(); l < data.size; ++it, ++l)
and
tile_tag(): distance(std::numeric_limits<int>::max()), dir(unknown_direction), used(false)
See:
http://www.cplusplus.com/reference/string/string/begin/
and
http://www.cplusplus.com/reference/std/limits/numeric_limits/
If I have used STL, how can I modify them to do what they do without STL?
Thanks!
EDIT: I guess this is OUR definition of STL: http://www.sgi.com/tech/stl/stl_index_cat.html
I don't see const_iterator...but I do see max. But is that the max I used?
Yes, you used it, since std::string alone is "part of the STL", but, more importantly, you are using the iterators, which are a distinctive trait of the STL.
Now, for the mandatory purists part: the STL was a library published by SGI and developed mainly by Alexander Stepanov. What I think you are referring to is "the part of the SGI STL which was later included into the standard with some minor modifications", i.e. all the generic containers/algorithms/iterators stuff.
The usage of the term "STL" for this stuff is tolerated, what is plain wrong, instead, is calling "STL" the whole C++ standard library (which includes a lot of other stuff, e.g. iostreams, locale, the library inherited from C, ...).
If I have used STL, how can I modify them to do what they do without STL?
Why would you want to do that? Anyhow, you have several options, that span from rewriting such classes and algorithms from scratch to using simpler data structures (e.g. C strings) reimplementing only the algorithms you need. Anyway, they both imply reinventing the wheel (and probably reinventing a square wheel), so I'd advise you against doing this unless you have a compelling reason.
EDIT: I guess this is OUR definition of STL: http://www.sgi.com/tech/stl/stl_index_cat.html
Are you sure? Almost no one today uses the SGI STL, generally you use the (more or less) equivalent portion of your C++ standard library (that, by the way, is what you are doing in your code, since you are getting everything from the std namespace).
I don't see const_iterator...
const_iterator is a typedef member of basic_string<char>, you find it in its page.
but I do see max. But is that the max I used?
No, it's not it, your max is not a global function, but a member of the std::numeric_limits template class. Such class do not come from the STL.
Does your code include the namespace std? Then yes, you have used the Standard library. How you can you modify your code to not use the Standard library? Why on earth would you want to? Implementations must provide it- that's why it's called Standard. But if you're truly insane and feel the need to not use it, then ... look up the documentation on what those functions and classes do and write your own replacement.
You have used the STL std::string and std::numeric_limits. I don't know why you would want to avoid using STL (perhaps homework?). You could use old style C strings with a char* and the macro definition MAX_INT from C limits.
What is this "STL" you speak of?
There was this thing that SGI developed in the mid 90s. Yeah, that's 15+ years ago. Older than the previous C++ standard. Back in the days when compilers like Turbo C++ and Borland C++ were best-of-breed; when people used the phrase "C with classes" unironically and without derision; when the idea of compiling C++ primarily by first cross-compiling to C was finally being abandoned; when exceptions were (at least in this context!) a new idea.
They called it "STL" for "standard template library", and they were doing a bunch of neat things with templates and developing new idioms. For its time, it was pretty revolutionary. So much so, in fact, that almost all of its stuff got officially accepted into the standard library with the 1999 standardization of the language. (Similarly, lots of Boost stuff - although nowhere near all; Boost is huge - has been added to the std namespace in the new standard.)
And that's where it died.
Unless you are specifically referring to a few oddball things like std::iota or lexicographical_compare_3way, it is not appropriate to speak of "the STL", and it hasn't been appropriate to speak of "the STL" for twelve years. That's an eternity in computer science terms. (But, hell, I still seem to keep hearing about new installations of Visual C++ 6.0, and some people using even older compilers...)
You're using the standard library of the C++ language. I guess you could write "SC++L" if you like, but that's a bit awkward, isn't it? I mean, no sane Java programmer would ever refer to the SJL, nor any Pythonista to the SPL. Why would you need jargon to talk about using the standard library of the language you are using?
This is what you're supposed to do by default, after all. Could you imagine writing code in C without malloc, strlen et. al.? Avoiding std::string in C++ would be just as silly. Sure, maybe you want to use C++ as a systems programming language. But really, the kinds of applications where you absolutely have to squeeze out every cycle (keep in mind that using std::string is often more efficient than naive string algorithms, and difficult to beat with hard work, because of the simple algorithmic benefits of caching the string length count and possibly over-allocating memory or keeping a cache for short strings) are generally not ones that involve string processing.
(Take it from me - I have several years' experience as a professional mobile game developer, dating to when phones were much, much less powerful: the standard approach to string processing is to redesign everything to need as little of it as possible. And even when you do have to assemble - and line-wrap - text, it was usually not worth optimizing anyway because the time the code spends on that is dwarfed by the time it takes to actually blit the characters to screen.)
"STL" can mean a number of things.
Formally, the name "STL" was only ever used for the library developed by Stepanov, and which is now hosted by SGI. The library consisted, roughly speaking, of containers (vector, list, deque, map, set and all those), iterators and algorithms (and a few other bits and pieces, such as allocators)
This library was adopted and adapted into the C++ standard library when the language was standardized in 1998. In this process, a number of changes were made: corners were cut, components were left out, in order to keep the library small, and to make it easy to swallow for the committee members. A number of changes in order for it to better interoperate with the rest of the standard library, or with the language itself, and certain parts of the existing standard library were modified to become a bit more STL-like.
Today, it's really not very useful to discuss the original STL. So when most C++ developers say "STL", they mean "the part of the C++ standard library that was created when the STL was adopted into the standard". In other words, all the container classes, the iterators and the algorithms in the standard library, are commonly called "STL", simply because they look a lot like the "original" STL, and they're a lot more relevant today than the original library.
But this also means that the term has become rather fluffy and vague. The std::string class was not a part of the original STL. But it was modifed to behave "STL-like". For example, it was given iterators, so that it can be used with the STL algorithms. Should it be considered a part of the STL? Maybe, maybe not.
And to forther confuse matters, the STL became fashionable after it was adopted into the language. Suddenly, people started talking about "STL-style". The Boost libraries, for example, are generally written in a very STL-like way, and some Boost libs have been adopted into the standard library in C++11. So should they now be considered part of the STL?
Probably not, but you could make a pretty good case that they should: that they follow the spirit of the STL, and if Stepanov had thought of them when he write the original STL library, maybe he'd have included them.
Except for a few purists, who think that everyone is better off if the term "STL" is used exclusively to refer to something that no one ever actually needs to refer to (the SGI library), and if we have no way to refer to the standard library subset that it spawned, pretty much any C++ developer can agree that if you've used the standard library containers OR the standard library iterators OR the standard library algorithms, then you have used the STL.
In your code I see an iterator. So yes, you've used at least a small corner of the STL.
But why does it matter?

Does general purpose libraries contain any code which cannot be written by normal users?

Do libraries such as boost, STL, ACE (which often make inclusions in namespace std) contain any special kind of coding techniques which is not possible to be coded/used by a usual programmer ?
In a broader sense, do they leverage any compiler or implementation specific utilities, which is not exposed to the general programmers ?
These are all written in the same code available to everyone. However, the code is often hard to read (at least for me) because they go to great lengths to ensure the generality of the libraries. Here is the sgi implementation of the STL. Browse through it and see for yourself.
Since the standard library is part of the C++ specification, your question is not well-founded.
For example, the implementation of std::fstream (or at least, std::filebuf) must use OS-dependent interfaces. Do those count as "implementation specific utilities"?
The bottom line is that the spec does not separate out the standard library from the rest of the language. It is all just part of the language, and its facilities are available to "usual programmers".
Boost is mostly written in standard C++, but they do take advantage of platform-specific features when that can yield performance improvements, and they occasionally need compiler-dependent extensions for features. The documentation will generally mention when a feature is not available on all platforms.
I do not know about ACE.
The STL (and the others) is written in 'pure C++'. See here for a very similar question.
C, on the other hand, has many system calls (unix/Windows/etc) in its standard library files to make everything work.
The C++0x STL also uses some compiler magic to make some new language features work.

why don't STL ifstream and ofstream classes take std::string as filenames?

This is a complaint about STL. Why do they take filename arguments as (char *) and not as std::string? This seems to make no sense.
There are two other questions on this topic:
How to open unicode filenames with
STL
Windows Codepage interactions with
C++
The issue is that I have a lot of code that looks like this:
std::ofstream f(fname.c_str());
WhenI would like it to look like this:
std::ofstream f(fname);
Additional issues that are mentioned in the above posts is the issue of UTF-16 vs. UTF-8. (UTF-16 might contain NULLs which would break the POSIX API). But that's not really an issue, because the implementation could convert UTF-16 to UTF-8 before calling open().
But seriously, this makes no sense. Are there any plans to upgrade STL?
why don’t ifstream and ofstream classes take std::string as filenames?
I've seen a few sensible arguments for that (namely that this would create a dependency of the streams on strings), but frankly I believe the actual reason is that the streams are much older than the standard library and its strings.
Are there any plans to upgrade STL?
It's called C++11 and will be the new version of the standard. I don't know whether file streams changed. You could look at the final draft and find out for yourself.
Note that STL is the name for a library of containers, algorithms, and iterators, incorporated into the standard library. Also part of the standard library are strings, streams and others.
In particular, streams are not part of the STL. They are siblings.
Historical reasons. The iostream library was developed separately from the string stuff. But why this wasn't integrated in the C++ Standard is anyone's guess. I've seem several questions on Usenet way back when (including the dependancy theory) , but never a really satisfactory explanation.
As I recall, it actually is (at least sort of) the situation with string vs. wstring. I can't find the post right now, but I'm reasonably certain I remember a Usenet post by Andrew Koenig saying that it had been brought up by members of one of the national committees (Japan is what I seem to recall, but could easily be wrong) brought up the question of how they could deal with file names in various languages (especially since relatively few OSes at the time provided much support for that). Even though it had started out pretty simple, it quickly became apparent that the only way to avoid it turning into a huge mess was to cease all discussion of the idea in general.

Why is there a different string class in every C++ platform out there?

While I like programming in C++, I hate the idea of:
std::basic_string vs QString vs wxString vs .............
Doesn't the standard string class satisfy the needs for these frameworks? I mean what is wrong with the standard string class?!
Just to emphasize, that below is the important question:
Do you learn "the" string class of the framework in every framework you are going to work with? would you instead stick to the standard string class by trying to adapt it everywhere?
Thanks...
The reason for multiple string classes is that the C++ standard was finalized fairly late (in 1998); it then took some time until all systems actually provided a correct C++ library. By that time, all these competing string classes where already written.
In addition, in some cases, people want to inherit from a single base class, which std::string wouldn't do.
IMO, std::string isn't old enough to be widespread (Qt and wxWidgets are older than the STL, or at least older than widely available stable and working STLs). Also, std::string is sadly not the best string class there is for everyone, and other frameworks have other needs.
Note! The paragraph below slightly incorrect, but kept to make sense of comments.
For instance, C++ STL's is very resource constrained, whereas the Qt string class offer lots of goodies that a committe would never agree on, especially as some want it to be easily implementable on embedded systems and the like.
One of the main problems with std::string is the lack of Unicode support. Even with std::wstring you only get a container for Unicode code points, but would still have to implement the Unicode-aware functionality.
Also, QString for example is "implicitly shared". This makes it very easy to pass strings around your code in an efficient way. They are actually copied only on write.
One reasonable reason (versus unreasonable reasons like "I don't want to learn the Standard Library") is that some libraries wish to retain control over the binary layout, in order to achieve certain kinds of interoperability (such as binary compatibility across versions). An example of this is _bstr_t in the VC++ libraries; it is important for COM purposes that a _bstr_t is represented as a BSTR (since that is what COM needs), so a wrapper built on top of a BSTR is valuable to COM developers.
IIRC Bjarne Stroustrup deliberately omitted a String class from C++ as he considered it a "rite of passage". All those who learnt C++ were expected to write their own. Certainly at the start of C++ there were no standard libraries and I remember versions from AT&T (which was a preprocessor for C) and the NIH Classes from a very pioneering group at the National Institutes of Health in the US (which also included early collection classes).
std::string is great... Oh, except that it doesn't have a "Format()" call... And, it doesn't have Split() or Join()... Actually, it doesn't do a lot of things that users of strings in those "inferior" scripting language get to take for granted...
If C++ had the ability to ADD to existing classes (like Objective-C or Ruby) then you probably wouldn't see this...
Also, consider that C++ generally does a better job (than things like Java) at letting you create objects that behave like real native types...
One of the tenants of C++ is "You don't pay for what you don't need." This means there does not need to be a one-size-fits-all string class that every C++ programmer MUST know and (more importantly) must USE. Maybe your project requires thread-safe strings. You can roll your own class. And you always have the option of using the existing std::string.
It just so happens that in most cases std::string is good enough. But when it isn't, aren't you glad you aren't locked into it. Try to roll your own String class in Java and see how long it takes until you are pulling your hair out.
As for your second point, if you are going to fight against a library you've added to your project, why did you add the library to your project in the first place? Part of the decision to use wxWidgets or QT is the acknowledgment that you must embrace its string class in your project (or at least a sizable portion of that project). Just like the decision to a "C" library means putting up with char* buffers and size parameters on all the functions.
So, yes, learn the alternate string class. If you are using a library (and wish to become proficient with it) you can't decide to ignore part of the library just because "it's another string class". That makes no sense.

Cross-Platform Generic Text Processing in C/C++

What's the current best practice for handling generic text in a platform independent way?
For example, on Windows there are the "A" and "W" versions of APIs. Down at the C layer we have the "_tcs" functions (like _tcscpy) which map to either "wcscpy" or "strcpy". And in the STL I've frequently used something like:
typedef std::basic_string<TCHAR> tstring;
What issues if any arise from these sorts of patterns on other systems?
There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.
I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.
EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).
One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.
I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.
Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.
Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.
One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.