Why do they have a System::String and a std::string in c++?
I couldn't find anything about this, except some topics about converting the one to another.
I noticed this when I want to put information of a textbox into a std::string variable. so I had to do some odd converting to get this.
Why do they have these 2 different strings when they actually do the same for coding? (holding a string value).
std::string is a class template from the c++ standard library that stores and manipulates strings. The data in std::string is basically a sequence of bytes, i.e. it doesn't have en encoding. std::string supports the most basic set of operations that you would expect from a string, namely it gives you methods for substring search and replace.
System::string is a class from Microsoft's .Net framework. It represents text as a series of Unicode characters, and has some more specialized methods like StartsWith, EndsWith, Split, Trim, ans so on.
Related
The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.
I like C, I have a C book called Full and Full C and 2 C++, I find these languages fantastic because of their incredible power and performance, but I have to end many of my projects because of these various types.
I say to have std::string, LPCSTR, System::String, TCHAR [], char s [] = "ss", char * s?
This causes tremendous headaches mainly in GUI applications, WinAPI has the problem of LPCSTR not being compatible with char or std::string, and now in CLR applications if it has System::String that gives a lot of headache to convert to std::String or char * s or even char s [].
Why don’t C/C++ have its string type unique like String in Java?
There are no "many types of string in c++". Canonically there is one template std::basic_string, which is basically a container specialized for strings of different character types.
std::string is a convenience typedef onto std::basic_string<char>. There are more such typedefs for different underlying character types.
AFAIK, standard c has also only one officially recognized string standard. It's ANSI-string i.e. a null terminated array of char.
All other you mention are either equivalent of this (e.g. LPCSTR is a long pointer to a constant string i.e. const char*), or some non-standard extensions written by library providers.
Your question is like asking why there are so many GUI libraries. Because there is no standard way to do this, or standard way is lacking in some way, and it was a design decision to provide and support own equivalent type.
Bottom line is, that on the library level, or language level, it's a design decision between different trade-offs. Simplicity, performance, character support, etc. etc. In general, storing text is hard.
Well, first we must answer the question: What is a string?
The C-standard defines it as a contiguous sequence of characters terminated by and including the first null character.1
It also mentions varieties using wchar_t, char16_t, or char32_t instead of char.
It also provides many functions for string-manipulation, and string-literals for notational convenience.
So, a sequence of characters can be a string, a char[] might hold a string, and a char* might point to one.
LPCSTR is a windows typedef for const char* with the added semantics that it should point to a string or be NULL.
TCHAR is one of a number of preprocessor-defines used for transitioning windows code from char to wchar_t. Depending on what TCHAR is, a TCHAR[] might be able to hold a string, or a wide-string.
C++ mixes up things a bit because it adds a data-type for handling strings. To reduce ambiguity, string is only used for the abstract concept, you have to rely on the context to disambiguate or be more explicit.
So the C string corresponds with the C++ null-terminated-byte-string, or NTBS.2
Yes, C++ also knows their wide varieties.
And C++ incorporates the C functions and adds some more.
In addition, C++ has std::basic_string<> for storing all kinds of counted strings, and some convenience-typedefs like std::string.
And now we get to the third language yet, namely C++/CLI.
Which incorporates all I spoke above from C++, and adds the CLI type System::String into the mix.
System::String is an immutable UTF-16 counted-string.
Now to answer the question why C++ does not define one single concrete type to be a string can be answered:
There are different types of string in C++ for interoperability, history, efficiency and convenience. Always use the right tool for the job.
Java and .Net do the same with byte-arrays, char-arrays, string-builders and the like.
Reference 1: C11 final draft, definition of string:
7. Library
7.1 Introduction
7.1.1 Definitions of terms
1 A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.
Reference 2: C++1z draft n4659 NTBS:
20.4.2.1.5.1 Byte strings [byte.strings]
1 A null-terminated byte string, or NTBS, is a character sequence whose highest-addressed element with defined content has the value zero (the terminating null character); no other element in the sequence has the value zero.163
2 The length of an NTBS is the number of elements that precede the terminating null character. An empty ntbs has a length of zero.
3 The value of an NTBS is the sequence of values of the elements up to and including the terminating null character.
4 A static NTBS is an NTBS with static storage duration.164
Each string type has his own target, the std::string is for standard library and this is the most common.
As you say C++ has power and performance and these strings allow more flexibility. Char[] and char* you can use for a more generic use of string.
In my project, I need to convert the string that contains the superscript - m² to the string m2.
My project accepts unit of measure which includes meter square (m²) or meter cube (m³). And I need to convert the superscripts to a normal integer or string in order to further process the input data.
However, at this moment am unable find any thing in C++ that does this for me.
The application is in C++ and we are using CComBSTR to store the string.
The ideal output would be m2 for m² and m3 for m³ and so on...
Any suggestions
A CComBSTR is just a wrapper for a BSTR. That in turn is a WCHAR*, which maps to C++ type wchar_t*. Since you're on Windows, you have to know that WCHAR is UTF-16.
That means you need to look for wchar_t(0x00B2) and wchar_t(0x00B3). std::find can do that, just pass it the begin and end of your BSTR.
I suspect the key point is the character encoding. A superscript 2 is not an ascii char. They have dedicated unicode chars (see http://en.wikipedia.org/wiki/Superscripts_and_Subscripts, with values in range 0x207x, except 2 and 3) and they are typically encoded with 16 bits. If you have a std::string then a character then they are probably encoded in UTF-8, which means more than one char per character (see http://en.wikipedia.org/wiki/UTF-8).
There is a lot to read, but at the end it's quite easy. You just need to search for a substring (che UTF-8 of superscript 2 and 3)
This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue.
The question is representing a string which has multiple languages. For example, think of a string that has "Hello" in many languages, all comma separated. What I want to do is to separate these words. So my questions are:
Can I use std::string for this or should I use std::wstring?
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
Overall, what is the 'accepted' way of handling this type of case?
Thank you.
Can I use std::string for this or should I use std::wstring?
Both can be used. If you use std::string, the encoding should be UTF-8 so as to avoid null-bytes which you'd get if you were to use UTF-16, UCS-2 etc. If you use std::wstring, you can also use encodings that require larger numbers to represent the individual characters, i.e. UCS-2 and UCS-4 will typically be fine, but strictly speaking this is implementation-dependent. In C++11, there is also std::u16string (good for UTF-16 and UCS-2) and std::u32string (good for UCS-4).
So, which of these types to use depends on which encoding you prefer, not on the number or type of languages you want to represent.
As a rule of thumb, UTF-8 is great for storage of large texts, while UCS-4 is best if memory footprint does not matter so much, but you want character-level iterations and position-arithmetic to be convenient and fast. (Example: Skipping n characters in an UTF-8 string is an O(n) operation, while it is an O(1) operation in UCS-4.)
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
I would use the same data type for the words as I would use for the text itself. I.e. words of a std::string text should also be std::string, and words from a std::wstring should be std::wstring.
(If there is really a good reason to switch from a string-datatype to a character-pointer datatype, of course char* is right for std::string and wchar_t* is right for std::string. Similarly for the C++11 types, there is char16_t* and char32_t*.)
Overall, what is the 'accepted' way of handling this type of case?
The first question you need to answer to yourself is which encoding you want to use for storage and processing. In highly international settings, only Unicode encodings are truly eligible, but there are still more than one to choose from: UTF-8, UCS-2 and UCS-4 are the most common ones. As described above, which one you choose has implications for memory footprint and processing speed, so think carefully about what types of operations you need to perform. It may be required to convert from one encoding to another at certain points in your program for optimal space and time behavior. Once you know which encoding you want to use in each part of the program, choose the data type accordingly.
Once encoding and data types have been decided, you might also need to look into Unicode normalization. In many languages, the same character (or character/diacritics combination) can be represented by more than one sequence of Unicode code points (esp. when combining characters are used). To deal with these cases properly, you may need to apply Unicode normalizations (such as NFKC) to the strings. Note that there is no built-in support for this in the C++ Standard Library.
The following code converts a string to a wstring which I need to call the stemming method I am using. However, the map in which I am storing the stemmed words is full of strings. I looked around at some of the solutions on SO and many of the conversions from wstring to string are circa a dozen lines of code. Is there any way to convert quickly (preferably inline or similar) from a string to a wstring and back?
string ANSIWord("documentation");
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer;
Otherwise, I will look into converting my map and other methods to use wstring.
EDIT:
Epiphany: I decided to place the entire conversion-method-conversion in a method of its own, thereby reducing it to the desired one line. However, I would still like to know out of curiousity/ for future reference.
Why don't you write a function that includes the circa 12 lines of code that do the conversion, and then call that function?