I see "CString" in MFC, and "QString" in QT.
what is the difference among string, CString, QString?
Why do not use "string" directly?
They're different variation on string types.
std::string is the one from the ISO standard and probably preferred in situations where you want portability. It is required to be provided by all implementations claiming to conform with the standard.
CString is, as you say, from MFC (documented here) and will generally only work in that environment. If you're programming exclusively to Windows, you can probably use that. It may have extra features not provided by std::string.
Similarly, QString is the Qt variation, documented here, and is meant to represent strings in programs using Qt. Like CString, it's more tightly bound to its environment so may offer efficiencies over std::string.
Looking around (doing your research for you basically) I found some stuff.
String: Does NOT support character encoding, no special functionality vs the others(.)
QString: Plenty of useful functions, some better compatibilities, supports character encoding, default UTF-16(.)
CString: Plenty of useful functions, some better compatibilities, and good for Unicode and Ascii compilation(..), ...
There are also some more things that are not mentioned here, the sources are
. http://blog.rburchell.com/2010/08/strings-and-qt.html
.. http://forums.codeguru.com/showthread.php?319932-CString-vs-std-string
... Elsewhere
.... Built to work better with its own framework
I hope I was helpful, as this is my first post.
Related
How to arrange correct processing of Unicode strings using pure C++?
What I mean is, when you put your unicode string into std::string and count its length, sometimes you get like 10 characters for 5-chars-long string.
How do they do it in serious open-source programs? How do they do it in a cross-platform manner? How do you tie it to file i/o and stdin/stdout streams?
Thanks.
There's Boost.Locale, which is written in C++, wraps the ICU library, and provides a nice, non-alien interface to it.
For Unicode work, my first choice would be Boost.Locale, followed by ICU directly (if there is something that Boost.Locale doesn't wrap yet).
std::[w]string, contrary to popular belief, has no Unicode support whatsoever. They both operate only on [w]char[_t] units, in an encoding agnostic way.
If you only need basic Unicode support in the form of length and conversions and encoding verification, there is utfcpp, which provides a beautiful C++ interface for these operations.
Application frameworks like Qt and wxWdigets do provide their own string classes, which offer better Unicode support, but often tying you to use the whole framework throughout your code.
Aside from that, there is ICU, which is the standard Unicode implementation around today.
A work in progress by one of the C++ masters on this website is ogonek. you can surely contact the author through the Lounge<C++> StackOverflow chat room to ask for details on his progress.
This is how: http://www.utf8everywhere.org
Have you checked http://site.icu-project.org already?
ICU is currently the Unicode library. If you want cross-platform Unicode support, ICU is basically the only place to get it.
If only its interface wasn't more unfriendly than the wrong end of an automatic shotgun.
I've used wxWidgets to do this. It makes for easy conversion from std::string to their string type wxString. It's not ideal, but it works well, is simple and portable.
I'm targeting Windows but I don't see any reason why some API code I'm writing cannot use basic C++ types. What I want to do is expose methods that return strings and ints. In the C# world I'd just use string, and have a unicode string, but in VC++ I've got the option of using std::string, std::wstring, or MFC/ATL CStrings.
Should I just use std::wstring exclusively to support unicode, or can I use std::string which would be compiled to unicode based on my build settings? I'm leaning toward the latter. I'd prefer to provide Get[Item]AsCString() methods on my objects for other string types.
Also should I be using size_t instead of integer?
The API is going to be used by me, and perhaps a future developer working on the C++ GUI. It is a way to separate concerns. My preferences:
Intuitiveness for other developers.
Forward compatibility with VC++
Compatibility with other C++ compilers
Performance (this is a lesser concern for me, but need the startup time for rest of my app)
Any guides would be appreciated.
You should probably stick to the STL string type. The MFC CString class is built on top of that nowadays anyway.
As has been noted before, using wstring is not a magic bullet to address Unicode issues since there are many Unicode characters that still require multiple wchars to encode.
Using Utf-8 instead has potential benefits (you don't have to worry about endianness for example).
On Windows, all modern kernels are wchar based, so there is a (minimal) performance overhead involved if you use the 8bit char versions of APIs.
In your situation it would take me few hours / days to develop an opinion and decide. First of all, I very much prefer C_API to C++_API, even for C++ code. Then the answer would be char*, or wchar*, or TCHAR*. Now, try to guess if you REALLY expect the need for UNICODE. Great majority of my projects (including those with GUIs), had no need for UNICODE, the simplicity and familiarity of plain C-arrays is often hard to beat.
In short, try to predict what will be your needs, do not try to look too far into future (2 years is a good mark), then come up with the simplest solution to meet the needs.
Last: To answer your question more directly, I would start with std::string as my 1st choice to evaluate. Unless I would find some bid advantage in favor of the other choices, I would stay with it.
Using std::wstring/string instead of the MFC CString will allow you to port your code to other frameworks (e.g. Qt for Windows).
Even when using std::string you could encode the strings in UTF-8, so your API will still be able to return UNICODE strings.
Keep in mind that even wstring is really UTF-16 and not the full 32 bits UNICODE (while on some operating systems wstring is UTF-32).
How do I write a std::codecvt facet? I'd like to write ones that go from UTF-16 to UTF-8, which go from UTF-16 to the systems current code page (windows, so CP_ACP), and to the system's OEM codepage (windows, so CP_OEM).
Cross-platform is preferred, but MSVC on Windows is fine too. Are there any kinds of tutorials or anything of that nature on how to correctly use this class?
I've written one based on iconv. It can be used on windows or on any POSIX OS.
(You will need to link with iconv obviously).
Enjoy
The answer for the "how to" question is to follow the codecvt reference. I was not able to find any better instructions in the Internet two years ago.
Important notices
theoretically there is no need for such work. codecvt_byname should be enough on any standard supporting platform. But in reality there are some compilers that don't support or badly support this class.
There is also a difference in interfaces of codecvt_byname on different compilers.
my working example is implemented with state template parameter of codecvt. Always use standard mbstate type there as this is the only way to use your codecvt with standard iostream classes.
std::mbstate_t type can't be used as a pointer on 64bit platforms in a cross-platform way.
stateless conversions work for short strings, but may fail if you try to convert a data chunk greater that streambuf internal buffer size (UTF is essentially stateful encoding)
The problem with this std::codecvt is it's a solution looking for a problem. Or rather, the problem it's trying to solve is unsolvable, so anybody trying to use it as a solution is going to be very disappointed.
If you don't know which character set your input or output is, then std::codecvt isn't ever going to be able to help you. Conversely, if you do know which character sets you're using, then you can trivially convert between them with a single function call. Wrapping that function call in a complicated mess of templates doesn't change those fundamentals.
...and that's why nobody uses std::codecvt. I recommend you just do what everybody else does, and pretend it never happened.
What's the current best practice for handling generic text in a platform independent way?
For example, on Windows there are the "A" and "W" versions of APIs. Down at the C layer we have the "_tcs" functions (like _tcscpy) which map to either "wcscpy" or "strcpy". And in the STL I've frequently used something like:
typedef std::basic_string<TCHAR> tstring;
What issues if any arise from these sorts of patterns on other systems?
There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.
I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.
EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).
One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.
I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.
Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.
Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.
One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.
What is the best practice of Unicode processing in C++?
Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide
Here is a checklist for Windows programming:
All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string
Look at
Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
My code is under the New BSD license and can be found here:
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
Willow Schlanger's example code seems like a good one (see his answer for more details).
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
Here's a list of the embedded libraries that seem decent.
Embedded libraries
http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)
Use IBM's International Components for Unicode
Have a look at the recommendations of UTF-8 Everywhere