I want a lite-weight C++ XML parser/DOM that:
Can take UTF-8 as input, and parse into UTF-16. Maybe it does this directly (ideal!), or perhaps it provides a hook for the conversion (such as taking a custom stream object that does the conversion before parsing).
Offers some XPath support.
I've been looking at RapidXML, the Kranf xmlParser, and pugiXML. The first two of those might permit requirement #1 by way of a hook. The third, pugiXML, supports the #2 requirement. But none of those three fulfill both requirements.
What is the smallest (free) library that can handle both requirements?
pugixml has an UNICODE branch. I guess UNICODE will be officially supported in the next version (0.6)
I'd really go for TinyXML + TinyXPath... Tiny, fully UTF-8 compilant and zlib/MIT licensed. If you want a more C++'s like interface there's also TinyXML++
Related
I see "CString" in MFC, and "QString" in QT.
what is the difference among string, CString, QString?
Why do not use "string" directly?
They're different variation on string types.
std::string is the one from the ISO standard and probably preferred in situations where you want portability. It is required to be provided by all implementations claiming to conform with the standard.
CString is, as you say, from MFC (documented here) and will generally only work in that environment. If you're programming exclusively to Windows, you can probably use that. It may have extra features not provided by std::string.
Similarly, QString is the Qt variation, documented here, and is meant to represent strings in programs using Qt. Like CString, it's more tightly bound to its environment so may offer efficiencies over std::string.
Looking around (doing your research for you basically) I found some stuff.
String: Does NOT support character encoding, no special functionality vs the others(.)
QString: Plenty of useful functions, some better compatibilities, supports character encoding, default UTF-16(.)
CString: Plenty of useful functions, some better compatibilities, and good for Unicode and Ascii compilation(..), ...
There are also some more things that are not mentioned here, the sources are
. http://blog.rburchell.com/2010/08/strings-and-qt.html
.. http://forums.codeguru.com/showthread.php?319932-CString-vs-std-string
... Elsewhere
.... Built to work better with its own framework
I hope I was helpful, as this is my first post.
Does Standard ML support Unicode?
I believe it does not but cannot find any authoritative documentation for SML stating such.
A yes or no is all that is needed, but you must know for a fact. No guessing or I believe answers. An authoritative link would be better.
Not really. All there is in the standard for the time being is the ability to use \uXXXX escapes in character and string literals, and that it does at least allow Unicode as the underlying character encoding for char or the optional WideChar.char. But the standard basis library does not prescribe any support for additional Unicode-aware functionality.
Particular implementations may have additional support, and you may perhaps find some third-party unicode libraries, but that's about it (unfortunately, I have no pointers at hand).
It depends a lot what you mean by "Unicode", which is a collection of many standards for many things. I've not seen any language or system that supports Unicode fully, and I don't even know what that would mean in all details.
You can certainly work with UTF-8 in SML: that encoding was invented to make it easy for ASCII applications to support Unicode. This might result it better and more efficient representation of Unicode than e.g. UTF-16 seen in Java, which does "support Unicode" officially, but then there are many practical problems with it (like surrogate characters).
With UTF-8 in SML strings, one question is how to work with string literals. Systems like Poly/ML allow to redefine the ML toplevel pretty printer for type string, and it is also feasible to wrap up the compiler to process string literals in a Unicode friendly way. Both of this is done in Isabelle/ML, which is based on Poly/ML. So if you take that big theorem proving environment as ML development platform, you have some kind of Unicode support built in (via so-called "Isabelle symbols").
How to arrange correct processing of Unicode strings using pure C++?
What I mean is, when you put your unicode string into std::string and count its length, sometimes you get like 10 characters for 5-chars-long string.
How do they do it in serious open-source programs? How do they do it in a cross-platform manner? How do you tie it to file i/o and stdin/stdout streams?
Thanks.
There's Boost.Locale, which is written in C++, wraps the ICU library, and provides a nice, non-alien interface to it.
For Unicode work, my first choice would be Boost.Locale, followed by ICU directly (if there is something that Boost.Locale doesn't wrap yet).
std::[w]string, contrary to popular belief, has no Unicode support whatsoever. They both operate only on [w]char[_t] units, in an encoding agnostic way.
If you only need basic Unicode support in the form of length and conversions and encoding verification, there is utfcpp, which provides a beautiful C++ interface for these operations.
Application frameworks like Qt and wxWdigets do provide their own string classes, which offer better Unicode support, but often tying you to use the whole framework throughout your code.
Aside from that, there is ICU, which is the standard Unicode implementation around today.
A work in progress by one of the C++ masters on this website is ogonek. you can surely contact the author through the Lounge<C++> StackOverflow chat room to ask for details on his progress.
This is how: http://www.utf8everywhere.org
Have you checked http://site.icu-project.org already?
ICU is currently the Unicode library. If you want cross-platform Unicode support, ICU is basically the only place to get it.
If only its interface wasn't more unfriendly than the wrong end of an automatic shotgun.
I've used wxWidgets to do this. It makes for easy conversion from std::string to their string type wxString. It's not ideal, but it works well, is simple and portable.
My source base is mostly using UTF8, but some older library has Windows Latin1 encoded strings hardcoded within it.
I was hoping Boost would have a clear conversion feature, but I did not find such. Do I really need to hand-code such a commonplace solution?
Looking for a portable solution, running on Linux.
(This Q is similar, but not quite the same)
Edit: ICU seems to be the right answer, but it's a bit overkill for my needs. I ended up doing string-replace for the known few extended chars that were used.
International Components for Unicode (ICU) does have the solutions you are looking for. Boost can be compiled with support for ICU, e.g. for Boost regular expressions, but precompiled versions of Boost usually don't include it.
What is the best practice of Unicode processing in C++?
Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide
Here is a checklist for Windows programming:
All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string
Look at
Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
My code is under the New BSD license and can be found here:
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
Willow Schlanger's example code seems like a good one (see his answer for more details).
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
Here's a list of the embedded libraries that seem decent.
Embedded libraries
http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)
Use IBM's International Components for Unicode
Have a look at the recommendations of UTF-8 Everywhere