Does the string returned from the GetStringUTFChars() end with a null terminated character? Or do I need to determine the length using GetStringUTFLength and null terminate it myself?
Yes, GetStringUTFChars returns a null-terminated string. However, I don't think you should take my word for it, instead you should find an authoritative online source that answers this question.
Let's start with the actual Java Native Interface Specification itself, where it says:
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Oh, surprisingly it doesn't say whether it's null-terminated or not. Boy, that seems like a huge oversight, and fortunately somebody was kind enough to log this bug on Sun's Java bug database back in 2008. The notes on the bug point you to a similar but different documentation bug (which was closed without action), which suggests that the readers buy a book, "The Java Native Interface: Programmer's Guide and Specification" as there's a suggestion that this become the new specification for JNI.
But we're looking for an authoritative online source, and this is neither authoritative (it's not yet the specification) nor online.
Fortunately, the reviews for said book on a certain popular online book retailer suggest that the book is freely available online from Sun, and that would at least satisfy the online portion. Sun's JNI web page has a link that looks tantalizingly close, but that link sadly doesn't go where it says it goes.
So I'm afraid I cannot point you to an authoritative online source for this, and you'll have to buy the book (it's actually a good book), where it will explain to you that:
UTF-8 strings are always terminated with the '\0' character, whereas Unicode strings are not. To find out how many bytes are needed to represent a jstring in the UTF-8 format, JNI programmers can either call the ANSI C function strlen on the result of GetStringUTFChars, or call the JNI function GetStringUTFLength on the jstring reference directly.
(Note that in the above sentence, "Unicode" means "UTF-16", or more accurately "the internal two-byte string representation used by Java, though finding proof of that is left as an exercise for the reader.)
All current answers to the question seem to be outdated (Edward Thomson's answer last update dates back to 2015), or referring to Android JNI documentation which can be authoritative only in the Android world. The matter has been clarified in recent (2017) official Oracle JNI documentation clean-up and updates, more specifically in this issue.
Now the JNI specification clearly states:
String Operations
This specification makes no assumptions on how a JVM
represent Java strings internally. Strings returned from these
operations:
GetStringChars()
GetStringUTFChars()
GetStringRegion()
GetStringUTFRegion()
GetStringCritical()
are therefore not required to
be NULL terminated. Programmers are expected to determine buffer
capacity requirements via GetStringLength() or GetStringUTFLength().
In the general case this means one should never assume JNI returned strings are null terminated, not even UTF-8 strings. In a pragmatic world one can test a specific behavior in a list of supported JVM(s). In my experience, rereferring to JVMs I actually tested:
Oracle JVMs do null terminate both UTF-16 (with \u0000) and UTF-8 strings (with '\0');
Android JVMs do terminate UTF-8 strings but not UTF-16 ones.
https://developer.android.com/training/articles/perf-jni says:
The Java programming language uses UTF-16. For convenience, JNI provides methods that work with Modified UTF-8 as well. The modified encoding is useful for C code because it encodes \u0000 as 0xc0 0x80 instead of 0x00. The nice thing about this is that you can count on having C-style zero-terminated strings, suitable for use with standard libc string functions. The down side is that you cannot pass arbitrary UTF-8 data to JNI and expect it to work correctly.
If possible, it's usually faster to operate with UTF-16 strings. Android currently does not require a copy in GetStringChars, whereas GetStringUTFChars requires an allocation and a conversion to UTF-8. Note that UTF-16 strings are not zero-terminated, and \u0000 is allowed, so you need to hang on to the string length as well as the jchar pointer.
Yes, strings returned by GetStringUTFChars() are null-terminated. I use it in my application, so proved it experimentally, let's say. While Oracle's documentation sucks, alternative sources are more informative: Java Native Interface (JNI) Tutorial
Related
I would like confirmation regarding my understanding of raw string literals and the (non-wide) execution character set on Windows.
Relevant paragraphs for which I desire specific confirmation are in BOLD. But first, some background.
BACKGROUND
(relevant questions are in the paragraphs below in bold)
As a result of the helpful discussion beneath #TheUndeadFish's answer to this question that I posted yesterday, I have attempted to understand the rules determining the character set and encoding used as the execution character set in MSVC on Windows (in the C++ specification sense of execution character set; see #DietmarKühl's posting).
I suspect that some might consider it a waste of time to even bother trying to understand the ANSI-related behavior of char * (i.e., non-wide) strings for non-ASCII characters in MSVC.
For example, consider #IInspectable's comment here:
You cannot throw a UTF-8 encoded string at the ANSI version of a
Windows API and hope for anything sane to happen.
Please note that in my current i18n project on a Windows MFC-based application, I will be removing all calls to the non-wide (i.e., ANSI) versions of API calls, and I expect the compiler to generate execution wide-character set strings, NOT execution character set (non-wide) strings internally.
However, I want to understand the existing code, which already has some internationalization that uses the ANSI API functions. Even if some consider the behavior of the ANSI API on non-ASCII strings to be insane, I want to understand it.
I think like others, I have found it difficult to locate clarified documentation about the non-wide execution character set on Windows.
In particular, because the (non-wide) execution character set is defined by the C++ standard to be a sequence of char (as opposed to wchar_t), UTF-16 cannot be used internally to store characters in the non-wide execution character set. In this day and age, it makes sense that the Unicode character set, encoded via UTF-8 (a char-based encoding), would therefore be used as the character set and encoding of the execution character set. To my understanding, this is the case on Linux. However, sadly, this is not the case on Windows - even MSVC 2013.
This leads to the first of my two questions.
Question #1: Please confirm that I'm correct in the following paragraph.
With this background, here's my question. In MSVC, including VS 2013, it seems that the execution character set is one of the (many possible) ANSI character sets, using one of the (many possible) code pages corresponding to that particular given ANSI character set to define the encoding - rather than the Unicode character set with UTF-8 encoding. (Note that I am asking about the NON-WIDE execution character set.) Is this correct?
BACKGROUND, CONTINUED (assuming I'm correct in Question #1)
If I understand things correctly, than the above bolded paragraph is arguably a large part of the cause of the "insanity" of using the ANSI API on Windows.
Specifically, consider the "sane" case - in which Unicode and UTF-8 are used as the execution character set.
In this case, it does not matter what machine the code is compiled on, or when, and it does not matter what machine the code runs on, or when. The actual raw bytes of a string literal will always be internally represented in the Unicode character set with UTF-8 as the encoding, and the runtime system will always treat such strings, semantically, as UTF-8.
No such luck in the "insane" case (if I understand correctly), in which ANSI character sets and code page encodings are used as the execution character set. In this case (the Windows world), the runtime behavior may be affected by the machine that the code is compiled on, in comparison with the machine the code runs on.
Here, then, is Question #2: Again, please confirm that I'm correct in the following paragraph.
With this continued background in mind, I suspect that: Specifically, with MSVC, the execution character set and its encoding depends in some not-so-easy-to-understand way on the locale selected by the compiler on the machine the compiler is running on, at the time of compilation. This will determine the raw bytes for character literals that are 'burned into' the executable. And, at run-time, the MSVC C runtime library may be using a different execution character set and encoding to interpret the raw bytes of character literals that were burned into the executable. Am I correct?
(I may add examples into this question at some point.)
FINAL COMMENTS
Fundamentally, if I understand correctly, the above bolded paragraph explains the "insanity" of using the ANSI API on Windows. Due to the possible difference between the ANSI character set and encoding chosen by the compiler and the ANSI character set and encoding chosen by the C runtime, non-ASCII characters in string literals may not appear as expected in a running MSVC program when the ANSI API is used in the program.
(Note that the ANSI "insanity" really only applies to string literals, because according to the C++ standard the actual source code must be written in a subset of ASCII (and source code comments are discarded by the compiler).)
The description above is the best current understanding I have of the ANSI API on Windows in regards to string literals. I would like confirmation that my explanation is well-formed and that my understanding is correct.
A very long story, and I have problems finding a single clear question. However, I think I can resolve a number of misunderstandings that led to this.
First of, "ANSI" is a synonym for the (narrow) execution character set. UTF-16 is the execution wide-character set.
The compiler will NOT choose for you. If you use narrow char strings, they are ANSI as far as the compiler (runtime) is aware.
Yes, the particular "ANSI" character encoding can matter. If you compile a L"ä" literal on your PC, and your source code is in CP1252, then that ä character is compiled to a UTF-16 ä. However, the same byte could be another non-ASCII character in other encodigns, which would result in a different UTF-16 character.
Note however that MSVC is perfectly capable of compiling both UTF-8 and UTF-16 source code, as long as it starts with U+FEFF "BOM". This makes the whole theoretical problem pretty much a non-issue.
[edit]
"Specifically, with MSVC, the execution character set and its encoding depends..."
No, MSVC has nothing to do with the execution character set, really. The meaning of char(0xE4) is determined by the OS. To see this, check the MinGW compiler. Executables produced by MinGW behave the same as those of MSVC, as both target the same OS.
I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.
A short summary
First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).
[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]
C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.
However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.
Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.
Observations
Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.
* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.
Questions
If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...
Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?
The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?
Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.
EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.
I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.
Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.
Is the above analysis correct, or are there any other
Unicode-supporting facilities I'm missing?
You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".
The standards committee has done a fantastic job in the last couple of
years moving C++ forward at a rapid pace. They're all smart people and
I assume they're well aware of the above shortcomings. Is there a
particular well-known reason that Unicode support remains so poor in
C++?
The Committee simply doesn't seem to give a shit about Unicode.
Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.
Going forward, does anybody know of any proposals to rectify the
situation? A quick search on isocpp.org didn't seem to reveal
anything.
There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.
Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.
Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.
Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?
Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.
What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?
Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.
I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...
In my current project I've been using wide chars (utf16). But since my only input from the user is going to be a url, which has to end up ascii anyways, and one other string, I'm thinking about just switching the whole program to ascii.
My question is, is there any benefit to converting the strings to utf16 before I pass them to a Windows API function?
After doing some research online, it seems like a lot of people recommend this if your not working with UTF-16 on windows.
In the Windows API, if you call a function like
int SomeFunctionA(const char*);
then it will automatically convert the string to UTF-16 and call the real, Unicode version of the function:
int SomeFunctionW(const wchar_t*);
The catch is, it converts the string to UTF-16 from the ANSI code page. That works OK if you actually have strings encoded in the ANSI code page. It doesn't work if you have strings encoded in UTF-8, which is increasingly common these days (e.g., nearly 70% of Web pages), and isn't supported as an ANSI code page.
Also, if you use the A API, you'll run into limitations like not (easily) being able to open files that have non-ANSI characters in their names (which can be arbitrary UTF-16 strings). And won't have access to some of Windows' newer features.
Which is why I always call the W functions. Even though this means annoying explicit conversions (from the UTF-8 strings used in the non-Windows-specific parts of our software).
The main point is that on Windows UTF-16 is the native encoding and all API functions that end in A are just wrappers around the W ones. The A functions are just carried around as compatibility to programs that were written for Windows 9x/ME and indeed, no new program should ever use them (in my opinion).
Unless you're doing heavy processing of billions of large strings I doubt there is any benefit to thinking about storing them in another (possibly more space-saving) encoding at all. Besides, even an URI can contain Unicode, if you think about IDN. So don't be too sure upfront about what data your users will pass to the program.
In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...
The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.
You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.
There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.