Converting jbyteArray to a character array, and then printing to console - java-native-interface

I am writing a JNI program where my .cpp file gets a jbyteArray and I want to be able to print the jbyteArray with printf. For that to happen, I believe I have to convert the jbyteArray to a character array.
For background knowledge, the java side of my JNI converts a String to a byteArray, and then that byteArray is passed in as an argument to my JNI function.
What I've done so far prints out the String correctly, but it is followed by junk characters, and I do not know how to get rid of these/if I am doing something wrong.
Here is what the String is:
dsa
and what prints to console:
dsa,�
The junk characters change depending on what the String is.
Here is the part of the code that is relevant:
.java file:
public class tcr extends javax.swing.JFrame{
static{
System.loadLibrary("tcr");
}
public native int print(byte file1[]);
.....
String filex1 = data1TextField.getText();//gets a filepath in the form of a String from a GUI jtextfield.
byte file1[]= filex1.getBytes();//convert file path from string to byte array
tcr t = new tcr();
t.print(file1);
}
.cpp code:
JNIEXPORT jint JNICALL Java_tcr_print(JNIIEnv *env, jobject thisobj, jbyteArray file1){
jboolean isCopy;
jbyte* a = env->GetByteArrayElements(file1,&isCopy);
char* b;
b = (char*)a;
printf("%s\n",b);
}
Any help would be appreciated.

Look what you are doing:
jbyte* a = env->GetByteArrayElements(file1,&isCopy);
a now points to a memory address where the byte contents of the string are stored. Let's assume that the file contains the string "Hello world". In UTF-8 encoding, that would be:
48 65 6c 6c 6f 20 77 6f 72 6c 64
char* b = (char*)a;
b now points to that memory region. It's a char pointer, so you probably want to use it as a C string. However, that won't work. C strings are defined as some bytes, ending with a zero byte. Now look up there and you'll see that there is no zero byte at the end of this string.
printf("%s\n",b);
Here it is. You are passing the char pointer to printf as %s which tells printf that it's a C string. However, it isn't a C string but printf still tries to print all characters until it reaches a zero byte. So what you see after dsa are actually bytes from your memory after the end of the byte array until there is (by coincidence) a zero byte. You can fix this by copying the bytes to a buffer that is one byte longer than the byte array and then setting the last element to zero.
UPDATE:
You can create the bigger buffer and append the null byte like this:
int textLength = strlen((const char*)a);
char* b = malloc(textLength + 1);
memcpy(b, a, textLength);
b[textLength] = '\0';
Now b is a valid null-terminated C string. Also, don't forget the call to ReleaseByteArrayElements. You can do that right after the memcpy call.

A jbyteArray is actually a very good way to pass a Java String through JNI. It allows you to easily convert the string into the character set and encoding needed by the libraries and files/devices you are using on the C++ side.
Be sure you understand "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Java String uses the Unicode character set and UTF-16 encoding (with a platform-dependent byte order).
String.getBytes() converts to the "platform's default charset". So, it is making an assumption about the character set and encoding you need, and what to do about characters that are not in the target character set. You can use other Java String.getBytes overloads or the Charset methods if you want to control these things explicitly.
In deciding which character set and encoding to use, consider that Unicode has been used for a couple decades as the primary string type in Java, .NET, VB, ...; in compiler source files for Java, ...; generally in the WWW. Of course, you might be limited by the things you want to interoperate with.
Now, it seems the problem you are facing is either that the target character set is missing characters that your Java String has and a substitute is being used, or the console you are using isn't displaying them properly.
The console (or any app with a UI), obviously, has to pick a typeface with which to render the characters. Typefaces generally don't support the million codepoints available in Unicode. You may be able to change the configuration of your console (or use another). For example, in Windows, you can use cmd.exe or ps (Windows PowerShell). You can change the font in Cmd.exe windows and use chcp to change the character set.
UPDATE:
As #main-- points out, if you use a function that expects a terminator appended to the string then you have to provide it, usually by copying the array since the JVM retains ownership of the array. This the actual cause of the behavior in this case. But, all of the above is relevant, too.

Related

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

How to tell if LPCWSTR text is numeric?

Entire string needs to be made of integers which as we know are 0123456789 I am trying with following function but it doesnt seem to work
bool isNumeric( const char* pszInput, int nNumberBase )
{
string base = "0123456789";
string input = pszInput;
return (::strspn(input.substr(0, nNumberBase).c_str(), base.c_str()) == input.length());
}
and the example of using it in code...
isdigit = (isNumeric((char*)text, 11));
It returns true even with text in the string
Presumably the issue is that text is actually LPCWSTR which is const wchar_t*. We have to infer this fact from the question title and the cast that you made.
Now, that cast is a problem. The compiler objected to you passing text. It said that text is not const char*. By casting you have not changed what text is, you simply lied to the compiler. And the compiler took its revenge.
What happens next is that you reinterpret the wide char buffer as being a narrow 8 bit buffer. If your wide char buffer has latin text, encoded as UTF-16, then every other byte will be zero. Hence the reinterpret cast that you do results in isNumeric thinking that the string is only 1 character long.
What you need to do is either:
Start using UTF-16 encoded wchar_t buffers in isNumeric.
Convert from UTF-16 to ANSI before calling isNumeric.
You should think about this carefully. It seems that at present you have a rather unholy mix of ANSI and UTF-16 in your program. You really ought to settle on a standard character encoding an use it consistently throughout. That is tenable internal to your program, but you will encounter external text that could use different encodings. Deal with that by converting at the boundary between your program and the outside world.
Personally I don't understand why you are using C strings at all. Surely you should be using std::wstring or std::string.

Have a PCWSTR and need it to be a WCHAR[]

I am re-writing a C++ method from some code I downloaded. The method originally took a PCWSTR as a parameter and then prompted the user to enter a file name. I modified the method to take two parameters (both PCWSTR) and not to prompt the user. I am already generating the list of files somewhere else. I am attempting to call my new (modified) method with both parameters from my method that iterates the list of files.
The original method prompted the user for input using a StringCBGetsW command. Like this...
HRESULT tst=S_OK; //these are at the top of the method
WCHAR fname[85] = {0}; //these are at the top of the method
tst = StringCbGetsW(fname,sizeof(fname));
The wchar fname gets passed to another iteration method further down. When I look at that method, it says it's a LPCWSTR type; I'm assuming it can take the WCHAR instead.
But what it can't do is take the PCWSTR that the method got handed. My ultimate goal is to try not prompt the user for the file name and to take instead the filename that was iterated earlier in another method.
tl;dr. I have a PCWSTR and it needs to get converted to a WCHAR. I don't know what a WCHAR [] is or how to do anything with it. Including to try to do a printf to see what it is.
PS...I know there are easier ways to move and copy around files, there is a reason I'm attempting to make this work using a program.
First, let's try to make some clarity on some Windows specific types.
WCHAR is a typedef for wchar_t.
On Windows with Microsoft Visual C++, it's a 16-bit character type (that can be used for Unicode UTF-16 strings).
PCWSTR and LPCWSTR are two different names for the same thing: they are basically typedefs for const wchar_t*.
The initial L in LPCWSTR is some legacy prefix that, read with the following P, stands for "long pointer". I've never programmed Windows in the 16-bit era (I started with Windows 95 and Win32), but my understanding is that in 16-bit Windows there were something like near pointers and far, or long pointers. Now we have just one type of pointers, so the L prefix can be omitted.
The P stands for "pointer".
The C stands for "constant".
The W stands for WCHAR/wchar_t, and last but not least, the STR part stands for "string".
So, decoding this kind of "Hungarian Notation", PCWSTR means const wchar_t*.
Basically, it's a pointer to a read-only NUL-terminated wchar_t Unicode UTF-16 string.
Is this information enough for you to solve your problem?
If you have a wchar_t string buffer, and a function that expects a PCWSTR, you can just pass the name of the buffer (corresponding the the address of its first character) to the function:
WCHAR buffer[100];
DoSomething(buffer, ...); // DoSomething(PCWSTR ....)
Sometimes - typically for output string parameters - you may also want to specify the size (i.e. "capacity") of the destination string buffer.
If this size is expressed using a count in characters (in this case, in wchar_ts), the the usual Win32 Hungarian Notation is cch ("count of characters"); else, if you want the size expressed in bytes, then the usual prefix is cb ("count of bytes").
So, if you have a function like StringCchCopy(), then from the Cch part you know the size is expressed in characters (wchar_ts).
Note that you can use _countof() to get the size of a buffer in wchar_ts.
e.g. in the above code snippet, _countof(buffer) == 100, since buffer is made by 100 wchar_ts; instead, sizeof(buffer) == 200, since each wchar_t is 2 bytes == 16 bits in size, so the total buffer size in bytes is 100 [wchar_t] * 2 [bytes/wchar_t] = 200 [bytes].

wchar_t* to char* conversion problems

I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?
In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.
My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);
the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.

C++ with wxWidgets, Unicode vs. ASCII, what's the difference?

I have a Code::Blocks 10.05 rev 0 and gcc 4.5.2 Linux/unicode 64bit and
WxWidgets version 2.8.12.0-0
I have a simple problem:
#define _TT(x) wxT(x)
string file_procstatus;
file_procstatus.assign("/PATH/TO/FILE");
printf("%s",file_procstatus.c_str());
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
Printf outputs "/PATH/TO/FILE" normally while wxLogVerbose turns into crap. When I want to change std::string to wxString I have to do following:
wxString buf;
buf = wxString::From8BitData(file_procstatus.c_str());
Somebody has an idea what might be wrong, why do I need to change from 8bit data?
This is to do with how the character data is stored in memory. Using the "string" you produce a string of type char using the ASCII character set whereas I would assume that the _TT macro expands to L"string" which create a string of type wchar_t using a Unicode character set (UTF-32 on Linux I believe).
the printf function is expecting a char string whereas wxLogVerbose I assume is expecting a wchar_t string. This is where the need for conversion comes from. ASCII used one byte per character (8 bit data) but wchar_t strings use multiple bytes per character so the problem is down to the character encoding.
If you don't want to have to call this conversion function then do something like the following:
wstring file_procstatus = wxT("/PATH/TO/FILE");
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
The following article gives best explanation about differences in Unicode and ASCII character set, how they are stored in memory and how string functions work with them.
http://allaboutcharactersets.blogspot.in/