UTF-8 to UCS-2 conversion with icu library - c++

I'm currently working on and hitting an issue with converting a UTF-8 string to a UCS-2 string with the icu library. There are several number of ways to do this in the library, but so far none of them seem to be working, but considering the popularity of this library I'm under the assumption that I'm doing something wrong.
First off is the common code. In all cases I'm creating and passing a string on an object, but until it reaches the conversion steps there is no manipulation.
The currently utf-8 string being used is simply "ĩ".
For the sake of simplicity I'll represent the string being used as uniString in this code
UErrorCode resultCode = U_ZERO_ERROR;
UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);
// Change the callback to error out instead of the default
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);
int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];
printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
if (resultCode == U_INVALID_CHAR_FOUND)
{
printf("Unmapped input character, cannot be converted to Latin1");
m_pConv = ucnv_open("UCS-2", &resultCode);
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (U_SUCCESS(resultCode))
{
pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
}
}
else
{
printf("DecodeText(): Text contents does not appear to be valid UTF-8");
}
}
else
{
printf("DecodeText(): Text successfully converted to Latin1");
std::string newBody(target, outputLength);
pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}
The problem is the ucnv_fromAlgorithmic function is throwing an error U_INVALID_CHAR_FOUND for the ucs-2 conversion. This makes sense for the ISO-8859-1 attempt, but not the ucs-2.
The other attempt was to use ucnv_convert which you can see is commented out. This function attempted conversion, but didn't fail on the ISO-8859-1 attempt as it should.
So the question is, does anyone have experience with these function and see something incorrect or is there something incorrect about the assumption of conversion for this character?

You need to reset resultCode to U_ZERO_ERROR before calling ucnv_open. Quote from manual:
"ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test if(U_FAILURE(errorCode)) { return immediately; } so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operation"

Related

How to convert OID of a code-signing algorithm from CRYPT_ALGORITHM_IDENTIFIER to a human readable string?

When I'm retrieving a code signing signature from an executable file on Windows, the CERT_CONTEXT of the certificate points to the CERT_INFO, that has CRYPT_ALGORITHM_IDENTIFIER SignatureAlgorithm member that contains the algorithm used for signing.
How do I convert that to a human readable form as such?
For instance, SignatureAlgorithm.pszObjId may be set to "1.2.840.113549.1.1.11" string, which is szOID_RSA_SHA256RSA according to this long list. I guess I can make a very long switch statement for it, and link it to "sha256", but I'd rather avoid it since I don't know what most of those values are. Is there an API that can do all that for me?
Use CryptFindOIDInfo to get information about a OID including the display name and the CNG algorithm identifier string:
void PrintSigAlgoName(CRYPT_ALGORITHM_IDENTIFIER* pSigAlgo)
{
if(pSigAlgo && pSigAlgo->pszObjId)
{
PCCRYPT_OID_INFO pCOI = CryptFindOIDInfo(CRYPT_OID_INFO_OID_KEY, pSigAlgo->pszObjId, 0);
if(pCOI && pCOI->pwszName)
{
_tprintf(_T("%ls"), pCOI->pwszName);
}
else
{
_tprintf(_T("%hs"), pSigAlgo->pszObjId);
}
}
}
Expanding on the answer of Anders. You can also get this information from the result of a call to WinVerifyTrust(). It is deeply nested inside CRYPT_PROVIDER_DATA:
GUID policyGUID = WINTRUST_ACTION_GENERIC_VERIFY_V2;
WINTRUST_DATA trustData;
// omitted: prepare trustData
DWORD lStatus = ::WinVerifyTrust( NULL, &policyGUID, &trustData );
if( lStatus == ERROR_SUCCESS )
{
CRYPT_PROVIDER_DATA* pData = ::WTHelperProvDataFromStateData( trustData.hWVTStateData );
if( pData && pData->pPDSip && pData->pPDSip->psIndirectData &&
pData->pPDSip->psIndirectData->DigestAlgorithm.pszObjId )
{
CRYPT_ALGORITHM_IDENTIFIER const& sigAlgo = pData->pPDSip->psIndirectData->DigestAlgorithm;
PCCRYPT_OID_INFO pCOI = ::CryptFindOIDInfo( CRYPT_OID_INFO_OID_KEY, sigAlgo.pszObjId, 0 );
if(pCOI && pCOI->pwszName)
{
_tprintf(_T("%ls"), pCOI->pwszName);
}
else
{
_tprintf(_T("%hs"), sigAlgo.pszObjId);
}
}
}
Note: Detailed error checking omitted for brevity!
Note2: From Win 8 onwards (and patched Win 7), WinVerifyTrust can be used to verify and get information about multiple signatures of a file, more info in this Q&A.

How to find a string is Json or not using Jansson?

I am using Jansson.
bool ConvertJsontoString(string inputText, string& OutText)
{
/* Before doing anything I want to check
if the inputText is a valid json string or not */
}
Why don't you read the documentation where it clearly states:
json_t *json_loads(const char *input, size_t flags, json_error_t *error)
Return value: New reference.
Decodes the JSON string input and returns the array or object it contains,
or NULL on error, in which case error is filled with information about the error.
flags is described above.
Also they even provide an example on how to use this:
root = json_loads(text, 0, &error);
free(text);
if(!root)
{
fprintf(stderr, "error: on line %d: %s\n", error.line, error.text);
return 1;
}

Wrong encoding when getting a string from MySQL database with C++

I'm writing an MFC app with C++ in Visual Studio 2012. App connects to a MySQL database and shows every row to a List Box.
Words are in Russian, database encoding is cp1251. I've set the same character set using this code:
if (!mysql_set_character_set(mysql, "cp1251")) {
statusBox.SetWindowText((CString)"CP1251 is set for MYSQL.");
}
But it doesn't help at all.
I display data using this code:
while ((row = mysql_fetch_row(result)) != NULL) {
CString string = (CString)row[1];
listBox.AddString(string);
}
This code also doesn't help:
mysql_query(mysql, "set names cp1251");
Please help. What should I do to display cyrillic correctly?
When crossing system boundaries that use different character encodings you have to convert between them. In this case, the MySQL database uses CP 1251 while Windows (and CString) use UTF-16. The conversion might look like this:
#if !defined(_UNICODE)
#error Unicode configuration required
#endif
CString CPtoUnicode( const char* CPString, UINT CodePage ) {
CString retValue;
// Retrieve required string length
int len = MultiByteToWideChar( CodePage, 0,
CPString, -1,
NULL, 0 );
if ( len == 0 ) {
// Error -> return empty string
return retValue;
}
// Allocate CString's internal buffer
LPWSTR buffer = retValue.GetBuffer( len );
// Do the conversion
MultiByteToWideChar( CodePage, 0,
CPString, -1,
buffer, len );
// Return control of the buffer back to the CString object
retValue.ReleaseBuffer();
return retValue;
}
This should be used as follows:
while ( ( row = mysql_fetch_row( result ) ) != NULL ) {
CString string = CPtoUnicode( row[1], 1251 );
listBox.AddString( string );
}
Alternatively, you could use CStrings built-in conversion support, which requires to set the thread's locale to the source encoding (CP 1251) and use the conversion constructor.

How to wrap UTF-8 encoded C++ std::strings with Swig in C#?

My question is nearly identical to this question, except that the linked question deals with char*, whereas I'm using std::string in my code. Like the linked question, I'm also using C# as my target language.
I have a class written in C++:
class MyClass
{
public:
const std::string get_value() const; // returns utf8-string
void set_value(const std::string &value); // sets utf8-string
private:
// ...
};
And this get's wrapped by SWIG in C# as follows:
public class MyClass
{
public string get_value();
public void set_value(string value);
}
SWIG does everything for me, except that it doesn't make an utf8 to utf16 string conversion during the calls to MyClass. My strings come through fine if they are representable in ASCII, but if I try passing a string with non-ascii characters in a round-trip through "set_value" and "get_value", I end up with unintelligible characters.
How can I make SWIG wrap UTF-8 encoded C++ strings in C#? n.b. I'm using std::string, not std::wstring, and not char*.
There's a partial solution on the SWIG sourceforge site, but it deals with char* not std::string, and it uses a (configurable) fixed length buffer.
With the help (read: genius!) of David Jeske in the linked Code Project article, I have finally been able to answer this question.
You'll need this class (from David Jeske's code) in your C# library.
public class UTF8Marshaler : ICustomMarshaler {
static UTF8Marshaler static_instance;
public IntPtr MarshalManagedToNative(object managedObj) {
if (managedObj == null)
return IntPtr.Zero;
if (!(managedObj is string))
throw new MarshalDirectiveException(
"UTF8Marshaler must be used on a string.");
// not null terminated
byte[] strbuf = Encoding.UTF8.GetBytes((string)managedObj);
IntPtr buffer = Marshal.AllocHGlobal(strbuf.Length + 1);
Marshal.Copy(strbuf, 0, buffer, strbuf.Length);
// write the terminating null
Marshal.WriteByte(buffer + strbuf.Length, 0);
return buffer;
}
public unsafe object MarshalNativeToManaged(IntPtr pNativeData) {
byte* walk = (byte*)pNativeData;
// find the end of the string
while (*walk != 0) {
walk++;
}
int length = (int)(walk - (byte*)pNativeData);
// should not be null terminated
byte[] strbuf = new byte[length];
// skip the trailing null
Marshal.Copy((IntPtr)pNativeData, strbuf, 0, length);
string data = Encoding.UTF8.GetString(strbuf);
return data;
}
public void CleanUpNativeData(IntPtr pNativeData) {
Marshal.FreeHGlobal(pNativeData);
}
public void CleanUpManagedData(object managedObj) {
}
public int GetNativeDataSize() {
return -1;
}
public static ICustomMarshaler GetInstance(string cookie) {
if (static_instance == null) {
return static_instance = new UTF8Marshaler();
}
return static_instance;
}
}
Then, in Swig's "std_string.i", on line 24 replace this line:
%typemap(imtype) string "string"
with this line:
%typemap(imtype, inattributes="[MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8Marshaler))]", outattributes="[return: MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8Marshaler))]") string "string"
and on line 61, replace this line:
%typemap(imtype) const string & "string"
with this line:
%typemap(imtype, inattributes="[MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8Marshaler))]", outattributes="[return: MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8Marshaler))]") string & "string"
Lo and behold, everything works. Read the linked article for a good understanding of how this works.

C++/CLI UTF-8 & JNI Not Converting Unicode String Properly

I have a Java class that returns a unicode string... Java has the correct version of the string but when it comes through a JNI wrapper in the form of a jstring it must be converted over to a C++ or C++/CLI string. Here is some test code I have which actually works on most languages except for the asian char sets. Chinese Simplified & Japanese characters are garbled and I can't figure out why. Here is the code snippet, I don't see anything wrong with either methods of conversion (the if statement checks os as I have two VMs with diff OS's and runs the appropriate conversion method).
String^ JStringToCliString(const jstring string){
String^ converted = gcnew String("");
JNIEnv* envLoc = GetJniEnvHandle();
std::wstring value;
jboolean isCopy;
if(string){
try{
jsize len = env->GetStringLength(string);
if(Environment::OSVersion->Version->Major >= 6) // 6 is post XP/2003
{
TraceLog::Log("Using GetStringChars() for string conversion");
const jchar* raw = envLoc->GetStringChars(string, &isCopy);
// todo add exception handling here for jvm
if (raw != NULL) {
value.assign(raw, raw + len);
converted = gcnew String(value.c_str());
env->ReleaseStringChars(string, raw);
}
}else{
TraceLog::Log("Using GetStringUTFChars() for string conversion.");
const char* raw = envLoc->GetStringUTFChars(string, &isCopy);
if(raw) {
int bufSize = MultiByteToWideChar(CP_UTF8, 0 , raw , -1, NULL , 0 );
wchar_t* wstr = new wchar_t[bufSize];
MultiByteToWideChar( CP_UTF8 , 0 , raw , -1, wstr , bufSize );
String^ val = gcnew String(wstr);
delete[] wstr;
converted = val; // partially working
envLoc->ReleaseStringUTFChars(string, raw);
}
}
}catch(Exception^ ex){
TraceLog::Log(ex->Message);
}
}
return converted;
}
Answer was to enable east asian languages in Windows XP as Win7 + Later work fine. Super easy.... waste of a entire day lol.