PathFileExistsA fails for UTF-8? - c++

I'm working with VS2017 and need to support UTF-8 paths due to an SDK that only supports UTF-8. Within my code, I'd like to test whether a UTF-8 path is valid, so am using
PathFileExistsA( path );
but it fails for a path I know is valid. (It passes if "path" has only ascii characters -- no chars requiring UTF-8).
I realized the "A" in PathFileExistsA stands for Ascii, but that's at the exclusion of UTF-8? Its counterpart is PathFileExistsW, but I can't use wide chars.
All I'm after is a test to determine whether a UTF-8 path is valid, so can use another function if more suitable.

Windows natively uses UTF-16 (wide chars) for its APIs. If you can't use wide chars for your input, then you can accept it as UTF-8 and convert it to wide chars using MultiByteToWideChar() and then call the wide char version of the API function:
char* lpUtf8 = ...;
// Look up the size of the wide string
size_t wideSize = ::MultiByteToWideChar( CP_UTF8, 0, lpUtf8, -1, 0, 0 );
// Allocate the string
wchar_t* lpWideString = new wchar_t[wideSize];
// Do the conversion
::MultiByteToWideChar( CP_UTF8, 0, lpUtf8, -1, lpWideString, wideSize );
// Call the wide function
::PathFileExistsW( lpWideString );
// Deallocate the string
delete[] lpWideString;
It's much cleaner if you use the STL string functions. This article is very good:
C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs

Related

Convert ICU Unicode string to std::wstring (or wchar_t*)

Is there an icu function to create a std::wstring from an icu UnicodeString ? I have been searching the ICU manual but haven't been able to find one.
(I know i can convert UnicodeString to UTF8 and then convert to platform dependent wchar_t* but i am looking for one function in UnicodeString which can do this conversion.
The C++ standard doesn't dictate any specific encoding for std::wstring. On Windows systems, wchar_t is 16-bit, and on Linux, macOS, and several other platforms, wchar_t is 32-bit. As far as C++'s std::wstring is concerned, it is just an arbitrary sequence of wchar_t in much the same way that std::string is just an arbitrary sequence of char.
It seems that icu::UnicodeString has no in-built way of creating a std::wstring, but if you really want to create a std::wstring anyway, you can use the C-based API u_strToWCS() like this:
icu::UnicodeString ustr = /* get from somewhere */;
std::wstring wstr;
int32_t requiredSize;
UErrorCode error = U_ZERO_ERROR;
// obtain the size of string we need
u_strToWCS(nullptr, 0, &requiredSize, ustr.getBuffer(), ustr.length(), &error);
// resize accordingly (this will not include any terminating null character, but it also doesn't need to either)
wstr.resize(requiredSize);
// copy the UnicodeString buffer to the std::wstring.
u_strToWCS(wstr.data(), wstr.size(), nullptr, ustr.getBuffer(), ustr.length(), &error);
Supposedly, u_strToWCS() will use the most efficient method for converting from UChar to wchar_t (if they are the same size, then it is just a straightfoward copy I suppose).

Given a buffer (void*) and an encoding name, how can I find a string or character's index?

I have a void* to a buffer (and its length, in bytes). I also know the text encoding of the buffer, it might be plain ASCII, UTF-8, Shift-JIS, UTF-16 (both LE or BE), or something else. And I need to find out if "<<<" appears in the buffer.
I'm using C++11 (VS2013) and coming from a .NET background the problem wouldn't be that hard: create an Encoding object instance for the encoding, then pass it the data (as Byte[]) and convert it to a String instance (internally using UTF-16LE), and use the string functions IndexOf or a Regex.
However the C++ equivalent doesn't necessarily work as Microsoft's C++ runtime library, specifically its implementation of locale, does not support UTF-8 multi-byte encoding (a complete mystery to me). I'm also not keen on the prospect of performing so many buffer copies (like how in .NET, you would Marshal.Copy the void* into Byte[], and then again into a String via the Encoding instance.
Here's the psuedo-code high-level logic I'm after:
const char* needle = "<<<";
size_t IndexOf(void* buffer, size_t bufferCB, std::string encodingName, std::string needle) {
char* needleEncoded = Convert( needle, encodingName );
setlocale( encodingName );
size_t index = strstr( buffer, bufferCB, needleEncoded );
return index;
}
My versions of setlocale and strstr aren't in the standard, but exemplify the kind of behaviour I need.

Converting "normal" std::string to utf-8

Let's see if I can explain this without too many factual errors...
I'm writing a string class and I want it to use utf-8 (stored in a std::string) as it's internal storage.
I want it to be able to take both "normal" std::string and std::wstring as input and output.
Working with std::wstring is not a problem, I can use std::codecvt_utf8<wchar_t> to convert both from and to std::wstring.
However after extensive googling and searching on SO I have yet to find a way to convert between a "normal/default" C++ std::string (which I assume in Windows is using the local system localization?) and an utf-8 std::string.
I guess one option would be to first convert the std::string to an std::wstring using std::codecvt<wchar_t, char> and then convert it to utf-8 as above, but this seems quite inefficient given that at least the first 128 values of a char should translate straight over to utf-8 without conversion regardless of localization if I understand correctly.
I found this similar question: C++: how to convert ASCII or ANSI to UTF8 and stores in std::string
Although I'm a bit skeptic towards that answer as it's hard coded to latin 1 and I want this to work with all types of localization to be on the safe side.
No answers involving boost thanks, I don't want the headache of getting my codebase to work with it.
If your "normal string" is encoded using the system's code page and you want to convert it to UTF-8 then this should work:
std::string codepage_str;
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[0], size);
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[0], utf8_size,
nullptr, nullptr);

Converting string to wchar_t (wide character) C++ [duplicate]

Is there any method?
My computer is AMD64.
::std::string str;
BOOL loadU(const wchar_t* lpszPathName, int flag = 0);
When I used:
loadU(&str);
the VS2005 compiler says:
Error 7 error C2664:: cannot convert parameter 1 from 'std::string *__w64 ' to 'const wchar_t *'
How can I do it?
First convert it to std::wstring:
std::wstring widestr = std::wstring(str.begin(), str.end());
Then get the C string:
const wchar_t* widecstr = widestr.c_str();
This only works for ASCII strings, but it will not work if the underlying string is UTF-8 encoded. Using a conversion routine like MultiByteToWideChar() ensures that this scenario is handled properly.
If you have a std::wstring object, you can call c_str() on it to get a wchar_t*:
std::wstring name( L"Steve Nash" );
const wchar_t* szName = name.c_str();
Since you are operating on a narrow string, however, you would first need to widen it. There are various options here; one is to use Windows' built-in MultiByteToWideChar routine. That will give you an LPWSTR, which is equivalent to wchar_t*.
You can use the ATL text conversion macros to convert a narrow (char) string to a wide (wchar_t) one. For example, to convert a std::string:
#include <atlconv.h>
...
std::string str = "Hello, world!";
CA2W pszWide(str.c_str());
loadU(pszWide);
You can also specify a code page, so if your std::string contains UTF-8 chars you can use:
CA2W pszWide(str.c_str(), CP_UTF8);
Very useful but Windows only.
If you are on Linux/Unix have a look at mbstowcs() and wcstombs() defined in GNU C (from ISO C 90).
mbs stand for "Multi Bytes String" and is basically the usual zero terminated C string.
wcs stand for Wide Char String and is an array of wchar_t.
For more background details on wide chars have a look at glibc documentation here.
Need to pass a wchar_t string to a function and first be able to create the string from a literal string concantenated with an integer variable.
The original string looks like this, where 4 is the physical drive number, but I want that to be changeable to match whatever drive number I want to pass to the function
auto TargetDrive = L"\\\\.\\PhysicalDrive4";
The following works
int a = 4;
std::string stddrivestring = "\\\\.\\PhysicalDrive" + to_string(a);
std::wstring widedrivestring = std::wstring(stddrivestring.begin(), stddrivestring.end());
const wchar_t* TargetDrive = widedrivestring.c_str();

CreateFileMapping() name

Im creating a DLL that shares memory between different applications.
The code that creates the shared memory looks like this:
#define NAME_SIZE 4
HANDLE hSharedFile;
create(char[NAME_SIZE] name)
{
hSharedFile = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, 1024, (LPCSTR)name);
(...) //Other stuff that maps the view of the file etc.
}
It does not work. However if I replace name with a string it works:
SharedFile = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, 1024, (LPCSTR)"MY_TEST_NAME");
How can I get this to work with the char array?
I have a java background where you would just use string all the time, what is a LPCSTR? And does this relate to whether my MS VC++ project is using Unicode or Multi-Byte character set
I suppose you should increase NAME_SIZE value.
Do not forget that array must be at least number of chars + 1 to hold \0 char at the end, which shows the end of the line.
LPCSTR is a pointer to a constant null-terminated string of 8-bit Windows (ANSI) characters and defined as follows:
LPCSTR defined as typedef __nullterminated CONST CHAR *LPCSTR;
For example even if you have "Hello world" constant and it has 11 characters it will take 12 bytes in the memory.
If you are passing a string constant as an array you must add '\0' to the end like {'T','E','S','T', '\0'}
If you look at the documentation, you'll find that most Win32 functions take an LPCTSTR, which represents a string of TCHAR. Depending on whether you use Unicode (the default) or ANSI, TCHAR will expand to either wchar_t or char. Also, LPCWSTR and LPCSTR explicitly represent Unicode and ANSI strings respectively.
When you're developing for Win32, in most cases, it's best to follow suit and use LPCTSTR wherever you need strings, instead of explicit char arrays/pointers. Also, use the TEXT("...") macro to create the correct kind of string literals instead of just "...".
In your case though, I doubt this is causing a problem, since both your examples use only LPCSTR. You have also defined NAME_SIZE to be 4, could it be that your array is too small to hold the string you want?