FILE_NOTIFY_INFORMATION doesn't support Utf-8 file name - c++

I am trying to watch a folder changes and notify the added filename so here is my code
bool FileWatcher::NotifyChange()
{
// Read the asynchronous result of the previous call to ReadDirectory
DWORD dwNumberbytes;
GetOverlappedResult(hDir, &overl, &dwNumberbytes, FALSE);
// Browse the list of FILE_NOTIFY_INFORMATION entries
FILE_NOTIFY_INFORMATION *pFileNotify = (FILE_NOTIFY_INFORMATION *)buffer[curBuffer];
// Switch the 2 buffers
curBuffer = (curBuffer + 1) % (sizeof(buffer)/(sizeof(buffer[0])));
SecureZeroMemory(buffer[curBuffer], sizeof(buffer[curBuffer]));
// start a new asynchronous call to ReadDirectory in the alternate buffer
ReadDirectoryChangesW(
hDir, /* handle to directory */
&buffer[curBuffer], /* read results buffer */
sizeof(buffer[curBuffer]), /* length of buffer */
FALSE, /* monitoring option */
FILE_NOTIFY_CHANGE_FILE_NAME ,
//FILE_NOTIFY_CHANGE_LAST_WRITE, /* filter conditions */
NULL, /* bytes returned */
&overl, /* overlapped buffer */
NULL); /* completion routine */
for (;;) {
(pFileNotify->Action == FILE_ACTION_ADDED)
{
qDebug()<<"in NotifyChange if ";
char szAction[42];
char szFilename[MAX_PATH] ;
memset(szFilename,'\0',sizeof( szFilename));
strcpy(szAction,"added");
wcstombs( szFilename, pFileNotify->FileName, MAX_PATH);
qDebug()<<"pFileNotify->FileName : "<<QString::fromWCharArray(pFileNotify->FileName)<<"\nszFilename : "<<QString(szFilename);
}
// step to the next entry if there is one
if (!pFileNotify->NextEntryOffset)
return false;
pFileNotify = (FILE_NOTIFY_INFORMATION *)((PBYTE)pFileNotify + pFileNotify->NextEntryOffset);
}
pFileNotify=NULL;
return true;
}
It works fine unless a file with Arabic name was added so I get
pFileNotify->FileName : "??? ???????.txt"
szFilename : ""
How can I support the UTF-8 code file name ???
any idea please.

Apart from FILE_NOTIFY_INFORMATION::FileName not being null-terminated, there's nothing wrong with it.
FileName:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
If there is both a short and long name for the file, the function will return one of these names, but it is unspecified which one.
FileNameLength: The size of the file name portion of the record, in bytes. Note that this value does not include the terminating null character.
You'll have to use FILE_NOTIFY_INFORMATION::FileNameLength / sizeof(WCHAR) to get the length of the string in wchars pointed to by FileName. So in your case, the proper way would be:
size_t cchFileNameLength = pFileNotify->FileNameLength / sizeof(WCHAR);
QString::fromWCharArray( pFileNotify->FileName, cchFileNameLength );
If you need to use a function that expects the string to be null-terminated (like wcstombs) you'd have to allocate a temporary buffer with the size of FILE_NOTIFY_INFORMATION::FileNameLength + sizeof(WCHAR) and null-terminate it yourself.
As for the empty szFilename and question marks, that's just the result of converting an UTF16 (NTFS) filename that contains unconvertible characters to ANSI. If there's no conversion possible, wcstombs returns an error and QDebug converts any unconvertible character to ?.
If wcstombs encounters a wide character it cannot convert to a multibyte character, it returns –1 cast to type size_t and sets errno to EILSEQ.
So if you need to support unicode filenames, do not convert them to ANSI and exclusively handle them with functions that support unicode.

Related

How to process CSV lines with nul char in some elements?

When reading and parsing a CSV-file line, I need to process the nul character that appears as the value of some row fields. It is complicated by the fact that sometimes the CSV file is in windows-1250 encoding, sometimes it in UTF-8, and sometimes UTF-16. Because of this, I have started some way, and then found the nul char problem later -- see below.
Details: I need to clean a CSV files from third party to the form common to our data extractor (that is the utility works as a filter -- storing one CSV form to another CSV form).
My initial approach was to open the CSV file in binary mode and check whether the first bytes form BOM. I know all the given Unicode files start with BOM. If there is no BOM, I know that it is in windows-1250 encoding.
The converted CSV file should use the windows-1250 encoding. So, after checking the input file, I open it using the related mode, like this:
// Open the file in binary mode first to see whether BOM is there or not.
FILE * fh{ nullptr };
errno_t err = fopen_s(&fh, fnameIn.string().c_str(), "rb"); // const fs::path & fnameIn
assert(err == 0);
vector<char> buf(4, '\0');
fread(&buf[0], 1, 3, fh);
::fclose(fh);
// Set the isUnicode flag and open the file according to that.
string mode{ "r" }; // init
bool isUnicode = false; // pessimistic init
if (buf[0] == 0xEF && buf[1] == 0xBB && buf[2] == 0xBF) // UTF-8 BOM
{
mode += ", ccs=UTF-8";
isUnicode = true;
}
else if ((buf[0] == 0xFE && buf[1] == 0xFF) // UTF-16 BE BOM
|| (buf[0] == 0xFF && buf[1] == 0xFE)) // UTF-16 LE BOM
{
mode += ", ccs=UNICODE";
isUnicode = true;
}
// Open in the suitable mode.
err = fopen_s(&fh, fnameIn.string().c_str(), mode.c_str());
assert(err == 0);
After the successful open, the input line is read or via fgets or via fgetws -- depending on whether Unicode was detected or not. Then the idea was to convert the buffer content from Unicode to 1250 if the unicode was detected earlier, or let the buffer be in 1250. The s variable should contain the string in the windows-1250 encoding. The ATL::CW2A(buf, 1250) is used when conversion is needed:
const int bufsize = 4096;
wchar_t buf[bufsize];
// Read the line from the input according to the isUnicode flag.
while (isUnicode ? (fgetws(buf, bufsize, fh) != NULL)
: (fgets(reinterpret_cast<char*>(buf), bufsize, fh) != NULL))
{
// If the input is in Unicode, convert the buffer content
// to the string in cp1250. Otherwise, do not touch it.
string s;
if (isUnicode) s = ATL::CW2A(buf, 1250);
else s = reinterpret_cast<char*>(buf);
...
// Now processing the characters of the `s` to form the output file
}
It worked fine... until a file with a nul character used as the value in the row appeared. The problem is that when the s variable is assigned, the nul cuts the rest of the line. In the observed case, it happened with the file that used 1250 encoding. But it can probably happen also in the UTF encoded files.
How to solve the problem?
The NUL character problem is solved by using either C++ or Windows functions. In this case, the easiest solution is MultiByteToWideChar which will accept an explicit string length, precisely so it doesn't stop on NUL.

do writefile function twice

bool sendMessageToGraphics(char* msg)
{
//char ea[] = "SSS";
char* chRequest = msg; // Client -> Server
DWORD cbBytesWritten, cbRequestBytes;
// Send one message to the pipe.
cbRequestBytes = sizeof(TCHAR) * (lstrlen(chRequest) + 1);
if (*msg - '8' == 0)
{
char new_msg[1024] = { 0 };
string answer = "0" + '\0';
copy(answer.begin(), answer.end(), new_msg);
char *request = new_msg;
WriteFile(hPipe, request, cbRequestBytes, &cbRequestBytes, NULL);
}
BOOL bResult = WriteFile( // Write to the pipe.
hPipe, // Handle of the pipe
chRequest, // Message to be written
cbRequestBytes, // Number of bytes to writ
&cbBytesWritten, // Number of bytes written
NULL); // Not overlapped
if (!bResult/*Failed*/ || cbRequestBytes != cbBytesWritten/*Failed*/)
{
_tprintf(_T("WriteFile failed w/err 0x%08lx\n"), GetLastError());
return false;
}
_tprintf(_T("Sends %ld bytes; Message: \"%s\"\n"),
cbBytesWritten, chRequest);
return true;
}
after the first writefile in running (In case of '8') the other writefile function doesn't work right, can someone understand why ?
the function sendMessageToGraphics need to send move to chess board
There are 2 problems in your code:
First of all, there's a (minor) problem where you initialize a string in your conditional statement. You initialize it as so:
string answer = "0" + '\0';
This does not do what you think it does. It will invoke the operator+ using const char* and char as its argument types. This will perform pointer addition, adding the value of '\0' to where your constant is stored. Since '\0' will be converted to the integer value of 0, it will not add anything to the constant. But your string ends up not having a '\0' terminator. You could solve this by changing the statement to:
string answer = std::string("0") + '\0';
But the real problem lies in the way you use your size variables. You first initialize the size variable to the string length of your input variable (including the terminating '\0' character). Then in your conditional statement you create a new string which you pass to WriteFile, yet you still use the original size. This may cause a buffer overrun, which is undefined behavior. You also set your size variable to however many bytes you wrote to the file. Then later on you use this same value again in the next call. You never actually check this value, so this could cause problems.
The easiest way to change this, is to make sure your sizes are set up correctly. For example, instead of the first call, you could do this:
WriteFile(hPipe, request, answer.size(), &cbBytesWritten, NULL);
Then check the return value WriteFile and the value of cbBytesWritten before you make the next call to WriteFile, that way you know your first call succeeded too.
Also, do not forget to remove your sizeof(TCHAR) part in your size calculation. You are never using TCHAR in your code. Your input is a regular char* and so is the string you use in your conditional. I would also advice replacing WriteFile by WriteFileA to show you are using such characters.
Last of all, make sure your server is actually reading bytes from the handle you write to. If your server does not read from the handle, the WriteFile function will freeze until it can write to the handle again.

Why I get wrong char array when I use WideCharToMultiByte?

Windows 7, Visual Studio 2015.
#ifdef UNICODE
char *buffer = NULL;
int iBuffSize = WideCharToMultiByte(CP_ACP, 0, result_msg.c_str(),
result_msg.size(), buffer, 0, NULL, NULL);
buffer = static_cast<char*>(malloc(iBuffSize));
ZeroMemory(buffer, iBuffSize);
WideCharToMultiByte(CP_ACP, 0, result_msg.c_str(),
result_msg.size(), buffer, iBuffSize, NULL, NULL);
string result_msg2(buffer);
free(buffer);
throw runtime_error(result_msg2);
#else
throw runtime_error(result_msg);
#endif
result_msg is std::wstring for unicode, and std::string for the multi-byte character set.
For Multi-byte character set:
For Unicode character set:
You specified the input string size as result_msg.size(), which does not include the terminating null character, and so the output won't be null-terminated, either. But when you convert buffer to a string, you are not specifying the size of buffer, so the string constructor expects a null terminator. Without that terminator, it is grabbing data from surrounding memory until it encounters a null byte (or get a memory access error).
Either use result_msg.size() + 1 for the input size, or specify -1 as the input size to let WideCharToMultiByte() determine the input size automatic. Either approach will include a null terminator in the output.
Or, keep using result_msg.size() s the input size, and use the value of iBuffSize when converting buffer to a string, then you don't need a null terminator:
string result_msg2(buffer, iBuffSize);

C++ char buffer pointer error

I'm using a function that reads a spooled file and sets a buffer with the output.
The function returns OK state and sets readBytes correctly. It also notifies that the reading operation has reached the end of the file.
char* splFileContent = new char[3000];
ULONG readBytes;
int z = cwbOBJ_ReadSplF(splFile, splFileContent, 500, &readBytes, 0);
//z value is REACHED END OF FILE or OK if read but didn't reach the end of the file.
The trouble comes when trying to convert the char buffer to string, I'm getting "4Ä" as string value...
I convert the char buffer to string this way:
stringstream s;
s << splFileContent;
string bufferContent = s.str();
What I'm doing wrong?
It looks like splFileContent is binary content and not printable characters.
The start of the file may contain a BOM of some sort, e.g. unicode indicator. If it is, you should read in the BOM first and then the rest of the file.
Note: unless the file read function here adds a NULL, be sure to append one as well.

MultiByteToWideChar or WideCharToMultiByte and txt files

I'm trying to write a universal text editor which can open and display ANSI and Unicode in EditControl. Do I need to repeatedly call ReadFile() if I determine that the text is ANSI? Can't figure out how to perform this task. My attempt below does not work, it displays '?' characters in EditControl.
LARGE_INTEGER fSize;
GetFileSizeEx(hFile,&fSize);
int bufferLen = fSize.QuadPart/sizeof(TCHAR)+1;
TCHAR* buffer = new TCHAR[bufferLen];
buffer[0] = _T('\0');
DWORD wasRead = 0;
ReadFile(hFile,buffer,fSize.QuadPart,&wasRead,NULL);
buffer[wasRead/sizeof(TCHAR)] = _T('\0');
if(!IsTextUnicode(buffer,bufferLen,NULL))
{
CHAR* ansiBuffer = new CHAR[bufferLen];
ansiBuffer[0] = '\0';
WideCharToMultiByte(CP_ACP,0,buffer,bufferLen,ansiBuffer,bufferLen,NULL,NULL);
SetWindowTextA(edit,ansiBuffer);
delete[]ansiBuffer;
}
else
SetWindowText(edit,buffer);
CloseHandle(hFile);
delete[]buffer;
There are a few buffer length errors and oddities, but here's your big problem. You call WideCharToMultiByte incorrectly. That is meant to receive UTF-16 encoded text as input. But when IsTextUnicode returns false that means that the buffer is not UTF-16 encoded.
The following is basically what you need:
if(!IsTextUnicode(buffer,bufferLen*sizeof(TCHAR),NULL))
SetWindowTextA(edit,(char*)buffer);
Note that I've fixed the length parameter to IsTextUnicode.
For what it is worth, I think I'd read in to a buffer of char. That would remove the need for the sizeof(TCHAR). In fact I'd stop using TCHAR altogether. This program should be Unicode all the way - TCHAR is what you use when you compile for both NT and 9x variants of Windows. You aren't compiling for 9x anymore I imagine.
So I'd probably code it like this:
char* buffer = new char[filesize+2];//+2 for UTF-16 null terminator
DWORD wasRead = 0;
ReadFile(hFile, buffer, filesize, &wasRead, NULL);
//add error checking for ReadFile, including that wasRead == filesize
buffer[filesize] = '\0';
buffer[filesize+1] = '\0';
if (IsTextUnicode(buffer, filesize, NULL))
SetWindowText(edit, (wchar_t*)buffer);
else
SetWindowTextA(edit, buffer);
delete[] buffer;
Note also that this code makes no allowance for the possibility of receiving UTF-8 encoded text. If you want to handle that you'd need to take your char buffer and send to through MultiByteToWideChar using CP_UTF8.