How can you read a file from a zip by opening the zip with a wide string file path? I only saw libraries and code examples with std::string or const char * file paths but I suppose they may fail on Windows with non-ASCII characters. I found this but I'm not using gzip.
Attempts
minizip:
const auto zip_file = unzOpen(jar_file_path.string().c_str()); // No wide string support
if (zip_file == nullptr)
{
throw std::runtime_error("unzOpen() failed");
}
libzippp:
libzippp::ZipArchive zip_archive(jar_file_path.string()); // No wide string support
const auto file_opened_successfully = zip_archive.open(libzippp::ZipArchive::ReadOnly);
if (!file_opened_successfully)
{
throw std::runtime_error("Failed to open the archive file");
}
Zipper does not seem to support wide strings either. Is there any way it can currently be done?
You might be in luck with minizip. I haven't tested this, but I found the following code in mz_strm_os_win32.c:
int32_t mz_stream_os_open(void *stream, const char *path, int32_t mode) {
...
path_wide = mz_os_unicode_string_create(path, MZ_ENCODING_UTF8);
if (path_wide == NULL)
return MZ_PARAM_ERROR;
#ifdef MZ_WINRT_API
win32->handle = CreateFile2(path_wide, desired_access, share_mode,
creation_disposition, NULL);
#else
win32->handle = CreateFileW(path_wide, desired_access, share_mode, NULL,
creation_disposition, flags_attribs, NULL);
#endif
mz_os_unicode_string_delete(&path_wide);
...
So it looks very much as if the author catered explicitly for Windows' lack of built-in UTF-8 support for the 'narrow string' file IO functions. It's worth a try at least, let's just hope that that function actually gets called when you try to open a zip file.
Regarding Minizip library, API function unzOpen() works well with UTF-8 only on Unix systems, but on Windows, path will be processed only in the current CodePage. For get full Unicode support, need to use new API functions unzOpen2_64() and zipOpen2_64() that allows to pass structure with set of functions for work with file system. Please see my answer with details in the similar question.
Related
I got a problem, that is: in a Windows application using SDL2 & SDL2_Image, it opens image files, for later saving them with modifications on the image data.
When it opens an image without special characters (like áéíóúñ, say, "buenos aires.jpg") it works as intended. But, if there is any special character as mentioned (say, "córdoba.jpg"), SDL_Image generates an error saying "Couldn't open file". Whatever, if i use the std::ifstream flux with the exact file name that i got from the CSV file (redundant, as "córdoba.jpg" or "misiónes.jpg"), the ifstream works well... Is it an error using the special characters? UNICODE, UTF, have something to do?
A little information about the environment: Windows 10 (spanish, latin american), SDL2 & SDL2_Image (up to date versions), GCC compiler using Mingw64 7.1.0
About the software I'm trying to make: it uses a CSV form, with the names of various states of Argentina, already tried changing encoding on the .CSV. It loads images based on the names found on the CSV, changes them, and saves.
I know maybe I am missing something basic, but already depleted my resources.
IMG_Load() forwards its file argument directly to SDL_RWFromFile():
// http://hg.libsdl.org/SDL_image/file/8fee51506499/IMG.c#l125
SDL_Surface *IMG_Load(const char *file)
{
SDL_RWops *src = SDL_RWFromFile(file, "rb");
const char *ext = SDL_strrchr(file, '.');
if(ext) {
ext++;
}
if(!src) {
/* The error message has been set in SDL_RWFromFile */
return NULL;
}
return IMG_LoadTyped_RW(src, 1, ext);
}
And SDL_RWFromFile()'s file argument should be a UTF-8 string:
SDL_RWops* SDL_RWFromFile(const char* file,
const char* mode)
Function Parameters:
file: a UTF-8 string representing the filename to open
mode: an ASCII string representing the mode to be used for opening the file; see Remarks for details
So pass UTF-8 paths into IMG_Load().
C++11 has UTF-8 string literal support built-in via the u8 prefix:
IMG_Load( u8"córdoba.jpg" );
I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéÃóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.
I've looked at a lot of examples for WideCharToMultiByte, etc. This question is more about testing.
I downloaded another language set, Chinese, Simplified China to my machine. Then using the virtual keyboard I created a directory on C:\ with some Chinese characters in the path, and placed a file inside the directory.
I'm trying to see that I get the correct filename by testing _wfopen with my path. I also have the same file in another location for testing:
//setlocale(LC_ALL, "zh-CN");
//setlocale(LC_ALL, "Chinese_China.936");
setlocale(LC_ALL, "");
wchar_t* outfilename = L"C:\\特殊他\\和阿涛和润\\bracket3holes.sat";
//wchar_t* outfilename = L"C:\\heather\\bracket3holes.sat";
wchar_t w[] = L"r";
FILE* foo = _wfopen(outfilename, w);
First I tried without setting locale, then I tried various combinations of setting locale to the language I downloaded (therefore the language of the path).
_wfopen works fine with the C:\heather path, but always returns a NULL pointer with the unicode path.
What am I missing? Any insight would be greatly appreciated. Note my code must be compilable back to vc9.
--- Based on the feedback, I saved the file as UTF-8 with BOM, added const before the wchar_t declarations, and now in the debugger I do see the right string and the file pointer is no longer null.
Thank you for your help. I'm still trying to wrap my head around this all, we're trying to transition from const char* to unicode-friendly.
When I used opencv's API cvLoadImage(const char *filename, int iscolor) It accepts const char * as file name. When the file name is not ASCII-character, I tried to convert it to UTF8 string. It fails because fopen() called in cvLoadImage() can not interpret the characters of the file name literally as ASCII string. I may used _wfopen() if tried to open file names, but if fopen() is called in the third-party library, is there any method to handle this problem?
Thank you.
Use GetShortPathName. It will return an old (8.3) name for the file, which you should be able to convert to char*, as it should not contain any non ASCII characters.
I've just tested it with some language specific characters and it worked as I described. I've successfully opened a file from C:\łęłęł\ąóąóą.tsttgbb using fopen.
setlocale(LC_ALL, ".65001");
fopen(u8"中文路径.txt", "rb"); //window7(中文) vs2017 ok
A quick search came up with nothing but people saying it can't be done. If you can't change cvLoadImage (which is reasonable, you don't want to mess with that), you can try to trick it.
You can create a link to the file, using the CreateSymbolicLink. I'm not sure it'll work, though, because the MKLINK command line utility requires administrative privileges.
If you can't create a symbolic link, you can always copy the file to a different location with an ASCII-only name.
If you really don't want to copy the file and symlinks don't work, you can create a file-proxy - created a named pipe with an ASCII only name, and translate each read from the pipe to a read from the file.
I would go with options 1 or 2, though - a lot simpler.
Here's a late contribution to this problem. I looked through the source of the runtime library (which Microsoft kindly supply) and found that I could replace the routine used by fopen to map an ANSI string with the following code (just link this into your exe and it will replace the routine in the runtime library).
The version listed works for Visual Studio 2017 using the v141_xp toolkit. I haven't tested it for other versions but I imagine some minor changes (such as the name of the routine itself) might be needed. It won't work of course if the offending library is a DLL. Make of it what you will.
#ifdef _DEBUG
#define _NORMAL_BLOCK 1
#define _CRT_BLOCK 2
#define _malloc_crt(s) (_malloc_dbg (s, _CRT_BLOCK, __FILE__, __LINE__))
#else
#define _malloc_crt _malloc_base
#endif
// A hack to make fopen et al accept UTF8 strings (as at Visual Studio 2017), see:
// D:\Program Files (x86)\Windows Kits\10\Source\10.0.10240.0\ucrt\internal\string_utilities.cpp
// D:\Program Files (x86)\Windows Kits\10\Source\10.0.10240.0\ucrt\inc\corecrt_internal_traits.h
extern "C" BOOL __cdecl __acrt_copy_path_to_wide_string (char const* const path, wchar_t** const result)
{
#if _MSC_VER != 1910
#define STRINGIZE_HELPER(x) #x
#define STRINGIZE(x) STRINGIZE_HELPER(x)
__pragma (message (__FILE__ "(" STRINGIZE (__LINE__) ") : Error: Code not tested for this version of Visual Studio"));
#endif
assert (path);
assert (result);
// Compute the required size of the wide character buffer:
int length = MultiByteToWideChar (CP_UTF8, 0, path, -1, nullptr, 0);
assert (length > 0);
*result = (wchar_t *) _malloc_crt (T2B (length));
// Do the conversion:
length = MultiByteToWideChar (CP_UTF8, 0, path, -1, *result, length);
assert (length);
return TRUE;
}
I have two problems, the first has been solved.
Current problem
If I embed a file that requires a library to load it, such as a jpeg image or a mp3 music, I will need to use the file as input to the library. However, each library is different and uses a way to get a file as input, the input may be the file name or a FILE* pointer (from libc's file interface).
I would like to know how to access an embedded file with a name. It will be inefficient if I create a temporary file, is there another way? Can I map a file name to memory? My platforms are Windows and Linux.
If show_file(const char* name) is a function from a library, I will need a string to open the file.
I have seen these questions:
How to get file descriptor of buffer in memory?
Getting Filename from file descriptor in C
and the following code is my solution. Is it a good solution? Is it inefficient?
# include <stdio.h>
# include <unistd.h>
extern char _binary_data_txt_start;
extern const void* _binary_data_txt_size;
const size_t len = (size_t)&_binary_data_txt_size;
void show_file(const char* name){
FILE* file = fopen(name, "r");
if (file == NULL){
printf("Error (show_file): %s\n", name);
return;
}
while (true){
char ch = fgetc(file);
if (feof(file) )
break;
putchar( ch );
}
printf("\n");
fclose(file);
}
int main(){
int fpipe[2];
pipe(fpipe);
if( !fork() ){
for( int buffsize = len, done = 0; buffsize>done; ){
done += write( fpipe[1], &_binary_data_txt_start + done, buffsize-done );
}
_exit(0);
}
close(fpipe[1]);
char name[200];
sprintf(name, "/proc/self/fd/%d", fpipe[0] );
show_file(name);
close(fpipe[0]);
}
The other problem (solved)
I tried to embed a file on Linux, with GCC, and it worked. However, I tried to do the same thing on Windows, with Mingw, and it did not compile.
The code is:
# include <stdio.h>
extern char _binary_data_txt_start;
extern char _binary_data_txt_end;
int main(){
for (char* my_file = &_binary_data_txt_start; my_file <= &_binary_data_txt_end; my_file++)
putchar(*my_file);
printf("\n");
}
The compilation commands are:
objcopy --input-target binary --output-target elf32-i386 --binary-architecture i386 data.txt data.o
g++ main.cpp data.o -o test.exe
On Windows, I get the following compiler error:
undefined reference to `_binary_data_txt_start'
undefined reference to `_binary_data_txt_end'
I tried to replace elf32-i386 with i386-pc-mingw32, but I still get the same error.
I think that for this to work with MinGW you'll need to remove the leading underscore from the names in the .c file. See Embedding binary blobs using gcc mingw for some details.
See if using the following helps:
extern char binary_data_txt_start;
extern char binary_data_txt_end;
If you need the same source to work for Linux or MinGW builds, you might need to use the preprocessor to have the right name used in the different environments.
If you're using a library that requires a FILE* for reading data, then you can use fmemopen(3) to create a pseudofile out of a memory blob. This will avoid creating a temporary file on disk. Unfortunately, it's a GNU extension, so I don't know if it's available with MinGW (likely not).
However, most well-written libraries (such as libpng and the IJG's JPEG library) provide routines for opening a file from memory as opposed to from disk. libpng, in particular, even offers a streaming interface, where you can incrementally decode a PNG file before it's been completely read into memory. This is useful if, say, you're streaming an interlaced PNG from the network and you want to display the interlaced data as it loads for a better user experience.
On Windows, you can embed custom resource into executable file. You would need a .RC file and a resource compiler. With Visual Studio IDE you can do it without hassle.
In your code, you would use FindResource, LoadResource and LockResource functions to load the contents into memory at runtime. A sample code that reads the resource as long string:
void GetResourceAsString(int nResourceID, CStringA &strResourceString)
{
HRSRC hResource = FindResource(NULL, MAKEINTRESOURCE(nResourceID), L"DATA");
HGLOBAL hResHandle = LoadResource(NULL, hResource);
const char* lpData = static_cast<char*> ( LockResource(hResHandle) );
strResourceString.SetString(lpData, SizeofResource(NULL, hResource));
FreeResource(hResource);
}
Where nResourceID is the ID of resource under custom resource type DATA. DATA is just a name, you may choose another name. Other in-built resources are cursors, dialogs, string-tables etc.
I've created a small library called elfdataembed which provides a simple interface for extracting/referencing sections embedded using objcopy. This allows you to pass the offset/size to another tool, or reference it directly from the runtime using file descriptors. Hopefully this will help someone in the future.
It's worth mentioning this approach is more efficient than compiling to a symbol, as it allows external tools to reference the data without needing to be extracted, and it also doesn't require the entire binary to be loaded into memory in order to extract/reference it.
Use nm data.o to see what it named the symbols. It may be something as simple as the filesystem differences causing the filename-derived symbols to be different (eg filename capitalized).
Edit: Just saw your second question. If you are using threads you can make a pipe and pass that to the library (first using fdopen() if it wants a FILE *). If you are more specific about the API you need to talk to I can add more specific advice.