C++ URL decode with utf8 characters error - c++

I can`t find any solution to my problem with utf8 characters inside an encoded url while in c++ visual studio.
I have this url encoded string :
//Encoded
%5C%CE%A4%CE%B5%CF%83%CF%84%5C
//Decoded
\Τεστ\
In any online encoder and decoder with php functions the above encoded string will give you the decoded correct string but in c++ visual studio any try i made with decoding url codes failed.
I use Unicode Character Set in my project and i retrieve this encoded url as : (p.s. i cant change the way i recieve it - it is an encoded url std:string)
std:string EncURL = "%5C%CE%A4%CE%B5%CF%83%CF%84%5C";
i then try with many decoding function from the internet to make it readable and use it but it always returns me chinese characters instead of the correct ones.
Below is a function among many that i tried and it works only if the encoded url has no utf8 characters inside.
string url_decode(string src){
string ret;
char ch;
int i,ii;
for(i=0; i<src.length(); i++){
if(int(src[i]) == 37){
sscanf(src.substr(i+1,2).c_str(), "%x", &ii);
ch = static_cast<char>(ii);
ret += ch;
i = i+2;
}else{
ret += src[i];
}
}
return (ret);
}
Will anyone give me a good way or solution of making URLdecoding function that will work properly even with utf8 characters inside ?
Any type or any way is used after the std:string EncURL doesnt matter, i just need to have a properly urldecoder in c++ for the string i recieve.
---------- Update
The reason that i need to convert is obvious for URL opening or filepath opening (folder or file) from within c++ but the encoded URL string is coming as is encoded inside C++ from outside the application.(database, web, chat, file, etc etc)
So i need to decode it in utf8 since i need to correct that non latin part for multilanguage purposes and then to use the decoded utf8 string for the reason is needed.
This update may helps for non converting it just for wcout or cout but for its target puprose which i really need and nothing yet is working as it should.
Thank you in advance

I will answer my question since i found the solution and for anyone else is using std::string and want to have utf8 characters correct may come in handy.
The solution is to convert the encoded url std::string to std::wstring and then use any URLdecoder to decode the wstring.
The decoded wstring is always correct as in php online url encode and decode.
So you can use the decoded wstring as you like.
For output even in windows console application you can use
Messabox to see the correct output with wstring

Related

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

urlDecode - php function in c++

I have urlDecode function.
But when i'm decoding some string like:
P%C4%99dz%C4%85cyJele%C5%84
I get output: PędzącyJeleń
Of course this is not correct output. I think its broken because there are Polish chars.
I try to set in compilator:
Use Unicode Character Set
or Use Multi-Byte Character Set
I try to do that using wstrings but i have a lot of errors :|
I suppose that i should use wstring not string but could you tell me how? There is not easier way to solve my problem? (i listen a lot about wstring and string and litte dont understand - wstring should not use on linux, but i have Windows)
//link to my functions at bottom
http://bogomip.net/blog/cpp-url-encoding-and-decoding/
//EDIT
When i change all string to wstring, fstream->wfstream
It still problem look:
z%C5%82omiorz - this wstring (from file ) != złomiorz , but this function print me L"z197130omiorz"
what is 197130 ? How to fix that ?:0

error about UTF_8 format while creating my xml using libxml and c++

I created an xml file using libxml and c++. What I want to do now, is reading from a .txt and put this text between some specific tags.
I have tried the following code, just reading from a file and write it between tags:
char * s ;
double d;
fichier>>i>>s>>d;
// fichier.close();
cout << s << endl ;
xmlNewChild(root_node, NULL, BAD_CAST "metadata",
BAD_CAST s );
While running this code, I get this error:
output error : string is not in UTF-8
So I guess that there is a format incompatibility between the input and output. Can you help me please? I don't know how to fix this.
You need to convert your input string into UTF-8 input using one of the functions defined in the encoding module. (Or using any other encoding library you like like icu ) you can find details about the encoding module here http://www.xmlsoft.org/html/libxml-encoding.html
My guess is that you want to preserve the bytes so that what you need is something like (VERY untested and derived purely from the docs.)
//Get the encoding
xmlCharEncodingHandlerPtr encoder = xmlGetCharEncodingHandler(XML_CHAR_ENCODING_ASCII);
// Each ascii byte should take up at most 2 utf-8 bytes IIRC so allocate enough space.
char* buffer_utf8 = new char[length_of_s*2];
//Do the encoding
int consumed = length_of_s;
int encoded_length=length_of_s*2;
int len = (*encoder.input)(buffer_utf8, &encoded,s,&consumed);
if( len<0 ) { .. error .. }
buffer_utf8[len]=0; // I'm not sure if this is automatically appended or not.
//Now you can use buffer_utf8 rather than s.
If your input is in a different encoding supported by libxml then it should just be a matter of changing XML_CHAR_ENCODING_ASCII to the right constant, though you may need to change thenumber of bytes allocated in the in buffer_utf8 too.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx