urlDecode - php function in c++ - c++

I have urlDecode function.
But when i'm decoding some string like:
P%C4%99dz%C4%85cyJele%C5%84
I get output: PędzącyJeleń
Of course this is not correct output. I think its broken because there are Polish chars.
I try to set in compilator:
Use Unicode Character Set
or Use Multi-Byte Character Set
I try to do that using wstrings but i have a lot of errors :|
I suppose that i should use wstring not string but could you tell me how? There is not easier way to solve my problem? (i listen a lot about wstring and string and litte dont understand - wstring should not use on linux, but i have Windows)
//link to my functions at bottom
http://bogomip.net/blog/cpp-url-encoding-and-decoding/
//EDIT
When i change all string to wstring, fstream->wfstream
It still problem look:
z%C5%82omiorz - this wstring (from file ) != złomiorz , but this function print me L"z197130omiorz"
what is 197130 ? How to fix that ?:0

Related

C++ URL decode with utf8 characters error

I can`t find any solution to my problem with utf8 characters inside an encoded url while in c++ visual studio.
I have this url encoded string :
//Encoded
%5C%CE%A4%CE%B5%CF%83%CF%84%5C
//Decoded
\Τεστ\
In any online encoder and decoder with php functions the above encoded string will give you the decoded correct string but in c++ visual studio any try i made with decoding url codes failed.
I use Unicode Character Set in my project and i retrieve this encoded url as : (p.s. i cant change the way i recieve it - it is an encoded url std:string)
std:string EncURL = "%5C%CE%A4%CE%B5%CF%83%CF%84%5C";
i then try with many decoding function from the internet to make it readable and use it but it always returns me chinese characters instead of the correct ones.
Below is a function among many that i tried and it works only if the encoded url has no utf8 characters inside.
string url_decode(string src){
string ret;
char ch;
int i,ii;
for(i=0; i<src.length(); i++){
if(int(src[i]) == 37){
sscanf(src.substr(i+1,2).c_str(), "%x", &ii);
ch = static_cast<char>(ii);
ret += ch;
i = i+2;
}else{
ret += src[i];
}
}
return (ret);
}
Will anyone give me a good way or solution of making URLdecoding function that will work properly even with utf8 characters inside ?
Any type or any way is used after the std:string EncURL doesnt matter, i just need to have a properly urldecoder in c++ for the string i recieve.
---------- Update
The reason that i need to convert is obvious for URL opening or filepath opening (folder or file) from within c++ but the encoded URL string is coming as is encoded inside C++ from outside the application.(database, web, chat, file, etc etc)
So i need to decode it in utf8 since i need to correct that non latin part for multilanguage purposes and then to use the decoded utf8 string for the reason is needed.
This update may helps for non converting it just for wcout or cout but for its target puprose which i really need and nothing yet is working as it should.
Thank you in advance
I will answer my question since i found the solution and for anyone else is using std::string and want to have utf8 characters correct may come in handy.
The solution is to convert the encoded url std::string to std::wstring and then use any URLdecoder to decode the wstring.
The decoded wstring is always correct as in php online url encode and decode.
So you can use the decoded wstring as you like.
For output even in windows console application you can use
Messabox to see the correct output with wstring

Conversion from char * to wchar* does not work properly

I'm getting a string like: "aña!a¡a¿a?a" from the server so I decode it and then I pass it to a function.
What I need to do with the message is something like loading paths depending the letters.
The header of my function is: void SetInfo(int num, char *descr[4]) so it receives one number and an array of 4 chars (sentences). To make it easier, let's say I just need to work only with descr[0].
When I debug and arrive there to SetInfo(), I get the exact message in the debugg view: "aña!a¡a¿a?a" so until here is all ok.
Initially, the info I was receiving on that function, was a std::wstring so all my code working with that message was with wstrings and strings but now what I receive is a char as shown in the header. The message arrived until here ok, but if I want to work with it, then I can't because if I debug and see each position of Descr[0] then I get
descr[0][0] = 'a'; //ok
descr[0][1] = 'Ã '; // BAD
so I tried converting char* to wchar* with a code found here:
size_t size = strlen(descr[0]) + 1;
wchar_t* wa = new wchar_t[size];
mbstowcs(wa,descr[0],size);
But then the debugger shows me that wa has:
wa wchar_t * 0x185d4be8 L"a-\uffffffff刯2e2e牵6365⽳6f73歯6f4c楲6553䈯736f獵6e6f档6946琯7361灭6569湰2e6f琀0067\021ᡰ9740슃b8\020\210=r"
which I suppose that is incorrect (I'm supossing that I have to see the same initial message of "aña!a¡a¿a?a". If this message is fine then I don't know how to get what I need...)
So my question is: how can I get that descr[0][0] = 'a' and descr[0][1] = 'ñ' ?? I can't pass char to wchar (you've already see what I got). Am I doing it wrong? Or is there any other way? I am really stuck on that so any idea will be very apreciated.
Before, when I was working with wstrings (and it worked so fine) I was doing something like this:
if (word[i]==L'\x00D1' or word[i]==L'\x00F1') // ñ or Ñ
path ="PathOfÑ";
where word[i] is the same as descr[0][1] in that case but with wstrings. So with that i knew that this word[i] was the letter 'ñ'. Maybe this helps to understand what I'm doing
(btw...I'm working on eclipse, on linux. )
The mbstowcs function work on C-style strings, and one of the things about C-style strings is that they have a special terminating character, '\0'. You don't seem to be adding this terminator to the string, leading mbstowcs to go out of bounds of the actual string and giving you undefined behavior.

Read special characters from file

For a few day I am trying to solve an ecoding issue in my C++ code.
I have the following content of a txt file (saved utf-8): "québécois". (and I have only read rights over the file).
I am reading it with ReadFile function in a std::string variable.
The first surpize is that the strContnet is not "québécois", but "québécois". (double encoded ??)
Why?
How cand I read it's content in a std::string variable in order to obtaint "québécois" (utf8 to ansi?)
Until now I have two ways to get somewere close to what I want but none satisfie me enought.
1. Convert from:
std::string to std::wstring (using MultiByteToWideChar)
create a CString from the wstring
use CT2CA to obtaint a new std::string
and finally use agin MultiByteToWideChar for a final std::wstring containint "québécois"
I thing the method is unsafe and not verry smart and I have a wstring not string.
I found this article:
How to convert from UTF-8 to ANSI using standard c++
And if I use utf8_to_string function twice I obtain what I want, but still I am not sure if it's the wright way.
Can anybody help me with a solution for this issue?

How do I convert wchar_t* to string?

I am new to C++.
And I am trying to convert wchar_t* to string.
I cannot use wstring in condition.
I have code below:
wchar_t *wide = L"中文";
wstring ret = wstring( wide );
string str2( ret.begin(), ret.end() );
But str2 returns some strange characters.
Where do I have to fix it?
You're trying to do it backwards. Instead of truncating wide characters to chars (which is very lossy), expand your chars to wide characters.
That is, transform your std::string into an std::wstring and concatenate the two std::wstrings.
I'm not sure what platform you're targeting. If you're on Windows platform you can call WideCharToMultiByte API function. Refer to MSDN for documentation.
If you're on Linux, I think you can use libiconv functions, try google.
Of course there is a port of libiconv for Windows.
In general this is a quite complex topic for a new beginners if you know nothing about character encodings - there are a lot of background knowledge to have to learn.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.