Polish diacritical marks in libcurl (c++)

Polish diacritical marks in libcurl (c++) - c++

I just have a problem with a text that contains Polish diacritical marks (eg. ą, ć, ę, ł, ń, ó, ś, ź, ż) obtained by libcurl from the server. I'm trying to display this text correctly in a Windows C++ console application.
I solved the similar problem with putting to the console screen something like that:
cout << "ąćęźół";
by switching codepage of my source file to: DOS Codepage 852 (Central Europe). Unfortunately it doesn't work out with text passing from libcurl. I think that it works only with the text written directly into the code. So could you tell my some helpful information? I have no idea how to resolve this issue.

Well I've written temporary solution for my problem. It works fine, but I'm not contented of this way:
char* cpl(const char* input)
{
size_t length = strlen(input);
char* output = new char[length+1];
/* Order of the diacretics
Ą ą Ć ć Ę ę
Ł ł Ń ń Ó ó
Ś ś Ź ź Ż ż
*/
const size_t pld_in[] = {
0xA1,0xB1,0xC6,0xE6,0xCA,0xEA,
0xA3,0xB3,0xD1,0xF1,0xD3,0xF3,
0xA6,0xB6,0xAC,0xBC,0xAF,0xBF,
};
const size_t pld_out[] = {
0xA4,0xA5,0x8F,0x86,0xA8,0xA9,
0x9D,0x88,0xE3,0xE4,0xE0,0xA2,
0x97,0x98,0x8D,0xAB,0xBD,0xBE
};
for(size_t i = 0; i < length; i++)
{
bool modified = false;
for(size_t j = 0; j < 18; j++)
{
if(*(input + i) == (*(pld_in + j)) + 0xFFFFFF00)
{
*(output + i) = *(pld_out + j);
modified = true;
break;
}
}
if(!modified)
*(output + i) = *(input + i);
}
*(output + length) = 0x00;
return output;
}
Could you propose better solution of this problem, without characters converting?

The content of the web page returned by libcurl will use the character set of the web page. What's likely happening here is that it's not the character set used by your "codeset", which I presume the MS-Windows term for locale.
libcurl should let you look at the headers of the HTTP response that was received from the server. Look at the Content-Type: header, which will indicate which character set the returned text uses; then look up which codepage uses the same character set.

Related

How can I iterate over CString and compare its characters to int?

I'm writing a dialog based MFC application in Visual Studio 2017 in C++. In the dialog I added a list control where the user can change the values of the cells as shown in the picture below:
after he changes the values, I want to check if those values are valid (so if he accidentally pressed the wrong button he will be notified). For this purpose I'm iterating over the different cells of the list and from each cell I extract the text which is written in it into a CString type variable. I want to check that this variable has only 8 characters which are '1' or '0'. The problem with the code I've written is that I get weird values when I try to print the different characters of the CString variable.
The Code for checking the validity of the CString:
void CEditableListControlDlg::OnBnClickedButton4()
{
// TODO: Add your control notification handler code here
// Iterate over the different cells
int bit7Col = 2;
int bit6Col = 3;
int bit5Col = 4;
int bit4Col = 5;
int bit3Col = 6;
int bit2Col = 7;
int bit1Col = 8;
int bit0Col = 9;
for (int i = 0; i < m_EditableList.GetItemCount(); ++i) {
CString bit7 = m_EditableList.GetItemText(i, bit7Col);
CString bit6 = m_EditableList.GetItemText(i, bit6Col);
CString bit5 = m_EditableList.GetItemText(i, bit5Col);
CString bit4 = m_EditableList.GetItemText(i, bit4Col);
CString bit3 = m_EditableList.GetItemText(i, bit3Col);
CString bit2 = m_EditableList.GetItemText(i, bit2Col);
CString bit1 = m_EditableList.GetItemText(i, bit1Col);
CString bit0 = m_EditableList.GetItemText(i, bit0Col);
CString cvalue = bit7 + bit6 + bit5 + bit4 + bit3 + bit1 + bit0;
std::string value((LPCSTR)cvalue);
int length = value.length();
if (length != 7) {
MessageBox("Register Value Is Too Long", "Error");
return;
}
for (int i = 0; i < length; i++) {
if (value[i] != static_cast<char>(0) || value[i] != static_cast<char>(1)) {
char c = value[i];
MessageBox(&c, "value"); // this is where I try to print the value
return;
}
}
}
}
Picture of what get's printed in the message box when I try to print one character of the variable value. I expect to see '1' but instead I see in the message box '1iiiiii`:
I've tried extracting the characters directly from the variable cvalue of type CString like this:
cvalue[i]
and it's length I got by using
strlen(cvalue[i])
but I've got the same result. I've also tried accessing the characters in the variable cvalue of type CString as follows:
cvalue.GetAt(i)
and to get it's length by using:
cvalue.GetLength()
But again, I've got the same results.
Perhaps anyone could advice me how can I check that the characters in the variable cvalue of type CString are '0' or '1'?
Thank you.

You don't need to use std::string to process your strings in this case: CString works fine.
Assuming that your CString cvalue is the string you want to check, you can write a simple loop like this:
// Check cvalue for characters different than '0' and '1'
for (int i = 0; i < cvalue.GetLength(); i++)
{
TCHAR currChar = cvalue.GetAt(i);
if ((currChar != _T('0')) && (currChar != _T('1')))
{
CString message;
message.Format(_T("Invalid character at position %d : %c"), i, currChar);
MessageBox(message, _T("Error"));
}
}
The reason for the apparently weird output in your case is that you are passing a pointer to a character that is not followed by a null-terminator:
// Wrong code
char c = value[i];
MessageBox(&c, "value");
If you don't want to build a CString with a formatted message containing the offending character, like I did in the previous sample code, an alternative could be creating a simple raw char array storing the character you want to output followed by the null-terminator:
// This is an array storing two chars: value[i] followed by '\0' (null)
char s[2] = {value[i], '\0'};
MessageBox(s, "value");
P.S.
I used TCHAR in my code sample instead of char, to make the code more easily portable to Unicode builds. Think of TCHAR as a preprocessor macro that maps to char for ANSI/MBCS builds (which seems your current case), and to wchar_t for Unicode builds.
A Brief Note on Validating User Input
With the above answer, I tried to strictly address your specific problem with CString character validation. But, if you can take a look from a broader perspective, I would definitely consider validating the user input before storing it in the list-view control. For example, you could handle the LVN_ENDLABELEDIT notification from the list-view control, and reject invalid input values.
Or, considering that the only valid values for each bit are 0 and 1, you could let the user select them from a combo-box.
Doing that starting from the MFC's CListCtrl is non-trivial work; so, you may also consider using other open-source controls, like this CGridListCtrlEx control available from CodeProject.

As you write in your last paragraph, "check that the characters in the variable cvalue of type CString are '0' or '1'?".
That's exactly how. '0' is the character 0. But you check for the integer 0, which is equal to the character '\0'. That's the end-of-string character.

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character.
If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.
enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };
switch (Encodings)
{
case USASCII:
ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
return new ByteField(ascii.c_str());
case ISO88591:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54F469
return new ByteField(ascii.c_str());
case UTF8:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54C3B469
return new ByteField(ascii.c_str());
case UTF16BE:
ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
//tôi output 005400F40069
return new ByteField(ascii.c_str());
case UTF16:
ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
//tôi output FEFF005400F40069
return new ByteField(ascii.c_str());
case UTF16LE:
ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
//tôi output 5400F4006900
return new ByteField(ascii.c_str());
}
void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
int n = s.length();
for (int i = 0; i < n; i++)
{
unsigned char c = s[i];
long val = long(c);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
result->append(ConvertBinToHex(bin));
}
}
std::string ToUTF16(std::string s, std::string * result, int encodings) {
int n = s.length();
if (encodings == UTF16) {
result->append("FEFF");
}
for (int i = 0; i < n; i++)
{
int val = int(s[i]);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
if (encodings == UTF16 || encodings == UTF16BE) {
result->append("00" + ConvertBinToHex(bin));
}
if (encodings == UTF16LE) {
result->append(ConvertBinToHex(bin) + "00");
}
}
}
std::string ConvertBinToHex(std::string str) {
long long temp = atoll(str.c_str());
int dec_value = 0;
int base = 1;
int i = 0;
while (temp) {
int last_digit = temp % 10;
temp = temp / 10;
dec_value += last_digit * base;
base = base * 2;
}
char hexaDeciNum[10];
while (dec_value != 0)
{
int temp = 0;
temp = dec_value % 16;
if (temp < 10)
{
hexaDeciNum[i] = temp + 48;
i++;
}
else
{
hexaDeciNum[i] = temp + 55;
i++;
}
dec_value = dec_value / 16;
}
str.clear();
for (int j = i - 1; j >= 0; j--) {
str = str + hexaDeciNum[j];
}
return str;
}

The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?
And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it". Please show a minimal, reproducible example along with the input/output
But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"
Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different
To force UTF-8/16/32 encoding you just need to use the u8, u and U suffix respectively, along with the correct type (char8_t, char16_t, char32_t or std::u8string/std::u16string/std::u32string)
std::u8string utf8 = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";
Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string. Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string
Edit:
To convert between UTF encodings use the standard std::codecvt, std::wstring_convert, std::codecvt_utf8_utf16...
Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs
WideCharToMultiByte and MultiByteToWideChar on Windows
iconv on Linux
Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information

-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.
If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function like mbstowcs but you need to set up your locale correctly before using it.
You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.
As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

insert utf-8 data in openldap with c api

What is the correct method to insert utf-8 data in an openldap database? I have data in a std::wstring which utf-8 encoded with:
std::wstring converted = boost::locale::conv::to_utf<wchar_t>(line, "Latin1");
When the string needs to added tot an ldapMod structure, i use this fuction:
std::string str8(const std::wstring& s) {
return boost::locale::conv::utf_to_utf<char>(s);
}
to convert from wstring to string. This is used in my function to create an LDAPMod:
LDAPMod ** y::ldap::server::createMods(dataset& values) {
LDAPMod ** mods = new LDAPMod*[values.elms() + 1];
mods[values.elms()] = NULL;
for(int i = 0; i < values.elms(); i++) {
mods[i] = new LDAPMod;
data & d = values.get(i);
switch (d.getType()) {
case NEW: mods[i]->mod_op = 0; break;
case ADD: mods[i]->mod_op = LDAP_MOD_ADD; break;
case MODIFY: mods[i]->mod_op = LDAP_MOD_REPLACE; break;
case DELETE: mods[i]->mod_op = LDAP_MOD_DELETE; break;
default: assert(false);
}
std::string type = str8(d.getValue(L"type"));
mods[i]->mod_type = new char[type.size() + 1];
std::copy(type.begin(), type.end(), mods[i]->mod_type);
mods[i]->mod_type[type.size()] = '\0';
mods[i]->mod_vals.modv_strvals = new char*[d.elms(L"values") + 1];
for(int j = 0; j < d.elms(L"values"); j++) {
std::string value = str8(d.getValue(L"values", j));
mods[i]->mod_vals.modv_strvals[j] = new char[value.size() + 1];
std::copy(value.begin(), value.end(), mods[i]->mod_vals.modv_strvals[j]);
mods[i]->mod_vals.modv_strvals[j][value.size()] = '\0';
}
mods[i]->mod_vals.modv_strvals[d.elms(L"values")] = NULL;
}
return mods;
}
The resulting LDAPMod is passed on to ldap_modify_ext_s and works as long as i only use ASCII characters. But if other characters are present in the string I get an ldap operations error.
I've also tried this with the function provided by the ldap library (ldap_x_wcs_to_utf8s) but the result is the same as with the boost conversion.
It's not the conversion itself that is wrong, because if I convert the modifications back to a std::wstring and show it in my program output, the encoding is still correct.
AFAIK openldap supports utf-8 since long, so I wonder if there's something else that must be done before this works?
I've looked into the openldap client/tools examples, but the utf-8 functions provided by the library are never used in there.
Update:
I noticed I can insert utf-8 characters like é into ldap with Apache Directory Studio. I can retrieve these values from ldap in my c++ program. But if I insert the same character again, without changing anything to that string, I get the ldap operations error again.

It turns out that my code was not wrong at all. My modifications tried to store the full name in the 'displayName' field as well as in 'gecos'. But apparently 'gecos' cannot handle utf8 data.
We don't actually use gecos anymore. The value was only present because of some software we used years ago, so I removed it from the directory.
What made it hard to find was that even though the loglevel was set to 'parse', this error was still not in the logs.
Because libldap can be such a hard nut to crack, I'll include a link to the complete code of the project i'm working on. It might serve as a starting point for other programmers. (Most of the code in tutorials I have found is outdated.)
https://github.com/yvanvds/yATools/tree/master/libadmintools/ldap

sprintf - 4 useless ascii characters before string

I'm working with visual studio 10, qt addin and opecv library.
What I want to do is to load multiple files using a for-loop:
(I have ui.image_templates_comboBox->currentText() = "cat")
for (int i = 1; i <= 15; i++){
string currentText = ui.image_templates_comboBox->currentText().toStdString();
char name[40];
sprintf(name, "Logos/cat/%s_%d.tif", &currentText, i);
templ_img [i] = cv::imread( name );
So, I thought this should be working OK, but when I debug it, I hover my mouse above "name" and I notice that there are 4 non-english characters preceding currentText value.
I ask 2 questions:
a) How is it possible to ommit those 4 useless characters? (I typed them as "1234" as this site couldn't display them)
name 0x003a7b04 "Logos/cat/1234cat_1.tif" char [40]
b) It is possible to collapse those 4 lines into 1 using an expression inside imread()?

You cannot use the address of an std::string where a const char* is expected. They are not the same.
sprintf(name, "Logos/cat/%s_%d.tif", currentText.c_str(), i);

You are mixing frameworks to much and you do not understand how sprintf works. Fix it like that:
for (int i = 1; i <= 15; i++){
QString fileName = QString("Logos/cat/%1_%2.tif")
.arg(ui.image_templates_comboBox->currentText())
.arg(i);
templ_img [i] = cv::imread(fileName.toAscii().data()); // or: toLocal8Bit, toLatin1(), toUtf8()

Replacing chars from string

My code is the following (reduced):
CComVariant* input is an input parameter
CString cstrPath(input ->bstrVal);
const CHAR cInvalidChars[] = {"/*&#^°\"§$[]?´`\';|\0"};
for (unsigned int i = 0; i < strlen(cInvalidChars); i++)
{
cstrPath.Replace(cInvalidChars[i],_T(''));
}
When debugging, value of cstrPath is L"§", value of cInvalidChars[7] is -89 '§'
I have tried to use .Remove() before, but the problem remains the same: when it comes to § or ´, the code table does not seem to match and the char does not get recognized properly and will not be removed. using a TCHAR array for invalidChars results in even different problems ('§' -> 'ﾴ').
The problem seems that I am not using the correct code tables, but everything I tried so far did not result in any success.
I want to successfully replace/delete any occuring '§'..
I also have had a look at several "delete character from string"-Posts but I did not find anything that helped me.
executable code:
CComVariant* pccovaValue = new CComVariant();
pccovaValue->bstrVal = L"§§";
const CHAR cInvalidChars[] = {"§"};
CString cstrPath(pccovaValue->bstrVal);
for (unsigned int i = 0; i < strlen(cInvalidChars); i++)
{
cstrPath.Remove(cInvalidChars[i]);
}
cstrPath = cstrPath;
just break into cstrPath = cstrPath;

According to the comments you are mixing up Unicode and ANSI encodings. It seems that your application is targeting Unicode which is good. You should stop using ANSI altogether.
Declare cInvalidChars like this:
CString cInvalidChars = L"/*&#^°\"§$[]?´`\';|";
The use of the L prefix means that the string literal is a wide character UTF-16 literal.
Then your loop can look like this:
for (int i = 0; i < cInvalidChars.GetLength(); i++)
cstrPath.Remove(cInvalidChars[i]);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js