Extract character from QString and compare - c++

I am trying to compare a specific character in a QString, but getting odd results:
My QString named strModified contains: "[y]£trainstrip+[height]£trainstrip+8"
I convert the string to a standard string:
std:string stdstr = strModified.toStdString();
I can see in the debugger that 'stdstr' contins the correct contents, but when I attempt to extract a character:
char cCheck = stdstr.c_str()[3];
I get something completely different, I expected to see '£' but instead I get -62. I realise that '£' is outside of the ASCII character set and has a code of 156.
But what is it returning?
I've modified the original code to simplify, now:
const QChar cCheck = strModified.at(intClB + 1);
if ( cCheck == mccAttrMacroDelimited ) {
...
}
Where mccAttrMacroDelimited is defined as:
const QChar clsXMLnode::mccAttrMacroDelimiter = '£';
In the debugger when looking at both definitions of what should be the same value, I get:
cCheck: -93 '£'
mccAttrMacroDelimiter: -93 with what looks like a chinese character
The comparison fails...what is going on?
I've gone through my code changing all QChar references to unsigned char, now I get a warning:
large integer implicitly truncated to unsigned type [-Woverflow]
on:
const unsigned char clsXMLnode::mcucAttrMacroDelimiter = '£';
Again, why? According to the google search this may be a bogus message.

I am happy to say that this has fixed the problem, the solution, declare check character as unsigned char and use:
const char cCheck = strModified.at(intClB + 1).toLatin1();

I think because '£' is not is the ASCII table, you will get weird behavior from 'char'. The compiler in Xcode does not even let me compile
char c = '£'; error-> character too large for enclosing literal type
You could use unicode since '£' can be found on the Unicode character table
£ : u+00A3 | Dec: 163.
The answer to this question heavily inspired the code I wrote to extract the decimal value for '£'.
#include <iostream>
#include <codecvt>
#include <locale>
#include <string>
using namespace std;
//this takes the character at [index] and prints the unicode decimal value
u32string foo(std::string const & utf8str, int index)
{
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32str = conv.from_bytes(utf8str);
char32_t u = utf32str[index];
cout << u << endl;
return &u;
}
int main(int argc, const char * argv[]) {
string r = "[y]£trainstrip+[height]£trainstrip+8";
//compare the characters at indices 3 and 23 since they are the same
cout << (foo(r,3) == foo(r, 23)) << endl;
return 0;
}
You can use a for loop to get all of the characters in the string if you want. Hopefully this helps

Related

conversion of utf-8 string to hex number strange problem

I need a reliable way to convert utf-8 string to hex. (I work with latex encoding)
My initial unicode string as I see from vs debug is
std::string symb = вЋ§;
And I know that in latex compiler (and adobe illustrator) this thing corresponds to the hex representation "f8f1".
It seems to me I have a problem with signed and unsigned int somewhere.
What I do is as follows:
#include <codecvt> // for std::codecvt_utf8
#include <locale> // for std::wstring_convert
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32;
std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(symb); // converts utf-8 to unicode
std::vector<std::string> array_of_symbols;
for (int i = 0; i < unicode_codepoints.length(); i++)
{
int symb1 = conv_utf8_utf32.from_bytes(symb)[i];
std::stringstream ss;
ss << std::hex << symb1;
std::string res(ss.str());
std::string res1 = "0x" + res;
array_of_symbols.push_back(res1);
}
But instead of obtaining f8f1 I got a different number.
In fact, already the variable unicode_codepoints has the "wrong" chart_32_t variable equal to 9127 (yet correct unicode representation, see the red symbol in fig.1).
The "correct" variable yielding f8f1 should, as it seems, be negative.
What surprises here, is that this code works for almost all other situations. Could someone explain what is wrong here?
And finally, why I'm so sure that the correct representation is f8f1, because the final render of this symbol in svg looks like this (see fig.2) and the encoding in the right upper corner.

How to get single characters from unicode string and compare, print them?

I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...
Here's my try (complete C program):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
In order to run this program:
Save the program to a UTF-8 encoded file called ustridx.c
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
Make sure the terminal is set to a UTF-8 locale (locale)
Run it with ./ustridx
Output:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.
'あ' and 'é' are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.
GCC however emits a warning (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.
Since C11 you can use UTF-32 character literals such as U'あ' which results in a char32_t value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.
A standard-compliant portable solution is using Unicode escape sequences for the character literal, like U'\u3042' for あ, but this is hardly different from using an integer constant such as 0x3042.
From libunistring's documentation:
Compares S1 and S2, each of length N, lexicographically. Returns a
negative value if S1 compares smaller than S2, a positive value if
S1 compares larger than S2, or 0 if they compare equal.
The comparison in the if statement was wrong. That was the reason for the mismatch. Of course, this reveals other, unrelated, issues that also need to be fixed. But, that's the reason for the puzzling result of the comparison.

I am not sure why I am getting output for this?

I was learning some string handling in C++ and was doing hit and trail on a code and surprisingly got output for the given code.
#include<bits/stdc++.h>
using namespace std;
int main(){
char str[12]={'\67','a','v','i'};
cout<<str;
return 0;
}
Surprisingly I get 7avi printed .
But if I replace '\67' with '\68'. The following error is shown on Repl.it (https://repl.it/languages/cpp)
#include<bits/stdc++.h>
using namespace std;
int main(){
char str[12]={'\68','a','v','i'};
cout<<str;
return 0;
}
main.cpp:6:19: warning: multi-character character constant [-Wmultichar]
char str[12]={'\68','a','v','i'};
^
main.cpp:6:19: error: constant expression evaluates to 1592 which cannot
be narrowed to type 'char' [-Wc++11-narrowing]
char str[12]={'\68','a','v','i'};
^~~~~
main.cpp:6:19: note: insert an explicit cast to silence this issue
char str[12]={'\68','a','v','i'};
^~~~~
static_cast<char>( )
main.cpp:6:19: warning: implicit conversion from 'int' to 'char' changes
value from 1592 to 56 [-Wconstant-conversion]
char str[12]={'\68','a','v','i'};
~^~~~~
2 warnings and 1 error generated.
compiler exit status 1
Please someone explain this behavior.
The \nnn notation, where nnn are digits between 0 and 7, is for octal (base 8) notation. So in \68, 68 is not a valid octal number. The number one more than 67 is 70. The way it's interpreting the code is that you have '\6' as character 6 in octal, but then an additional '8' ASCII character inside your character literal - hence a multi-character constant, which can not be stored in a char variable. You could store it in a "wide character":
wchar_t str[12]={'\68','a','v','i'};
But, there is no operator<< overload to display an array of wchar_t, so your cout << str line will match the void* overload and just display the memory address of the first element in the array, rather than any of the characters themselves.
You can fix that using:
wcout << str;
Separately, I recommend putting a newline after your output too. Without it, your output may be overwritten by the console prompt before you can see it, though that doesn't happen in the online REPL you're using. It should look like:
wcout << str << '\n';
I think you're trying to type in an ASCII character using either octal or hex.(octal usually begins with a 0, but hex with a 0x). Just don't put the the ASCII code in quotes, instead put the code straight into the array, like so:
char str[12] = {68, 'a', 'v', 'i'}; //decimal
char str[12] = {0x44, 'a', 'v', 'i'}; //hex
char str[12] = {0104, 'a', 'v', 'i'}; //octal
Side Note
Please don't use <bits/stdc++.h>. It's not standardized(see here, for a more detailed explanation.) Instead include <iostream> for cout and the other requisite libraries for your other needs.

Converting fractions (1/8, 3/8, 5/8, 7/8) to UTF-8 in C++

I have to display fractions using the symbols and I can't seem to be able to display these 4.
using
char UCP_VULGAR_FRACTION_ONE_HALF_UTF8 = L'\u00BD';
char UCP_VULGAR_FRACTION_ONE_QUARTER_UTF8 = L'\u00BC';
char UCP_VULGAR_FRACTION_THREE_QUARTERS_UTF8 = L'\u00BE';
I can get 1/2, 1/4 and 3/4 to display just fine (cout<< (char)UCP_VULGAR_FRACTION_ONE_HALF_UTF8), but doing the same for those fractions:
char UCP_VULGAR_FRACTION_ONE_EIGHTH_UTF8 = L'\u215B';
char UCP_VULGAR_FRACTION_THREE_EIGHTHS_UTF8 = L'\u215C';
char UCP_VULGAR_FRACTION_FIVE_EIGHTHS_UTF8 = L'\u215D';
char UCP_VULGAR_FRACTION_SEVEN_EIGHTHS_UTF8 = L'\u215E';
Gets me [, \, ] and ^. What am I doing wrong? I tried g_unichar_to_utf8 with no success...
For UTF-8 you need to store multibyte characters - characters contained in one or more bytes. Typically these are stored in a std::string:
std::string const UCP_VULGAR_FRACTION_ONE_EIGHTH_UTF8 = u8"\u215B";
std::string const UCP_VULGAR_FRACTION_THREE_EIGHTHS_UTF8 = u8"\u215C";
std::string const UCP_VULGAR_FRACTION_FIVE_EIGHTHS_UTF8 = u8"\u215D";
std::string const UCP_VULGAR_FRACTION_SEVEN_EIGHTHS_UTF8 = u8"\u215E";
Or possibly a null terminated char array:
char const* UCP_VULGAR_FRACTION_ONE_EIGHTH_UTF8 = "\u215B";
char const* UCP_VULGAR_FRACTION_THREE_EIGHTHS_UTF8 = "\u215C";
char const* UCP_VULGAR_FRACTION_FIVE_EIGHTHS_UTF8 = "\u215D";
char const* UCP_VULGAR_FRACTION_SEVEN_EIGHTHS_UTF8 = "\u215E";
Use wchar_t instead of char. Also be aware that you can't print wchar_t using std::cout, you need to use wide version of std::cout which is std::wcout. BTW, If you use wcout with cout together, the program will crash most probably. so you may want to store these unicode characters in normal UTF-8 std::string instead of wchar_t, and print them using std::cout

How to convert an ASCII char to its ASCII int value?

I would like to convert a char to its ASCII int value.
I could fill an array with all possible values and compare to that, but it doesn't seems right to me. I would like something like
char mychar = "k"
public int ASCItranslate(char c)
return c
ASCItranslate(k) // >> Should return 107 as that is the ASCII value of 'k'.
The point is atoi() won't work here as it is for readable numbers only.
It won't do anything with spaces (ASCII 32).
Just do this:
int(k)
You're just converting the char to an int directly here, no need for a function call.
A char is already a number. It doesn't require any conversion since the ASCII is just a mapping from numbers to character representation.
You could use it directly as a number if you wish, or cast it.
In C++, you could also use static_cast<int>(k) to make the conversion explicit.
Do this:-
char mychar = 'k';
//and then
int k = (int)mychar;
To Convert from an ASCII character to it's ASCII value:
char c='A';
cout<<int(c);
To Convert from an ASCII Value to it's ASCII Character:
int a=67;
cout<<char(a);
#include <iostream>
char mychar = 'k';
int ASCIItranslate(char ch) {
return ch;
}
int main() {
std::cout << ASCIItranslate(mychar);
return 0;
}
That's your original code with the various syntax errors fixed. Assuming you're using a compiler that uses ASCII (which is pretty much every one these days), it works. Why do you think it's wrong?