I'm trying to use extended Ascii codes in a console application using C++ and Code::Blocks (character codes greater than 128). http://www.asciitable.com/
The console shows a question mark inside a diamond.
I tried so far:
char myChar = 200;
cout << myChar;
cout << static_cast<char>(200);
char can't hold the whole character set
use unsigned char instead.
unsigned char myChar = 200;
cout << myChar << endl;
a char is generally a signed char.
it can hold values from -128 to 127. ASCII fits nicely in 0 to 127, so char is reasonable when working with ASCII.
For the non-ASCII characters 128 to 255, you need something bigger.
unsigned char can store values from 0 to 255. That covers the whole character set.
It's just what you need.
There are other things to research. You can read about unicode. But unsigned char should get you around your current issue.
Related
I want to convert Unicode characters (Persian) to int.
Based on this list, the Unicode number of 'آ' is U+0622.
Suppose i want to give U+0622 as integer value. I wrote this piece of code:
unsigned int Alef = (unsigned int)'آ';
std::cout << Alef << std::endl;
output:
63
Correct Answer is 1570 and as you see the output is wrong. I guess it only converts first byte of Unicode Character.
How do i convert that Unicode character to give correct answer?
Try expressing the character as a wchar literal:
unsigned int Alef = (unsigned int) L'آ';
std::cout << Alef << std::endl;
But make sure you're saving as Unicode, nano, for example, converts the 'آ' to a '?' before saving. As would Notepad on Windows I think?
Also to add to my answer, you should write Unicode characters to std::wcout not std::cout as cout is for single byte chars and wcout is for wchar types.
EDIT: Notepad does save as Unicode
I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/
I would like to convert a char to its ASCII int value.
I could fill an array with all possible values and compare to that, but it doesn't seems right to me. I would like something like
char mychar = "k"
public int ASCItranslate(char c)
return c
ASCItranslate(k) // >> Should return 107 as that is the ASCII value of 'k'.
The point is atoi() won't work here as it is for readable numbers only.
It won't do anything with spaces (ASCII 32).
Just do this:
int(k)
You're just converting the char to an int directly here, no need for a function call.
A char is already a number. It doesn't require any conversion since the ASCII is just a mapping from numbers to character representation.
You could use it directly as a number if you wish, or cast it.
In C++, you could also use static_cast<int>(k) to make the conversion explicit.
Do this:-
char mychar = 'k';
//and then
int k = (int)mychar;
To Convert from an ASCII character to it's ASCII value:
char c='A';
cout<<int(c);
To Convert from an ASCII Value to it's ASCII Character:
int a=67;
cout<<char(a);
#include <iostream>
char mychar = 'k';
int ASCIItranslate(char ch) {
return ch;
}
int main() {
std::cout << ASCIItranslate(mychar);
return 0;
}
That's your original code with the various syntax errors fixed. Assuming you're using a compiler that uses ASCII (which is pretty much every one these days), it works. Why do you think it's wrong?
I am reading the string of data from the oracle database that may or may not contain the Unicode characters into a c++ program.Is there any way for checking the string extracted from the database contains an Unicode characters(UTF-8).if any Unicode characters are present they should be converted into hexadecimal format and need to displayed.
There are two aspects to this question.
Distinguish UTF-8-encoded characters from ordinary ASCII characters.
UTF-8 encodes any code point higher than 127 as a series of two or more bytes. Values at 127 and lower remain untouched. The resultant bytes from the encoding are also higher than 127, so it is sufficient to check a byte's high bit to see whether it qualifies.
Display the encoded characters in hexadecimal.
C++ has std::hex to tell streams to format numeric values in hexadecimal. You can use std::showbase to make the output look pretty. A char isn't treated as numeric, though; streams will just print the character. You'll have to force the value to another numeric type, such as int. Beware of sign-extension, though.
Here's some code to demonstrate:
#include <iostream>
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; ++pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
You could call it like this:
int main()
{
char const* test = "ab\xef\xbb\xbfhu";
print_characters(test);
return 0;
}
Output on Solaris 10 with Sun C++ 5.8:
$ ./a.out
a b 0xef 0xbb 0xbf h u
The code detects UTF-8-encoded characters, but it makes no effort to decode them; you didn't mention needing to do that.
I used *pc & 0xff to convert the expression to an integral type and to mask out the sign-extended bits. Without that, the output on my computer was 0xffffffbb, for instance.
I would convert the string to UTF-32 (you can use something like UTF CPP for that - it is very easy), and then loop through the resulting string, detect code points (characters) that are above 0x7F and print them as hex.
Friends
I want to integrate the following code into the main application code. The junk characters that come populated with the o/p string dumps the application
The following code snipette doesnt work..
void stringCheck(char*);
int main()
{
char some_str[] = "Common Application FE LBS Serverr is down";
stringCheck(some_str);
}
void stringCheck(char * newString)
{
for(int i=0;i<strlen(newString);i++)
{
if ((int)newString[i] >128)
{
TRACE(" JUNK Characters in Application Error message FROM DCE IS = "<<(char)newString[i]<<"++++++"<<(int)newString[i]);
}
}
}
Can someone please show me the better approaches to find junk characters in a string..
Many Thanks
Your char probably is represented signed. Cast it to unsigned char instead to avoid that it becomes a negative integer when casting to int:
if ((unsigned char)newString[i] >128)
Depending on your needs, isprint might do a better job, checking for a printable character, including space:
if (!isprint((unsigned char)newString[i]))
...
Note that you have to cast to unsigned char: input for isprint requires values between 0 and UCHAR_MAX as character values.