strlen() not working well with special characters - c++

When trying to determine the length of a low-level character string with the strlen function of I have noticed that it does not work properly when the string contains Spanish characters that do not exist in English, such as the exclamation opening symbol !, accents or the letter ñ. All these elements are counted as two characters, a situation that is not fixed with Locale.
#include <cstring>
#include <iostream>
int main() {
const char * s1 = "Hola!";
const char * s2 = "¡Hola!";
std::cout << s1 << " has " << strlen(s1) << " elements, but " << s2
<< " has " << strlen(s2) << " intead of 6" << std::endl;
}
This is a work for the university on low-level strings, so it is not possible to use libraries as strings.

strlen gives you the number of non-zero char objects in the buffer pointed to by its argument, up to the first zero char. Your system is apparently using a character encoding (most likely UTF-8) where these problematic characters take up more than one byte (that is, more than one char object).
How to solve this depends on what you're trying to do. For certain operations (such as determining the size of a buffer needed to store the string), the result from strlen is 100% correct, as it's exactly what you need. For most other purposes, welcome to the vast world of character/byte/code-point/whatever nuances. You might want to read up on text encodings, Unicode etc. http://utf8everywhere.org/ might be a good site to start.
You've mentioned this is a university assignment: based on what the teaching goal is, you might need to implement some form of UTF en/de-coding, or just steer clear of non-ASCII characters.

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

String handling with Nordic characters is difficult in C++

I have tried many ways to solve this problem. I just want to part a string or do stuff with each character. As soon as there are Nordic characters in the string, it's not possible to part that string.
The length() function returns the right answer if we look at memory use, but that's not the same as the string length. "ABCÆØÅ" does not have 6 as the length, is has 9. One extra for each special character.
Anybody with a good answer??
The test under here, shows the problem, some letters and a lot of ? marks. :-(
int main()
{
string name = "some æøå string";
for_each(name.begin(), name.end(), [] (char c) {
cout << c;
cout << endl;
});
}
If your terminal supports utf-8 encoding shouldn't be no problem in using the std::cout with the string you enter, but, you need to tell the compiler that you typed in an utf8 string, like this:
int main()
{
string name = u8"some æøå string";
for_each(name.begin(), name.end(), [] (char c) {
cout << c;
cout << endl;
});
cout<<name; //this will also work
return 0; //add this just to be tidy
}
you need to that because characters in UTF-8 might need 1,2,3 or 4 bytes depending on its face.
Then depending on what you need to do, for example split between characters, you should create a function to detect how long is each utf8 character. Then you create a 'string' for each utf8 character and extract as many characters as needed from the original string.
There is a very good library (very compact) utf8proc that let you do those such things.
utf8proc helped me in many projects for resolving these kind of issues.

How could I copy data that contain '\0' character

I'm trying to copy data that conatin '\0'. I'm using C++ .
When the result of the research was negative, I decide to write my own fonction to copy data from one char* to another char*. But it doesn't return the wanted result !
My attempt is the following :
#include <iostream>
char* my_strcpy( char* arr_out, char* arr_in, int bloc )
{
char* pc= arr_out;
for(size_t i=0;i<bloc;++i)
{
*arr_out++ = *arr_in++ ;
}
*arr_out = '\0';
return pc;
}
int main()
{
char * out= new char[20];
my_strcpy(out,"12345aa\0aaaaa AA",20);
std::cout<<"output data: "<< out << std::endl;
std::cout<< "the length of my output data: " << strlen(out)<<std::endl;
system("pause");
return 0;
}
the result is here:
I don't understand what is wrong with my code.
Thank you for help in advance.
Your my_strcpy is working fine, when you write a char* to cout or calc it's length with strlen they stop at \0 as per C string behaviour. By the way, you can use memcpy to copy a block of char regardless of \0.
If you know the length of the 'string' then use memcpy. Strcpy will halt its copy when it meets a string terminator, the \0. Memcpy will not, it will copy the \0 and anything that follows.
(Note: For any readers who are unaware that \0 is a single-character byte with value zero in string literals in C and C++, not to be confused with the \\0 expression that results in a two-byte sequence of an actual backslash followed by an actual zero in the string... I will direct you to Dr. Rebmu's explanation of how to split a string in C for further misinformation.)
C++ strings can maintain their length independent of any embedded \0. They copy their contents based on this length. The only thing is that the default constructor, when initialized with a C-string and no length, will be guided by the null terminator as to what you wanted the length to be.
To override this, you can pass in a length explicitly. Make sure the length is accurate, though. You have 17 bytes of data, and 18 if you want the null terminator in the string literal to make it into your string as part of the data.
#include <iostream>
using namespace std;
int main() {
string str ("12345aa\0aaaaa AA", 18);
string str2 = str;
cout << str;
cout << str2;
return 0;
}
(Try not to hardcode such lengths if you can avoid it. Note that you didn't count it right, and when I corrected another answer here they got it wrong as well. It's error prone.)
On my terminal that outputs:
12345aaaaaaa AA
12345aaaaaaa AA
But note that what you're doing here is actually streaming a 0 byte to the stdout. I'm not sure how formalized the behavior of different terminal standards are for dealing with that. Things outside of the printable range can be used for all kinds of purposes depending on the kind of terminal you're running... positioning the cursor on the screen, changing the color, etc. I wouldn't write out strings with embedded zeros like that unless I knew what the semantics were going to be on the stream receiving them.
Consider that if what you're dealing with are bytes, not to confuse the issue and to use a std::vector<char> instead. Many libraries offer alternatives, such as Qt's QByteArray
Your function is fine (except that you should pass to it 17 instead of 20). If you need to output null characters, one way is to convert the data to std::string:
std::string outStr(out, out + 17);
std::cout<< "output data: "<< outStr << std::endl;
std::cout<< "the length of my output data: " << outStr.length() <<std::endl;
I don't understand what is wrong with my code.
my_strcpy(out,"12345aa\0aaaaa AA",20);
Your string contains character '\' which is interpreted as escape sequence. To prevent this you have to duplicate backslash:
my_strcpy(out,"12345aa\\0aaaaa AA",20);
Test
output data: 12345aa\0aaaaa AA
the length of my output data: 18
Your string is already terminated midway.
my_strcpy(out,"12345aa\0aaaaa AA",20);
Why do you intend to have \0 in between like that? Have some other delimiter if yo so desire
Otherwise, since std::cout and strlen interpret a \0 as a string terminator, you get surprises.
What I mean is that follow the convention i.e. '\0' as string terminator

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.

Reverse string with non-ASCII characters

I want to change the order in the string with special characters like this:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ
to
ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ
I try to use std::reverse
std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;
but the output show me that:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ <- reversed string
So i try do this "manually" :
std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count << " " << text1.size() << std::endl;
unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
char tmp = text1[i];
text1[i] = text1[maxIndex];
text1[maxIndex] = tmp;
maxIndex--;
}
std::cout << text1 << std::endl;
But in this case I have a problem in text1.size() because every special character are counted twice:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
13 27 <- second number is text1.size()
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ
How is the proper way to reverse a string with special characters?
Your code really does correctly reverse bytes in your string, there's nothing wrong here. The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" in UTF-8 encoding.
And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. This means that one char (one byte) is no longer one character, so reversing char's isn't now the same as reversing characters.
To achieve your goal you have at least two options:
Use some utf-8 library that will let you iterate characters instead of bytes. One example is http://utfcpp.sourceforge.net/
Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles.
UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html
You might code a reverseUt8 function by yourself:
std::string getMultiByteReversed(char ch1, char ch2)
{
if (ch == '\xc3') // most utf8 characters
return std::string(ch1)+ std::string(ch2);
} else {
return std::string(ch1);
}
}
std::string reverseMultiByteString(const std::string &s)
{
std::string result;
for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
std::string reversed;
if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
result += reversed;
++it;
} else {
result += *it;
}
}
return result;
}
You can look up the utf8 codes at: http://www.utf8-chartable.de/
There are a couple of issues here. The answer is complex and can depend on exactly what you're trying to do.
First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. If you just reverse the bytes, you'll break the UTF-8 encoding. The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes.
The next problem is that a single grapheme might consist of multiple Unicode code points. For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e".
Now you might think that you can get around this problem by normalizing the string into a composed form first. That has its own problems, however, because not every grapheme can be represented by a single code point. For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. There is no single code point for it. If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead.
Then you have languages where characters change form based on context, and I don't yet know much about dealing with those.
It may be closer to what you want to reverse grapheme clusters. See UAX29: Unicode text segmentation.
have you tried swapping characters one by one.
For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. In that way, the string will be reversed.