Reverse string with non-ASCII characters - c++

I want to change the order in the string with special characters like this:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ
to
ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ
I try to use std::reverse
std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;
but the output show me that:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ <- reversed string
So i try do this "manually" :
std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count << " " << text1.size() << std::endl;
unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
char tmp = text1[i];
text1[i] = text1[maxIndex];
text1[maxIndex] = tmp;
maxIndex--;
}
std::cout << text1 << std::endl;
But in this case I have a problem in text1.size() because every special character are counted twice:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
13 27 <- second number is text1.size()
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ
How is the proper way to reverse a string with special characters?

Your code really does correctly reverse bytes in your string, there's nothing wrong here. The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" in UTF-8 encoding.
And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. This means that one char (one byte) is no longer one character, so reversing char's isn't now the same as reversing characters.
To achieve your goal you have at least two options:
Use some utf-8 library that will let you iterate characters instead of bytes. One example is http://utfcpp.sourceforge.net/
Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles.
UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html

You might code a reverseUt8 function by yourself:
std::string getMultiByteReversed(char ch1, char ch2)
{
if (ch == '\xc3') // most utf8 characters
return std::string(ch1)+ std::string(ch2);
} else {
return std::string(ch1);
}
}
std::string reverseMultiByteString(const std::string &s)
{
std::string result;
for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
std::string reversed;
if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
result += reversed;
++it;
} else {
result += *it;
}
}
return result;
}
You can look up the utf8 codes at: http://www.utf8-chartable.de/

There are a couple of issues here. The answer is complex and can depend on exactly what you're trying to do.
First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. If you just reverse the bytes, you'll break the UTF-8 encoding. The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes.
The next problem is that a single grapheme might consist of multiple Unicode code points. For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e".
Now you might think that you can get around this problem by normalizing the string into a composed form first. That has its own problems, however, because not every grapheme can be represented by a single code point. For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. There is no single code point for it. If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead.
Then you have languages where characters change form based on context, and I don't yet know much about dealing with those.
It may be closer to what you want to reverse grapheme clusters. See UAX29: Unicode text segmentation.

have you tried swapping characters one by one.
For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. In that way, the string will be reversed.

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

String handling with Nordic characters is difficult in C++

I have tried many ways to solve this problem. I just want to part a string or do stuff with each character. As soon as there are Nordic characters in the string, it's not possible to part that string.
The length() function returns the right answer if we look at memory use, but that's not the same as the string length. "ABCÆØÅ" does not have 6 as the length, is has 9. One extra for each special character.
Anybody with a good answer??
The test under here, shows the problem, some letters and a lot of ? marks. :-(
int main()
{
string name = "some æøå string";
for_each(name.begin(), name.end(), [] (char c) {
cout << c;
cout << endl;
});
}
If your terminal supports utf-8 encoding shouldn't be no problem in using the std::cout with the string you enter, but, you need to tell the compiler that you typed in an utf8 string, like this:
int main()
{
string name = u8"some æøå string";
for_each(name.begin(), name.end(), [] (char c) {
cout << c;
cout << endl;
});
cout<<name; //this will also work
return 0; //add this just to be tidy
}
you need to that because characters in UTF-8 might need 1,2,3 or 4 bytes depending on its face.
Then depending on what you need to do, for example split between characters, you should create a function to detect how long is each utf8 character. Then you create a 'string' for each utf8 character and extract as many characters as needed from the original string.
There is a very good library (very compact) utf8proc that let you do those such things.
utf8proc helped me in many projects for resolving these kind of issues.

Bit manipulation on character string

Can we apply bit manipulation on a character string?
If so, is it always possible to retrieve back a character string from the manipulated string?
I was hoping to use the XOR operator on two strings by converting them to binary and then back to character string.
I took up some code from another StackOverflow question but it only solves half the problem
std::string TextToBinaryString(string words)
{
string binaryString = "";
for (char& _char : words)
{
binaryString +=std::bitset<8>(_char).to_string();
}
return binaryString;
}
I don't know how to convert this string of ones and zeroes back to a string of characters.
I did read std::stio in some google search results as a solution but was not able to understand them.
The manipulation that I wish to do is
std::string message("Hello World");
int n = message.size();
bin_string = TextToBinaryString(message)
std::string left,right;
bin_string.copy(left,n/2,0);
bin_string.copy(right,n,n/2);
std::string result = left^right;
I know I can hardcode this by picking up every entry and applying the operation but it is the conversion of the binary string back to characters that are making me scratch my head.
*EDIT: *I am trying to implement a cipher framework called Feistel cipher (SORRY, should had made that clear before) there they use the property of XOR that when you XOR something with the same thing again it cancels out... For eg. (A^B)^B=A. I wanted to output the ciphered jibberish in the middle. Hence, the query.
Can we apply bit manipulation on a character string?
Yes.
A character is an integer type, so you can do anything to them you can do to any other integer. What happened when you tried?
If so, is it always possible to retrieve back a character string from the manipulated string?
No. It is sometimes possible to recover the original string, but some manipulations are not reversible.
XOR, the particular operation you asked about, is self-reversing, so it works in that case but not in general.
A cheesy example (depends on ASCII character set, don't do this in real code for converting case, etc. etc.)
#include <iostream>
#include <string>
int main() {
std::string s("a");
std::cout << "original: " << s << '\n';
s[0] ^= 0x20;
std::cout << "modified: " << s << '\n';
s[0] ^= 0x20;
std::cout << "restored: " << s << '\n';
}
shows (on an ASCII-compatible) system
original: a
modified: A
restored: a
Note that I'm not converting "a" into "1100001" first, and then using XOR (somehow) zero bit 5 giving "1000001" and then converting that back into "A". Why would I?
This part of your question suggests you don't understand the difference between values and representations: the character is always stored in binary. You can also always treat it as if it is stored in octal, or in decimal, or in hexadecimal - the choice of base only affects how we write (or print) the value, and not what the value is in itself.
Writing a Feistel cipher where the plaintext and key are the same length is trivial:
std::string feistel(std::string const &text, std::string const &key)
{
std::string result;
std::transform(text.begin(), text.end(), key.begin(),
std::back_inserter(result),
[](char a, char b) { return a^b; }
);
return result;
}
This doesn't work at all if the key is shorter, though - looping round the key appropriately is left as an exercise for the reader.
Oh, and printing the encoded string is unlikely to work nicely (unless your key is helpfully just a sequence of space characters, as above).
You probably want something like this:
#include<string>
#include<cassert>
using namespace std;
std::string someBitmanipulation(string words)
{
std::string manipulatedstring;
for (char& thechar : words)
{
thechar ^= 0x5A; // xor with 0x5A
}
return manipulatedstring;
}
int main()
{
std::string original{ "ABC" };
// xor each char of original with 0x5a at put result into manipulated
auto manipulated = someBitmanipulation(original);
// check if manipulating the manipulated string is the same as the original string
assert(original == someBitmanipulation(manipulated));
}
You don't need std::bitset at all.
Now change thechar ^= 0x5A; to say thechar |= 0x5A; and see what happens.

Reading alphabetical characters only from file - c++

I am to read words from a text file. Word is defined as a consecutive sequence of letters. So for example in the following string:
"It’s a ver5y good #” idea of a line. You know it?"
the words are:
it s a ver y good idea of line you know
('it' and 'a' are doubled)
I was wondering, if there's any clever function that reads words until it finds a non-alphabetical character? Or the only way to do it is to read char by char and use push_back until we find non-alphabetical one?
When you read a string from a stream, the stream reads a contiguous run of non-white-space characters as the string. It then ignores any white-space characters. The next non-white-space character is the beginning of the next string it'll read. This is pretty much the behavior you want, with one more exception: you want everything other than letters to be treated like white-space.
Fortunately, the stream doesn't hard-code its idea of what's "white space". It uses a locale to tell it what's white space. A locale, in turn, is composed of pieces that deal with individual aspects ("facets") of localization. The facet that deal specifically with classifying characters is a ctype facet. So, if we write a ctype facet that classifies everything other than a letter as white space, we can read "words" from the stream quite easily.
Here's some code to do exactly that:
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
The char specialization of a ctype facet is (always) table driven. All we really have to do is create a table that classifies characters properly. In this case, that means alphabetical characters are classified as upper- or lower-case, and everything else is classified as white-space. We do that by filling the table with ctype_base::space, then for the alphabetical characters basically saying: "oops, no that's not white-space, that's upper- or lower-case.
Technically, the way I've done that is slightly incorrect--it assumes that upper-case and lower-case letters are contiguous. This is true of any sane character set, but not of EBCDIC. If we wanted to be technically correct, instead of the two "std::fill" calls, we could write a loop something like this:
auto max = std::numeric_limits<unsigned char>::max();
for (int i=0; i<max; i++)
if (islower(i))
table[i] = std::ctype_base::lower;
else if (isupper(i))
table[i] = std::ctype_base::upper;
else
table[i] = std::ctype_base::space;
Either way, the conclusion is fairly simple: upper case is upper case, lower case is lower case, everything else is "white space".
Once we've written that, we need to tell the stream to use that locale; then we can read our words really easily:
int main() {
std::istringstream infile("It’s a ver5y good #” idea of a line. You know it?");
// Tell the stream to use our character classifier:
infile.imbue(std::locale(std::locale(), new alpha_only));
std::string word;
while (infile >> word)
std::cout << word << "\n";
}
[I've put a new-line between each "word" so you can easily see what it's reading as a word.]
Result:
It
s
a
ver
y
good
idea
of
a
line
You
know
it
Based on your result in the question, you apparently also only want each word to appear once in the output. To do that, you'd typically insert each word in a set as its read, and only write it to the output if insertion in the set was successful.
std::unordered_set<std::string> words;
std::string word;
while (infile >> word)
if (words.insert(word).second)
std::cout << word << "\n";
The insert for set and unordered_set returns a pair<iterator, bool>, where the bool indicates whether insertion was successful. If it was previously present, that will fail and return false, so based on that we decide whether to write the word out or not.
With this modification, it still appears in the output twice--the first instance has the i capitalized, and the second doesn't. To filter that out, you'll need to convert each string entirely to lower-case (or entirely to upper-case) before inserting it into the set.

C++ Non ASCII letters

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}