C++ Character Encoding

C++ Character Encoding - c++

This is my C++ Code where i'm trying to encode the received file path to utf-8.
#include <string>
#include <iostream>
using namespace std;
void latin1_to_utf8(unsigned char *in, unsigned char *out);
string encodeToUTF8(string _strToEncode);
int main(int argc,char* argv[])
{
// Code to receive fileName from Sockets
cout << "recvd ::: " << recvdFName << "\n";
string encStr = encodeToUTF8(recvdFName);
cout << "encoded :::" << encStr << "\n";
}
void latin1_to_utf8(unsigned char *in, unsigned char *out)
{
while (*in)
{
if (*in<128)
{
*out++=*in++;
}
else
{
*out++=0xc2+(*in>0xbf);
*out++=(*in++&0x3f)+0x80;
}
}
*out = '\0';
}
string encodeToUTF8(string _strToEncode)
{
int len= _strToEncode.length();
unsigned char* inpChar = new unsigned char[len+1];
unsigned char* outChar = new unsigned char[2*(len+1)];
memset(inpChar,'\0',len+1);
memset(outChar,'\0',2*(len+1));
memcpy(inpChar,_strToEncode.c_str(),len);
latin1_to_utf8(inpChar,outChar);
string _toRet = (const char*)(outChar);
delete[] inpChar;
delete[] outChar;
return _toRet;
}
And the OutPut is
recvd ::: /Users/zeus/ÄÈÊÑ.txt
encoded ::: /Users/zeus/AÌEÌEÌNÌ.txt
The above function latin1_to_utf8 is provided as an solution Convert ISO-8859-1 strings to UTF-8 in C/C++ , Looks like it works.[Answer is accepted]. So i think i must be making some mistake, but i'm not able to identify what it is. Can someone help me out with this , Please.
I have first posted this question in Codereview,but i'm not getting any answers out there. So sorry for the duplication.

Do you use any platform or you build it on the top of std? I am sure that many people use such convertions and therefore there is library. I strongly recommend you to use the libraray, because the library is tested and usually the best know way is used.
A library which I found doing this is boost locale
This is standard. If you use QT I will recommend you to use the QT conversion library for this (it is platform independant)
QT
In case you want to do it yourself (you want to see how it works or for any other reason)
1. Make sure that you allocate memory ! - this is very important in C,C++ . Since you use iostream use new to allocate memory and delete to release it (this is also important C++ won't figure out when to release it for sure. This is developer's job here - C++ is hardcore :D )
2. Check that you allocate the right size of memory. I expect unicode to be larger memory (it encodes more symbols and sometimes uses large numbers).
3. As already mentioned above read from somewhere (terminal or file) but output in new file. After that when you open the file with text editor make sure you set the encoding to be utf-8 ( your text editor has to know how to interpretate the data)
I hope that helps.

You are first outputting the original Latin-1 string to a terminal expecting a certain encoding, probably Latin-1. You then transcode to UTF-8 and output it to the same terminal, which interprets it differently. Classic mojibake. Try the following with the output instead:
for(size_t i=0, len=strlen(outChar); i!=len; ++i)
std::cout << static_cast<unsigned>(static_cast<unsigned char>(outChar[i])) << ' ';
Note that the two casts are to first get the unsigned byte value and then to get the unsigned value to keep the stream from treating it as a char. Note that your char might already be unsigned, but that's compile-dependent.

Related

Coding a path in unicode c++

I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this:
https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");

Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input char data: c4 87
Output wchar_t data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input char data: 3f
Output wchar_t data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";

Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that &dash; when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.

The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?

Handling Automatic Naming of Files In C++ Sprintf

I am currently writing a program in C++. I want to save a number of files continuously throughout the run of my program. The format of the filename is as such:
char fnameC[sizeof "C:\..._SitTurn_104_c2_00_00_000.bmp"];
- SitTurn is an experiment name
- 104 is an experiment number
These two will be changing after each different run of the program. Currently, my program works like this:
char fnameCVS[sizeof"C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_SitTurn_104_c2_02.csv"];
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2;
double elapsedTime;
SYSTEMTIME comptime;
int main(int argc, char *argv[])
{
GetSystemTime(&comptime);
sprintf_s(fnameCVS, "C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_SitTurn_104_c2_%02d.csv", comptime.wDay);
However, I tried this and I can't seem to get it to work. Can anyone help me?
...//rest of code set up
string expName = "SitStand";
string subjNumber = "101";
char fnameCVS[sizeof "C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_" + expName + "_" + subjNumber + "_c2_02.csv"];
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2;
double elapsedTime;
SYSTEMTIME comptime;
int main(int argc, char *argv[])
{
GetSystemTime(&comptime);
sprintf_s(fnameCVS, "C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_" + expName + "_" + subjNumber + "_c2_%02d.csv", comptime.wDay);
Since I am using this filename later in the program also, I would like to be able to just rename all files by changing the two strings: expName and subjNumber. Can someone help me explain how I can name my files using a string inputs (e.g. expName and subjNumber), so I only have to rename those corresponding string each time I change the experiment name, or subject number. Thanks!

Try this:
char fnameCVS[MAX_PATH+1];
SYSTEMTIME comptime;
GetSystemTime(&comptime);
sprintf_s(fnameCVS, _countof(fnameCVS), "C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_SitTurn_104_c2_%02d.csv", comptime.wDay);
Or this:
#include <string>
#include <sstream>
std::string expName = "SitStand";
std::string subjNumber = "101";
std::string fnameCVS;
SYSTEMTIME comptime;
GetSystemTime(&comptime);
std::ostringstream oss;
oss << "C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_" << expName << "_" << subjNumber << "_c2_" << std::setw(2) << std::setfill('0') << comptime.wDay << ".csv";
fnameCVS = oss.str();

You are mixing sprintf and std::string, which is never a good plan. You should either pick to use C's sprintf with char *, or C++'s std::string with std::stringstream.
Your fnameCVS array isn't going to be big enough: you'll take the sizeof of a std::string, which almost certainly will not be what you want.
Option 1: Use only sprintf. Allocate a big-enough string (e.g. char fnameCVS[256]) and use snprintf(fnameCVS, 256, "...Skeleton_%s_%d_c2_%02.csv", ...).
Option 2: Use only string and use a std::stringstream to build your filename.

This is a really bad idea:
char fnameCVS[sizeof"C:\\Users\\Adam\\Desktop\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_RGBDepth_DataAcquisition2013\\Skeleton_SitTurn_104_c2_02.csv"];
The main reason is that it is very difficult to visually inspect whether you have allocated the right number of bytes. Perhaps you make a slight change to the filename later in the sprintf line but then you forget to update this line or you make a typo. Boom, buffer overflow (which may go undetected until it is time to give a presentation).
A secondary bug is that when you use %02d in printf, the 2 is a minimum field width; if the number would require more than 2 digits then it outputs more than 2 digits, causing a buffer overflow. To be safe here you'd need to check that the number is between 0 and 99 before printing it.
Finally, sprintf_s is a non-standard function, there's really no reason to use it instead of sprintf or snprintf.
In C++ the equivalent formatting is a bit more wordy, but leaves no possibility of buffer overflows:
std::string fnameCVS;
// ...
std::ostringstream oss;
oss << "C:\\Users\\whatever...." << std::setw(2) << std::setfill('0')
<< comptime.wDay;
fnameCVS = oss.str();
If you really want to stick with the printf family plus a static char array (note: you can use printf and a dynamically-sized char container) then to make your code safe:
char const my_format[] = "C:\\Users\\whatever.....\\%02d.csv";
char fnameCVS[ sizeof my_format - 2 ]; // "NN" is two chars shorter than "%02d"
// ...
if ( comptime.wDay < 0 || comptime.wDay > 99 )
throw std::runtime_error("wDay out of range");
snprintf(fnameCVS, sizeof fnameCVS, my_format, comptime.wDay);
Your update indicates that you want to compute various other parts of the filename at runtime too; the C++ version that I suggest is easier to extend than the C-with-static-array version where you have to calculate the amount of memory you need by hand.

Quickly convert raw data to hex string in c++

I'm reading data from a file and trying to display the raw data as 2 digit hex strings.
I'm using the Qt framework, specifically the QTextEdit.
I've tried a bunch of different approaches and have almost accomplished what I want it to do, however it has some unexpected errors I don't know anything about.
Currently this is my implementation:
1) Read in the data:
ifstream file (filePath, ios::in|ios::binary|ios::ate);
if (file.is_open())
{
size = file.tellg();
memblock = new char [size+1];
file.seekg(0, ios::beg);
file.read(memblock, size);
file.close();
}
2) Create a single QString that will be used (because QTextEdit requires a QString):
QString s;
3) Loop through the array appending each successive character to the QString s.
int count = 0;
for(i=0;i<size;i++)
{
count++;;
s.append(QString::number(memblock[i], 16).toUpper());
s.append("\t");
if (count == 16)
{
s.append("\n");
count -= 16;
}
}
Now this works fine, except when it reaches a character FF, it appears as FFFFFFFFFFFFFFFF
So my main questions are:
Why do only the 'FF' characters appear as 'FFFFFFFFFFFFFFFF' instead?
Is there a way to convert the char data to base 16 strings without using QString::number?
I want this implementation to be as fast as possible, so if something like sprintf could work, please let me know, as I would guess that might be faster that QString::number.

QString can't be used for binary data. You should use QByteArray instead. It can be easily created from char* buffer and can be easily converted to hex string using toHex.
QByteArray array(memblock, size);
textEdit->setText(QString(array.toHex()));

QString::number doesn't have an overload that takes a char, so your input is being promoted to an int; consequently you're seeing the effects of sign extension. You should be seeing similar behavior for any input greater than 0x7F.
Try casting the data prior to calling the function.
s.append(QString::number(static_cast<unsigned char>(memblock[i]), 16).toUpper());

Writing chars as a byte in C++

I'm writing a Huffman encoding program in C++, and am using this website as a reference:
http://algs4.cs.princeton.edu/55compression/Huffman.java.html
I'm now at the writeTrie method, and here is my version:
// write bitstring-encoded tree to standard output
void writeTree(struct node *tempnode){
if(isLeaf(*tempnode)){
tempfile << "1";
fprintf(stderr, "writing 1 to file\n");
tempfile << tempnode->ch;
//tempfile.write(&tempnode->ch,1);
return;
}
else{
tempfile << "0";
fprintf(stderr, "writing 0 to file\n");
writeTree(tempnode->left);
writeTree(tempnode->right);
}
}
Look at the line commented - let's say I'm writing to a text file, but I want to write the bytes that make up the char at tempnode->ch (which is an unsigned char, btw). Any suggestions for how to go about doing this? The line commented gives an invalid conversion error from unsigned char* to const char*.
Thanks in advance!
EDIT: To clarify: For instance, I'd like my final text file to be in binary -- 1's and 0's only. If you look at the header of the link I provided, they give an example of "ABRACADABRA!" and the resulting compression. I'd like to take the char (such as in the example above 'A'), use it's unsigned int number (A='65'), and write 65 in binary, as a byte.

A char is identical to a byte. The preceding line tempfile << tempnode->ch; already does exactly what you seem to want.
There is no overload of write for unsigned char, but if you want, you can do
tempfile.write(reinterpret_cast< char * >( &tempnode->ch ),1);
This is rather ugly, but it does exactly the same thing as tempfile << tempnode->ch.
EDIT: Oh, you want to write a sequence of 1 and 0 characters for the bits in the byte. C++ has an obscure trick for that:
#include <bitset>
tempfile << std::bitset< 8 >( tempnode->ch );

How I can print the wchar_t values to console?

Example:
#include <iostream>
using namespace std;
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
cout << ru
<< endl
<< en;
return 0;
}
This code only prints HEX-values like adress.
How to print the wchar_t string?

Edit: This doesn’t work if you are trying to write text that cannot be represented in your default locale. :-(
Use std::wcout instead of std::cout.
wcout << ru << endl << en;

Can I suggest std::wcout ?
So, something like this:
std::cout << "ASCII and ANSI" << std::endl;
std::wcout << L"INSERT MULTIBYTE WCHAR* HERE" << std::endl;
You might find more information in a related question here.

You cannot portably print wide strings using standard C++ facilities.
Instead you can use the open-source {fmt} library to portably print Unicode text. For example (https://godbolt.org/z/nccb6j):
#include <fmt/core.h>
int main() {
const char en[] = "Hello";
const char ru[] = "Привет";
fmt::print("{}\n{}\n", ru, en);
}
prints
Привет
Hello
This requires compiling with the /utf-8 compiler option in MSVC.
For comparison, writing to wcout on Linux:
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет";
std::wcout << ru << std::endl << en;
may transliterate the Russian text into Latin (https://godbolt.org/z/za5zP8):
Privet
Hello
This particular issue can be fixed by switching to a locale that uses UTF-8 but a similar problem exists on Windows that cannot be fixed just with standard facilities.
Disclaimer: I'm the author of {fmt}.

Windows has the very confusing information. You should learn C/C++ concept from Unix/Linux before programming in Windows.
wchar_t stores character in UTF-16 which is a fixed 16-bit memory size called wide character but wprintf() or wcout() will never print non-english wide characters correctly because no console will output in UTF-16. Windows will output in current locale while unix/linux will output in UTF-8, all are multi-byte. So you have to convert wide characters to multi-byte before printing. The unix command wcstombs() doesn't work on Windows, use WideCharToMultiByte() instead.
First you need to convert file to UTF-8 using notepad or other editor. Then install font in command prompt console so that it can read/write in your language and change code page in console to UTF-8 to display correctly by typing in the command prompt "chcp 65001" while cygwin is already default to UTF-8. Here is what I did in Thai.
#include <windows.h>
#include <stdio.h>
int main()
{
wchar_t* in=L"ทดสอบ"; // thai language
char* out=(char *)malloc(15);
WideCharToMultiByte(874, 0, in, 15, out, 15, NULL, NULL);
printf(out); // result is correctly in Thai although not neat
}
Note that
874=(Thai) code page in the operating system, 15=size of string
My suggestion is to avoid printing non-english wide characters to console unless necessary because it is not easy.

#include <iostream>
using namespace std;
void main()
{
setlocale(LC_ALL, "Russian");
cout << "\tДОБРО ПОЖАЛОВАТЬ В КИНО!\n";
}

The way to do it is to convert UTF-16 LE (Default Windows encoding) into UTF-8, and then print to console (chcp 65001 first, to switch codepage to UTF-8).
It's pretty trivial to convert UTF-16 to UTF-8. Use this page as a guide, if you need more than 2 byte characters.
short* cmd_s = (short*)cmd;
while(cmd_s[i] != 0)
{
short u16 = cmd_s[i++];
if(u16 > 0x7F)
{
unsigned char c0 = ((char)u16 & 0x3F) | 0x80; // Least significant
unsigned char c1 = char(((u16 >> 6) & 0x1F) | 0xC0); // Most significant
cout << c1 << c0; // Use Big-endian network order
}
else
{
unsigned char c0 = (char)u16;
cout << c0;
}
}
Of course, you can put it in a function and extend it to handle wider characters (For Cyrillic it should be enough), but I wanted to show basic algorithm, and to prove that it's not hard at all and you don't need any libraries, just a few lines of code.

You could use use a normal char array that is actually filled with utf-8 characters. This should allow mixing characters across languages.

You can print wide characters with wprintf.
#include <iostream>
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
wprintf(en);
wprintf(ru);
return 0;
}
Output:
Hello
Привет

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Character Encoding - c++

Related

Coding a path in unicode c++

Handling Automatic Naming of Files In C++ Sprintf

Quickly convert raw data to hex string in c++

Writing chars as a byte in C++

How I can print the wchar_t values to console?

Categories

Resources