Diacritic Wikipedia
I've build an EEPROM (Kind of like a very low tech USB stick) Programmer and I'm writing a program witch reads text from a txt file and then converts it into bin/hex that the programmer can use to program the data onto the EEPROM. I've got everything except for a function that converts the string into hex. I've tried using this code which works somewhat well.
string Text = "This is a string.";
for(int i = 0; i < Text.size(); i++) {
cout << uppercase << hex << (int)Text[i] << " ";
}
This will out put this:
54 68 69 73 20 69 73 20 61 20 73 74 72 69 6E 67 2E
But when giving it:
Thïs ìs â stríng.
It wil retun this:
54 68 FFFFFFC3 FFFFFFAF 73 20 FFFFFFC3 FFFFFFAC 73 20 FFFFFFC3 FFFFFFA2 20 73 74 72 FFFFFFC3 FFFFFFAD 6E 67 2E
This doesn't look right to me. My best guess is that normal char are converted to ASCII and the special ones get converted in some form of Unicode.
Is there a way to make everything Unicode?
Side note The EEPROM can only hold 2k bytes so the more space efficient the better.
So my end goal is:
Make a function that turns a string into its hex equivalent.
With the end result being space efficient and supporting diacritics.
Make another function that could read the hex and turn it into a string, also with support for diacritics.
If that is not possible I'm willing to use a custom formatting that would store an 'ê' like "|e^" for example. With an equivalent of "|" as a way for me to intercept a special character.
Thanks for your help!
a double cast is needed here: first cast the character to (unsigned char), then cast to (int):
(int)(unsigned char)Text[i]
this is necessary because casting as (unsigned int) does not work as you might expect. the signed char value is first widened, then the cast is applied, but at that point, the sign extension has already been performed.
see this on https://godbolt.org/z/GhYz3T8v6
Related
I am currently trying to write a text renderer. I have code that procedurally generates an atlas map based on a font up to some number of utf characters (currently up to 0xFF for testing). It also skips the ranges that don;t represent characters (for obvious reasons).
The goal now is to calculate the UV coordinates of a character for each character in a string, for which I need to take an arbitrary string and iterate through every character, obtaining the utf index of that character (which I can then map to the index in the atlas).
However, although for ascii characters the conversion works just fine, UTF characters are all over the place.
If I create a regular string and then iterate over the characters, the utf characters explode into negative representations, in spite the string being printed correctly by cout.
If instead I convert the string to a u32string the code iterates each utf character as 2 characters.
Essentially this:
string str = " ABCDEFGHKabcdefgz¡¥";
std::u32string input;
input.clear();
for(unsigned char c : str) input += c;
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
for(char32_t c : input)
{
std::cout << converter.to_bytes(c) << std::endl;
std::cout << c << std::endl;
}
Prints:
32
A
65
B
66
C
67
D
68
E
69
F
70
G
71
H
72
K
75
a
97
b
98
c
99
d
100
e
101
f
102
g
103
z
122
Â
194
¡
161
Â
194
¥
165
Is there a way in C++ to iterate through every utf character (stand alone) and convert it to an unsignrd integer? Ideally I would like to be able to print the character I am parsing to the terminal as well as the code (for debugging).
There's lots of questions on this topic on SO, but none I read described how to convert the characters to indices dynamically.
So I am writing a program to turn a Chinese-English definition .txt file into a vocab trainer that runs through the CLI. However, in windows when I try to compile this in VS2017 it turns into gibberish and I'm not sure why. I think it was working OK in linux but windows seems to mess it up quite a bit. Does this have something to do with the encoding table in windows? Am I missing something? I wrote the code in Linux as well as the input file, but I tried writing the characters using windows IME and still has the same result. I think the picture speaks best for itself. Thanks
Note: Added sample of input/output as it appears in Windows, as requested. Also, input is UTF-8.
Sample of input
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
Sample of output
人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth
土(tǔ),earth
Picture of Input file & Output
TL;DR: The Windows terminal hates Unicode. You can work around it, but it's not pretty.
Your issues here are unrelated to "char versus wchar_t". In fact, there's nothing wrong with your program! The problems only arise when the text leaves through cout and arrives at the terminal.
You're probably used to thinking of a char as a "character"; this is a common (but understandable) misconception. In C/C++, the char type is usually synonymous with an 8-bit integer, and thus is more accurately described as a byte.
Your text file chineseVocab.txt is encoded as UTF-8. When you read this file via fstream, what you get is a string of UTF-8-encoded bytes.
There is no such thing as a "character" in I/O; you're always transmitting bytes in a particular encoding. In your example, you are reading UTF-8-encoded bytes from a file handle (fin).
Try running this, and you should see identical results on both platforms (Windows and Linux):
int main()
{
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
{
cout << "Number of bytes in the line: " << dec << line.length() << endl;
cout << " ";
for (char c : line)
{
// Here we need to trick the compiler into displaying this "char" as an integer:
unsigned int byte = (unsigned char)c;
cout << hex << byte << " ";
}
cout << endl;
cout << endl;
}
return 0;
}
Here's what I see in mine (Windows):
Number of bytes in the line: 16
e4 ba ba 28 72 c3 a9 6e 29 2c 70 65 72 73 6f 6e
Number of bytes in the line: 15
e5 88 80 28 64 c4 81 6f 29 2c 6b 6e 69 66 65
Number of bytes in the line: 14
e5 8a 9b 28 6c c3 ac 29 2c 70 6f 77 65 72
Number of bytes in the line: 27
e5 8f 88 28 79 c3 b2 75 29 2c 72 69 67 68 74 20 68 61 6e 64 3b 20 61 67 61 69 6e
Number of bytes in the line: 15
e5 8f a3 28 6b c7 92 75 29 2c 6d 6f 75 74 68
So far, so good.
The problem starts now: you want to write those same UTF-8-encoded bytes to another file handle (cout).
The cout file handle is connected to your CLI (the "terminal", the "console", the "shell", whatever you wanna call it). The CLI reads bytes from cout and decodes them into characters so they can be displayed.
Linux terminals are usually configured to use a UTF-8 decoder. Good news! Your bytes are UTF-8-encoded, so your Linux terminal's decoder matches the text file's encoding. That's why everything looks good in the terminal.
Windows terminals, on the other hand, are usually configured to use a system-dependent decoder (yours appears to be DOS codepage 437). Bad news! Your bytes are UTF-8-encoded, so your Windows terminal's decoder does not match the text file's encoding. That's why everything looks garbled in the terminal.
OK, so how do you solve this? Unfortunately, I couldn't find any portable way to do it... You will need to fork your program into a Linux version and a Windows version. In the Windows version:
Convert your UTF-8 bytes into UTF-16 code units.
Set standard output to UTF-16 mode.
Write to wcout instead of cout
Tell your users to change their terminals to a font that supports Chinese characters.
Here's the code:
#include <fstream>
#include <iostream>
#include <string>
#include <windows.h>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
using namespace std;
// Based on this article:
// https://msdn.microsoft.com/magazine/mt763237?f=255&MSPPError=-2147217396
wstring utf16FromUtf8(const string & utf8)
{
std::wstring utf16;
// Empty input --> empty output
if (utf8.length() == 0)
return utf16;
// Reject the string if its bytes do not constitute valid UTF-8
constexpr DWORD kFlags = MB_ERR_INVALID_CHARS;
// Compute how many 16-bit code units are needed to store this string:
const int nCodeUnits = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of the source UTF-8 string, in bytes
nullptr, // Unused - no conversion done in this step
0 // Request size of destination buffer, in wchar_ts
);
// Invalid UTF-8 detected? Return empty string:
if (!nCodeUnits)
return utf16;
// Allocate space for the UTF-16 code units:
utf16.resize(nCodeUnits);
// Convert from UTF-8 to UTF-16
int result = ::MultiByteToWideChar(
CP_UTF8, // Source string is in UTF-8
kFlags, // Conversion flags
utf8.data(), // Source UTF-8 string pointer
utf8.length(), // Length of source UTF-8 string, in bytes
&utf16[0], // Pointer to destination buffer
nCodeUnits // Size of destination buffer, in code units
);
return utf16;
}
int main()
{
// Based on this article:
// https://blogs.msmvps.com/gdicanio/2017/08/22/printing-utf-8-text-to-the-windows-console/
_setmode(_fileno(stdout), _O_U16TEXT);
fstream fin("chineseVocab.txt");
string line;
while (getline(fin, line))
wcout << utf16FromUtf8(line) << endl;
return 0;
}
In my terminal, it mostly looks OK after I change the font to MS Gothic:
Some characters are still messed up, but that's due to the font not supporting them.
#include <iostream>
#include <fstream>
using namespace std;
struct example
{
int num1;
char abc[10];
}obj;
int main ()
{
ofstream myfile1 , myfile2;
myfile1.open ("example1.txt");
myfile2.open ("example2.txt");
myfile1 << obj.num1<<obj.abc; //instruction 1
myfile2.write((char*)&obj, sizeof(obj)); //instruction 2
myfile1.close();
myfile2.close();
return 0;
}
In this example will both the example files be identical with data or different? Are instruction 1 and instruction 2 same?
There's a massive difference.
Approach 1) writes the number using ASCII encoding, so there's an ASCII-encoded byte for each digit in the number. For example, the number 28 is encoded as one byte containing ASCII '2' (value 50 decimal, 32 hex) and another for '8' (56 / 0x38). If you look at the file in a program like less you'll be able to see the 2 and the 8 in there as human-readable text. Then << obj.abc writes the characters in abc up until (but excluding) the first NUL (0-value byte): if there's no NUL you run off the end of the buffer and have undefined behaviour: your program may or may not crash, it may print nothing or garbage, all bets are off. If your file is in text mode, it might translate any newline and/or carriage return characters in abc1 to some other standard representation of line breaks your operating system uses (e.g. it might automatically place a carriage return after every newline you write, or remove carriage returns that were in abc1).
Approach 2) writes the sizeof(obj) bytes in memory: that's a constant number of bytes regardless of their content. The number will be stored in binary, so a program like less won't show you the human-readable number from num1.
Depending on the way your CPU stores numbers in memory, you might have the bytes in the number stored in different orders in the file (something called endianness). There'll then always be 10 characters from abc1 even if there's a NUL in there somewhere. Writing out binary blocks like this is normally substantially faster than converting number to ASCII text and the computer having to worry about if/where there are NULs. Not that you normally have to care, but not all the bytes written necessarily contribute to the logical value of obj: some may be padding.
A more subtle difference is that for approach 1) there are ostensibly multiple object states that could produce the same output. Consider {123, "45"} and {12345, ""} -> either way you'd print "12345". So, you couldn't later open and read from the file and be sure to set num1 and abc to what they used to be. I say "ostensibly" above because you might happen to have some knowledge we don't, such as that the abc1 field will always start with a letter. Another problem is knowing where abc1 finishes, as its length can vary. If these issues are relevant to your actual use (e.g. abc1 could start with a digit), you could for example write << obj.num1 << ' ' << obj.abc1 << '\n' so the space and newline would tell you where the fields end (assuming abc1 won't contain newlines: if it could, consider another delimiter character or an escaping/quoting convention). With the space/newline delimiters, you can read the file back by changing the type of abc1 to std::string to protect against overruns by corrupt or tampered-with files, then using if (inputStream >> obj.num1 && getline(inputStream, obj.abc1)) ...process obj.... getline can cope with embedded spaces and will read until a newline.
Example: {258, "hello\0\0\0\0\0"} on a little-endian system where sizeof(int) is 32 and the stucture's padded out to 12 bytes would print (offsets and byte values shown in hexadecimal):
bytes in file at offset...
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
approach 1) 32 35 38 69 65 6c 6c 6f
'2' '5' '8' 'h' 'e' 'l' 'l' 'o'
approach 2) 00 00 01 02 69 65 6c 6c 6f 00 00 00 00 00 00 00
[-32 bit 258-] 'h' 'e' 'l' 'l' 'o''\0''\0''\0''\0''\0' pad pad
Notes: for approach 2, 00 00 01 02 encodes 100000010 binary which is 258 decimal. (Search for "binary encoding" to learn more about this).
I would like to ask if there is a method in C++ to turning a .txt file with hexa digits for example
0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68 69 73 20 70 72
to a new .txt with that looking
\x0E\x1F\xBA\x0E\x00\xB4\x09\xCD\x21\xB8\x01\x4C\xCD\x21\x54\x68\x69\x73\x20\x70\x72"
I searched the answer in google but found nothing and tried a script in C++ but does not work with error message "24 11 \x used with no following hex digits"
#include <iostream>
#include <fstream>
#include<vector>
using namespace std;
int main()
{
string hexaEnter;
ifstream read;
ofstream write;
write.open ("newhexa.txt",std::ios_base::app);
read.open("hexa.txt");
while (!read.eof() )
{
read >> hexaEnter;
write << "\x" + hexaEnter;
}
write.close();
read.close();
system("pause");
return 1;
}
write << "\x" + hexaEnter;
// ^^
Here, C++ sees the beginning of a hex escape sequence, like \x0E or \x1F, but it can't find the actual hex values because you didn't provide any.
That's because what you intended to do was literally write the character \ and the character x, so escape the backslash to make that happen:
write << "\\x" + hexaEnter;
// ^^^
As an aside, your loop condition is wrong.
This HMACSHA1 code below works for converting "Password" and "Message" to AFF791FA574D564C83F6456CC198CBD316949DC9 as evidence by http://buchananweb.co.uk/security01.aspx.
My question is, Is it possible to have:
BYTE HMAC[] = {0x50,0x61,0x73,0x73,0x77,0x6F,0x72,0x64};
BYTE data2[] = {0x4D,0x65,0x73,0x73,0x61,0x67,0x65};
And still get the same value: AFF791FA574D564C83F6456CC198CBD316949DC9.
For example, if I was on a server and received the packet:
[HEADER] 08 50 61 73 73 77 6F 72 64 00
[HEADER] 07 4D 65 73 73 61 67 65 00
And I rip 50 61 73 73 77 6F 72 64 & 4D 65 73 73 61 67 65 from the packet and used this for my HMACSHA1. How would I go about doing that to get the correct HMACSHA1 value?
BYTE HMAC[] = "Password";
BYTE data2[] = "Message";
//BYTE HMAC[] = {0x50,0x61,0x73,0x73,0x77,0x6F,0x72,0x64};
//BYTE data2[] = {0x4D,0x65,0x73,0x73,0x61,0x67,0x65};
HMAC_CTX ctx;
result = (unsigned char*) malloc(sizeof(char) * result_len);
ENGINE_load_builtin_engines();
ENGINE_register_all_complete();
HMAC_CTX_init(&ctx);
HMAC_Init_ex(&ctx, HMAC, strlen((const char*)HMAC), EVP_sha1(), NULL);
HMAC_Update(&ctx, data2, strlen((const char*)(data2)));
HMAC_Final(&ctx, result, &result_len);
HMAC_CTX_cleanup(&ctx);
std::cout << "\n\n";
for(int i=0;i<result_len;i++)
std::cout << setfill('0') << setw(2) << hex << (int)result[i];
int asd;
std::cin >> asd;
// AFF791FA574D564C83F6456CC198CBD316949DC9
EDIT:
It works by doing this:
BYTE HMAC[] = {0x50,0x61,0x73,0x73,0x77,0x6F,0x72,0x64, 0x00};
BYTE data2[] = {0x4D,0x65,0x73,0x73,0x61,0x67,0x65, 0x00};
By adding 0x00, at the end. But, my question is more towards ripping it from data, and using it... would it still be fine?
The issue is the relation ship between arrays, strings, and the null char.
When you declare "Password", the compiler logically treats the string literal as a nine byte array, {0x50,0x61,0x73,0x73,0x77,0x6F,0x72,0x64, 0x00}. When you call strlen, it will count the number of bytes until it encounters the first 0x00. strlen("Password") will return 8 even though there are technically nine characters in the array of characters.
So when you declare an array of 8 bytes as follows without a trailing null byte:
BYTE HMAC[] = {0x50,0x61,0x73,0x73,0x77,0x6F,0x72,0x64};
The problem is that "strlen(HMAC)" will count at least 8 bytes, and keep counting while traversing undefined memory until it finally (if ever) hits a byte that is zero. At best, you might get lucky because the stack memory always has a zero byte padding your array declaration. More likely it will return a value much larger than 8. Maybe it will crash.
So when you parse the HMAC and MESSAGE field from your protocol packet, you count the number of bytes actually parsed (not including the terminating null). And pass that count into the hmac functions to indicate the size of your data.
I don't know your protocol code, but I hope you aren't using strlen to parse the packet to figure out where the string inside the packet ends. A clever attacker could inject a packet with no null terminator and cause your code do bad things. I hope you are parsing securely and carefully. Typical protocol code doesn't include the null terminating byte in the strings packed inside. Usually the "length" is encoded as an integer field followed by the string bytes. Makes it easier to parse and determine if the length would exceed the packet size read in.