C/C++: How to convert 6bit ASCII to 7bit ASCII - c++

I have a set of 6 bits that represent a 7bit ASCII character. How can I get the correct 7bit ASCII code out of the 6 bits I have? Just append a zero and do an bitwise OR?
Thanks for your help.
Lennart

ASCII is inherently a 7-bit character set, so what you have is not "6-bit ASCII". What characters make up your character set? The simplest decoding approach is probably something like:
char From6Bit( char c6 ) {
// array of all 64 characters that appear in your 6-bit set
static SixBitSet[] = { 'A', 'B', ... };
return SixBitSet[ c6 ];
}
A footnote: 6-bit character sets were quite popular on old DEC hardware, some of which, like the DEC-10, had a 36-bit architecture where 6-bit characters made some sense.

You must tell us how your 6-bit set of characters looks, I don't think there is a standard.
The easiest way to do the reverse mapping would probably be to just use a lookup table, like so:
static const char sixToSeven[] = { ' ', 'A', 'B', ... };
This assumes that space is encoded as (binary) 000000, capital A as 000001, and so on.
You index into sixToSeven with one of your six-bit characters, and get the local 7-bit character back.

I can't imagine why you'd be getting old DEC-10/20 SIXBIT, but if that's what it is, then just add 32 (decimal). SIXBIT took the ASCII characters starting with space (32), so just add 32 to the SIXBIT character to get the ASCII character.

The only recent 6-bit code I'm aware of is base64. This uses four 6-bit printable characters to store three 8-bit values (6x4 = 8x3 = 24 bits).
The 6-bit values are drawn from the characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
which are the values 0 thru 63. Four of these (say UGF4) are used to represent three 8-bit values.
UGF4 = 010100 000110 000101 111000
= 01010000 01100001 01111000
= Pax
If this is how your data is encoded, there are plenty of snippets around that will tell you how to decode it (and many languages have the encoder and decoder built in, or in an included library). Wikipedia has a good article for it here.
If it's not base64, then you'll need to find out the encoding scheme. Some older schemes used other lookup methods of the shift-in/shift-out (SI/SO) codes for choosing a page within character sets but I think that was more for choosing extended (e.g., Japanese DBCS) characters rather than normal ACSII characters.

If I were to give you the value of a single bit, and I claimed it was taken from Windows XP, could you reconstruct the entire OS?
You can't. You've lost information. There is no way to reconstruct that, unless you have some knowledge about what was lost. If you know that, say, the most significant bit was chopped off, then you can set that to zero, and you've reconstructed at least half the characters correctly.
If you know how 'a' and 'z' are represented in your 6-bit encoding, you might be able to guess at what was removed by comparing them to their 7-bit representations.

Related

Loop through Unicode string as character

With the following string, the size is incorrectly output. Why is this, and how can I fix it?
string str = " ██████";
cout << str.size();
// outputs 19 rather than 7
I'm trying to loop through str character by character so I can read it into into a vector<string> which should have a size of 7, but I can't do this since the above code outputs 19.
TL;DR
The size() and length() members of basic_string returns the size in units of the underlying string, not the number of visible characters. To get the expected number:
Use UTF-16 with u prefix for very simple strings that contain no non-BMP, no combining characters and no joining characters
Use UTF-32 with U prefix for very simple strings that don't contain any combining or joining characters
Normalize the string and count for arbitrary Unicode strings
" ██████" is a space followed by a series of 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding and many letters are encoded using multiple bytes (because obviously you can't encode more than 256 characters with just one byte). In UTF-8 code points between U+0800 and U+FFFF are encoded by 3 bytes. Therefore the length of the the string in UTF-8 is 1 + 6*3 = 19 bytes.
You can check with any Unicode converter like this one and see that the string is encoded as 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 in UTF-8, and you can also loop through each byte of your string to check
If you want the total number of visible characters in the string then it's a lot trickier and churill's solution doesn't work. Read the example in Twitter
If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:
café 0x63 0x61 0x66 0xC3 0xA9 Using the “é” character, called the “composed character”.
café 0x63 0x61 0x66 0x65 0xCC 0x81 Using the combining diacritical, which overlaps the “e”
You need a Unicode library like ICU to normalize the string and count. Twitter for example uses Normalization Form C
Edit:
Since you're only interested in box-drawing characters which doesn't seem to lie outside the BMP and don't contain any combining characters, UTF-16 and UTF-32 will work. Like std::string, std::wstring is also a basic_string and doesn't have a mandatory encoding. In most implementations it's often either UTF-16 (Windows) or UTF-32 (*nix) so you may use it, but it's unreliable and depends on source code encoding. The better way is to use std::u16string (std::basic_string<char16_t>) and std::u32string (std::basic_string<char32_t>). They'll work regardless of system and encoding of the source file
std::wstring wstr = L" ██████";
std::u16string u16str = u" ██████";
std::u32string u32str = U" ██████";
std::cout << str.size(); // may work, returns the number of wchar_t characters
std::cout << u16str.size(); // always returns the number of UTF-16 code units
std::cout << u32str.size(); // always returns the number of UTF-32 code units
In case you're interested in how to work out on that for all Unicode characters continue reading below
The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.
[...]
Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes
Twitter - Counting characters
See also
When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
Getting Twitter characters count
Why is the length of this string longer than the number of characters in it?
std::string only contains 1 byte long chars (usually 8 bit, contains UTF-8 char), you need wchar_t and std::wstring to achieve what you want:
std::wstring str = L" ██████";
std::cout << str.size();
Allthough this prints 7 (one space and 6 unicode chars). Notice the L before the string literal, so it will be interpreted as a wide string.

C++ - A few quetions about my textbook's ASCII table

In my c++ textbook, there is an "ASCII Table of Printable Characters."
I noticed a few odd things that I would appreciate some clarification on:
Why do the values start with 32? I tested out a simple program and it has the following lines of code: char ch = 1; std::cout << ch << "\n"; of code and nothing printed out. So I am kind of curious as to why the values start at 32.
I noticed the last value, 127, was "Delete." What is this for, and what does it do?
I thought char can store 256 values, why is there only 127? (Please let me know if I have this wrong.)
Thanks in advance!
The printable characters start at 32. Below 32 there are non-printable characters (or control characters), such as BELL, TAB, NEWLINE etc.
DEL is a non-printable character that is equivalent to delete.
char can indeed store 256 values, but its signed-ness is implementation defined. If you need to store values from 0 to 255 then you need to explicitly specify unsigned char. Similarly from -128 to 127, have to specify signed char.
EDIT
The so called extended ASCII characters with codes >127 are not part of the ASCII standard. Their representation depends on the so called "code page" chosen by the operating system. For example, MS-DOS used to use such extended ASCII characters for drawing directory trees, window borders etc. If you changed the code page, you could have also used to display non-English characters etc.
It's a mapping between integers and characters plus other "control" "characters" like space, line feed and carriage return interpreted by display devices (possibly virtual). As such it is arbitrary, but they are organized by binary values.
32 is a power of 2 and an alphabet starts there.
Delete is the signal from your keyboard delete key.
At the time the code was designed only 7 bits were standard. Not all bytes (parts words) were 8 bits.

How to find whether byte read is japanese or english?

I have an array which contains Japanese and ascii characters.
I am trying to find whether characters read is English character or Japanese characters.
in order to solve this i followed as
read first byte , if multicharcterswidth is not equal to one, move pointer to next byte
now display whole two byte together and display that Japanese character has been read.
if multicharcterswidth is equal to one, display the byte. and show message english has been read.
above algo work fine but fails in case of halfwidth form of Japanese eg.シ,ァ etc. as it is only one byte.
How can i find out whether characters are Japanese or English?
**Note:**What i tried
I read from web that first byte will tell whether it is japanese or not which i have covered in step 1 of my algo. But It won't work for half width.
EDIT:
The problem i was solving i include control characters 0X80 at start and end of my characters to identify the string of characters.
i wrote following to identify the end of control character.
cntlchar.....(my characters , can be japnese).....cntlchar
if ((buf[*p+1] & 0X80) && (mbMBCS_charWidth(&buf[*p]) == 1))
// end of control characters reached
else
// *p++
it worked fine when for english but didn't work for japanese half width.
How can i handle this?
Your data must be using Windows Codepage 932. That is a guess, but examining the codepoints shows what you are describing.
The codepage shows that characters in the range 00 to 7F are "English" (a better description is "7-bit ASCII"), the characters in the ranges 81 to 9F and E0 to FF are the first byte of a multibyte code, and everything between A1 and DF are half-width Kana characters.
For individual bytes this is impractical to impossible. For larger sets of data you could do statistical analysis on the bytes and see if it matches known English or Japanese patterns. For example, vowels are very common in English text but different Japanese letters would have similar frequency patterns.
Things get more complicated than testing bits if your data includes accented characters.
If you're dealing with Shift-JIS data and Windows-1252 encoded text, ideally you just remap it to UTF-8. There's no standard way to identify text encoding within a text file, although things like MIME can help if added on externally as metadata.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

Are there delimiter bytes for UTF8 characters?

If I have a byte array that contains UTF8 content, how would I go about parsing it? Are there delimiter bytes that I can split off to get each character?
Take a look here...
http://en.wikipedia.org/wiki/UTF-8
If you're looking to identify the boundary between characters, what you need is in the table in "Description".
The only way to get a high bit zero is the ASCII subset 0..127, encoded in a single byte. All the non-ASCII codepoints have 2nd byte onwards with "10" in the highest two bits. The leading byte of a codepoint never has that - it's high bits indicate the number of bytes, but there's some redundancy - you could equally watch for the next byte that doesn't have the "10" to indicate the next codepoint.
0xxxxxxx : ASCII
10xxxxxx : 2nd, 3rd or 4th byte of code
11xxxxxx : 1st byte of code, further high bits indicating number of bytes
A codepoint in unicode isn't necessarily the same as a character. There are modifier codepoints (such as accents), for instance.
Bytes that have the first bit set to 0 are normal ASCII characters. Bytes that have their first bit set to 1 are part of a UTF-8 character.
The first byte in every UTF-8 character has its second bit set to 1, so that the byte has the most significant bits 11. Each following byte that belongs to the same UTF-8 character starts with 10 instead.
The first byte of each UTF-8 character additionally indicates how many of the following bytes belong to the character, depending on the number of bits that are set to 1 in the most significant bits of that byte.
For more details, see the Wikipedia page for UTF-8.