Unsigned byte for 10 conflicts with newline - c++

Is there a way to differentiate the first value (which is a number 10 saved as an unsigned char) from the newline character in the following demo code?
int main() {
unsigned char ch1(10), ch2('\n');
std::cout << (int)ch1 << " " << (int)ch2 << std::endl;
}
The output is
10 10
I want to write to a file such characters as unsigned bytes, but also want the newline character be distinguishable from a number '10' when read at a later time.
Any suggestions?
regards,
Nikhil

There is no way. You write the same byte, and preserve no other information.
You need to think of other way of encoding you values, or reserve one value for your sentinel (like 255 or 0). Of course, you need to be sure, that this value is not present in your input.
Other possibility it to use one byte-value as 'special' character to escape your control codes. Similar as '\' is used to give special meaning to 'n' in '\n'. But it makes all parsing more complicated, as your values may be now one- or two-byte long. Unless you are under tight pressure memory-wise, I would advice to store values as their string representation, this is usually more readable.

No, a char of value 10 is a newline. Take a look at an ascii table. You'll see that the number 10 would be two different chars (49 and 48, respectively).

Related

Error in getting ASCII of character in C++

I saw this question : How to convert an ASCII char to its ASCII int value?
The most voted answer (https://stackoverflow.com/a/15999291/14911094) states the solution as :
Just do this:
int(k)
But i am having issues with this.
My code is :
std::cout << char(144) << std::endl;
std::cout << (int)(char(144)) << std::endl;
std::cout << int('É') << std::endl;
Now the output comes as :
É
-112
-55
Now i can understand the first line but what is happening for the second an the third lines?
Firstly how can some ASCII be negative and secondly how can that be different for the same character.
Also as far as i have tested this is not some random garbage from the memory as this stays same for every time i run the program also :
If i change it to 145 :
æ
-111
The output to changes by 1 so as far as i guess this may due to some kind of overflow.
But i cannot get it exactly as i am converting to int and that should be enough(4 bytes) to store the result.
Can any one suggest a solution?
If your platform is using ASCII for the character encoding (most do these days), then bear in mind that ASCII is only a 7 bit encoding.
It so happens that char is a signed type on your platform. (The signedness or otherwise of char doesn't matter for ASCII as only the first 7 bits are required.)
Hence char(144) gives you a char with a value of -112. (You have a 2's complement char type on your platform: from C++14 you can assume that, but you can't in C).
The third line implies that that character (which is not in the ASCII set) has a value of -55.
int(unsigned char('É'))
would force it to a positive value on all but the most exotic of platforms.
The C++ standard only guarantees that characters in the basic execution character set1 have non-negative encodings. Characters outside that basic set may have negative encodings - it depends on the locale.
Upper- and lowercase Latin alphabet, decimal digits, most punctuation, and control characters like tab, newline, form feed, etc.

what is the correct way to define a non-alphanumeric char in c++?

Sometimes I need to define a char which represents a non-alphanumeric char.
What is the correct way to define its value in C++?
Is using EOF or char_traits<char>::eof() a good choice?
You're reading too much in to the word char.
At the end of the day, it is little more than a size. In this case, 8 bits. Shorts are 16 (and you can wear them on the beach), ints can be 32 or something else, and longs can be 64 (or ints, or a quick conversation with the relevant authorities on the beach as to why you lost both pairs of shorts).
The correct way to define a value in C++ is basically down to what the maximum value that can be held. char_traits::eof() is indeed a good constant, but out of context - means very little.
EOF is not a char value; it's an int value that's returned by some functions to indicate that no valid character data could be obtained. If you're looking for a value to store in a char object, EOF is definitely not a good choice.
If your only requirement is to store some non-alphanumeric value in a char object (and you don't char which), just choose something. Any punctuation character will do.
char example = '*';
char another_example = '?';
char yet_another_example = '\b'; // backspace
This assumes I'm understanding your question correctly. As stated:
Sometimes I need to define a char which represents a non-alphanumeric char.
it's not at all clear what you mean. What exactly do you mean by "represents"? If you're looking for some arbitrary non-alphanumeric character, see above. If you're looking for some arbitrary value that merely indicates that you should have a non-alphanumeric character in some particular place, you can pick anything you like, as long as you use it consistently.
For example, "DD-DD" might be template representing two decimal digits, followed by a hyphen, followed by two more decimal digits -- but only if you establish and follow a convention that says that's what it means.
Please update your question to make it clear what you're asking.

Extract (first) UTF-8 character from a std::string

I need to use a C++ implementation of PHP's mb_strtoupper function to imitate Wikipedia's behavior.
My problem is, that I want to feed only a single UTF-8 character to the function, namely the first of a std::string.
std::string s("äbcdefg");
mb_strtoupper(s[0]); // this obviously can't work with multi-byte characters
mb_strtoupper('ä'); // works
Is there an efficient way to detect/return only the first UTF-8 character of a string?
In UTF-8, the high bits of the first byte tell you how many subsequent bytes are part of the same code point.
0b0xxxxxxx: this byte is the entire code point
0b10xxxxxx: this byte is a continuation byte - this shouldn't occur at the start of a string
0b110xxxxx: this byte plus the next (which must be a continuation byte) form the code point
0b1110xxxx: this byte plus the next two form the code point
0b11110xxx: this byte plus the next three form the code point
The pattern can be assumed to continue, but I don't think valid UTF-8 ever uses more than four bytes to represent a single code point.
If you write a function that counts the number of leading bits set to 1, then you can use it to figure out where to split the byte sequence in order to isolate the first logical code point, assuming the input is valid UTF-8. If you want to harden against invalid UTF-8, you'd have to write a bit more code.
Another way to do it is to take advantage of the fact that continuation bytes always match the pattern 0b10xxxxxx, so you take the first byte, and then keep taking bytes as long as the next byte matches that pattern.
std::size_t GetFirst(const std::string &text) {
if (text.empty()) return 0;
std::size_t length = 1;
while ((text[length] & 0b11000000) == 0b10000000) {
++length;
}
return length;
}
For many languages, a single code point usually maps to a single character. But what people think of as single characters may be closer to what Unicode calls a grapheme cluster, which is one or more code points that combine to produce a glyph.
In your example, the ä can be represented in different ways: It could be the single code point U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or it could be a combination of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS. Fortunately, just picking the first code point should work for your goal to capitalize the first letter.
If you really need the first grapheme cluster, you have to look beyond the first code point to see if the next one(s) combine with it. For many languages, it's enough to know which code points are "non-spacing" or "combining" or variant selectors. For some complex scripts (e.g., Hangul?), you might need to turn to this Unicode Consortium technical report.
Library str.h
#include <iostream>
#include "str.h"
int main (){
std::string text = "äbcdefg";
std::string str = str::substr(text, 0, 1); // Return:~ ä
std::cout << str << std::endl;
}

C++ Converting/casting String letters to integer values (ASCII)

I'm currently doing a project where I need to create two data structures that will be used to contain strings. One of them has to be a form of linked list, and I've been adivsed to seperate the words out into seperate lists inside of it for each letter of the alphabet. I'm required to think about efficiency so I have an array of Head pointers of size 26, and am wanting to convert the first character of the word given into an integer so I can put it into the subscript, such as:
//a string called s is passed as a parameter to the function
int i = /*some magic happens here*/ s.substr(0, 1);
currentPointer = heads[i]; //then I start iterating through the list
I've been searching around and all I've seemd to have found is how to covert number characters that are in strings into integers, and not letter characters, and am wondering how on earth I can get this working without resorting to a huge and ugly set of if statements
When you are setting i to the value of the first character, you are getting the ASCII value.
So i is out of your 0-25 range : See man ascii
You can reduce it by substraction of the first alaphabet ascii letter. (Be care full with the case)
std::string s("zoro");
int i = s[0];
std::cout << "Ascii value : " << i << " Reduced : " << i - 'a' << std::endl;
Which produce the ASCII value 'z' = 112 and 25 for the reduced value, as expected.
I think you are confusing values with representations. "Ten" "10" and "1 1 1 1 1 1 1 1 1 1" are all the same value, just represented differently.
I've been searching around and all I've seemd to have found is how to covert number characters that are in strings into integers, and not letter characters
There's no difference. Characters are always represented by integers anyway. It's just a matter of representation. Just present the value the way you want.
By the way, this is a key concept programmers have to understand. So it's worth spending some time thinking about it.
A classic example of this misunderstanding is a question like "I have a variable i that has some value in decimal. How can I make it store a value in hex?" Of course that makes no sense, it stores values and hex and decimal are representations. If you have ten cars, you have ten in cars, not in decimal or hex. If i has the value ten, then the value ten is in i, not a representation of ten in decimal or hex.
Of course, when you display the value stored in i, you have to choose how to represent it. You can display it as ten, or 10, | | | | | | | | | | |, or whatever.
And you might have a string that has a representation of the value "ten" in hex, and you might need to assign that value to a variable. That requires converting from a representation to the value it represents.
There are input and output functions that input and out values in various representations.
I suspect you want to convert digits stored in strings as characters to integers, e.g. character '9' to integer 9.
In order to do so:
char c = '9';
int x = c - '0';
That will work regardless of whether you have a computer using ASCII or EBCDIC...
In this case, you don't seem to need atoi or itoa (neither is going to do anything very useful with, for example, J). You just want something like:
int i = tolower(s[0])-'a';
In theory that's not portable -- if there's any chance of the code being used on a machine that uses EBCDIC (i.e., an IBM or compatible mainframe) you'll want to use something like 'z'-'a' as the size of your array, since it won't be exactly 26 (EBCDIC includes some other characters inserted between some letters, so the letters are in order but not contiguous).
Probably more importantly, if you want to support languages other than English, things change entirely in a hurry -- you might have a different number of letters than 26, they might not all be contiguous, etc. For such a case your basic design is really the problem. Rather than fixing that one line of code, you probably need to redesign almost completely.
As an aside, however, there's a pretty good chance that a linked list isn't a very good choice here.

C++ pwrite(), pread() data to a text file

I have to pwrite() characters to a text file with each character representing 1 byte.
Also, I need to write integers to the text file, so 12 has to be one byte also, not 2 bytes (even though two characters).
I am using char *pointer for the characters and integers, but I am getting stuck since the text fill prints jumbled values for the integers (#'s, upside-down ?'s, etc.) Like when I pwrite() pointer[0] = 105; The 105 translates 'i' in the text.txt file (and pread() reads as 'i') Somehow the 105 is lost in translation.
Any ideas how to pwrite()/pread() correctly?
ofstream file; file.open("text.txt");
char *characters = new char;
characters[0] = 105;
cout << pwrite(3, characters, 1, 0);
Also, the 3, is the filedes, which I guess :-P Don't know how to actually find.
The text.txt file then has 'i' in it (ASCII 105 I'm assuming). When I pread() then, how will I know if it was originally and 'i' or 105?
Breaking this down in chunks:
"I have to pwrite() characters to a text file with each character representing 1 byte"
By definition, each ASCI character is one byte, and you make no mention of need to to write locale-aware multi-byte characters, or Unicode derivatives, so I'm thinking on this one you're probably covered.
"Also, I need to write integers to the text file, so 12 has to be one byte also, not 2 bytes (even though two characters)"
You're describing a binary write of your integer data. However, keep in mind that "integers" as a numeric representation can be larger than just a number represented by "one byte". If you want to write an integer that can be represented in a single byte, your options are:
For signed data, values can range from [-128,127]
For unsigned data, values can range from [0, 255]
These are the limitations of an integer value in a single octet.
"I am using char *pointer for the characters and integers, but I am getting stuck since the text fill prints jumbled values for the integers (#'s, upside-down ?'s, etc.)"
The char pointer for characters we covered before, and will likely be fine. The integers will NOT be. Your resulting file per your description will not be a "text" file in the literal sense. It will contain both character data (your char buffers) and binary data (your integers). Please remember an integer within a single byte with a value of 0x01 will be just that, a single octet with the first bit set. A byte representing the ASCI character '1' will have a value of 0x31 (see any ASCI chart), and value 0xF1 for EBCDIC (don't ask). Using your example, **you cannot write the value 12 in a single byte and have it be displayable "text" (character) data in your file. The single-byte integer value 12 will be represented in your file as a single byte value 0x0C. Trying to view this as "text" will not work; it is not printable ASCI. In fact, the ASCI value of 0x0C is actually a form-feed control character.
Bottom line, if you don't know the difference between ASCI characters and integer bytes, explaining how pwrite() works will do little good but to confuse you more.
"Like when I pwrite() pointer[0] = 105; The 105 translates 'i' in the text.txt file (and pread() reads as 'i') Somehow the 105 is lost in translation"
Refer to the ASCI chart linked several places in this answer. The byte value 105 is, infact, the ASCI value of the character 'i'. The 105 isn't lost; its being displayed as the character it represents.
Finally, pwrite() is a POSIX system call for Linux, BSD, and anyone else that chooses to expose it. It is not part of the C or C++ standards. That said, your first argument for pwrite() should be obtained from a system call, open(). You should never piggyback on a file descriptor assumed to be opened by a different api call unless you go through a supported API to do so. The code in this question does not.