C++ pwrite(), pread() data to a text file - c++

I have to pwrite() characters to a text file with each character representing 1 byte.
Also, I need to write integers to the text file, so 12 has to be one byte also, not 2 bytes (even though two characters).
I am using char *pointer for the characters and integers, but I am getting stuck since the text fill prints jumbled values for the integers (#'s, upside-down ?'s, etc.) Like when I pwrite() pointer[0] = 105; The 105 translates 'i' in the text.txt file (and pread() reads as 'i') Somehow the 105 is lost in translation.
Any ideas how to pwrite()/pread() correctly?
ofstream file; file.open("text.txt");
char *characters = new char;
characters[0] = 105;
cout << pwrite(3, characters, 1, 0);
Also, the 3, is the filedes, which I guess :-P Don't know how to actually find.
The text.txt file then has 'i' in it (ASCII 105 I'm assuming). When I pread() then, how will I know if it was originally and 'i' or 105?

Breaking this down in chunks:
"I have to pwrite() characters to a text file with each character representing 1 byte"
By definition, each ASCI character is one byte, and you make no mention of need to to write locale-aware multi-byte characters, or Unicode derivatives, so I'm thinking on this one you're probably covered.
"Also, I need to write integers to the text file, so 12 has to be one byte also, not 2 bytes (even though two characters)"
You're describing a binary write of your integer data. However, keep in mind that "integers" as a numeric representation can be larger than just a number represented by "one byte". If you want to write an integer that can be represented in a single byte, your options are:
For signed data, values can range from [-128,127]
For unsigned data, values can range from [0, 255]
These are the limitations of an integer value in a single octet.
"I am using char *pointer for the characters and integers, but I am getting stuck since the text fill prints jumbled values for the integers (#'s, upside-down ?'s, etc.)"
The char pointer for characters we covered before, and will likely be fine. The integers will NOT be. Your resulting file per your description will not be a "text" file in the literal sense. It will contain both character data (your char buffers) and binary data (your integers). Please remember an integer within a single byte with a value of 0x01 will be just that, a single octet with the first bit set. A byte representing the ASCI character '1' will have a value of 0x31 (see any ASCI chart), and value 0xF1 for EBCDIC (don't ask). Using your example, **you cannot write the value 12 in a single byte and have it be displayable "text" (character) data in your file. The single-byte integer value 12 will be represented in your file as a single byte value 0x0C. Trying to view this as "text" will not work; it is not printable ASCI. In fact, the ASCI value of 0x0C is actually a form-feed control character.
Bottom line, if you don't know the difference between ASCI characters and integer bytes, explaining how pwrite() works will do little good but to confuse you more.
"Like when I pwrite() pointer[0] = 105; The 105 translates 'i' in the text.txt file (and pread() reads as 'i') Somehow the 105 is lost in translation"
Refer to the ASCI chart linked several places in this answer. The byte value 105 is, infact, the ASCI value of the character 'i'. The 105 isn't lost; its being displayed as the character it represents.
Finally, pwrite() is a POSIX system call for Linux, BSD, and anyone else that chooses to expose it. It is not part of the C or C++ standards. That said, your first argument for pwrite() should be obtained from a system call, open(). You should never piggyback on a file descriptor assumed to be opened by a different api call unless you go through a supported API to do so. The code in this question does not.

Related

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Represent a buffer efficiently as unicode string

I have a random buffer.
I need to encode it to a unicode string (utf16 LE. as used by windows wide-char specification) so it can be used as PWSTR. For example when calling to StringCchPrintfW
A possible solution can be to use base64. But in order to make it a unicode string I will have to add a zero byte after every char, which will be inefficient in space.
And if I will just print the buffer, it might contain '\0' which will terminate the string, or '%' which will effect the formatting (maybe it can be escaped), or other unicode chars that will prevent it from being used in formatting.
The code to generate the string that will be printed, and parsing it in the end will be written in C#, but the buffer will be used in windows C++ to be used in a formatting and then written to a file.
Here are two methods I can think of:
The easy one: convert each of your bytes in a UTF-16 wchar_t by summing 0x8000 to its value (i.e. you append a 0x80 byte). The efficiency is only 50% but at least you spare the base64 conversion, which would lower the efficiency to 37.5%.
The efficient but complicated one: read your data in 15-bit chunks (if your total number of bits is not a multiple of 15, pad with null bits at the end). Convert each chunk in a UTF-16 character by adding 0x4000 to its value. Then add a final wchar_t of value 0xC000 + n, where n (0 <= n <= 14) is the number of padding bits in the final chunk. In exchange for a much more complicated algorithm, you get a very good efficiency: 93.75%.
Both of the method avoid all the perils of using binary data in a UTF-16 format string: no null bytes, no '%' characters, no surrogate pairs, only printable characters (most of which are Chinese ideograms).

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

I would like to split a string at every blank character (' ', '\n', '\r', '\t', '\v', '\f')
The string is stored in UTF8 encoding in a byte array (char*, or vector or string, for instance)
Can I just split the byte array at each splitting character? Said otherwise, am I sure that the byte values corresponding to these characters cannot be found in a multi-byte character? By looking at the UTF-8 spec it seems all multibyte characters have only bytes higher than 128.
Thanks
Yes, you can.
Multibyte sequences necessarily include one lead byte (the two MSBs equal to 11) and one ore more continuation bytes (two MSBs equal to 10). The total length of the multibyte sequence (lead byte+continuation bytes) is equal to the number of count of MSBs equal to 1 in the lead byte, before the first bit 0 appears (e.g.: if lead byte is 110xxxxx, exactly one continuation byte should follow; if it is 11110xxx, there should be exactly three continuation bytes).
So, if you find short MB sequences or stray continuationb bytes without a lead byte, your string is probably invalid anyway, and you split procedures probably wouldn't screw it any further than what it probably already was.
But there is something you might want to note: Unicode introduces other “blank” symbols in the upper, non-ASCII compatible ranges. You might want to treat them accordingly.
If you limit yourself to the set of whitespace characters you mention, the answer is definitely "yes".
Of course, there is always an issue of checking whether your text is valid UTF-8 in the first place...

Unsigned byte for 10 conflicts with newline

Is there a way to differentiate the first value (which is a number 10 saved as an unsigned char) from the newline character in the following demo code?
int main() {
unsigned char ch1(10), ch2('\n');
std::cout << (int)ch1 << " " << (int)ch2 << std::endl;
}
The output is
10 10
I want to write to a file such characters as unsigned bytes, but also want the newline character be distinguishable from a number '10' when read at a later time.
Any suggestions?
regards,
Nikhil
There is no way. You write the same byte, and preserve no other information.
You need to think of other way of encoding you values, or reserve one value for your sentinel (like 255 or 0). Of course, you need to be sure, that this value is not present in your input.
Other possibility it to use one byte-value as 'special' character to escape your control codes. Similar as '\' is used to give special meaning to 'n' in '\n'. But it makes all parsing more complicated, as your values may be now one- or two-byte long. Unless you are under tight pressure memory-wise, I would advice to store values as their string representation, this is usually more readable.
No, a char of value 10 is a newline. Take a look at an ascii table. You'll see that the number 10 would be two different chars (49 and 48, respectively).

How to identify the file content as ASCII or binary

How do you identify the file content as being in ASCII or binary using C++?
If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
If the first two bytes are hex FE FF, the file is tentatively UTF-16 BE.
If the first two bytes are hex FF FE, and the following two bytes are not hex 00 00 , the file is tentatively UTF-16 LE.
If the first four bytes are hex 00 00 FE FF, the file is tentatively UTF-32 BE.
If the first four bytes are hex FF FE 00 00, the file is tentatively UTF-32 LE.
If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
If the file contains only big-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
If the file contains only little-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
If the file contains only big-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
If the file contains only little-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).
You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127. One way of many ways to do it:
int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127)
;
if(c == EOF) {
/* file is all ASCII */
}
However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
The contents of every file is binary. So, knowing nothing else, you can't be sure.
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
Unless you define your question more closely, there can't be a definitive answer.
Have a look a how the file command works ; it has three strategies to determine the type of a file:
filesystem tests
magic number tests
and language tests
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.
If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.
Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary.
After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters.
You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!).
If you need them, you'll have to define.
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.
Github's linguist uses charlock holmes library to detect binary files, which in turn uses ICU's charset detection.
The ICU library is available for many programming languages, including C and Java.
bool checkFileASCIIFormat(std::string fileName)
{
bool ascii = true;
std::ifstream read(fileName);
int line;
while ((ascii) && (!read.eof())) {
line = read.get();
if (line > 127) {
//ASCII codes only go up to 127
ascii = false;
}
}
return ascii;
}