How to identify the file content as ASCII or binary - c++

How do you identify the file content as being in ASCII or binary using C++?

If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
If the first two bytes are hex FE FF, the file is tentatively UTF-16 BE.
If the first two bytes are hex FF FE, and the following two bytes are not hex 00 00 , the file is tentatively UTF-16 LE.
If the first four bytes are hex 00 00 FE FF, the file is tentatively UTF-32 BE.
If the first four bytes are hex FF FE 00 00, the file is tentatively UTF-32 LE.
If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
If the file contains only big-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
If the file contains only little-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
If the file contains only big-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
If the file contains only little-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).

You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127. One way of many ways to do it:
int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127)
;
if(c == EOF) {
/* file is all ASCII */
}
However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.

My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.

The contents of every file is binary. So, knowing nothing else, you can't be sure.
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
Unless you define your question more closely, there can't be a definitive answer.

Have a look a how the file command works ; it has three strategies to determine the type of a file:
filesystem tests
magic number tests
and language tests
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.

If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.

Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.

To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary.
After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters.
You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!).
If you need them, you'll have to define.
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;

This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.

Github's linguist uses charlock holmes library to detect binary files, which in turn uses ICU's charset detection.
The ICU library is available for many programming languages, including C and Java.

bool checkFileASCIIFormat(std::string fileName)
{
bool ascii = true;
std::ifstream read(fileName);
int line;
while ((ascii) && (!read.eof())) {
line = read.get();
if (line > 127) {
//ASCII codes only go up to 127
ascii = false;
}
}
return ascii;
}

Related

Detecting Unicode in files in Windows 10

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default. This does break the existing code that checks the header to determine Unicode in files. How can I now tell in C++ if a file is in unicode?
Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/
The code we have to determine Unicode:
int IsUnicode(const BYTE p2bytes[3])
{
if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF)
return 1; // UTF-8
if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
return 2; // UTF-16 (BE)
if( p2bytes[0]==0xFF && p2bytes[1]==0xFE)
return 3; // UTF-16 (LE)
return 0;
}
If it's so much pain, why isn't there a typical function to determine the encoding?
You should use the W3C method, which it is something like:
if you know the encoding, use that
if there is a BOM, use it to determine the encoding
decode as UTF-8. UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence. Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters.
if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).
This method should work very well. It is biased on UTF-8, but the world went to such directions long ago. Determining which codepage is much more difficult.
You may add a step before the last step. If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form. Unicode requires that you know which form (e.g. from side channel), else the files should have a BOM. But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts, so they are used in many scripts, so you should have many 00).
Now Windows 10 does not require unicode files to have the BOM header.
Windows never had this requirement. Every program can read text files like it wants to.
Maybe interesting: a BOM may not be desirable for UTF-8 because it breaks ASCII compatibility.
This does break the existing code that checks the header to determine Unicode in files.
This is a misunderstanding. Other code likely had Unicode support for a longer time than Notepad from Windows.
How can I now tell in C++ if a file is in unicode?
Typically you would check for the presence of a BOM and then use that information of course.
Next you can try to read (the beginning of) the file with all possible encodings. The ones that throw an exception are obviously not suitable.
From the remaining encodings, you could use a heuristic to determine the encoding.
And if it still was the wrong choice, give the user an option to change the encoding manually. That's how it is done in many editors, like Notepad++.

Read file with unknown character type

I need to read text from a file that could contain any type of character (char, char8_t, wchar_t, etc). How can i determine which type of character is used and create an instance of basic_ifstream<char_type> depending on that type?
So I guess you want to auto-detect the encoding of an unknown text file.
This is impossible to do in a 100% reliable way. However, my experience shows that you can achieve very high reliability (> 99.99%) in most practical situations. The bigger the file, the most reliable it is to guess its encoding: some tenths of bytes is usually already enough to be confident with the guess.
A valid Unicode code point is a value from U+1 to U+10FFFF included, excluding the surrogate range U+D800 to U+DFFF. Code point U+0 is actually valid, but excluding it highly decreases the number of false positive guesses (NUL bytes should never appear in any practical text file). For an even better guess, we can exclude some more very rare control characters.
Here is the algorithm I would propose:
If the file begins with a valid BOM (UTF-8, UTF-16BE/LE, UTF-32BE/LE), trust that BOM.
If the file contains only ASCII characters (non null bytes < 128), treat it as ASCII (use char).
If the file is valid UTF-8, then assume it is UTF-8 (use char8_t, but char will work also). Note that ASCII is a subset of UTF-8, so the previous check could be bypassed.
If the file is valid UTF-32 (check both little and big endian versions), then assume UTF-32 (char32_t, possibly also wchar_t on Linux or macOS). Swap the bytes if needed.
If the file is valid UTF-16 (check both little and big endian versions), including restrictions on surrogate pairs, and there is a higher correlation between even or odd bytes than between all bytes together, assume UTF-16 (char16_t, possibly also wchar_t on Windows). Swap the bytes if needed.
Otherwise, the file is probably not in some Unicode encoding, and may use old code pages. Good luck to auto-detect which one. The more common one by far is 8859-1 (Latin-1), use char. It may also be some raw binary data.
It's impossible to know for sure. You have to be told what the character type is. Frequently text files will begin with a Byte-Order-Mark to clue you in, but even that's not entirely foolproof.
You can make reasonable guesses as to the file contents, for example, if you "know" that most of it is ASCII-range text, it should be easy to figure out if the file is full of char or wchar_t characters. Even this relies on assumptions and should not be considered bulletproof.

How do Binary Files works? (From c++'s point of view) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have some missunderstandings about Binary files, i dont understand what a binary file is, i know text files are also binary files, but it needs to be parsed in order to extract information, unlike text files binary files with the same contents looks diffrent, for example while storing my name in a binary file "Rishabh" it not only stores Rishabh in that file but with some extra unreadable characters, what is it?? Why does'nt it only store characters like a text file, and what are binary file formats, eg. .3d, .zip, .mp3 etc... From my knowledge in text files, format extension specifies what the format is or how to process that file, like .dae, .xml, .htm etc... These contains tags to store datas, but what about binary files, because it dont needs any tags because its stored as a variable in that file from which we have to copy contents to the programs variables, (i mean to say its like stored in memory) so why these binary file formats are diffrent, why just not only a single program read all the contents of the file which is unkown to the world and to me?? And what is binary file format cracking??
All files have some kind of pre-determined encoding since computers can't store anything but bit-patterns in bytes on disk. A text file contains only the encoding for printable characters plus space, and few other encodings to end-a-line, tab, and maybe form-feed and a few others related to character display on a device. Because the encoding in a text file is a well-known standard, and is quite common, there are functions in most, if not all languages, to deal specifically with that type of file. Most importantly, they know how to read a line at a time - they recognize line-terminator character(s).
If however, you type the characters of your name in some other program besides a text editor - say you write using the text tool in Gimp or Microsoft Paint, and then save it. The program has to save more information than just your name. Your name has a position on a canvas that must be saved. It also has a font and a size and whether it is bold or italic or underlined, that need to be saved. The size of the canvas needs to be saved. The color being used, even if white and black, needs to be saved. This encoding will be different than the encoding used to save the letters of your name. So if you edit the file with a text editor, you will see some gibberish since the text editor is expecting character encoding and knows nothing about the encoding Gimp uses for fonts, font sizes, x,y positions, etc.
C++ compilers are not written with routines to understand any binary file encodings. The routines for reading/writing binary files in C++ will just read and write sequences of bytes. Although, since the fundamental type that holds a byte of data in C++ is a char (or unsigned char), you will see binary prototypes like
write ( char * buffer, streamsize size );
read ( char * buffer, streamsize size );
But the char pointer in this case should be considered as a "byte *" since the read/write functions are just moving bytes of data from/to disk or memory without any regard for character encodings.
C++ read/write routines don't know, or care what the format or encoding is for the bytes they are moving. So it is left up to the programmer to write code to process or handle these bytes according to the pre-defined format for the file. However, the routines written to process a specific format of binary file can be compiled into a library that can then be shared or sold, and used by many C++ programmers. For example, LibXL can be used to read the binary format of Excel files from a C++ program.
From the perspective of C/C++, the only difference between text and binary files is how line endings are handled.
If you open a file in binary mode, then read reads exactly the bytes in the file, and write writes exactly the bytes which are in memory.
If you open a file in text mode, then whatever character or character sequence is conventionally used to represent the end of a line in a file is transformed into some single character (which is written in the source code as '\n', although it is only one character) when the file is read, and the \n is transformed into the conventional end-of-line character or sequence when the file is written to. Also, it is not technically legal for the file to not end with an end-of-line sequence, and there may be a limit to the length of a line.
In Unix, the two modes are identical, because \n is a representation of the character code 10 (0A in hex), and that is precisely the conventional line-ending character. In Windows, by contrast, the conventional line-ending sequence is two bytes long -- {13,10} or {0D,0A}. \n is still 0A, so effectively the 0D preceding the 0A is deleted from the data read from the file, and an 0D is inserted before every 0A when data is written to the file.
Some (much) older operating systems had no conventional line-ending character. Instead, all lines were padded with space characters to the exact same length, making it possible to directly seek to a specific line number. C libraries working in text mode would typically read exactly the line length, and then delete the trailing spaces (if any) and finally add the code corresponding to \n (some such systems used EBCDIC instead of ASCII, so \n was a different integer value). Writing the data out, the \n would be deleted and replaced with exactly the correct number of spaces to bring the line to the standard length. Fortunately, those of us who don't work in a computing museum don't have to deal with that stuff any more, and Apple abandoned its use of 0D as the line-end character with the advent of OSX, so the text/binary difference is now limited to Windows.
Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.
Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!
Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)
Its one of the fundamentals of a Computer system.
Probably a great explanation in this link
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
Some text quoted:
Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.
A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.

Reading text files of unknown encoding in C++

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?
Is there some class that auto-detects the encoding?
I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.
This is impossible in the general case. If the file contains exactly
the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of
the ISO 8859 variants. Several heuristics can be used as a guess,
however: read the first "page" (512 bytes or so), then, in the following
order:
See if the block starts with a BOM in one of the Unicode
formats
Look at the first four bytes. If they contain `'\0'`, you're probably
dealing with some form of UTF-16 or UTF-32, according to the following
pattern:
'\0', other, '\0', other
UTF16BE
other, '\0', other, '\0'
UTF16LE
'\0', '\0', '\0', other
UTF32BE
other, '\0', '\0', '\0'
UTF32RLE
Look for a byte with the top bit set. If it's the start of a legal
UTF-8 character, then the file is probably in UTF-8. Otherwise... in
the regions where I've worked, ISO 8859-1 is generally the best
guess.
Otherwise, you more or less have to assume ASCII, until you
encounter a byte with the top bit set (at which point, you use the
previous heuristic).
But as I said, it's not 100% sure.
(PS. How do I format a table here. The text in point 2 is declared as
an HTML table, but it doesn't seem to be showing up as one.
One of the ways(brute force) of doing can be
Built a list of suitable encodings (only iso-codepages and unicode)
Iterate over all considered encodings
Encode the text using this encoding
Encode it back to Unicode
Compare the results for errors
If no errors remember the encoding that produced the fewest bytes
Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

C++ ifstream UTF8 first characters

Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?
´╗┐
I have no idea what it is, I just know that it's not there when I save to ASCII.
UPDATE: If I save it to UTF8 (without BOM) it's not there.
How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?
Thanks!
When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.
When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF. Those bytes correspond to box-drawing characters in most OEM code pages (which is the default for a console window on Windows).
The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.
The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.
If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.
Update: You can convert the U+FEFF ZERO WIDTH NO BREAK characters into U+2060 WORD JOINER except at the beginning of a file [Gillam, Richard, Unicode Demystified, Addison-Wesley, 2003, p. 108]. My personal code does this. If, when decoding UTF-8, I see the 0xEF 0xBB 0xBF at the beginning of the file, I take it as a happy sign that I indeed have UTF-8. If the file doesn't begin with those bytes, I just proceed decoding normally. If, while decoding later in the file, I encounter a U+FEFF, I emit U+2060 and proceed. This means U+FEFF is used only as a BOM and not as its deprecated meaning.
Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.
Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.
Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.
I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.
How can I check the encoding of a file
You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.
Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.
I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.
As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)