Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1
This is my expectation:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
This is the result on g++ 4.8.1
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);
The source file is ISO-8859(-1)
We use these compiler directives: -m64 -std=c++11 -pthread -O3 -fpic
In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.
Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.
Update
If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8
That is:
const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);
assert( utf8 == "åäö");
compiler directive: g++ -m64 -std=c++11 -pthread -O3 -finput-charset=ISO8859-1
Tried a few other charset defined from iconv, ex: ISO_8859-1 and so on...
I'm even more confused now than before...
The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.
So you have several factors at play:
which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "åäö" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.
Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.
However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.
You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the \uXXXX escape sequences in the string literal ( \u00E5 instead of å, for example)
Edit:
To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)
If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".
However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:
it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.
Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.
But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).
The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.
The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.
In order to illustrate this discussion, here are some examples. Let's consider the code:
int main() {
std::cout << "åäö\n";
}
1) Compiling this with g++ -std=c++11 encoding.cpp will produce an executable that yields:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
In other words, two bytes per "grapheme cluster" (according to unicode jargon, i.e. in this case, per character), plus the final newline (0a). This is because my file is encoded in utf-8, the input-charset is assumed to be utf-8 by cpp, and the exec-charset is utf-8 by default in gcc (see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html). Good.
2) Now if I convert my file to iso-8859-1 and compile again using the same command, I get:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
i.e. the three characters are now encoded using iso-8859-1. I am not sure about the magic going on here, as this time it seems that cpp correctly guessed that the file was iso-8859-1 (without any hint), converted it to utf-8 internally (according to the link above) but the compiler still stored the iso-8859-1 string in the binary. This we can check by looking at the .rodata section of the binary:
% objdump -s -j .rodata a.out
a.out: file format elf64-x86-64
Contents of section .rodata:
400870 01000200 00e5e4f6 0a00 ..........
(Note the "e5e4f6" sequence of bytes).
This makes perfect sense as a programmer who uses latin-1 literals does not expect them to come out as utf-8 strings in his program's output.
3) Now if I keep the same iso-8859-1-encoded file, but compile with g++ -std=c++11 -finput-charset=iso-8859-1 encoding.cpp, then I get a binary that ouptuts utf-8 data:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
I find this weird: the source encoding has not changed, I explicitly tell gcc it is latin-1, and I get utf-8 as a result! Note that this can be overriden if I explicitly request the exec-charset with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp:
% ./a.out | od -txC
0000000 e5 e4 f6 0a
It is not clear to me how these two options interact...
4) Now let's add the "u8" prefix into the mix:
int main() {
std::cout << u8"åäö\n";
}
If the file is utf-8-encoded, unsurprisingly compiling with defaults char-sets (g++ -std=c++11 encoding.cpp), the output is utf-8 as well. If I request the compiler to use iso-8859-1 internally instead (g++ -std=c++11 -fexec-charset=iso-8859-1 encoding.cpp), the output is still utf-8:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it looks like the prefix "u8" prevented the compiler to convert the literal to the execution character set. Even better, if I convert the same source file to iso-8859-1, and compile with g++ -std=c++11 -finput-charset=iso-8859-1 -fexec-charset=iso-8859-1 encoding.cpp, then I still get utf-8 output:
% ./a.out | od -txC
0000000 c3 a5 c3 a4 c3 b6 0a
So it seems that"u8" actually acts as an "operator" that tells the compiler "convert this literal to utf-8".
Related
I am trying to figure out wide characters in c. For example, I test a string that contains a single letter "Ē" that is encoded as c492 in utf8.
char* T1 = "Ē";
//This is the resulting array { 0xc4, 0x92, 0x00 }
wchar_t* T2 = L"Ē";
//This is the resulting array { 0x00c4, 0x2019, 0x0000 }
I expected that the second array would be {0xc492, 0x0000}, instead it contains an extra character that just wastes space in my opinion. Can anyone help me understand what is going on with this?
What you've managed to do here is mojibake. Your source code is written in UTF-8 but it was interpreted in Windows codepage 1252 (i.e. the compiler source character set was CP1252).
The wide string contents are the Windows codepage 1252 characters of the UTF-8 bytes 0xC4 0x92 converted to UCS-2. The easiest way out is to just using an escape instead:
wchar_t* T2 = L"\x112";
or
wchar_t* T2 = L"\u0112";
The larger problem is that to my knowledge neither C nor C++ have a mechanism for specifying the source character set within the code itself, so it is always a setting or option external to something that you can easily copy-paste.
Your compiler is misinterpreting your source code file (which is saved as UTF-8) as Windows-1252 (commonly called ANSI). It does not interpret the byte sequence C4 92 as the one-character UTF-8 string "Ē", but as the two-character Windows-1252 string "Ä’". The unicode codepoint of "Ä" is U+00C4, and the unicode codepoint of "’" is U+2019. This is exactly what you see in your wide character string.
The 8-bit string only works, because the misinterpretation of the string does not matter, as it is not converted during compilation. The compiler reads the string as Windows-1252 and emits the string as Windows-1252 (so it does not need to convert anything, and considers both to be "Ä’"). You interpret the source code and the data in the binary as UTF-8, so you consider both to be "Ē".
To have the compiler treat your source code as UTF-8, use the switch /utf-8.
BTW: The correct UTF-16 encoding (which is the encoding MSVC uses for wide character strings) to be observed in a wide-character string is not {0xc492, 0x0000}, but {0x0112, 0x0000}, because "Ē" is U+0112.
While debugging on gcc, I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135". Which makes sense -- 万 is 0x4E07, and 0x4E in octal is 116.
Now on Apple LLVM 9.1.0 on an Intel-powered Macbook, I find that that same literal is not handled as the same string, ie:
u16string{u"万不得已"} == u16string{u"\007\116\015\116\227\137\362\135"}
goes from true to false. I'm still on a little-endian system, so I don't understand what's happening.
NB. I'm not trying to use the correspondence u"万不得已" == u"\007\116\015\116\227\137\362\135". I just want to understand what's happening.
I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135"
No, actually it is not. And here's why...
u"..." string literals are encoded as a char16_t-based UTF-16 encoded string on all platforms (that is what the u prefix is specifically meant for).
u"万不得已" is represented by this UTF-16 codeunit sequence:
4E07 4E0D 5F97 5DF2
On a little-endian system, that UTF-16 sequence is represented by this raw byte sequence:
07 4E 0D 4E 97 5F F2 5D
In octal, that would be represented by "\007\116\015\116\227\137\362\135" ONLY WHEN using a char-based string (note the lack of a string prefix, or u8 would also work for this example).
u"\007\116\015\116\227\137\362\135" is NOT a char-based string! It is a char16_t-based string, where each octal number represents a separate UTF-16 codeunit. Thus, this string actually represents this UTF-16 codeunit sequence:
0007 004E 000D 004E 0097 005F 00F2 005D
That is why your two u16string objects are not comparing as the same string value. Because they are really not equal.
You can see this in action here: Live Demo
I know that only positive character ASCII values are guaranteed cross platform support.
In Visual Studio 2015, I can do:
cout << '\xBA';
And it prints:
║
When I try that on http://ideone.com I don't print anything.
If I try to directly print this using the literal character:
cout << '║';
Visual Studio gives the warning:
warning C4566: character represented by universal-character-name '\u2551' cannot be represented in the current code page (1252)
And then prints:
?
When this command is run on http://ideone.com I get:
14849425
I've read that wchars may provide a cross platform approach to this. Is that true? Or am I simply out of luck on extended ASCII?
There are two separate concepts in play here.
The first one is one of a locale, which is often called "code page" in Microsoft-ese. A locale defines which visual characters are represented by which byte sequence. In your first example, whatever locale your program gets executed as, it shows the "║" character, in response to the byte 0xBA.
Other locales, or code pages, will display different characters for the same bytes. Many locales are multibyte locales, where it can take several bytes to display a single character. In the UTF-8 locale, for example, the same character, ║, takes three bytes to display: 0xE2 0x95 0x91.
The second concept here is one of the source code character set, which comes from the locale in which the source code is edited, before it gets compiled. When you enter the ║ character in your source code, it may get represented, I suppose, either as the 0xBA character, or maybe 0xE2 0x95 0x91 sequence, if your editor uses the UTF-8 locale. The compiler, when it reads the source code, just sees the actual byte sequence. Everything gets reduced to bytes.
Fortunately, all C++ keywords use US-ASCII, so it doesn't matter what character set is used to write C++ code. Until you start using non-Latin characters. Which result in a compiler warning, informing you, basically, that you're using stuff that may or may not work, depending on the eventual locale the resulting program runs in.
First, your input source file has its own encoding. Your compiler needs to be able to read this encoding (maybe with the help of flags/settings).
With a simple string, the compiler is free to do what it wants, but it must yield a const char[]. Usually, the compiler keeps the source encoding when it can, so the string stored in your program will have the encoding of your input file. There are cases when the compiler will do a conversion, for example if your file is UTF-16 (you can't fit UTF-16 characters in chars).
When you use '\xBA', you write a raw character, and you chose yourself your encoding, so there is no encoding from the compiler.
When you use '║', the type of '║' is not necessarily char. If the character is not representable as a single byte in the compiler character set, its type will be int. In the case of Visual Studio with the Windows-1252 source file, '║' doesn't fit, so it will be of type int and printed as such by cout <<.
You can force an encoding with prefixes on string literals. u8"" will force UTF-8, u"" UTF-16 and U"" UTF-32. Note that the L"" prefix will give you a wide char wchar_t string, but it's still implementation dependent. Wide chars on Windows are UCS-2 (2 bytes per char), but UTF-32 (4 bytes per char) on linux.
Printing to the console only depends on the type of the variable. cout << is overloaded with all common types, so what it does depends on the type. cout << will usually feed char strings as is to the console (actually stdin), and wcout << will usually feed wchar_t strings as is. Other combinations may have conversions or interpretations (like feeding an int). UTF-8 strings are char strings, so cout << should always feed them correctly.
Next, there is the console itself. A console is a totally independent piece of software. You feed it some bytes, it display them. It doesn't care one bit about your program. It uses its own encoding, and try to print the bytes you fed using this encoding.
The default console encoding on Windows is Code page 850 (not sure if it is always the case). In your case, your file is CP 1252 and your console is CP 850, which is why you can't print '║' directly (CP 1252 doesn't contain '║'), but you can using a raw character. You can change the console encoding on Windows with SetConsoleCP().
On linux, the default encoding is UTF-8, which is more convenient because it support the whole Unicode range. Ideone uses linux, so it will use UTF-8. Note that there is the added layer of HTTP and HTML, but they also use UTF-8 for that.
I am writing a Lexer in MSVC and I need a way to represent an exact character match for all 128 Basic Latin unicode characters.
However, according to this MSDN article, "With the exception of 0x24 and 0x40, characters in the range of 0 to 0x20 and 0x7f to 0x9f cannot be represented with a universal character name (UCN)."
...Which basically means I am not allowed to declare something like wchar_t c = '\u0000';, let alone use a switch statement on this 'disallowed' range of characters. Also, for '\n' and '\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's...
(i.e. Windows uses '\r\n', while Unix simply uses '\n' and older versions of MacOS use '\r')
...and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used.
But this C3850 compiler error simply refuses to allow me to do things my way...
So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input?
In C++11 the restrictions on what characters you may represent with universal character names do not apply inside character and string literals.
C++11 2.3/2
Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.15
That means that those restrictions on UCNs don't apply inside character and string literals:
wchar_t c = L'\u0000'; // perfectly okay
switch(c) {
case L'\u0000':
;
}
This was different in C++03 and I assume from your question that Microsoft has not yet updated their compiler to allow this. However I don't think this matters because using UCNs does not solve the problem you're trying to solve.
and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used
Using UCNs does not do anything to determine the encoding scheme used. A UCN is a source encoding independent method of including a particular character in your source, but the compiler is required to treat it exactly the same as if that character had been written literally in the source.
For example, take the code:
int main() {
unsigned char c = 'µ';
std::cout << (int)c << '\n';
}
If you save the source as UTF-16 and build this with Microsoft's compiler on a Windows system configured to use code page 1252 then the compiler will convert the UTF-16 representation of 'µ' to the CP1252 representation. If you build this source on a system configured with a different code page, one that does not contain the character, then the compiler will give a warning/error when it fails to convert the character to that code page.
Similarly, if you save the source code as UTF-8 (with the so-called 'BOM', so that the compiler knows the encoding is UTF-8) then it will convert the UTF-8 source representation of the character to the system's code page if possible, whatever that is.
And if you replace 'µ' with a UCN, '\u00B5', the compiler will still do exactly the same thing; it will convert the UCN to the system's code page representation of U+00B5 MICRO SIGN, if possible.
So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input?
I'm not sure what you're asking. I'm guessing you want to ensure that the integral values of char or wchar_t variables/literals are consistent with a certain encoding scheme (probably ASCII since you're only asking about characters in the ASCII range), but what is the 'source input'? The encoding of your lexer's source files? The encoding of the input to your lexer? How do you expect the 'source input' to vary?
Also, for '\n' and '\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's...
(i.e. Windows uses '\r\n', while Unix simply uses '\n' and older versions of MacOS use '\r')
This is a misunderstanding of text mode I/O. When you write the character '\n' to a text mode file the OS can replace the '\n' character with some platform specific representation of a new line. However this does not mean that the actual value of '\n' is any different. The change is made purely within the library for writing files.
For example you can open a file in text mode, write '\n', then open the file in binary mode and compare the written data to '\n', and the written data can differ from '\n':
#include <fstream>
#include <iostream>
int main() {
char const * filename = "test.txt";
{
std::ofstream fout(filename);
fout << '\n';
}
{
std::ifstream fin(filename, std::ios::binary);
char buf[100] = {};
fin.read(buf, sizeof(buf));
if (sizeof('\n') == fin.gcount() && buf[0] == '\n') {
std::cout << "text mode written '\\n' matches value of '\\n'\n";
} else {
// This will be executed on Windows
std::cout << "text mode written '\\n' does not match value of '\\n'\n";
}
}
}
This also doesn't depend on using the '\n' syntax; you can rewrite the above using 0xA, the ASCII newline character, and the results will be the same on Windows. (I.e., when you write the byte 0xA to a text mode file Windows will actually write the two bytes 0xD 0xA.)
I found that omitting the string literal and simply using the hexadecimal value of the character allows everything to compile just fine.
For example, you would change the following line:
wchar_t c = L'\u0000';
...to:
wchar_t c = 0x0000;
Though, I'm still not sure if this actually holds the same independent values provided by a UCN.
How do you identify the file content as being in ASCII or binary using C++?
If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
If the first two bytes are hex FE FF, the file is tentatively UTF-16 BE.
If the first two bytes are hex FF FE, and the following two bytes are not hex 00 00 , the file is tentatively UTF-16 LE.
If the first four bytes are hex 00 00 FE FF, the file is tentatively UTF-32 BE.
If the first four bytes are hex FF FE 00 00, the file is tentatively UTF-32 LE.
If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
If the file contains only big-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
If the file contains only little-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
If the file contains only big-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
If the file contains only little-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).
You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127. One way of many ways to do it:
int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127)
;
if(c == EOF) {
/* file is all ASCII */
}
However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
The contents of every file is binary. So, knowing nothing else, you can't be sure.
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
Unless you define your question more closely, there can't be a definitive answer.
Have a look a how the file command works ; it has three strategies to determine the type of a file:
filesystem tests
magic number tests
and language tests
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.
If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.
Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary.
After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters.
You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!).
If you need them, you'll have to define.
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.
Github's linguist uses charlock holmes library to detect binary files, which in turn uses ICU's charset detection.
The ICU library is available for many programming languages, including C and Java.
bool checkFileASCIIFormat(std::string fileName)
{
bool ascii = true;
std::ifstream read(fileName);
int line;
while ((ascii) && (!read.eof())) {
line = read.get();
if (line > 127) {
//ASCII codes only go up to 127
ascii = false;
}
}
return ascii;
}