Understanding the concept of 'File encoding'

Understanding the concept of 'File encoding' - c++

I have already gone through some stuff on the web and SOF explaining 'file encoding' but I still have questions. File is a group of related records and on disk, its contents are just stored as '1's and '0's. Every time, a running program wants to read in a file or write to the file, the file is brought into the RAM and put into the address space of the running program (aka process). Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
There is one explanation on SOF which reads 'At the storage level, a file contains an array of bytes. On top of this you have the encoding layer for text files. The format layer comes last, on top of the encoding layer for text files or on top of the array of bytes for all the other binary files'. I am sort of fine with this but would like to know if it is 100% correct.
The question basically came up when understanding file opening modes in C++.

I think the description of the orders of layers is confusing here. I would consider formats and encodings to be related but not tied together so tightly. Let's try to define it formally.
A file is a contiguous sequence of bytes. A byte is a contiguous sequence of bits.
A symbol is a unit of data. Bytes are one kind of symbol. There are other symbols that are not bytes. Consider the number 6 - it is a symbol but not a byte. It can however be encoded as a byte, commonly as 00000110 (this is the two's complement encoding of 6).
An encoding maps a set of symbols to another set of symbols. Most commonly, it maps from a set of non-byte symbols to bytes, which when applied to an entire file makes it a file encoding. Two's complement gives a representation of the numeric values. On the other hand, ASCII, for example, gives a representation of the Latin alphabet and related characters in bytes. If you take ASCII and apply it to a string of text, say "Hello, World!", you get a sequence of bytes. If you store this sequence of bytes as a file, you have a file encoded as ASCII.
A format describes a set of valid sequences of symbols. When applied to the bytes of a file, it is a file format. An example is the BMP file format for storing raster graphics. It specifies that there must be a few bytes at the beginning that identify the file format as BMP, followed by a few bytes to describe the size and depth of the image, and so on. An example of a format that is not a file format would be how we write decimal numbers in English. The basic format is a sequence of numerical characters followed by an optional decimal point with more numerical characters.
Text Files
A text file is a kind of file that has a very simple format. It's format is very simple because it has no structure. It immediately begins with some encoding of a character and ends with the encoding of the final character. There's usually no header or footer or metadata or anything like that. You just start interpreting the bytes as characters right from the beginning.
But how do you interpret the characters in the file? That's where the encoding comes in. If the file is encoded as ASCII, the byte 01000001 represents the Latin letter A. There are much more complicated encodings, such as UTF-8. In UTF-8, a character cannot necessarily be represented in a single byte. Some can, some can't. You determine the number of bytes to interpret as a character from the first few bits of the first byte.
When you open a file in your favourite text editor, how does it know how to interpret the bytes? Well that's an interesting problem. The text editor has to determine the encoding of the file. It can attempt to do this in many ways. Sometimes the file name gives a hint through its extension (.txt is likely to be at least ASCII compatible). Sometimes the first character of the file gives a good hint as to what the encoding is. Most text editors will, however, give you the option to specify which encoding to treat the file as.
A text file can have a format. Often the format is entirely independent of the encoding of the text. That is, the format doesn't describe the valid sequences of bytes at all. It instead describes the valid sequences of characters. For example, HTML is a format for text files for marking up documents. It describes the sequences of characters that determine the contents of a document (note: not the sequence of bytes). As an example, it says that the sequence of characters <html> are an opening tag and must be followed at some point by the closing tag </html>. Of course, the format is much more detailed than this.
Binary file
A binary file is a file with meaning determined by its file format. The file format describes the valid sequences of bytes within the file and the meaning that that sequence has. It is not some interpretation of the bytes that matters at the file format level - it is the order and arrangement of bytes.
As described above, the BMP file format gives a way of storing raster graphics. It says that the first two bytes must be 01000010 01001101, the next four bytes must give the size of the file as a count of the number of bytes, and so on, leading up to the actual pixel data.
A binary file can have encodings within it. To illustrate this, consider the previous example. I said that the four bytes following the first two in a BMP file give the size of the file in bytes. How are those bytes interpreted? The BMP file format states that those bytes give the size as an unsigned integer. This is the encoding of those bytes.
So when you browse the directories on your computer for a BMP file and open it, how does your system know how to open it? How does it know which program to use to view it? The format of a binary file is much more strongly hinted by the file extension than the encoding of a text file. If the filename has .bmp at the end, your system will likely consider it to be a BMP file and just open it in whatever graphics program you have. It may also look at the first few bytes and see what they suggest.
Summary
The first level of understanding the meaning of bytes in a file is that file's format. A text file has an extremely simple format - start at the beginning, interpreting characters until you reach the end. How you interpret the characters depends on that text file's character encoding. Most formats are more complicated, however, and will likely have encodings nested within them. At some level you have to start extracting abstract information from your bytes and that's where the encodings kick in. But then whatever is being encoded can also have a format that is applied to it. You have a chain of formats and encodings until you get the information that you want.

Let's see if this helps...
A Unix file is just an array of bits (1/0) the current minimum number of bits in a file is 8, i.e. 1 byte. All file interaction is done at no less than the byte level. On most systems now a days, you don't really have to concern your self with the maximum size of a file. There are still some small variences in Operating Systems, but very few if none have maximum sizes of less that 1 GB.
The encoding or format of a file is only dependent on the applications that use it.
There are many common file formats, such as 'unix ASCII text' and PDF. Most of the files you will come accross will have a documented format specification somewhere on the net. For example the specification of a 'Unix ASCII text file' is:
A collection of ascii characters where each line is terminated by a end of line character. The end of line character is specificed in c++ as std::endl' or the quoted "\n". Unix specifies this character as the binary value - 012(oct) or 00001010.
Hope this helps :)

The determination of how to encode/display something is entirely up to the designer of the program. Of course, there are standards for certain types of files - a PDF or JPG file has a standard format for its content. The definition of both PDF and JPG is quite complex.
Text files have at least somewhat of a standard - but how to interpret or use the contents of a text-file may be just as complex and confusing as JPEG - the only difference is that the content is (some sort of) text, so you can load it up in a text editor and try to make sense of it. But see below for an example line of "text in a database type application".
In C and C++, there is essentially just one distinction, files are either "binary" or "text" ("not-binary"). The difference is about the treatment of "special bits", mostly to do with "endings" - a text file will contain end of line markers, or newlines ('\n') [more in a bit about newlines], and in some operating systems , also contain "end of file marker(s)" - for example in old CP/M, the file was sized in blocks of 128 or 256 bytes. So if we had "Hello, World!\n" in a text file, that file would be 128 bytes long, and the remaining 114 bytes would be "end-of-file" markers. Most modern operating systems track filesize in bytes, so there's no need to have a end-of-file marker in the file. But C supports many operating systems, both new and old, so the language has an allowance for this. End of file is typically CTRL-Z (DOS, Windows, etc) or CTRL-D (Unix - Linux, etc). When the C runtime library hits the end of file character, it will stop reading and give the error code/behaviour, same as if "there is no more file to read here".
Line endings or newlines need special treatment because they are not always the same in the OS that the file is living on. For example, Windows and DOS uses "Carriage Return, Line Feed" (CR, LF - CTRL-M, CTRL-J, ASCII 13 and 10 respectively) as the end of line. In the various forms of Unix, (Linux, MacOS X and BSD for example), the line ending is "Line Feed" (LF, CTRL-J) alone. In older MacOS, the line ending is ONLY "carriage Return." So that you as a programmer don't have to worry about exactly how lines end, the C runtime library will do translation of the "native" line-ending to a standardized line-ending of '\n' (which translates to "Line Feed" or character value 10). Of course, this means that the C runtime library needs to know that "if there is a CR followed by LF, we should just give out an LF character."
For binary files, we really DO NOT want any translation of the data, just because our pixels happen to be the values 13 and 10 next to each other, doesn't mean we want it merged to a single 10 byte, right? And if the code reads a byte of the value 26 (CTRL-Z) or 4 (CTRL-D), we certainly don't want the input to stop there...
Now, if I have a database text file that contains:
10 01353-897617 14000 Mats
You probably have very little idea what that means - I mean you can probably figure out that "Mats" is my name - but it could also be those little cardboard things to go under glasses (aka "Beer-mats") or something to go on the floor, e.g. "Prayer Mats" for Muslims.
The number 10 could be a customer number, article number, "row number" or something like that. 01353-896617 could be just about anything - perhaps my telephone number [no it isn't, but it does indeed resemble it] - but it could also be a "manufacturers part number" or some form of serial number or some such. 14000? Price per item, number of units in stock, my salary [I hope not!], distance in miles from my address to Sydney in Australia [roughly, I think].
I'm sure someone else, not given anything else could come up with hundreds of other answers.
[The truth is that it's just made up nonsense for the purpose of this answer, except for the bit at the beginning of the "phone number", which is a valid UK area code - the point is to explain that "the meaning of a set of fields in a text-file can only be understood if there is something describing the meaning of the fields"]
Of course the same applies to binary files, except that it's often even harder to figure out what the content is, because of the lack of separators - if you didn't have spaces and dashes in the text above, it would be much harder to know what belongs where, right? There are typically no 'spaces' and other such things in a binary file. It's all down to someone's description or definition in some code somewhere, or something like that.
I hope my ramblings here have given you some idea.

Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
The format of the file, obviously. If you are reading a BMP file, you have to first read the header, then height*width pixel data. If you are reading .txt, just read the characters as-is. Text files can have different encodings, such as Unicode.
Some formats, like .png, are compressed, meaning that their raw data takes more space in memory that the file on disk.
The particular algorithm is chosen depending on various factors. On Windows, it's usually the extension that matters. In web, the content type is dominant.
In general, if you try to read the file in other format, you will usually get garbage. That can be forced sometimes : try opening a .bmp file in your text editor, for example.

So basically we're talking about text files mainly, right?
Now to the point: when your text editor loads the file into memory, from some information it deduces its file encoding (either you tell it or it has a special file format marker among the first few bytes of the file, or whatever). Then it's the program itself that decides how it treats the raw bytes.
For example, if you tell your text editor to open a file as ASCII, it will treat each byte as an individual character, and it will display the character A whenever encounters the number 65 as the current byte to show, etc (because 65 is the ASCII character code for A).
However, if you tell it to open your file as UTF-16, then it will grab two bytes (well, more precisely, two octets) at a time, it will use this so-called "word" as the numeric value to be looked up, and it will, for example, display a ç character when the two bytes it read were corresponding to 231, the Unicode character code of ç.

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.

You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.

TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.

How does compiling C++ code produce machine code?

I'm studying C++ using the website learncpp.com. Chapter 0.5 states that the purpose of a compiler is to translate human-readable source code to machine-readable machine code, consisting of 1's and 0's.
I've written a short hello-world program and used g++ hello-world.cpp to compile it (I'm using macOS). The result is a.out. It does print "Hello World" just fine, however, when I try to look at a.out in vim/less/Atom/..., I don't see 1's and 0', but rather a lot of this:
H�E�H��X�����H�E�H�}���H��X���H9��
Why are the contents of a.out not just 1's and 0's, as would be expected from machine code?

They are binary bits (1s and 0s) but whatever piece of software you are using to view the file's contents is trying to read them as human readable characters, not as machine code.
If you think about it, everything that you open in a text editor is comprised of binary bits stored on bare metal. Those 1s and 0s can be interpreted in many many different ways, and most text editors will attempt to read them in as characters. Take the character 'A' for example. It's ASCII code is 65 which is 01000001 in binary. When a text editor reads through the file on your computer it is processing those bits as characters rather than machine instructions, and therefore it reads in 8 bits (byte) in the pattern 01000001 it knows that it has just read an 'A'.
This process results in that jumble of symbols you see in the executable file. While some of the content happens to be in the right pattern to make human readable characters, the majority of them will likely be outside of what either the character encoding considers valid or knows how to print, resulting in the '�' that you see.
I won't go into the intricacies of how character encodings work here, but read Character Encodings for Beginners for a bit more info.

What is Eol in text file and normal file?

Now I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>).Data is read continously untill eol character does not apprears and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Now i created a normal file i.e. a file without .txt
eg
ifstream("test"); // No .txt
Now what is eol marker in this case

The ".txt" at the end of the filename is just a convention. It's just part of the filename.
It does not signify any magical property of the file, and it certainly doesn't change how the file is handled by your operating system kernel or file system driver.
So, in short, what difference is there? None.
I know that text files have a end of line marker which sets the limit for reading a line which a single shifing operator(>>)
That is incorrect.
Data is read continously untill eol character does not apprears
Also incorrect. Some operating systems (e.g. Windows IIRC) inject an EOF (not EOL!) character into the stream to signify to calling applications that there is no more data to read. Other operating systems don't even do that. But in neither case is there an actual EOF character at the end of the actual file.
while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker
That conversion may or may not happen and, either way, EOL is not EOF.
if i add white spaces in my text then would it act as eol maker cause it does.
That's a negative, star command.
I'm not sure where you're getting all this stuff from, but you've been heavily mistaught. I suggest a good, peer-reviewed, well-recommended book from Amazon about how computer operating systems work.

When reading strings in C++ using the extraction operator >>, the default is to skip spaces.
If you want the entire line verbatim, use std::getline.
A typical input loop is:
int main(void)
{
std::string text_from_file;
std::ifstream input_file("My_data.txt");
if (!input_file)
{
cerr << "Error opening My_data.txt for reading.\n";
return EXIT_FAILURE;
}
while (input_file >> text_from_file)
{
// Process the variable text_from_file.
}
return EXIT_SUCCESS;
}

A lot of old and mainframe operating systems required a record structure of all data files which, for text files, originated with a Hollerith (punch) card of 80 columns and was faithfully preserved through disk file records, magnetic tapes, output punch card decks, and line printer lines. No line ending was used because the record structure required that every record have 80 columns (and were typically filled with spaces). In later years (1960s+), having variable length records with an 80 column maximum became popular. Today, even OpenVMS still requires the file creator to specify a file format (sequential, indexed, or "stream") and record size (fixed, variable) where the maximum record size must be specified in advance.
In the modern era of computing (which effectively began with Unix) it is widely considered a bad idea to force a structure on data files. Any programmer is free to do that to themselves and there are plenty of record-oriented data formats like compiler/linker object files (.obj, .so, .o, .lib, .exe, etc.), and most media formats (.gif, .tiff, .flv, .mov, mp3, etc.)
For communicating text lines, the paradigm is to target a terminal or printer and for that, line endings should be indicated. Most operating systems environments (except MSDOS and Windows) use the \n character which is encoded in ASCII as a linefeed (ASCII 10) code. MSDOS and ilk use '\r\n' which are encoded as carriage return then linefeed (ASCII 13, 10). There are advantages and disadvantages to both schemes. But text files may also contain other controls, most commonly the ANSI escape sequences which control devices in specific ways:
clear the screen, either in part or all of it
eject a printer page, skip some lines, reverse feed, and other little-used features
establish a scrolling region
change the text color
selecting a font, text weight, page size, etc.
For these operations, line endings are not a concern.
Also, data files encoded in ASCII such as JSON and XML (especially HTML with embedded Javascript), might not have any line endings, especially when the data is obfuscated or compressed.
To answer your questions:
I am quite confused about the end of line character I am working with c++ and I know that text files have a end of line marker
Maybe. Maybe not. From a C or C++ program's viewpoint, writing \n indicates to the runtime environment the end of a line. What the system does with that varies by runtime operating environment. For Unix and Linux, no translation occurs (though writing to a terminal-like device converts to \r\n). In MSDOS, '\n' is translated to \r\n. In OpenVMS, '\n' is removed and that record's size is set. Reading does the inverse translation.
which sets the limit for reading a line which a single shifing operator(>>).
There is no such limit: A program can choose to read data byte-by-byte if it wants as well as ignore the line boundaries.
The "shifting operators" are overloaded for filestreams to input or output data but are not related to bit twiddling shifts. These operators were chosen for visual approximation of input/output and due to their low operator precedence.
Data is read continously untill eol character does not apprears
This bit is confusing: I think you meant until eol character appears, which is indeed how the line-oriented functions gets() and fgets() work.
and while opening a file in text mode carriage return(CR) is converted into CRLF which is eol marker so if i add white spaces in my text then would it act as eol maker cause it does.
Opening the file does not convert anything, but reading from a file might. However, no environment (that I know of) converts input to CR LF. MSDOS converts CR LF on input to \n.
Adding spaces has no effect on end of lines, end of file, or anything. Spaces are just data. However, the C++ streaming operations reading/writing numbers and some other datatypes use whitespace (a sequence of spaces, horizontal tabs, vertical tabs, form feed, and maybe some others) as a delimiter. This convenience feature may cause some confusion.
Now i created a normal file i.e. a file without .txt eg
ifstream("test"); \No .txt
Now what is eol marker in this case
The filename does not determine the file type. In fact, file.txt may not be a text file at all. Using a particular file extension is convenient for humans to communicate a file's purpose, but it is not obligatory.

Converting WAV file audio input into plain ASCII characters

I am working on a project where we need to convert WAV file audio input into plain ASCII characters. The input WAV file will contain a single short alphanumeric code e.g. asdrty543 and each character will be pronounced one by one when you play the WAV file. Our requirement is that when a single character code is pronounced we need to convert it into it's equivalent ASCII code. The implementation will be done in C/C++ as un-managed Win32 DLL. We are open to use third party libraries. I am already googling for directions. However, I will really appreciate it if I can get directions/pointers from an experienced programmer who has already worked on similar requirement. Thank you in advance for your help.

ASCII characters like Az09 are only a portion of the ASCII Table. WAV files like any other file is stored and accessed in bytes.
1 byte has 256 different values. Therefore one can't simply convert bytes into Az09 since there are not enough Az09 characters.
You'll have to find a library which opens WAV files and creates the wave format for you. In relation to the wave's intensity and length, a chain of Az or Az09 characters can be produced.
I believe you're trying to convert the wave to a series of notes. That's possible too, using the same approach.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.

First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js