The expression "binary=True" in In embedding using word2vec - word2vec

What does mean and what is used for the expression expression "binary=True" in the following line of code:
w2vmodel = gensim.models.KeyedVectors.load_word2vec_format(
'models/GoogleNews-vectors-negative300.bin.gz'),
binary=True # <-- this
)

The format written by Google's original word2vec.c program had an option to write in plain-text or binary. (Essentially, one wrote floating-point values as human-readable decimal strings, and the other as packed 4-byte binary representations which look like line-noise/strange-characters if viewed as text/characters.)
If you want to read such a file that was written in binary mode, you need to specify binary=True, or else the file format will be misinterpreted, likely failing with errors. There are no other differences in later behavior once the data has been successfully read.

Related

How does compiling C++ code produce machine code?

I'm studying C++ using the website learncpp.com. Chapter 0.5 states that the purpose of a compiler is to translate human-readable source code to machine-readable machine code, consisting of 1's and 0's.
I've written a short hello-world program and used g++ hello-world.cpp to compile it (I'm using macOS). The result is a.out. It does print "Hello World" just fine, however, when I try to look at a.out in vim/less/Atom/..., I don't see 1's and 0', but rather a lot of this:
H�E�H��X�����H�E�H�}���H��X���H9��
Why are the contents of a.out not just 1's and 0's, as would be expected from machine code?
They are binary bits (1s and 0s) but whatever piece of software you are using to view the file's contents is trying to read them as human readable characters, not as machine code.
If you think about it, everything that you open in a text editor is comprised of binary bits stored on bare metal. Those 1s and 0s can be interpreted in many many different ways, and most text editors will attempt to read them in as characters. Take the character 'A' for example. It's ASCII code is 65 which is 01000001 in binary. When a text editor reads through the file on your computer it is processing those bits as characters rather than machine instructions, and therefore it reads in 8 bits (byte) in the pattern 01000001 it knows that it has just read an 'A'.
This process results in that jumble of symbols you see in the executable file. While some of the content happens to be in the right pattern to make human readable characters, the majority of them will likely be outside of what either the character encoding considers valid or knows how to print, resulting in the '�' that you see.
I won't go into the intricacies of how character encodings work here, but read Character Encodings for Beginners for a bit more info.

Writing a mixof ascii and binary data in Fortran

I'm trying to write a mix of ASCII and binary data as given below for a vtk file format data.
I understand that the binary or ASCII distinction must be made in a file-OPEN statement (in the FORM='BINARY', preferably: ACCESS='STREAM' ). I don't understand how to write the file for the format I require.
What I'm trying to output:
ascii keyword
ascii keyword
ascii keyword
ascii keyword
ascii keywords "variable value in ascii" ascii keywords
.....SOME BINARY DATA ....
.....................
What I'm using:
write(fl) "# vtk DataFile Version 3.0"//CHAR(13)//CHAR(10)
write(fl)"Flow Field"//CHAR(13)//CHAR(10)
write(fl)"BINARY"//CHAR(13)//CHAR(10)
write(fl)"DATASET UNSTRUCTURED_GRID"//CHAR(13)//CHAR(10)
write(fl)"POINTS",npoints,"float" -------------> gives value of npoints(example:8) in binary format
What the output should be:
# vtk DataFile Version 3.0
Flow Field
BINARY
DATASET UNSTRUCTURED_GRID
POINTS 8 Float
.....SOME BINARY DATA ....
.....................
What the output is:
# vtk DataFile Version 3.0
Flow Field
BINARY
DATASET UNSTRUCTURED_GRID
POINTSÒ^O^#^#float
.....SOME BINARY DATA ....
...................
Firstly, you will find examples of writing of VTK files on the internet, like in questions binary vtk for Rectilinear_grid from fortran code can not worked by paraview and Binary VTK for RECTILINEAR_GRID from fortran code in various open source research codes, like https://bitbucket.org/LadaF/elmm/src/866794b5f95ec93351b0edea47e52af8eadeceb5/src/simplevtk.f90?at=master&fileviewer=file-view-default (this one is my simplified example, there are many more) or in dedicated libraries, like http://people.sc.fsu.edu/~jburkardt/f_src/vtk_io/vtk_io.html (there is also a VTKFortran library for the XML VTK files).
Socondly, even though you are on Windows, you should not use the Windows line ending conventions in VTK binary files. End your lines just with achar(10) (or the new_line constant from iso_fortran_env). And don't forget that the binary data must be bigendian. There are examples how to deal with that in the links above.
Thirdly, to put an integer number to a string, we have a huge number of duplicates. I mean really huge. Start here Convert integers to strings to create output filenames at run time and I will shamelessly recommend my itoa function there, because it will simplify your code a lot.
write(fl)"POINTS ",itoa(npoints)," float"
I would replace
write(fl)"POINTS",npoints,"float"
with
BLOCK
integer, parameter :: big_enough = 132 ! Or whatever
character(big_enough) line
write(line,'(*(g0))')"POINTS ",npoints," Float"//achar(13)//achar(10)
write(f1) trim(line)
END BLOCK

in python2 is OK, but in python3 doesn't work

#!/usr/bin/env python3
f = open('dv.bmp', mode='rb')
slika = f.read()
f.closed
pic = slika[:28]
slika = slika[54:]
# dimenzije originalnog bitmapa
pic_w = ord(pic[18]) + ord(pic[19])*256
pic_h = ord(pic[22]) + ord(pic[23])*256
print(pic_w, pic_h)
why this code doesn't work in python3 (in python2 it works fine) OR
howto read binary file into string type in python3?
In Python 2.x, binary mode (e.g. 'rb') only affects how Python interprets end-of-line characters:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
However in Python 3.x, binary mode also changes the type of the resulting data:
Normally, files are opened in text mode, that means, you read and
write strings from and to the file, which are encoded in a specific
encoding. If encoding is not specified, the default is platform
dependent (see open()). 'b' appended to the mode opens the file in
binary mode: now the data is read and written in the form of bytes
objects. This mode should be used for all files that don’t contain
text.
Since the read results in a bytes object, indexing it results in an integer, not a one-character string as in Python 2. Passing that integer to the ord() function raises the error mentioned in your comment.
The solution is just to omit the ord() call in Python 3, since the integer you get from indexing the bytes object is the same as what you'd get from calling ord() on the string equivalent.

How can I know if im reading a binary or a text file

My program has different options: You can read a binary file or a text file, but you can the binary file option and choose a text file... How can I do to detect that you have introduced a incorrect file while I'm doing this
while(fich.read((char *)&struct,sizeof(struct)))
How can I do to detect that you have introduced a incorrect file while I'm doing this
The simple answer is: You cannot.
It's impossible to distinguish plain (let's say ASCII encoded) text files from binary files.
Any of the introductory byte sequences read from the file might be valid for both.
The silly but common solutions for this problem are:
give your file name an extension that implies a particular format
let your file have a magic byte sequence (1-2 bytes) in the beginning and imply a particular format

Understanding the concept of 'File encoding'

I have already gone through some stuff on the web and SOF explaining 'file encoding' but I still have questions. File is a group of related records and on disk, its contents are just stored as '1's and '0's. Every time, a running program wants to read in a file or write to the file, the file is brought into the RAM and put into the address space of the running program (aka process). Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
There is one explanation on SOF which reads 'At the storage level, a file contains an array of bytes. On top of this you have the encoding layer for text files. The format layer comes last, on top of the encoding layer for text files or on top of the array of bytes for all the other binary files'. I am sort of fine with this but would like to know if it is 100% correct.
The question basically came up when understanding file opening modes in C++.
I think the description of the orders of layers is confusing here. I would consider formats and encodings to be related but not tied together so tightly. Let's try to define it formally.
A file is a contiguous sequence of bytes. A byte is a contiguous sequence of bits.
A symbol is a unit of data. Bytes are one kind of symbol. There are other symbols that are not bytes. Consider the number 6 - it is a symbol but not a byte. It can however be encoded as a byte, commonly as 00000110 (this is the two's complement encoding of 6).
An encoding maps a set of symbols to another set of symbols. Most commonly, it maps from a set of non-byte symbols to bytes, which when applied to an entire file makes it a file encoding. Two's complement gives a representation of the numeric values. On the other hand, ASCII, for example, gives a representation of the Latin alphabet and related characters in bytes. If you take ASCII and apply it to a string of text, say "Hello, World!", you get a sequence of bytes. If you store this sequence of bytes as a file, you have a file encoded as ASCII.
A format describes a set of valid sequences of symbols. When applied to the bytes of a file, it is a file format. An example is the BMP file format for storing raster graphics. It specifies that there must be a few bytes at the beginning that identify the file format as BMP, followed by a few bytes to describe the size and depth of the image, and so on. An example of a format that is not a file format would be how we write decimal numbers in English. The basic format is a sequence of numerical characters followed by an optional decimal point with more numerical characters.
Text Files
A text file is a kind of file that has a very simple format. It's format is very simple because it has no structure. It immediately begins with some encoding of a character and ends with the encoding of the final character. There's usually no header or footer or metadata or anything like that. You just start interpreting the bytes as characters right from the beginning.
But how do you interpret the characters in the file? That's where the encoding comes in. If the file is encoded as ASCII, the byte 01000001 represents the Latin letter A. There are much more complicated encodings, such as UTF-8. In UTF-8, a character cannot necessarily be represented in a single byte. Some can, some can't. You determine the number of bytes to interpret as a character from the first few bits of the first byte.
When you open a file in your favourite text editor, how does it know how to interpret the bytes? Well that's an interesting problem. The text editor has to determine the encoding of the file. It can attempt to do this in many ways. Sometimes the file name gives a hint through its extension (.txt is likely to be at least ASCII compatible). Sometimes the first character of the file gives a good hint as to what the encoding is. Most text editors will, however, give you the option to specify which encoding to treat the file as.
A text file can have a format. Often the format is entirely independent of the encoding of the text. That is, the format doesn't describe the valid sequences of bytes at all. It instead describes the valid sequences of characters. For example, HTML is a format for text files for marking up documents. It describes the sequences of characters that determine the contents of a document (note: not the sequence of bytes). As an example, it says that the sequence of characters <html> are an opening tag and must be followed at some point by the closing tag </html>. Of course, the format is much more detailed than this.
Binary file
A binary file is a file with meaning determined by its file format. The file format describes the valid sequences of bytes within the file and the meaning that that sequence has. It is not some interpretation of the bytes that matters at the file format level - it is the order and arrangement of bytes.
As described above, the BMP file format gives a way of storing raster graphics. It says that the first two bytes must be 01000010 01001101, the next four bytes must give the size of the file as a count of the number of bytes, and so on, leading up to the actual pixel data.
A binary file can have encodings within it. To illustrate this, consider the previous example. I said that the four bytes following the first two in a BMP file give the size of the file in bytes. How are those bytes interpreted? The BMP file format states that those bytes give the size as an unsigned integer. This is the encoding of those bytes.
So when you browse the directories on your computer for a BMP file and open it, how does your system know how to open it? How does it know which program to use to view it? The format of a binary file is much more strongly hinted by the file extension than the encoding of a text file. If the filename has .bmp at the end, your system will likely consider it to be a BMP file and just open it in whatever graphics program you have. It may also look at the first few bytes and see what they suggest.
Summary
The first level of understanding the meaning of bytes in a file is that file's format. A text file has an extremely simple format - start at the beginning, interpreting characters until you reach the end. How you interpret the characters depends on that text file's character encoding. Most formats are more complicated, however, and will likely have encodings nested within them. At some level you have to start extracting abstract information from your bytes and that's where the encodings kick in. But then whatever is being encoded can also have a format that is applied to it. You have a chain of formats and encodings until you get the information that you want.
Let's see if this helps...
A Unix file is just an array of bits (1/0) the current minimum number of bits in a file is 8, i.e. 1 byte. All file interaction is done at no less than the byte level. On most systems now a days, you don't really have to concern your self with the maximum size of a file. There are still some small variences in Operating Systems, but very few if none have maximum sizes of less that 1 GB.
The encoding or format of a file is only dependent on the applications that use it.
There are many common file formats, such as 'unix ASCII text' and PDF. Most of the files you will come accross will have a documented format specification somewhere on the net. For example the specification of a 'Unix ASCII text file' is:
A collection of ascii characters where each line is terminated by a end of line character. The end of line character is specificed in c++ as std::endl' or the quoted "\n". Unix specifies this character as the binary value - 012(oct) or 00001010.
Hope this helps :)
The determination of how to encode/display something is entirely up to the designer of the program. Of course, there are standards for certain types of files - a PDF or JPG file has a standard format for its content. The definition of both PDF and JPG is quite complex.
Text files have at least somewhat of a standard - but how to interpret or use the contents of a text-file may be just as complex and confusing as JPEG - the only difference is that the content is (some sort of) text, so you can load it up in a text editor and try to make sense of it. But see below for an example line of "text in a database type application".
In C and C++, there is essentially just one distinction, files are either "binary" or "text" ("not-binary"). The difference is about the treatment of "special bits", mostly to do with "endings" - a text file will contain end of line markers, or newlines ('\n') [more in a bit about newlines], and in some operating systems , also contain "end of file marker(s)" - for example in old CP/M, the file was sized in blocks of 128 or 256 bytes. So if we had "Hello, World!\n" in a text file, that file would be 128 bytes long, and the remaining 114 bytes would be "end-of-file" markers. Most modern operating systems track filesize in bytes, so there's no need to have a end-of-file marker in the file. But C supports many operating systems, both new and old, so the language has an allowance for this. End of file is typically CTRL-Z (DOS, Windows, etc) or CTRL-D (Unix - Linux, etc). When the C runtime library hits the end of file character, it will stop reading and give the error code/behaviour, same as if "there is no more file to read here".
Line endings or newlines need special treatment because they are not always the same in the OS that the file is living on. For example, Windows and DOS uses "Carriage Return, Line Feed" (CR, LF - CTRL-M, CTRL-J, ASCII 13 and 10 respectively) as the end of line. In the various forms of Unix, (Linux, MacOS X and BSD for example), the line ending is "Line Feed" (LF, CTRL-J) alone. In older MacOS, the line ending is ONLY "carriage Return." So that you as a programmer don't have to worry about exactly how lines end, the C runtime library will do translation of the "native" line-ending to a standardized line-ending of '\n' (which translates to "Line Feed" or character value 10). Of course, this means that the C runtime library needs to know that "if there is a CR followed by LF, we should just give out an LF character."
For binary files, we really DO NOT want any translation of the data, just because our pixels happen to be the values 13 and 10 next to each other, doesn't mean we want it merged to a single 10 byte, right? And if the code reads a byte of the value 26 (CTRL-Z) or 4 (CTRL-D), we certainly don't want the input to stop there...
Now, if I have a database text file that contains:
10 01353-897617 14000 Mats
You probably have very little idea what that means - I mean you can probably figure out that "Mats" is my name - but it could also be those little cardboard things to go under glasses (aka "Beer-mats") or something to go on the floor, e.g. "Prayer Mats" for Muslims.
The number 10 could be a customer number, article number, "row number" or something like that. 01353-896617 could be just about anything - perhaps my telephone number [no it isn't, but it does indeed resemble it] - but it could also be a "manufacturers part number" or some form of serial number or some such. 14000? Price per item, number of units in stock, my salary [I hope not!], distance in miles from my address to Sydney in Australia [roughly, I think].
I'm sure someone else, not given anything else could come up with hundreds of other answers.
[The truth is that it's just made up nonsense for the purpose of this answer, except for the bit at the beginning of the "phone number", which is a valid UK area code - the point is to explain that "the meaning of a set of fields in a text-file can only be understood if there is something describing the meaning of the fields"]
Of course the same applies to binary files, except that it's often even harder to figure out what the content is, because of the lack of separators - if you didn't have spaces and dashes in the text above, it would be much harder to know what belongs where, right? There are typically no 'spaces' and other such things in a binary file. It's all down to someone's description or definition in some code somewhere, or something like that.
I hope my ramblings here have given you some idea.
Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
The format of the file, obviously. If you are reading a BMP file, you have to first read the header, then height*width pixel data. If you are reading .txt, just read the characters as-is. Text files can have different encodings, such as Unicode.
Some formats, like .png, are compressed, meaning that their raw data takes more space in memory that the file on disk.
The particular algorithm is chosen depending on various factors. On Windows, it's usually the extension that matters. In web, the content type is dominant.
In general, if you try to read the file in other format, you will usually get garbage. That can be forced sometimes : try opening a .bmp file in your text editor, for example.
So basically we're talking about text files mainly, right?
Now to the point: when your text editor loads the file into memory, from some information it deduces its file encoding (either you tell it or it has a special file format marker among the first few bytes of the file, or whatever). Then it's the program itself that decides how it treats the raw bytes.
For example, if you tell your text editor to open a file as ASCII, it will treat each byte as an individual character, and it will display the character A whenever encounters the number 65 as the current byte to show, etc (because 65 is the ASCII character code for A).
However, if you tell it to open your file as UTF-16, then it will grab two bytes (well, more precisely, two octets) at a time, it will use this so-called "word" as the numeric value to be looked up, and it will, for example, display a ç character when the two bytes it read were corresponding to 231, the Unicode character code of ç.