Word Length as defined in the .ZIP format specification - compression

So I've been reading through PKWARE's specification of the .zip file format and have noticed that in several places they give block sizes in terms of words (the processor word, not the dictionary word :-) ).
Now, the way I understand it, the byte size of a word is specific to a certain processor family. So if a file was zipped on an i386 and then unzipped on an x64-86, the two architectures would have different definitions of a word (4 bytes vs. 8 bytes) and would therefore interpret the block data differently.
Am I missing something here? Or do the folks at PKWARE simply assume that 1 word = 4 bytes? That seems like the most likely option to me - I've checked some zip files with a hex editor and the 4-byte definition would fit, but I'd like some confirmation because its not like I have a whole bunch of different processors to test with :)
Thanks in advance, and sorry if the question already exists - I did try searching but it's a little difficult because the word "word" is so ambiguous (see what I mean?)

Where the specification says "word" for a stored block in the deflate format, it means two bytes (in LSB order).
For zip decryption (where said encryption should not be used since it's so weak), again a word means two bytes.
When it talks about a general purpose flag word under imploding, it again means two bytes.

Related

What's the best way to store binary

Ive recently implemented Hoffman compression in c++, if I were to store the results as binary it would take up a lot more space as each 1 and 0 is a character. Alternatively I was thinking maybe I could break the binary into sections of 8 and put characters in the text file, but that would kinda be annoying (so hopefully that can be avoided). My question here is what is the best way to store binary in a text file in terms of character efficietcy?
[To recap the comments...]
My question here is what is the best way to store binary in a text file in terms of character efficiently?
If you can store the data as-is, then do so (in other words, do not use any encoding; simply save the raw bytes).
If you need to store the data within a text file (for instance as a paragraph or as a quoted string), then you have many ways of doing so. For instance, base64 is a very common one, but there are many others.

How do Binary Files works? (From c++'s point of view) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have some missunderstandings about Binary files, i dont understand what a binary file is, i know text files are also binary files, but it needs to be parsed in order to extract information, unlike text files binary files with the same contents looks diffrent, for example while storing my name in a binary file "Rishabh" it not only stores Rishabh in that file but with some extra unreadable characters, what is it?? Why does'nt it only store characters like a text file, and what are binary file formats, eg. .3d, .zip, .mp3 etc... From my knowledge in text files, format extension specifies what the format is or how to process that file, like .dae, .xml, .htm etc... These contains tags to store datas, but what about binary files, because it dont needs any tags because its stored as a variable in that file from which we have to copy contents to the programs variables, (i mean to say its like stored in memory) so why these binary file formats are diffrent, why just not only a single program read all the contents of the file which is unkown to the world and to me?? And what is binary file format cracking??
All files have some kind of pre-determined encoding since computers can't store anything but bit-patterns in bytes on disk. A text file contains only the encoding for printable characters plus space, and few other encodings to end-a-line, tab, and maybe form-feed and a few others related to character display on a device. Because the encoding in a text file is a well-known standard, and is quite common, there are functions in most, if not all languages, to deal specifically with that type of file. Most importantly, they know how to read a line at a time - they recognize line-terminator character(s).
If however, you type the characters of your name in some other program besides a text editor - say you write using the text tool in Gimp or Microsoft Paint, and then save it. The program has to save more information than just your name. Your name has a position on a canvas that must be saved. It also has a font and a size and whether it is bold or italic or underlined, that need to be saved. The size of the canvas needs to be saved. The color being used, even if white and black, needs to be saved. This encoding will be different than the encoding used to save the letters of your name. So if you edit the file with a text editor, you will see some gibberish since the text editor is expecting character encoding and knows nothing about the encoding Gimp uses for fonts, font sizes, x,y positions, etc.
C++ compilers are not written with routines to understand any binary file encodings. The routines for reading/writing binary files in C++ will just read and write sequences of bytes. Although, since the fundamental type that holds a byte of data in C++ is a char (or unsigned char), you will see binary prototypes like
write ( char * buffer, streamsize size );
read ( char * buffer, streamsize size );
But the char pointer in this case should be considered as a "byte *" since the read/write functions are just moving bytes of data from/to disk or memory without any regard for character encodings.
C++ read/write routines don't know, or care what the format or encoding is for the bytes they are moving. So it is left up to the programmer to write code to process or handle these bytes according to the pre-defined format for the file. However, the routines written to process a specific format of binary file can be compiled into a library that can then be shared or sold, and used by many C++ programmers. For example, LibXL can be used to read the binary format of Excel files from a C++ program.
From the perspective of C/C++, the only difference between text and binary files is how line endings are handled.
If you open a file in binary mode, then read reads exactly the bytes in the file, and write writes exactly the bytes which are in memory.
If you open a file in text mode, then whatever character or character sequence is conventionally used to represent the end of a line in a file is transformed into some single character (which is written in the source code as '\n', although it is only one character) when the file is read, and the \n is transformed into the conventional end-of-line character or sequence when the file is written to. Also, it is not technically legal for the file to not end with an end-of-line sequence, and there may be a limit to the length of a line.
In Unix, the two modes are identical, because \n is a representation of the character code 10 (0A in hex), and that is precisely the conventional line-ending character. In Windows, by contrast, the conventional line-ending sequence is two bytes long -- {13,10} or {0D,0A}. \n is still 0A, so effectively the 0D preceding the 0A is deleted from the data read from the file, and an 0D is inserted before every 0A when data is written to the file.
Some (much) older operating systems had no conventional line-ending character. Instead, all lines were padded with space characters to the exact same length, making it possible to directly seek to a specific line number. C libraries working in text mode would typically read exactly the line length, and then delete the trailing spaces (if any) and finally add the code corresponding to \n (some such systems used EBCDIC instead of ASCII, so \n was a different integer value). Writing the data out, the \n would be deleted and replaced with exactly the correct number of spaces to bring the line to the standard length. Fortunately, those of us who don't work in a computing museum don't have to deal with that stuff any more, and Apple abandoned its use of 0D as the line-end character with the advent of OSX, so the text/binary difference is now limited to Windows.
Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.
Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!
Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)
Its one of the fundamentals of a Computer system.
Probably a great explanation in this link
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
Some text quoted:
Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.
A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.

Encoding binary data using string class

I am going through one of the requirment for string implementations as part of study project.
Let us assume that the standard library did not exist and we were
foced to design our own string class. What functionality would it
support and what limitations would we improve. Let us consider
following factors.
Does binary data need to be encoded?
Is multi-byte character encoding acceptable or is unicode necessary?
Can C-style functions be used to provide some of the needed functionality?
What kind of insertion and extraction operations are required?
My question on above text
What does author mean by "Does binary data need to be encoded?". Request to explain with example and how can we implement this.
What does author mean y point 2. Request to explain with example and how can we implement this.
Thanks for your time and help.
Regarding point one, "Binary data" refers to sequences of bytes, where "bytes" almost always means eight-bit words. In the olden days, most systems were based on ASCII, which requires seven bits (or eight, depending on who you ask). There was, therefore, no need to distinguish between bytes and characters. These days, we're more friendly to non-English speakers, and so we have to deal with Unicode (among other codesets). This raises the problem that string types need to deal with the fact that bytes and characters are no longer the same thing.
This segues onto point two, which is about how you represent strings of characters in a program. UTF-8 uses a variable-length encoding, which has the remarkable property that it encodes the entire ASCII character set using exactly the same bytes that ASCII encoding uses. However, it makes it more difficult to, e.g., count the number of characters in a string. For pure ASCII, the answer is simple: characters = bytes. But if your string might have non-ASCII characters, you now have to walk the string, decoding characters, in order to find out how many there are1.
These are the kinds of issues you need to think about when designing your string class.
1This isn't as difficult as it might seem, since the first byte of each character is guaranteed not to have 10 in its two high-bits. So you can simply count the bytes that satisfy (c & 0xc0) != 0xc0. Nonetheless, it is still expensive relative to just treating the length of a string buffer as its character-count.
The question here is "can we store ANY old data in the string, or does certain byte-values need to be encoded in some special way. An example of that would be in the standard C language, if you want to use a newline character, it is "encoded" as \n to make it more readable and clear - of course, in this example I'm talking of in the source code. In the case of binary data stored in the string, how would you deal with "strange" data - e.g. what about zero bytes? Will they need special treatment?
The values guaranteed to fit in a char is ASCII characters and a few others (a total of 256 different characters in a typical implementation, but char is not GUARANTEED to be 8 bits by the standard). But if we take non-european languages, such as Chinese or Japanese, they consist of a vastly higher number than the ones available to fit in a single char. Unicode allows for several million different characters, so any character from any european, chinese, japanese, thai, arabic, mayan, and ancient hieroglyphic language can be represented in one "unit". This is done by using a wider character - for the full size, we need 32 bits. The drawback here is that most of the time, we don't actually use that many different characters, so it is a bit wasteful to use 32 bits for each character, only to have zero's in the upper 24 bits nearly all the time.
A multibyte character encoding is a compromise, where "common" characters (common in the European languages) are used as one char, but less common characters are encoded with multiple char values, using a special range of character to indicate "there is more data in the next char to combine into a single unit". (Or,one could decide to use 2, 3, or 4 char each time, to encode a single character).

protocol buffers : no notation for fixed size buffers?

Since I am not getting an answer on this question I gotta prototype and check myself, as my dataset headers need to be fixed size, I need fixed size strings. So, is it possible to specify fixed size strings or byte arrays in protocol buffers ? It is not readily apparent here, and I kinda feel bad about forcing fixed size strings into the header message. --i.e, std::string('\0', 128);
If not I'd rather use a #pragma pack(1) struct header {...};'
edit
Question indirectly answered here. Will answer and except
protobuf does not have such a concept in the protocol, nor in the .proto schema language. In strings and blobs, the data is always technically variable length using a length prefix (which itself uses varint encoding, so even the length is variable length).
Of course, if you only ever store data of a particular length, then it will line up. Note also that since strings in protobuf are unicode using UTF-8 encoding, the length of the encoded data is not as simple as the number of characters (unless you are using only ASCII characters).
This is a slight clarification to the previous answer. Protocol Buffers does not encode strings as UTF-8, it encodes them as regular bytes. The on-wire format would be the number of bytes consumed followed by the actual bytes. See https://developers.google.com/protocol-buffers/docs/encoding/.
While the on-wire format is always the same, protocol buffers provides two interfaces for developers to use, string and bytes, with the primary difference being that the former will generally try to provide string types to the developer where as the latter will try to provide byte types (I.e. Java would provide String for string and ByteArray for bytes).

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.