By using modified RLE, there must at least one "compressed" file larger than original one? - compression

I have read this.
The mathmatical proof shown that for all lossless algorithms, there will be at least one "compressed" file getting larger than before.
For RLE, the compressed text file will be larger if all characters are different.
e.g. ABC -> 1A1B1C
But if i modified the rule of RLE:
(1) For 1,2 length characters, no number will be added
e.g. ABCCDDDEFFFF -> ABCC3DE4F
It seems okay and no compressed file will be getting larger.
However, it contradicts the mathematical proof.

You fail to support decompression, because your compression is not unique. The problem is that your input may contain numbers as well. Now the input "1A" in RLE transforms to "11A" and back to 1A. In your scheme, "1A" and "A" both compress to "1A".

Related

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

Understanding the concept of 'File encoding'

I have already gone through some stuff on the web and SOF explaining 'file encoding' but I still have questions. File is a group of related records and on disk, its contents are just stored as '1's and '0's. Every time, a running program wants to read in a file or write to the file, the file is brought into the RAM and put into the address space of the running program (aka process). Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
There is one explanation on SOF which reads 'At the storage level, a file contains an array of bytes. On top of this you have the encoding layer for text files. The format layer comes last, on top of the encoding layer for text files or on top of the array of bytes for all the other binary files'. I am sort of fine with this but would like to know if it is 100% correct.
The question basically came up when understanding file opening modes in C++.
I think the description of the orders of layers is confusing here. I would consider formats and encodings to be related but not tied together so tightly. Let's try to define it formally.
A file is a contiguous sequence of bytes. A byte is a contiguous sequence of bits.
A symbol is a unit of data. Bytes are one kind of symbol. There are other symbols that are not bytes. Consider the number 6 - it is a symbol but not a byte. It can however be encoded as a byte, commonly as 00000110 (this is the two's complement encoding of 6).
An encoding maps a set of symbols to another set of symbols. Most commonly, it maps from a set of non-byte symbols to bytes, which when applied to an entire file makes it a file encoding. Two's complement gives a representation of the numeric values. On the other hand, ASCII, for example, gives a representation of the Latin alphabet and related characters in bytes. If you take ASCII and apply it to a string of text, say "Hello, World!", you get a sequence of bytes. If you store this sequence of bytes as a file, you have a file encoded as ASCII.
A format describes a set of valid sequences of symbols. When applied to the bytes of a file, it is a file format. An example is the BMP file format for storing raster graphics. It specifies that there must be a few bytes at the beginning that identify the file format as BMP, followed by a few bytes to describe the size and depth of the image, and so on. An example of a format that is not a file format would be how we write decimal numbers in English. The basic format is a sequence of numerical characters followed by an optional decimal point with more numerical characters.
Text Files
A text file is a kind of file that has a very simple format. It's format is very simple because it has no structure. It immediately begins with some encoding of a character and ends with the encoding of the final character. There's usually no header or footer or metadata or anything like that. You just start interpreting the bytes as characters right from the beginning.
But how do you interpret the characters in the file? That's where the encoding comes in. If the file is encoded as ASCII, the byte 01000001 represents the Latin letter A. There are much more complicated encodings, such as UTF-8. In UTF-8, a character cannot necessarily be represented in a single byte. Some can, some can't. You determine the number of bytes to interpret as a character from the first few bits of the first byte.
When you open a file in your favourite text editor, how does it know how to interpret the bytes? Well that's an interesting problem. The text editor has to determine the encoding of the file. It can attempt to do this in many ways. Sometimes the file name gives a hint through its extension (.txt is likely to be at least ASCII compatible). Sometimes the first character of the file gives a good hint as to what the encoding is. Most text editors will, however, give you the option to specify which encoding to treat the file as.
A text file can have a format. Often the format is entirely independent of the encoding of the text. That is, the format doesn't describe the valid sequences of bytes at all. It instead describes the valid sequences of characters. For example, HTML is a format for text files for marking up documents. It describes the sequences of characters that determine the contents of a document (note: not the sequence of bytes). As an example, it says that the sequence of characters <html> are an opening tag and must be followed at some point by the closing tag </html>. Of course, the format is much more detailed than this.
Binary file
A binary file is a file with meaning determined by its file format. The file format describes the valid sequences of bytes within the file and the meaning that that sequence has. It is not some interpretation of the bytes that matters at the file format level - it is the order and arrangement of bytes.
As described above, the BMP file format gives a way of storing raster graphics. It says that the first two bytes must be 01000010 01001101, the next four bytes must give the size of the file as a count of the number of bytes, and so on, leading up to the actual pixel data.
A binary file can have encodings within it. To illustrate this, consider the previous example. I said that the four bytes following the first two in a BMP file give the size of the file in bytes. How are those bytes interpreted? The BMP file format states that those bytes give the size as an unsigned integer. This is the encoding of those bytes.
So when you browse the directories on your computer for a BMP file and open it, how does your system know how to open it? How does it know which program to use to view it? The format of a binary file is much more strongly hinted by the file extension than the encoding of a text file. If the filename has .bmp at the end, your system will likely consider it to be a BMP file and just open it in whatever graphics program you have. It may also look at the first few bytes and see what they suggest.
Summary
The first level of understanding the meaning of bytes in a file is that file's format. A text file has an extremely simple format - start at the beginning, interpreting characters until you reach the end. How you interpret the characters depends on that text file's character encoding. Most formats are more complicated, however, and will likely have encodings nested within them. At some level you have to start extracting abstract information from your bytes and that's where the encodings kick in. But then whatever is being encoded can also have a format that is applied to it. You have a chain of formats and encodings until you get the information that you want.
Let's see if this helps...
A Unix file is just an array of bits (1/0) the current minimum number of bits in a file is 8, i.e. 1 byte. All file interaction is done at no less than the byte level. On most systems now a days, you don't really have to concern your self with the maximum size of a file. There are still some small variences in Operating Systems, but very few if none have maximum sizes of less that 1 GB.
The encoding or format of a file is only dependent on the applications that use it.
There are many common file formats, such as 'unix ASCII text' and PDF. Most of the files you will come accross will have a documented format specification somewhere on the net. For example the specification of a 'Unix ASCII text file' is:
A collection of ascii characters where each line is terminated by a end of line character. The end of line character is specificed in c++ as std::endl' or the quoted "\n". Unix specifies this character as the binary value - 012(oct) or 00001010.
Hope this helps :)
The determination of how to encode/display something is entirely up to the designer of the program. Of course, there are standards for certain types of files - a PDF or JPG file has a standard format for its content. The definition of both PDF and JPG is quite complex.
Text files have at least somewhat of a standard - but how to interpret or use the contents of a text-file may be just as complex and confusing as JPEG - the only difference is that the content is (some sort of) text, so you can load it up in a text editor and try to make sense of it. But see below for an example line of "text in a database type application".
In C and C++, there is essentially just one distinction, files are either "binary" or "text" ("not-binary"). The difference is about the treatment of "special bits", mostly to do with "endings" - a text file will contain end of line markers, or newlines ('\n') [more in a bit about newlines], and in some operating systems , also contain "end of file marker(s)" - for example in old CP/M, the file was sized in blocks of 128 or 256 bytes. So if we had "Hello, World!\n" in a text file, that file would be 128 bytes long, and the remaining 114 bytes would be "end-of-file" markers. Most modern operating systems track filesize in bytes, so there's no need to have a end-of-file marker in the file. But C supports many operating systems, both new and old, so the language has an allowance for this. End of file is typically CTRL-Z (DOS, Windows, etc) or CTRL-D (Unix - Linux, etc). When the C runtime library hits the end of file character, it will stop reading and give the error code/behaviour, same as if "there is no more file to read here".
Line endings or newlines need special treatment because they are not always the same in the OS that the file is living on. For example, Windows and DOS uses "Carriage Return, Line Feed" (CR, LF - CTRL-M, CTRL-J, ASCII 13 and 10 respectively) as the end of line. In the various forms of Unix, (Linux, MacOS X and BSD for example), the line ending is "Line Feed" (LF, CTRL-J) alone. In older MacOS, the line ending is ONLY "carriage Return." So that you as a programmer don't have to worry about exactly how lines end, the C runtime library will do translation of the "native" line-ending to a standardized line-ending of '\n' (which translates to "Line Feed" or character value 10). Of course, this means that the C runtime library needs to know that "if there is a CR followed by LF, we should just give out an LF character."
For binary files, we really DO NOT want any translation of the data, just because our pixels happen to be the values 13 and 10 next to each other, doesn't mean we want it merged to a single 10 byte, right? And if the code reads a byte of the value 26 (CTRL-Z) or 4 (CTRL-D), we certainly don't want the input to stop there...
Now, if I have a database text file that contains:
10 01353-897617 14000 Mats
You probably have very little idea what that means - I mean you can probably figure out that "Mats" is my name - but it could also be those little cardboard things to go under glasses (aka "Beer-mats") or something to go on the floor, e.g. "Prayer Mats" for Muslims.
The number 10 could be a customer number, article number, "row number" or something like that. 01353-896617 could be just about anything - perhaps my telephone number [no it isn't, but it does indeed resemble it] - but it could also be a "manufacturers part number" or some form of serial number or some such. 14000? Price per item, number of units in stock, my salary [I hope not!], distance in miles from my address to Sydney in Australia [roughly, I think].
I'm sure someone else, not given anything else could come up with hundreds of other answers.
[The truth is that it's just made up nonsense for the purpose of this answer, except for the bit at the beginning of the "phone number", which is a valid UK area code - the point is to explain that "the meaning of a set of fields in a text-file can only be understood if there is something describing the meaning of the fields"]
Of course the same applies to binary files, except that it's often even harder to figure out what the content is, because of the lack of separators - if you didn't have spaces and dashes in the text above, it would be much harder to know what belongs where, right? There are typically no 'spaces' and other such things in a binary file. It's all down to someone's description or definition in some code somewhere, or something like that.
I hope my ramblings here have given you some idea.
Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
The format of the file, obviously. If you are reading a BMP file, you have to first read the header, then height*width pixel data. If you are reading .txt, just read the characters as-is. Text files can have different encodings, such as Unicode.
Some formats, like .png, are compressed, meaning that their raw data takes more space in memory that the file on disk.
The particular algorithm is chosen depending on various factors. On Windows, it's usually the extension that matters. In web, the content type is dominant.
In general, if you try to read the file in other format, you will usually get garbage. That can be forced sometimes : try opening a .bmp file in your text editor, for example.
So basically we're talking about text files mainly, right?
Now to the point: when your text editor loads the file into memory, from some information it deduces its file encoding (either you tell it or it has a special file format marker among the first few bytes of the file, or whatever). Then it's the program itself that decides how it treats the raw bytes.
For example, if you tell your text editor to open a file as ASCII, it will treat each byte as an individual character, and it will display the character A whenever encounters the number 65 as the current byte to show, etc (because 65 is the ASCII character code for A).
However, if you tell it to open your file as UTF-16, then it will grab two bytes (well, more precisely, two octets) at a time, it will use this so-called "word" as the numeric value to be looked up, and it will, for example, display a ç character when the two bytes it read were corresponding to 231, the Unicode character code of ç.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.
First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.