How to identify compressed/uncompressed bit groups? - compression

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.

First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.

Related

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

Converting WAV file audio input into plain ASCII characters

I am working on a project where we need to convert WAV file audio input into plain ASCII characters. The input WAV file will contain a single short alphanumeric code e.g. asdrty543 and each character will be pronounced one by one when you play the WAV file. Our requirement is that when a single character code is pronounced we need to convert it into it's equivalent ASCII code. The implementation will be done in C/C++ as un-managed Win32 DLL. We are open to use third party libraries. I am already googling for directions. However, I will really appreciate it if I can get directions/pointers from an experienced programmer who has already worked on similar requirement. Thank you in advance for your help.
ASCII characters like Az09 are only a portion of the ASCII Table. WAV files like any other file is stored and accessed in bytes.
1 byte has 256 different values. Therefore one can't simply convert bytes into Az09 since there are not enough Az09 characters.
You'll have to find a library which opens WAV files and creates the wave format for you. In relation to the wave's intensity and length, a chain of Az or Az09 characters can be produced.
I believe you're trying to convert the wave to a series of notes. That's possible too, using the same approach.

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

Understanding the concept of 'File encoding'

I have already gone through some stuff on the web and SOF explaining 'file encoding' but I still have questions. File is a group of related records and on disk, its contents are just stored as '1's and '0's. Every time, a running program wants to read in a file or write to the file, the file is brought into the RAM and put into the address space of the running program (aka process). Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
There is one explanation on SOF which reads 'At the storage level, a file contains an array of bytes. On top of this you have the encoding layer for text files. The format layer comes last, on top of the encoding layer for text files or on top of the array of bytes for all the other binary files'. I am sort of fine with this but would like to know if it is 100% correct.
The question basically came up when understanding file opening modes in C++.
I think the description of the orders of layers is confusing here. I would consider formats and encodings to be related but not tied together so tightly. Let's try to define it formally.
A file is a contiguous sequence of bytes. A byte is a contiguous sequence of bits.
A symbol is a unit of data. Bytes are one kind of symbol. There are other symbols that are not bytes. Consider the number 6 - it is a symbol but not a byte. It can however be encoded as a byte, commonly as 00000110 (this is the two's complement encoding of 6).
An encoding maps a set of symbols to another set of symbols. Most commonly, it maps from a set of non-byte symbols to bytes, which when applied to an entire file makes it a file encoding. Two's complement gives a representation of the numeric values. On the other hand, ASCII, for example, gives a representation of the Latin alphabet and related characters in bytes. If you take ASCII and apply it to a string of text, say "Hello, World!", you get a sequence of bytes. If you store this sequence of bytes as a file, you have a file encoded as ASCII.
A format describes a set of valid sequences of symbols. When applied to the bytes of a file, it is a file format. An example is the BMP file format for storing raster graphics. It specifies that there must be a few bytes at the beginning that identify the file format as BMP, followed by a few bytes to describe the size and depth of the image, and so on. An example of a format that is not a file format would be how we write decimal numbers in English. The basic format is a sequence of numerical characters followed by an optional decimal point with more numerical characters.
Text Files
A text file is a kind of file that has a very simple format. It's format is very simple because it has no structure. It immediately begins with some encoding of a character and ends with the encoding of the final character. There's usually no header or footer or metadata or anything like that. You just start interpreting the bytes as characters right from the beginning.
But how do you interpret the characters in the file? That's where the encoding comes in. If the file is encoded as ASCII, the byte 01000001 represents the Latin letter A. There are much more complicated encodings, such as UTF-8. In UTF-8, a character cannot necessarily be represented in a single byte. Some can, some can't. You determine the number of bytes to interpret as a character from the first few bits of the first byte.
When you open a file in your favourite text editor, how does it know how to interpret the bytes? Well that's an interesting problem. The text editor has to determine the encoding of the file. It can attempt to do this in many ways. Sometimes the file name gives a hint through its extension (.txt is likely to be at least ASCII compatible). Sometimes the first character of the file gives a good hint as to what the encoding is. Most text editors will, however, give you the option to specify which encoding to treat the file as.
A text file can have a format. Often the format is entirely independent of the encoding of the text. That is, the format doesn't describe the valid sequences of bytes at all. It instead describes the valid sequences of characters. For example, HTML is a format for text files for marking up documents. It describes the sequences of characters that determine the contents of a document (note: not the sequence of bytes). As an example, it says that the sequence of characters <html> are an opening tag and must be followed at some point by the closing tag </html>. Of course, the format is much more detailed than this.
Binary file
A binary file is a file with meaning determined by its file format. The file format describes the valid sequences of bytes within the file and the meaning that that sequence has. It is not some interpretation of the bytes that matters at the file format level - it is the order and arrangement of bytes.
As described above, the BMP file format gives a way of storing raster graphics. It says that the first two bytes must be 01000010 01001101, the next four bytes must give the size of the file as a count of the number of bytes, and so on, leading up to the actual pixel data.
A binary file can have encodings within it. To illustrate this, consider the previous example. I said that the four bytes following the first two in a BMP file give the size of the file in bytes. How are those bytes interpreted? The BMP file format states that those bytes give the size as an unsigned integer. This is the encoding of those bytes.
So when you browse the directories on your computer for a BMP file and open it, how does your system know how to open it? How does it know which program to use to view it? The format of a binary file is much more strongly hinted by the file extension than the encoding of a text file. If the filename has .bmp at the end, your system will likely consider it to be a BMP file and just open it in whatever graphics program you have. It may also look at the first few bytes and see what they suggest.
Summary
The first level of understanding the meaning of bytes in a file is that file's format. A text file has an extremely simple format - start at the beginning, interpreting characters until you reach the end. How you interpret the characters depends on that text file's character encoding. Most formats are more complicated, however, and will likely have encodings nested within them. At some level you have to start extracting abstract information from your bytes and that's where the encodings kick in. But then whatever is being encoded can also have a format that is applied to it. You have a chain of formats and encodings until you get the information that you want.
Let's see if this helps...
A Unix file is just an array of bits (1/0) the current minimum number of bits in a file is 8, i.e. 1 byte. All file interaction is done at no less than the byte level. On most systems now a days, you don't really have to concern your self with the maximum size of a file. There are still some small variences in Operating Systems, but very few if none have maximum sizes of less that 1 GB.
The encoding or format of a file is only dependent on the applications that use it.
There are many common file formats, such as 'unix ASCII text' and PDF. Most of the files you will come accross will have a documented format specification somewhere on the net. For example the specification of a 'Unix ASCII text file' is:
A collection of ascii characters where each line is terminated by a end of line character. The end of line character is specificed in c++ as std::endl' or the quoted "\n". Unix specifies this character as the binary value - 012(oct) or 00001010.
Hope this helps :)
The determination of how to encode/display something is entirely up to the designer of the program. Of course, there are standards for certain types of files - a PDF or JPG file has a standard format for its content. The definition of both PDF and JPG is quite complex.
Text files have at least somewhat of a standard - but how to interpret or use the contents of a text-file may be just as complex and confusing as JPEG - the only difference is that the content is (some sort of) text, so you can load it up in a text editor and try to make sense of it. But see below for an example line of "text in a database type application".
In C and C++, there is essentially just one distinction, files are either "binary" or "text" ("not-binary"). The difference is about the treatment of "special bits", mostly to do with "endings" - a text file will contain end of line markers, or newlines ('\n') [more in a bit about newlines], and in some operating systems , also contain "end of file marker(s)" - for example in old CP/M, the file was sized in blocks of 128 or 256 bytes. So if we had "Hello, World!\n" in a text file, that file would be 128 bytes long, and the remaining 114 bytes would be "end-of-file" markers. Most modern operating systems track filesize in bytes, so there's no need to have a end-of-file marker in the file. But C supports many operating systems, both new and old, so the language has an allowance for this. End of file is typically CTRL-Z (DOS, Windows, etc) or CTRL-D (Unix - Linux, etc). When the C runtime library hits the end of file character, it will stop reading and give the error code/behaviour, same as if "there is no more file to read here".
Line endings or newlines need special treatment because they are not always the same in the OS that the file is living on. For example, Windows and DOS uses "Carriage Return, Line Feed" (CR, LF - CTRL-M, CTRL-J, ASCII 13 and 10 respectively) as the end of line. In the various forms of Unix, (Linux, MacOS X and BSD for example), the line ending is "Line Feed" (LF, CTRL-J) alone. In older MacOS, the line ending is ONLY "carriage Return." So that you as a programmer don't have to worry about exactly how lines end, the C runtime library will do translation of the "native" line-ending to a standardized line-ending of '\n' (which translates to "Line Feed" or character value 10). Of course, this means that the C runtime library needs to know that "if there is a CR followed by LF, we should just give out an LF character."
For binary files, we really DO NOT want any translation of the data, just because our pixels happen to be the values 13 and 10 next to each other, doesn't mean we want it merged to a single 10 byte, right? And if the code reads a byte of the value 26 (CTRL-Z) or 4 (CTRL-D), we certainly don't want the input to stop there...
Now, if I have a database text file that contains:
10 01353-897617 14000 Mats
You probably have very little idea what that means - I mean you can probably figure out that "Mats" is my name - but it could also be those little cardboard things to go under glasses (aka "Beer-mats") or something to go on the floor, e.g. "Prayer Mats" for Muslims.
The number 10 could be a customer number, article number, "row number" or something like that. 01353-896617 could be just about anything - perhaps my telephone number [no it isn't, but it does indeed resemble it] - but it could also be a "manufacturers part number" or some form of serial number or some such. 14000? Price per item, number of units in stock, my salary [I hope not!], distance in miles from my address to Sydney in Australia [roughly, I think].
I'm sure someone else, not given anything else could come up with hundreds of other answers.
[The truth is that it's just made up nonsense for the purpose of this answer, except for the bit at the beginning of the "phone number", which is a valid UK area code - the point is to explain that "the meaning of a set of fields in a text-file can only be understood if there is something describing the meaning of the fields"]
Of course the same applies to binary files, except that it's often even harder to figure out what the content is, because of the lack of separators - if you didn't have spaces and dashes in the text above, it would be much harder to know what belongs where, right? There are typically no 'spaces' and other such things in a binary file. It's all down to someone's description or definition in some code somewhere, or something like that.
I hope my ramblings here have given you some idea.
Now what determines how the bits (or bytes) in the file should be decoded/encoded and read and displayed/written?
The format of the file, obviously. If you are reading a BMP file, you have to first read the header, then height*width pixel data. If you are reading .txt, just read the characters as-is. Text files can have different encodings, such as Unicode.
Some formats, like .png, are compressed, meaning that their raw data takes more space in memory that the file on disk.
The particular algorithm is chosen depending on various factors. On Windows, it's usually the extension that matters. In web, the content type is dominant.
In general, if you try to read the file in other format, you will usually get garbage. That can be forced sometimes : try opening a .bmp file in your text editor, for example.
So basically we're talking about text files mainly, right?
Now to the point: when your text editor loads the file into memory, from some information it deduces its file encoding (either you tell it or it has a special file format marker among the first few bytes of the file, or whatever). Then it's the program itself that decides how it treats the raw bytes.
For example, if you tell your text editor to open a file as ASCII, it will treat each byte as an individual character, and it will display the character A whenever encounters the number 65 as the current byte to show, etc (because 65 is the ASCII character code for A).
However, if you tell it to open your file as UTF-16, then it will grab two bytes (well, more precisely, two octets) at a time, it will use this so-called "word" as the numeric value to be looked up, and it will, for example, display a ç character when the two bytes it read were corresponding to 231, the Unicode character code of ç.

How to read partial data from large text file in C++

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I don´t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?
If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).
You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.
If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.
If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).
As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.