How to detect codeword length for LZW Decoding - c++

I'm writing a general LZW decoder c++ program and I'm having trouble finding documentation on the length (in bits) of codewords used. Some articles I've found say that codewords are 12bits long, while others say 16bits, while still others say that variable bit length is used. So which is it? It would make sense to me that bit length is variable since that would give the best compression (i.e. initially start with 9 bits, then move to 10 when necessary, then move to 11 etc...). But I can't find any "official" documentation on what the industry standard is.
For example, if I were to open up Microsoft Paint and create a simple 100x100pixel all black image and save it as a Tiff. The image is saved in the Tiff using LZW compression. So in this scenario when I'm parsing the LZW codewords, should I read in 9bits, 12bits, or 16bits for the first codeword? and how would I know which to use?
Thanks for any help you can provide.

LZW can be done any of these ways. By far the most common (at least in my experience) is start with 9 bit codes, then when the dictionary gets full, move to 10 bit codes, and so on up to some maximum size.
From there, you typically have a couple of choices. One is to clear the dictionary and start over. Another is to continue using the current dictionary, without adding new entries. In the latter case, you typically track the compression rate, and if it drops too far, then you clear the dictionary and start over.
I'd have to dig through docs to be sure, but if I'm not mistaken, the specific implementation of LZW used in TIFF starts at 9 and goes up to 12 bits (when it was being designed, MS-DOS was a major target, and the dictionary for 12-bit codes used most of the available 640K of RAM). If memory serves, it clears the table as soon as the last 12-bit code has been used.

Related

С++ how to make a gif from bmp

I need to implement gif from bmp to animate Abelian sandpile model using only c++ standard library.
Ideally, your starting points would be specifications for GIF and BMP.
The GIF Specification, is a pretty easy thing to find.
Unfortunately, (at least to the best of my knowledge) Microsoft has never brought all the information about BMP format into a single document to act as a specification. There's a lot of documentation in various places, but no one place that has all of it together and completely organized (at least of which I'm aware).
That means you're kind of stuck with a piecemeal approach. Fortunately, you probably don't need to read every possible legitimate BMP file--it's been around a long time, so there are lots of variations, many of which are rarely used any more (e.g., 16-color bitmaps).
At a guess, you probably only need to deal with one or two specific variants (e.g., 24 or 32-bits per pixel), which makes life a great deal easier. Here's a page that gives at least a starting point for documentation on how BMP files are formatted.
You'll probably need to consider at least a few ancillary problems though. Unless your input BMP files use 8 bits per pixel with a palette to define the color associated with each of those 255 values, you're probably going to have at least one other problem: you'll most likely be starting with a file that has lots of colors (e.g., as noted above, 24 or 32 bits per pixel), but for a GIF file you need to reduce that to only 8 bits per pixel, so you'll need to choose 255 colors that best represent those in the pictures you care about, then for each input pixel, choose one of those 255 colors to represent that pixel as well as possible.
Depending on how much you care about color fidelity vs. spatial resolution, there are multitudes of ways of doing this job, varying from fairly simple (but with results that may be rather mediocre) to extremely complex (with results that may be somewhat better, but will probably still be fairly mediocre).

Compress many versions of a text with fast access to each

Let's say I store many versions of a source code file in a source code repository - maybe 500 historic versions of a 50k source file. So storing the versions directly would take about 12.5 MB (assuming the file grew linearly over time). Naturally though, there is ample room for compression as there will only be slight differences between most successive versions.
What I want is compact storage as well as reasonably quick extraction of any of the versions at any time.
So we would probably store a list of oft-occuring text chunks, and each version would just contain pointers to the chunks it is made of. To make this really compact, text chunks would be able to defined as concatenations of other chunks.
Is there a well-established compression algorithm that produces this kind of structure? I was not sure what term to search for.
(Bonus points if adding a new version is faster than recompressing the whole set of versions.)
What you want is called "git". In fact, that is exactly what you want. Including bonus points.
Seeing as there were no usable answers, I came up with my own format today to demonstrate what I mean. I am storing 850 versions of a source file about 20k in size. Usually from one version to the next just one line was added (but there were other changes as well).
If I store these 850 versions in a .zip, it is 4.2 MB big. I want less than that, way less.
My format is line-based. Basically each file version is stored as a list of pointers into a table. Each table entry is either:
a literal line,
or a pair of pointers into the table.
In the second case, in decompression, the two pointers have to be followed successively.
Not sure if this description makes sense to you right away, but the thing works.
The compressor generates a single text file from which each of the 850 versions can be extracted instantly. This text file has a size of 45k.
Finally we can simply gzip this file which gets us down to 18.5k. Quite an improvement from 4.2 MB!
The compressor uses a very simple but effective way to find repeating combinations of lines.
So the answer to the initial question is that there is an algorithm that combines inter-file compression (like .tar.gz) with instant extraction if any contained file (like .zip).
I still don't know how you would call this class of compression algorithms.

Can we move the pointer of ofstream back and forth for output to file? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need to output results to a file that has a predefined format. The columns look like this:
TIME COL1 COL2 COL3 COL4 ...
I am using ofstream. I have to output results line by line. However, the case can be that results for certain columns may not be available at a certain time. The order of the results may also not be sorted.
I can control the spacing between the columns while initially specifying the headers.
I guess my question is: Is it possible to move the ofstream pointer back and forth horizontally per line?
What I tried so far:
1) find current position of ofstream pointer using:
long pos = fout.tellp()
2) calculate the position to be shifted based on spacing:
spacing = column_spacing * column_number
long newpos = pos + spacing
3) then use seekp() to move pointer:
fout.seekp(newpos)
4) provide output:
fout << "output"
This does not work. Basically, the pointer does not move. The idea is to make my ofstream fout move back and forth if possible. I would appreciate any suggestions on how to control it.
Some information about the output: I am computing the elevation angle of GPS satellites in the sky over time. Hence, there are 32 columns synonymous to number of GPS satellites in total. At any point in time, not all satellites are visible and hence the need to skip some satellites/columns. Also, the list of elevation of satellites may not be arranged in ascending order due to the limitations of observation file. Hope that helps in drawing the situation.
An example of desired output. The header (TIME, SAT1, ... SAT32) is defined prior to the output of results and is not part of the question here. The spacing between each column is controlled during definition of the headers (lets say 15 spaces between each column). The output can be truncated to 1 decimal place. A new line occurs once all results at current time t are written. Then I process the observations for time t+1 and then write the outputs again, and so on. Hence the writing occurs in an epochwise manner. Satellite elevation is stored in a vector(double) and satellite number is stored in a vector(int). Both vectors are of same length. I just need to write them to a file. For the example below, the output of time is in seconds and satellite elevation is in degrees:
TIME SAT1 SAT2 SAT3 ... SAT12 SAT13 ... SAT32
1 34.3 23.2 12.2 78.2
2 34.2 23.1 12.3 78.2
3 34.1 11.3 23.0 78.3
And so on... As you may notice, satellite elevations may or may not be available, all depends on the observations. Lets also assume that the size of output and efficiency is not of priority here. Based on 24 hours of observations, the output file size can reach upto 5-10 MB's.
Thanks for your time in advance!
Can we move the pointer of ofstream back and forth for output to file?
NO, You probably don't want to do that (even if it would be doable in principle; but that would be inefficient and very brittle to code, and nearly impossible to debug), in particular for a textual output whose width is variable (I am guessing that your COLi could have variable width, as usual in most textual format). It looks that your approach is wrong.
The general way is to build in memory, as some graph of "objects" or "data structure", the entire representation of your output file. This is generally enough, unless you really need to output something huge.
If your typical textual output is of reasonable size (a few gigabytes at most) then representing the data as some internal data structure is worthwhile and it is very common practice.
If your textual output is huge (dozens of gigabytes or terabytes, which is really unlikely), then you won't be able to represent it in memory (unless you have a costly computer with a terabyte of RAM). However, you could use some database (perhaps sqlite) to serve as internal representation.
In practice, textual formats are always output in sequence (from some internal representation) and textual files in those formats have a reasonable size (so it is uncommon to have a textual file of many gigabytes today; in such cases, databases -or splitting the output file in several pieces in some directory- are better).
Without specifying precisely your textual format (e.g. using EBNF notation) and giving an example - and some estimation of the output size-, your question is too broad, and you can only get hints like above.
the output file size can reach upto 5-10 MB's
This is really tiny on current computers (even a cheap smartphone has a gigabyte of RAM). So build the data structure in memory, and output it at once when it is completed.
What data structures should you use depend upon your actual problem (and the inputs your program gets, and the precise output you want it to produce). Since you don't specify your program in your question, we cannot help. Probably C++ standard containers and smart pointers could be useful (but this is just a guess).
You should read some introduction to programming (like SICP), then some good C++ programming book and read some good Introduction to Algorithms. You probably need to read something about compilation techniques (since they include parsing and outputting structured data), like the Dragon Book. Learning to program takes a lot of time.
C++ is really a very difficult programming language, and I believe it is not the best way to learn programming. Once you have learned a bit how to program, invest your time in learning C++. Your issues is not on std::ostream or C++ but on designing your program and its architecture correctly
BTW, if the output of your program is feeding some other program (and is not only or mostly for human consumption) you might use some established textual format, perhaps JSON, YAML, CSV, XML (see also this example), ....
2 34.2 23.1 12.3 78.2
How significant are the spaces in the above line (what would happen if a space is inserted after the first 2 and another space is removed after 12.3) ? Can a wide number like 3.14159265358979323846264 appear in your output? Or how many digits do you want? That should be documented precisely somewhere ! Are you allowed to improve the output format above (you might perhaps use some sign like ? for missing numbers; that would make the output less ambiguous and more readable for humans and easier to parse by some other program)?
You need to define precisely (in English) the behavior of your program, including its input and output formats. An example of input and output is not a specification (it is just an example).
BTW, you may also want to code your program to provide several different output formats. For example, you could decide to provide CSV format for usage in speadsheets, JSON format for other data processing, gnuplot output to get nice figures, LaTeX output to be able to insert your output in some technical report, HTML output to be usable thru a browser, etc. Once you have a good internal representation (as convenient data structures) of your computed data, outputting it in various formats is easy and very convenient.
Probably your domain (satellite processing) has defined some widely used data formats. Study them in detail (at least for inspiration on specifying your own output format). I am not at all an expert of satellite data, but with google I quickly found examples like GEOSCIENCE AUSTRALIA
(CCRS) LANDSAT THE MATIC MAPPER DIGITAL DATA FORMAT
DESCRIPTION (in more than a hundred pages). You should specify your output format as precisely as they do (perhaps several dozens of pages in English, with a few pages of EBNF), and EBNF is a convenient notation for that (with a lot of additional explanations in English)
Look also for inspiration into other output data format descriptions.
You probably should, if you invent your output format, publish its specification (in English) so that other people could code programs taking your output as input to their code.
In many domains, data is much more valuable (i.e. costs much more, in € or US$) than the code processing it. This is why its format should be precisely documented. You need to specify that format so that a future programmer in 2030 could easily write a parser for it. So details matter a big lot. Specify unambiguously your output format in great details (in some English document).
Once you have specified that output format, coding the output routines from some good enough internal data representation is easy work (and don't require insane tricks like moving the file offset of the output). And a good enough specification of the output format is also a guideline in designing your internal data representations.
Is it possible to move the ofstream pointer back and forth horizontally per line?
It might be doable, but it is so inefficient and error-prone (and impossible to debug) that in practice you should never do that (but instead, specify in details your output and code a simple sequential output routines, as all textual format related software do).
BTW, today we use UTF-8 everywhere in textual files, and a single UTF-8 encoded Unicode character might span one (e.g. for some digit like 0 or latin letters like E) or several bytes (e.g. for accentuated letters like é, or cyrillic letters like я, or symbols like ∀, etc...) so replacing a single UTF8 character by a single other one could mean some byte insertion or deletion.
Notice that current file systems do not allow to insert characters or bytes or delete a span of characters in the middle of a file (for example, on Linux, there is no syscalls(2) allowing this) and do not really know about lines (the end of line is just a convention, e.g. \n byte on Linux). Programs doing that (like your favorite source code editor) are always representing the data in memory. Today, a file is a sequence of bytes, and you can only append bytes at its end, or replace bytes in the middle (from the operating system's point of view); but insertion or deletion of bytes span in the middle of the file is not possible, and that is why a textual file is -in practice- always written sequentially, from start to end, without moving inside the current file offset (other than appending bytes at its end).
(if this is homework for some CS college or undergraduate course, I guess that your teacher is expecting you to define and document your output format)

Choosing Symbols in an efficient way, in Arithmetic Code Algorithm, C++

I'm trying to implement Arithmetic Code Algorithm to compress binary images (JPG images transformed to binary base using opencv). The problem is that I've to save in the compressed file, the encoded string and the symbols which I used to generate this encoded string and also their frequencies, so I can be able to decode it. The symbols take a lot of space even if I'm transforming them to ascii and if I tried to take less number of characters for each symbol the size of the encoded string becomes bigger. So I wonder if there's an efficient way to save symbols in the compressed file with minimum possible size. And I want to know the most efficient way to choose the symbols from the original file.
Thanks in advance :)
325,592,005 bytes is 310 megabytes. You managed to compress this image into 2.8+6.1=8.9 megabytes so you decreased the size by 97%. It's a good result and I wouldn't worry here. Besides 6.1 megabytes of 64 bits long symbols means that you have around 800K of them. It is much less than the maximum possible number of possible symbols i.e. 2^64 - 1. It is again a good result.
As to your question about using multiple compression algorithms. Firstly, you have to know that in the case of the loosless compression the optimal number of bits per symbol is equal to the entropy. And the arithmetic encoding is close to be optimal (see this, this or this). It means that there is no much sense in using more than 1 algorithm one after another, if one of them is the arithmetic encoding.
As to the arithmetic coding vs Huffman codes. The latter is actually the special case of the former. And as far as I know the arithmetic encoding is always at least as good as Huffman codes.
It is also worth adding one thing. If you can consider lossy compression there is actually no limit in the compression rate. Or in other words, you can compress data as much as you want as long as quality loss is still acceptable for you. However, even in this case using multiple compression algorithms is not required. The one is enough.

How to check if a char is valid in C++

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.
Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.
Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.
I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!
Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.