Compressing an array of two values - compression

What is the cheapest way to compress an array, to a byte[] array, given only two possible values in it?
Array length has no limit.
The best idea I have seen so far is to insert the repeated number of times each value appears, so for example the array "11111001" is compressed to "521".
I wonder if there is a better way.
Thanks.

First off, convert to a bit array. Right off the bat it will take 1/8th the space.
Then you'd need to examine your data to see if there is any apparent redundancy. If not, you're done. If there is redundancy, you'd need to figure out a way to model it, and then compress it. Run-length coding as you propose is useful only if long runs of zeros and/or ones are common in the data.

Related

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

"Tricks" to filling a "rolling" time-series array absent brute-force pushback of all values each iteration

My applications is financial, in C++, Visual Studio 2003.
I'm trying to maintain an array of the last (x) values for an observation, and as each new value arrives I have a loop first push all of the other values back then add the new value in the front.
It's computationally intensive, and I've been trying to be clever and come up with some way around this. I've probably either stated an oxymoronic problem or reached the limit of my intellect, or both.
Here's an idea I have:
Suppose it's 60 seconds of data, and new data arrive each second. Suppose we have an integer between 0 and 59, that will serve to index an element of the array. Suppose each second, when the data arrives, we first iterate the integer then overwrite the element of the array at that index with the new data. Then, suppose in our calculations, we refer to the same integer as the base, work backwards to zero, then to 59 then back down again. The formulas in the math would be a bit more tedious to write. But my application does a lot of these pushback/fills of arrays, each second for several data points, and each array having 3600 elements per data series (one hour of seconds).
Does the board think this is a good idea? Or am I being silly here?
What you're describing is nothing more than a circular buffer.
There's an implementation in Boost, and probably in other
libraries as well, and a good algorithm description on the
Wikipedia (http://en.wikipedia.org/wiki/Circular_buffer).
And yes, it's a good solution for the problem you describe.
You could use modulo as you suggested (hint: x % y is the syntax for "x" modulo "y"), or you could maintain two buffers where you essentially swap which one is the current data to be read and which buffer is the stale data that is to be overwritten. For copying large quantities of plain-old-data (POD) data, you should take a look at memcpy. Of course, in quantitative trading, people will do all sorts of things to get a speed edge, including custom hardware that allows one to bypass several layers of copying.
Are you sure you are talking about arrays? They don't have a "push" operation - it sounds more like an std::vector.
Anyway, here is the solution for what I think that you want:
If I understood it right, you want a collection of 3600 elements and each second the last element drops off and a new element is added.
So you should use a linked list queue for that task. Operations are performed in O(1).

How to handle very large matrices (e.g. 1000000 by 1000000)

My question is very general..and its not duplicate too..
when we declare something like this int mat[1000000][1000000];
it is sure it will give an error saying matrix size too large.
i have seen many problems on many competitive programming websites where we need to declare a 2d matrix with 10^6 rows, columns ,I know there is always some trick associated with it to reduce the matrix size.
so i just want to ask what are the possible ways or tricks we can use in such cases to minimize the size ..i mean which types of algorithms are generally required to solve it like DP or anyone else??
In DP, if current row is dependent only on previous row, you can use
int mat[2][1000000];. After calculating current row, you can immediately discard previous row and switch current and previous.
Sometimes, it is possible to use std::map instead of 2D array.
I have encountered many question in programming contests and the
solutions defers from case to case basis, so if you mention a
specific case, I can possibly give you a better targeted solution.
That depends very much on the specific task. There is no universal "trick" that will always work. You'll have to look for something in the particular problem that allows you to solve it in a different way.
That said, if I could really see no other way, I'd start thinking about how many elements of that matrix will really be non-zero (perhaps I can use a sparse array or a map (dictionary) instead). Or maybe I don't need to store all the elements it memory, but can instead re-calculate them every time I need them.
At any rate, a matrix that large (or any kind of fake representation of it) will NOT be useful. Not just because you don't have enough memory, but also because filling up such an array with data will take anywhere from several hours to many months. That should be your first concern - figuring out how to solve the task with less data and computations. When you figure out that, you'll also see what data structure is appropriate.

Sorting a sequence of numbers from within a file without saving them into an array

Is it possible to sort a sequence of numbers from within a file without saving them into an array, and, if yes, how?
I'm assuming this is a text file, rather than a binary file. One of the problems with text files (for storing numbers) is that the numbers are likely different sizes.
Yes, assuming all the numbers take up the same space (which means, if it's a text file, you have padded all the numbers to the same length). [Ok, so technically, it would be possible to do anyway, but that would require reading all the intermediate numbers between two points, and then writing them back again, and then you are almost certainly better of just reading the whole file in and storing it back out again].
As to "how" - the method is pretty much the same as for any other sorting algorithm, read two values, if they are out of order, swap them. There are probably algorithms that "reduce the number of reads/swaps", I have not looked into it.
I expect, if your concern is "I don't have enough memory for the entire file", then you could read a couple of large chunks and sort within/between those chunks. Repeat as necessary. Again, there are probably sorting algorithms specifically for this, but I'm unsure which - I tend to use unix sort when I need to sort text files.
The first answer on this page has a link for "comparing sorts"
How to sort an array using minimum number of writes?

Is it possible to do simple arithmetic (e.g. addition) on "compressed" integers?

I would like to compress an an array of integers, initially initialized to 0, using a yet-to-be-determined integer compression/decompression method.
Is it possible with some integer compression method to increment (+1) a specific element of an array of compressed integers accurately using C or C++?
Of all the common compression techniques, two stand out as potentially usable in this without a full decompress cycle.
First, sparse arrays were built specifically for this. With a sparse array, you typically store a map of index to value. You don't store array elements that haven't been modified, so if most of your array is 0, it need not be stored. Many arrays (and matrices) in simulations are sparse, and there's a huge literature. Here adding to a value would simply be accessing the index with [] and incrementing - the access will create if nonexistent.
Next, run length encoding may also work if you find that you are working with large sequences of the same number, but those "runs" are not all the same number. Since they are not the same, a sparse array would not work and RLE is a solution. Incrementing a number is not as easy as for sparse, but basically if not a run, you add and check to see if you can make a new run. If part of a run, you split the run. RLE typically only makes sense with visual data or certain mathematical patterns.
You can certainly implement this, if your increment method:
Decompresses the entire array.
Increments the desired entry.
Compresses the entire array again.
If you want to increment in less of a dumb way you'll need intimate knowledge of the compression process, and so would we to give you more assistance.