what compression algorithm to use for highly redundant data

what compression algorithm to use for highly redundant data - c++

This program uses sockets to transfer highly redundant 2D byte arrays (image like). While the transfer rate is comparatively high (10 Mbps), the arrays are also highly redundant (e.g. Each row may contain several consequently similar values).
I have tried zlib and lz4 and the results were promising, however I still think of a better compression method and please remember that it should be relatively fast as in lz4. Any suggestions?

You should look at the PNG algorithms for filtering image data before compressing. They are simple to more sophisticated methods for predicting values in a 2D array based on previous values. To the extent that the predictions are good, the filtering can make for dramatic improvements in the subsequent compression step.
You should simply try these filters on your data, and then feed it to lz4.

you could create your own, if the data in rows is similar you can create a resource / index map thus reducing substantial the size, something like this
Original file:
row 1: 1212, 34,45,1212,45,34,56,45,56
row 2: 34,45,1212,78,54,87,....
you could create a list of unique values, than use and index in replacement,
34,45,54,56,78,87,1212
row 1: 6,0,2,6,1,0,.....
this can potantialy save you over 30% or more data transfer, but it depends on how redundant the data is
UPDATE
Here a simple implementation
std::set<int> uniqueValues
DataTable my2dData; //assuming 2d vector implementation
std::string indexMap;
std::string fileCompressed = "";
int Find(int value){
for(int i = 0; i < uniqueValues.size; ++i){
if(uniqueValues[i] == value) return i;
}
return -1;
}
//create list of unique values
for(int i = 0; i < my2dData.size; ++i){
for(int j = 0; j < my2dData[i].size; ++j){
uniqueValues.insert(my2dData[i][j]);
}
}
//create indexes
for(int i = 0; i < my2dData.size; ++i){
std::string tmpRow = "";
for(int j = 0; j < my2dData[i].size; ++j){
if(tmpRow == ""){
tmpRow = Find(my2dData[i][j]);
}
else{
tmpRow += "," + Find(my2dData[i][j]);
}
}
tmpRow += "\n\r";
indexMap += tmpRow;
}
//create file to transfer
for(int k = 0; k < uniqueValues.size; ++k){
if(fileCompressed == ""){
fileCompressed = "i: " + uniqueValues[k];
}
else{
fileCompressed += "," + uniqueValues[k];
}
}
fileCompressed += "\n\r\d:" + indexMap;
now on the receiving end you just do the opposite, if the line start with "i" you get the index, if it start with "d" you get the data

Related

Matrix inversion via Gaussian Elimination

I'm focusing on 3x3 matrices for now as my code is volatile. I read the matrix from a text file and print to the console, based on its dimensions I generate the identity matrix.
const int m = 3;
const int n = 3;
int ID[m][n] = {};
for (i = 1; i <= n; ++i){
ID[i][i] = 1;
}
For some reason ID(2)[3] gets printed as 4227276 so I have to force it to zero manually after the fact.
Aside from other elementary row operations like swapping rows based on leading entry position, the main chunk of my code consists of the following:
float matrix[m][n];
int i,j,k,p,s;
for(s = 1;s <= m;++s){
j = s;
k = j + 1;
p = j;
for(i = n;i >= j;--i){ // makes leading entries 1
ID[j][i] = ID[j][i]/matrix[j][j];
matrix[j][i] = matrix[j][i]/matrix[j][j];
}
for(j = k;j <= m;++j){ //converts to upper triangular
for(i = n;i >= 1;--i){
ID[j][i] = ID[j][i] - matrix[j][i]*matrix[p][i];
matrix[j][i] = matrix[j][i] - matrix[j][i]*matrix[p][i];
}
}
}
for(j = (m-1);j >= 1;--j){ //makes entries above diagonal zero
for(i = n;i > j;--i){
ID[j][i] = ID[j][i] - matrix[j][i]*matrix[i][i];
matrix[j][i] = matrix[j][i] - matrix[j][i]*matrix[i][i];
}
}
I'm basically doing to the identity matrix whatever I do to matrix[m][n] to reduce it to row echelon form as you would with the augmented matrix. The row operations are pretty haphazard as I was just doing whatever worked to make matrix[m][n] an identity matrix. Afterwards, I just slotted ID[m][n] in there... not really sure what's happening but the result is half right.
my result
right answer
I realize that I the term I subtract from ID might need to be a multiple of ID but that makes it even worse. What mistakes have I made?

In C++ the indexes of a n-dimensional array start from 0 to n-1: so the first element of the array a is a[0], the second element is a[1], ..., the n-th element is a[n-1].
When you use the for
for (i = 1; i <= n; ++i){
ID[i][i] = 1;
}
you are discarding the first elements of each row and each column, accessing moreover to memory positions that do not belong to ID (e.g. ID[n][n]) which contain some unknown values.
You have to iterate over your arrays using for cycles such as
for (i = 0; i < n; ++i){
ID[i][i] = 1;
}
or if you desire
for (i = 1; i <= n; ++i){
ID[i-1][i-1] = 1;
}
but I found last solution quite confusing.

How to Compare multiple variables at the same time in the C++?

I'm making Sudoku validater program that checks whether solved sudoku is correct or not, In that program i need to compare multiple variables together to check whether they are equal or not...
I have provided a snippet of code, what i have tried, whether every su[][] has different value or not. I'm not getting expecting result...
I want to make sure that all the values in su[][] are unequal.
How can i achieve the same, what are mistakes in my snippet?
Thanks...
for(int i=0 ; i<9 ;++i){ //for checking a entire row
if(!(su[i][0]!=su[i][1]!=su[i][2]!=su[i][3]!=su[i][4]!=su[i][5]!=su[i][6]!=su[i][7]!=su[i][8])){
system("cls");
cout<<"SUDOKU'S SOLUTION IS INCORRECT!!";
exit(0);
}
}

To check for each column uniqueness like that you would have to compare each element to the other ones in a column.
e.g.:
for (int i = 0; i < 9; ++i) {
for (int j = 0; j < 9; ++j) {
for (int k = j + 1; k < 9; ++k) {
if (su[i][j] == su[i][k]) {
system("cls");
cout << "SUDOKU'S SOLUTION IS INCORRECT!!\n";
exit(0);
}
}
}
}
Since there are only 8 elements per row this cubic solution shouldn't give you much overhead.
If you had a higher number N of elements you could initialize an array of size N with 0 and transverse the column. For the i-th element in the column you add 1 to that elements position in the array. Then transverse the array. If there's a position whose value is different from 1, it means you have a duplicated value in the column.
e.g.:
for (int i = 0; i < N; ++i) {
int arr[N] = {0};
for (int j = 0; j < N; ++j)
++arr[su[i][j] - 1];
for (int i = 0; i < N; ++i) {
if (arr[i] != 1) {
system("cls");
cout << "SUDOKU'S SOLUTION IS INCORRECT!!\n";
exit(0);
}
}
}
This approach is way more faster than the first one for high values of N.
The codes above check the uniqueness for each column, you would still have to check for each row.
PS: I have not tested the codes, it may have a bug, but hope you get the idea.

How to calculate a mismatch score between n number of strings more efficiently?

Suppose I have a vector that contains n strings, where the strings can be length 5...n. Each string must be compared with each string character by character. If there is a mismatch, the score is increased by one. If there is a match, the score does not increase. Then I will store the resulting scores in a matrix.
I have implemented this in the following way:
for (auto i = 0u; i < vector.size(); ++i)
{
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = 0u; j < vector.size(); ++j)
{
std::string next = vector[j];
int score = 0;
for (auto k = 0u; k < sizeOfStrings; ++k)
{
if(first[k] == second[k])
{
score += 0;
}
else
{
score += 1;
}
}
//store score into matrix
}
}
I am not happy with this solution because it is O(n^3). So I have been trying to think of other ways to make this more efficient. I have thought about writing another function that would replace the innards of our j for loop, however, that would still be O(n^3) since the function would still need a k loop.
I have also thought about a queue, since I only care about string[0] compared to string[1] to string[n]. String[1] compared to string[2] to string[n]. String[2] compared to string[3] to string[n], etc. So my solutions have unnecessary computations since each string is comparing to every other string. The problem with this, is I am not really sure how to build my matrix out of this.
I have finally, looked into the std template library, however std::mismatch doesn't seem to be what I am looking for, or std::find. What other ideas do you guys have?

I don't think you can easily get away from O(n^3) comparisons, but you can easily implement the change you talk about. Since the comparisons only need to be done one way (i.e. comparing string[1] to string[2] is the same as comparing string[2] to string[1]), as you point out, you don't need to iterate through the entire array each time and can change the start value of your inner loop to be the current index of your outer loop:
for (auto i = 0u; i < vector.size(); ++i) {
// vector.size() x vector.size() matrix
std::string first = vector[i]; //horrible naming convention
for (auto j = i; j < vector.size(); ++j) {
To store it in a matrix, setup your i x j matrix, initialize it to all zeroes and simply store each score in M[i][j]
for (auto k = 0u; k < sizeOfStrings; ++k) {
if (first[k] != second[k]) {
M[i][j]++;
}
}

If you have n strings each of length m, then no matter what (even with your queue idea), you have to do at least (n-1)+(n-2)+...+(1)=n(n-1)/2 string comparisons, so you'll have to do (n(n-1)/2)*m char comparisons. So no matter what, your algorithm is going to be O(mn^2).

General comment:
You don't have to compare the same strings with each other. And what is more important you starting from the begining each time in second loop while you already computed those diffs, so change the second loop to start from i+1.
By doing so your complexity will decrease as you won't check string that you already checked or are the same.
Improvement
Sort vector and remove duplicated entries, then instead wasting computation for checking the same strings you will only check those that are different.

The other answers that say this is at least O(mn^2) or O(n^3) are incorrect. This can be done in O(mn) time where m is string size and n is number of strings.
For simplicity we'll start with the assumption that all characters are ascii.
You have a data structure:
int counts[m][255]
where counts[x][y] is the number of strings that have ascii character y at index x in the string.
Now, if you did not restrict to ascii, then you would need to use a std::map
map counts[m]
But it works the same way, at index m in counts you have a map in which each entry in the map y,z tells you how many strings z use character y at index m. You would also want to choose a map with constant time lookups and constant time insertions to match the complexity.
Going back to ascii and the array
int counts[m][255] // start by initializing this array to all zeros
First initialize the data structure:
m is size of strings,
vec is a std::vector with the strings
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
for(int j = 0; j < m; j++) {
counts[j][str[j]]++;
}
}
Now that you have this structure, you can calculate the scores easily:
for (int i = 0; i < vec.size(); i++) {
std::string str = vec[i];
int score = 0;
for(int j = 0; j < m; j++) {
score += counts[j][str[j]] - 1; //subtracting 1 gives how many other strings have that same char at that index
}
std::cout << "string \"" << str << "\" has score " << score;
}
As you can see by this code, this is O(m * n)

Algorithm for smoothing

I wrote this code for smoothing of a curve .
It takes 5 points next to a point and adds them and averages it .
/* Smoothing */
void smoothing(vector<Point2D> &a)
{
//How many neighbours to smooth
int NO_OF_NEIGHBOURS=10;
vector<Point2D> tmp=a;
for(int i=0;i<a.size();i++)
{
if(i+NO_OF_NEIGHBOURS+1<a.size())
{
for(int j=1;j<NO_OF_NEIGHBOURS;j++)
{
a.at(i).x+=a.at(i+j).x;
a.at(i).y+=a.at(i+j).y;
}
a.at(i).x/=NO_OF_NEIGHBOURS;
a.at(i).y/=NO_OF_NEIGHBOURS;
}
else
{
for(int j=1;j<NO_OF_NEIGHBOURS;j++)
{
a.at(i).x+=tmp.at(i-j).x;
a.at(i).y+=tmp.at(i-j).y;
}
a.at(i).x/=NO_OF_NEIGHBOURS;
a.at(i).y/=NO_OF_NEIGHBOURS;
}
}
}
But i get very high values for each point, instead of the similar values to the previous point . The shape is maximized a lot , what is going wrong in this algorithm ?

What it looks like you have here is a bass-ackwards implementation of a finite impulse response (FIR) filter that implements a boxcar window function. Thinking about the problem in terms of DSP, you need to filter your incoming vector with NO_OF_NEIGHBOURS equal FIR coefficients that each have a value of 1/NO_OF_NEIGHBOURS. It is normally best to use an established algorithm rather than reinvent the wheel.
Here is a pretty scruffy implementation that I hammered out quickly that filters doubles. You can easily modify this to filter your data type. The demo shows filtering of a few cycles of a rising saw function (0,.25,.5,1) just for demonstration purposes. It compiles, so you can play with it.
#include <iostream>
#include <vector>
using namespace std;
class boxFIR
{
int numCoeffs; //MUST be > 0
vector<double> b; //Filter coefficients
vector<double> m; //Filter memories
public:
boxFIR(int _numCoeffs) :
numCoeffs(_numCoeffs)
{
if (numCoeffs<1)
numCoeffs = 1; //Must be > 0 or bad stuff happens
double val = 1./numCoeffs;
for (int ii=0; ii<numCoeffs; ++ii) {
b.push_back(val);
m.push_back(0.);
}
}
void filter(vector<double> &a)
{
double output;
for (int nn=0; nn<a.size(); ++nn)
{
//Apply smoothing filter to signal
output = 0;
m[0] = a[nn];
for (int ii=0; ii<numCoeffs; ++ii) {
output+=b[ii]*m[ii];
}
//Reshuffle memories
for (int ii = numCoeffs-1; ii!=0; --ii) {
m[ii] = m[ii-1];
}
a[nn] = output;
}
}
};
int main(int argc, const char * argv[])
{
boxFIR box(1); //If this is 1, then no filtering happens, use bigger ints for more smoothing
//Make a rising saw function for demo
vector<double> a;
a.push_back(0.); a.push_back(0.25); a.push_back(0.5); a.push_back(0.75); a.push_back(1.);
a.push_back(0.); a.push_back(0.25); a.push_back(0.5); a.push_back(0.75); a.push_back(1.);
a.push_back(0.); a.push_back(0.25); a.push_back(0.5); a.push_back(0.75); a.push_back(1.);
a.push_back(0.); a.push_back(0.25); a.push_back(0.5); a.push_back(0.75); a.push_back(1.);
box.filter(a);
for (int nn=0; nn<a.size(); ++nn)
{
cout << a[nn] << endl;
}
}
Up the number of filter coefficients using this line to see a progressively more smoothed output. With just 1 filter coefficient, there is no smoothing.
boxFIR box(1);
The code is flexible enough that you can even change the window shape if you like. Do this by modifying the coefficients defined in the constructor.
Note: This will give a slightly different output to your implementation as this is a causal filter (only depends on current sample and previous samples). Your implementation is not causal as it looks ahead in time at future samples to make the average, and that is why you need the conditional statements for the situation where you are near the end of your vector. If you want output like what you are attempting to do with your filter using this algorithm, run the your vector through this algorithm in reverse (This works fine so long as the window function is symmetrical). That way you can get similar output without the nasty conditional part of algorithm.

in following block:
for(int j=0;j<NO_OF_NEIGHBOURS;j++)
{
a.at(i).x=a.at(i).x+a.at(i+j).x;
a.at(i).y=a.at(i).y+a.at(i+j).y;
}
for each neighbour you add a.at(i)'s x and y respectively to neighbour values.
i understand correctly, it should be something like this.
for(int j=0;j<NO_OF_NEIGHBOURS;j++)
{
a.at(i).x += a.at(i+j+1).x
a.at(i).y += a.at(i+j+1).y
}

Filtering is good for 'memory' smoothing. This is the reverse pass for the learnvst's answer, to prevent phase distortion:
for (int i = a.size(); i > 0; --i)
{
// Apply smoothing filter to signal
output = 0;
m[m.size() - 1] = a[i - 1];
for (int j = numCoeffs; j > 0; --j)
output += b[j - 1] * m[j - 1];
// Reshuffle memories
for (int j = 0; j != numCoeffs; ++j)
m[j] = m[j + 1];
a[i - 1] = output;
}
More about zero-phase distortion FIR filter in MATLAB: http://www.mathworks.com/help/signal/ref/filtfilt.html

The current-value of the point is used twice: once because you use += and once if y==0. So you are building the sum of eg 6 points but only dividing by 5. This problem is in both the IF and ELSE case. Also: you should check that the vector is long enough otherwise your ELSE-case will read at negative indices.
Following is not a problem in itself but just a thought: Have you considered to use an algorithm that only touches every point twice?: You can store a temporary x-y-value (initialized to be identical to the first point), then as you visit each point you just add the new point in and subtract the very-oldest point if it is further than your NEIGHBOURS back. You keep this "running sum" updated for every point and store this value divided by the NEIGHBOURS-number into the new point.

You make addition with point itself when you need to take neighbor points - just offset index by 1:
for(int j=0;j<NO_OF_NEIGHBOURS;j++)
{
a.at(i).x += a.at(i+j+1).x
a.at(i).y += a.at(i+j+1).y
}

This works fine for me:
for (i = 0; i < lenInput; i++)
{
float x = 0;
for (int j = -neighbours; j <= neighbours; j++)
{
x += input[(i + j <= 0) || (i + j >= lenInput) ? i : i + j];
}
output[i] = x / (neighbours * 2 + 1);
}

Count 'white' pixels in opencv binary image (efficiently)

I am trying to count all the white pixels in an OpenCV binary image. My current code is as follows:
whitePixels = 0;
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
if (binary.at<int>(i, j) != 0)
++whitePixels;
However, after profiling with gprof I've found that this is a very slow piece of code, and a large bottleneck in the program.
Is there a method which can compute the same value faster?

cv::CountNonZero. Usually the OpenCV implementation of a task is heavily optimized.

You can use paralell computing. You divide the image in N parts and run your code in differents threads then you get the result of each threads and after this you can add this results for obtain the finally amount.

The last pixel in a row is usually followed by the first pixel in the next row (C code):
limit=width*height;
i=0;
while (i<limit)
{
if (binary.at<int>(0,i) != 0) ++whitePixels;
++i;
}

Actually binary.at<int>(i, j) is slow access!
Here is simple code that access faster than yours.
for (int i = 0; i < height; ++i)
{
uchar * pixel = image.ptr<uchar>(i);
for (int j = 0; j < width; ++j)
{
if(pixel[j]!=0)
{
//do your job
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

what compression algorithm to use for highly redundant data - c++

Related

Matrix inversion via Gaussian Elimination

How to Compare multiple variables at the same time in the C++?

How to calculate a mismatch score between n number of strings more efficiently?

Algorithm for smoothing

Count 'white' pixels in opencv binary image (efficiently)

Categories

Resources