Faster C# (or other .NET) Levenshtein distance implementation - c++

Good night,
I have been working with fuzzy string matching for some time now, and using C with some pointers I could write a very fast (for my needs) implementation of the Levenshtein distance between two strings. I tried to port the code to C# using unsafe code and the fixed keyword, but the performance was way slower. So I chose to build a C++ dll and use [DllImport] from C#, automatically marshalling every string. The problem is that, after profiling, this keeps being the most time-consuming part of my program, taking between 50-57% of the total running time of the program. Since I think I will need to do some heavy work with lots of substrings of a text field coming from some 3-million database records, I think the time the Levenshtein distance is taking is almost unacceptable. That being, I would like to know if you have any suggestions, both algorithmic or programming-related, to the code below, or if you know of any better algorithm to calculate this distance?
#define Inicio1 (*(BufferVar))
#define Inicio2 (*(BufferVar+1))
#define Fim1 (*(BufferVar+2))
#define Fim2 (*(BufferVar+3))
#define IndLinha (*(BufferVar+4))
#define IndCol (*(BufferVar+5))
#define CompLinha (*(BufferVar+6))
#define TamTmp (*(BufferVar+7))
int __DistanciaEdicao (char * Termo1, char * Termo2, int TamTermo1, int TamTermo2, int * BufferTab, int * BufferVar)
{
*(BufferVar) = *(BufferVar + 1) = 0;
*(BufferVar + 2) = TamTermo1 - 1;
*(BufferVar + 3) = TamTermo2 - 1;
while ((Inicio1 <= *(BufferVar + 2)) && (Inicio2 <= *(BufferVar + 3)) && *(Termo1 + Inicio1) == *(Termo2 + Inicio2))
Inicio1 = ++Inicio2;
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
while ((Fim1 >= 0) && (Fim2 >= 0) && *(Termo1 + Fim1) == *(Termo2 + Fim2))
{ Fim1--; Fim2--;}
if (Inicio2 > Fim2) return (Fim1 - Inicio1 + 1);
TamTermo1 = Fim1 - Inicio1 + 1;
TamTermo2 = Fim2 - Inicio2 + 1;
CompLinha = ((TamTermo1 > TamTermo2) ? TamTermo1 : TamTermo2) + 1;
for (IndLinha = 0; IndLinha <= TamTermo2; *(BufferTab + CompLinha * IndLinha) = IndLinha++);
for (IndCol = 0; IndCol <= TamTermo1; *(BufferTab + IndCol) = IndCol++);
for (IndCol = 1; IndCol <= TamTermo1; IndCol++)
for (IndLinha = 1; IndLinha <= TamTermo2; IndLinha++)
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
return *(BufferTab + CompLinha * TamTermo2 + TamTermo1);
}
Please note that BufferVar and BufferTab are two external int * (in th case, int[] variables being marshalled from C#) which I do not instantiate in every function call to make the whole process faster. Still, this code is pretty slow for my needs. Can anyone please give me some suggestions, or, if possible, provide some better code?
Edit: The distance can't be bounded, I need the actual distance.
Thank you very much,

1. Brute Force
Here is an implementation of the Levenshtein Distance in Python.
def levenshtein_matrix(lhs, rhs):
def move(index): return (index+1)%2
m = len(lhs)
n = len(rhs)
states = [range(n+1), [0,]*(n+1)]
previous = 0
current = 1
for i in range(1, m+1):
states[current][0] = i
for j in range(1,n+1):
add = states[current][j-1] + 1
sub = states[previous][j] + 1
repl = states[previous][j-1] + abs(cmp(lhs[i-1], rhs[j-1]))
states[current][j] = min( repl, min(add,sub) )
previous = move(previous)
current = move(current)
return states[previous][n]
It's the typical dynamic programming algorithm, just taking advantage that since one only need the last row, keeping only two rows at a time is sufficient.
For a C++ implementation, you might look at LLVM's one (line 70-130), note the use of a stack allocated array of fixed size, replaced only when necessary by a dynamically allocated array.
I just can't follow up your code to try and diagnose it... so let's change the angle of attack. Instead of micro-optimizing the distance, we'll change the algorithm altogether.
2. Doing better: using a Dictionary
One of the issue you face is that you could do much better.
The first remark is that the distance is symmetric, though it doesn't change the overall complexity it will halve the time necessary.
The second is that since you actually have a dictionary of known words, you can build on that: "actor" and "actual" share a common prefix ("act") and thus you need not recompute the first stages.
This can be exploited using a Trie (or any other sorted structure) to store your words. Next you will take one word, and compute its distance relatively to all of the words stored in the dictionary, taking advantage of the prefixes.
Let's take an example dic = ["actor", "actual", "addict", "atchoum"] and we want to compute the distance for word = "atchoum" (we remove it from the dictionary at this point)
Initialize the matrix for the word "atchoum": matrix = [[0, 1, 2, 3, 4, 5, 6, 7]]
Pick the next word "actor"
Prefix = "a", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6]]
Prefix = "ac", matrix = [[0, 1, 2, 3, 4, 5, 6, 7], [1, 0, 1, 2, 3, 4, 5, 6], [2, 1, 1, 2, 3, 4, 5, 6]]
Prefix = "act", matrix = [[..], [..], [..], [..]]
Continue until "actor", you have your distance
Pick the next word "actual", rewind the matrix until the prefix is a prefix of our word, here up to "act"
Prefix = "actu", matrix = [[..], [..], [..], [..], [..]]
Continue until "actual"
Continue for the other words
What's important here is the rewind step, by preserving the computation done for the previous word, with which you share a good-length prefix, you effectively save a lot of work.
Note that this is trivially implemented with a simple stack and does not require any recursive call.

Try the simple approach first - don't use pointers and unsafe code - just code plain ordinary C#... but use the correct algorithm.
There is a simple and efficient algorithm on Wikipedia that uses dynamic programming and runs O(n*m) where n and m are the lengths of the inputs. I suggest you try implementing that algorithm first, as it is described there and only start optimizing it after you've implemented it, measured the performance and found it to be insufficient.
See also the section Possible improvements where it says:
By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small
If I had to guess where the problem is I'd probably start by looking at this line that runs inside two loops:
*(BufferTab + CompLinha * IndLinha + IndCol) = ((*(Termo1 + (IndCol + Inicio1 - 1)) == *(Termo2 + (IndLinha + Inicio2 - 1))) ? *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)) : ((*(BufferTab + CompLinha * (IndLinha - 1) + IndCol) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + IndCol)) : ((*(BufferTab + CompLinha * IndLinha + (IndCol - 1)) < *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1))) ? *(BufferTab + CompLinha * IndLinha + (IndCol - 1)) : *(BufferTab + CompLinha * (IndLinha - 1) + (IndCol - 1)))) + 1);
There appears to be a lot of duplication there though it's hard for me to spot exactly what's going on. Could you factor some of that out? And you definitely need to make it more readable.

You shouldn't try all your possible words with the Levenshtein distance algortihm. You should use another faster metric to filter out the likely candidates and only on then use the Levenshtein to remove ambiguity. The first sieve can be based on a n-gram (trigram works often well) frequency histogram or a hash function.

Related

Argmax function in C++?

I am trying to make armgax function in C++
For example,
C = wage * hour * shock + t + a * R + (1 - delta) * d - d_f - a_f - fx1 * (1 - delta) * d;
double result;
result = (1 / (1 - sig)) * pow(pow(C, psi) * pow(d, 1 - psi), (1 - sig));
Suppose every variable except for 'd_f' and 'a_f' are given and both 'd_f' and 'a_f' some number(double) in [0, 24].
If I want to get the combination of 'd_f' and 'a_f' that maximizes 'result', is there a proper function that I can use?
Thanks.

Gaussian Blur Glitching/Segmentation

I'm trying to implement a Gaussian Blur from scratch (using C++). In the code below I've hard-coded the Gaussian kernel I'm using. I only kept one dimension as I'm trying to use the optimization I've read about where you can do a horizontal convolution pass and a vertical one over that to make your blur more efficient. Unfortunately, I'm running into some issues. Here is my code:
float gKern[5] = {0.05448868, 0.24420134, 0.40261995, 0.24420134, 0.05448868};
int** gaussianBlur(int** image, int height, int width) {
int **ret = new int*[height];
for(int i = 0; i < height; i++) {
ret[i] = new int[width];
}
for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
if (i == 0) {
ret[i][j] = (gKern[0] * image[2][j]) + (gKern[1] * image[1][j]) + (gKern[2] * image[0][j]) + (gKern[3] * image[1][j]) + (gKern[4] * image[2][j]);
} else if (i == 1) {
ret[i][j] = (gKern[0] * image[1][j]) + (gKern[1] * image[0][j]) + (gKern[2] * image[1][j]) + (gKern[3] * image[2][j]) + (gKern[4] * image[3][j]);
} else if (i == (height - 2)) {
ret[i][j] = (gKern[0] * image[i - 2][j]) + (gKern[1] * image[i - 1][j]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i + 1][j]) + (gKern[4] * image[i][j]);
} else if (i == (height - 1)) {
ret[i][j] = (gKern[0] * image[i - 2][j]) + (gKern[1] * image[i - 1][j]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i - 1][j]) + (gKern[4] * image[i - 2][j]);
} else {
ret[i][j] = (gKern[0] * image[i - 2][j]) + (gKern[1] * image[i - 1][j]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i + 1][j]) + (gKern[4] * image[i + 2][j]);
}
}
}
int** temp = image;
image = ret;
for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
if (j == 0) {
ret[i][j] = (gKern[0] * image[i][2]) + (gKern[1] * image[i][1]) + (gKern[2] * image[i][0]) + (gKern[3] * image[i][1]) + (gKern[4] * image[i][2]);
} else if (j == 1) {
ret[i][j] = (gKern[0] * image[i][1]) + (gKern[1] * image[i][0]) + (gKern[2] * image[i][1]) + (gKern[3] * image[i][2]) + (gKern[4] * image[i][3]);
} else if (j == (width - 2)) {
ret[i][j] = (gKern[0] * image[i][j - 2]) + (gKern[1] * image[i][j - 1]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i][j + 1]) + (gKern[4] * image[i][j]);
} else if (j == (width - 1)) {
ret[i][j] = (gKern[0] * image[i][j - 2]) + (gKern[1] * image[i][j - 1]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i][j - 1]) + (gKern[4] * image[i][j - 2]);
} else {
ret[i][j] = (gKern[0] * image[i][j - 2]) + (gKern[1] * image[i][j - 1]) + (gKern[2] * image[i][j]) + (gKern[3] * image[i][j + 1]) + (gKern[4] * image[i][j + 2]);
}
}
}
image = temp;
return ret;
}
The first pass (the first for block) seems to work fine as when I comment out the second block I do get a slightly blurred image. But when I use both I get a choppy "weird" image, as shown below (the first image is my grayscale input, the second is the choppy output):
The problem is with the pointers you use.
The function starts with image as input and ret as the intermediate result of the first step.
The second step must use ret as input, and write either to the original input (overwrite the input image) or to a new image. Instead, you do:
int** temp = image;
image = ret;
// read from image and write to ret
image = temp;
return ret;
That is, going into the second pass, both image and ret point to the same data, you then read and write to the same data. Next you do a pointer assignment that has no effect (image is never used after this) and return the intermediate buffer.
If you want to write to the input image, simply swap the image and ret pointers before the second pass:
std::swap(image, res);
If you don’t want that, you’ll have to new another image to write into.
It is bad practice to use an array of arrays to store an image. If you look into the source code of any image processing library, you’ll see they allocate a single large memory block for the image, which stores all image rows concatenated. Knowing the width of the image, you know how to index: image[x + y*width].
This not only simplifies code (no loops to allocate a single image), but it also greatly speeds up code: there is no pointer lookup any more, and all data is close together to best use the cache.
This whole code can be simplified significantly by following the advice above: the two passes can be done with the same code. Write a function that filters one line of the image. It takes a pointer to the first pixel, a line length, and a step (which is 1 for horizontal lines, and width for vertical lines). This 1D function is then called in a loop over the lines in another function. This second function is then called once to do the horizontal pass, and once to do the vertical pass. (See here for details.)
In this situation, it is easy to avoid intermediate images by using a buffer of the size of a single image line. Write into that buffer, then copy the whole line back into the input image after it is filtered. This means you have a single buffer of size max(width,height) rather than a buffer of size width*height.
The 1D filter function can also be simplified. That loop should not have any if statements, they will significantly slow down. Execution. Instead, special-case the first two and last two pixels, and loop only over the bulk of the pixels where you don’t have to worry about the image edge.

Derivative of a function with different formulas on different intervals

Is there a canonical way of declaring a function by parts in Sympy? I tried
import sympy
import sympy.functions.special.delta_functions as special
sympy.init_printing()
x = sympy.symbols('x', real=True)
V = x*x * (special.Heaviside(x + 1) - special.Heaviside(x - 1)) \
+ (1 + 2*sympy.log(x)) * special.Heaviside(x - 1) \
+ (1 + 2*sympy.log(-x)) * special.Heaviside(-x - 1)
which defines a differentiable function, but
print(V.diff(x).simplify())
# Prints: (x*(x**2*(-DiracDelta(x - 1) + DiracDelta(x + 1)) - 2*x*(Heaviside(x - 1) - Heaviside(x + 1)) - (2*log(-x) + 1)*DiracDelta(x + 1) + (2*log(x) + 1)*DiracDelta(x - 1)) + 2*Heaviside(-x - 1) + 2*Heaviside(x - 1))/x
Is there a way to somehow tell Sympy to simplify DiracDelta(x - a)*f(x) to DiracDelta(x - a)*f(a)?
Piecewise-defined functions are implemented by Piecewise class. Your function would be expressed as
V = sympy.Piecewise((1 + 2*sympy.log(-x), x < -1),
(x**2, x < 1),
(1 + 2*sympy.log(x), True))
print(V.diff(x))
which prints Piecewise((2/x, x < -1), (2*x, x < 1), (2/x, True))
The (expr, cond) pairs in Piecewise are processed in the order given: the first cond that evaluates to True (if the preceding evaluated to False) causes the corresponding expr to be returned.

Time complexity of recursive algorithm with two recursive calls

I am trying to analyze the Time Complexity of a recursive algorithm that solves the Generate all sequences of bits within Hamming distance t problem. The algorithm is this:
// str is the bitstring, i the current length, and changesLeft the
// desired Hamming distance (see linked question for more)
void magic(char* str, int i, int changesLeft) {
if (changesLeft == 0) {
// assume that this is constant
printf("%s\n", str);
return;
}
if (i < 0) return;
// flip current bit
str[i] = str[i] == '0' ? '1' : '0';
magic(str, i-1, changesLeft-1);
// or don't flip it (flip it again to undo)
str[i] = str[i] == '0' ? '1' : '0';
magic(str, i-1, changesLeft);
}
What is the time complexity of this algorithm?
I fond myself pretty rusty when it comes to this and here is my attempt, which I feel is no where near the truth:
t(0) = 1
t(n) = 2t(n - 1) + c
t(n) = t(n - 1) + c
= t(n - 2) + c + c
= ...
= (n - 1) * c + 1
~= O(n)
where n is the length of the bit string.
Related questions: 1, 2.
It's exponential:
t(0) = 1
t(n) = 2 t(n - 1) + c
t(n) = 2 (2 t(n - 2) + c) + c = 4 t (n - 2) + 3 c
= 2 (2 (2 t(n - 3) + c) + c) + c = 8 t (n - 3) + 7 c
= ...
= 2^i t(n-i) + (2^i - 1) c [at any step i]
= ...
= 2^n t(0) + (2^n - 1) c = 2^n + (2^n - 1) c
~= O(2^n)
Or, using WolframAlpha: https://www.wolframalpha.com/input/?i=t(0)%3D1,+t(n)%3D2+t(n-1)+%2B+c
The reason it's exponential is that your recursive calls are reducing the problem size by 1, but you're making two recursive calls. Your recursive calls are forming a binary tree.

Discrete Wavelet Transform integer Daub 5/3 lifting issue

I'm trying to run an integer-to-integer lifting 5/3 on an image of lena. I've been following the paper "A low-power Low-memory system for wavelet-based image compression" by Walker, Nguyen, and Chen (Link active as of 7 Oct 2015).
I'm running into issues though. The image just doesn't seem to come out quite right. I appear to be overflowing slightly in the green and blue channels which means that subsequent passes of the wavelet function find high frequencies where there ought not to be any. I'm also pretty sure I'm getting something else wrong as I am seeing a line of the s0 image at the edges of the high frequency parts.
My function is as follows:
bool PerformHorizontal( Col24* pPixelsIn, Col24* pPixelsOut, int width, int pixelPitch, int height )
{
const int widthDiv2 = width / 2;
int y = 0;
while( y < height )
{
int x = 0;
while( x < width )
{
const int n = (x) + (y * pixelPitch);
const int n2 = (x / 2) + (y * pixelPitch);
const int s = n2;
const int d = n2 + widthDiv2;
// Non-lifting 5 / 3
/*pPixelsOut[n2 + widthDiv2].r = pPixelsIn[n + 2].r - ((pPixelsIn[n + 1].r + pPixelsIn[n + 3].r) / 2) + 128;
pPixelsOut[n2].r = ((4 * pPixelsIn[n + 2].r) + (2 * pPixelsIn[n + 2].r) + (2 * (pPixelsIn[n + 1].r + pPixelsIn[n + 3].r)) - (pPixelsIn[n + 0].r + pPixelsIn[n + 4].r)) / 8;
pPixelsOut[n2 + widthDiv2].g = pPixelsIn[n + 2].g - ((pPixelsIn[n + 1].g + pPixelsIn[n + 3].g) / 2) + 128;
pPixelsOut[n2].g = ((4 * pPixelsIn[n + 2].g) + (2 * pPixelsIn[n + 2].g) + (2 * (pPixelsIn[n + 1].g + pPixelsIn[n + 3].g)) - (pPixelsIn[n + 0].g + pPixelsIn[n + 4].g)) / 8;
pPixelsOut[n2 + widthDiv2].b = pPixelsIn[n + 2].b - ((pPixelsIn[n + 1].b + pPixelsIn[n + 3].b) / 2) + 128;
pPixelsOut[n2].b = ((4 * pPixelsIn[n + 2].b) + (2 * pPixelsIn[n + 2].b) + (2 * (pPixelsIn[n + 1].b + pPixelsIn[n + 3].b)) - (pPixelsIn[n + 0].b + pPixelsIn[n + 4].b)) / 8;*/
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
pPixelsOut[s].r = pPixelsIn[n].r + (((pPixelsOut[d - 1].r + pPixelsOut[d].r) >> 2) - 64);
pPixelsOut[d].g = pPixelsIn[n + 1].g - (((pPixelsIn[n].g + pPixelsIn[n + 2].g) >> 1) + 127);
pPixelsOut[s].g = pPixelsIn[n].g + (((pPixelsOut[d - 1].g + pPixelsOut[d].g) >> 2) - 64);
pPixelsOut[d].b = pPixelsIn[n + 1].b - (((pPixelsIn[n].b + pPixelsIn[n + 2].b) >> 1) + 127);
pPixelsOut[s].b = pPixelsIn[n].b + (((pPixelsOut[d - 1].b + pPixelsOut[d].b) >> 2) - 64);
x += 2;
}
y++;
}
return true;
}
There is definitely something wrong but I just can't figure it out. Can anyone with slightly more brain than me point out where I am going wrong? Its worth noting that you can see the un-lifted version of the Daub 5/3 above the working code and this, too, give me the same artifacts ... I'm very confused as I have had this working once before (It was over 2 years ago and I no longer have that code).
Any help would be much appreciated :)
Edit: I appear to have eliminated my overflow issues by clamping the low pass pixels to the 0 to 255 range. I'm slightly concerned this isn't the right solution though. Can anyone comment on this?
You can do some tests with extreme values to see the possibility of overflow. Example:
pPixelsOut[d].r = pPixelsIn[n + 1].r - (((pPixelsIn[n].r + pPixelsIn[n + 2].r) >> 1) + 127);
If:
pPixelsIn[n ].r == 255
pPixelsIn[n+1].r == 0
pPixelsIn[n+2].r == 255
Then:
pPixelsOut[d].r == -382
But if:
pPixelsIn[n ].r == 0
pPixelsIn[n+1].r == 255
pPixelsIn[n+2].r == 0
Then:
pPixelsOut[d].r == 128
You have a range of 511 possible values (-382 .. 128), so, in order to avoid overflow or clamping, you would need one extra bit, some quantization, or another encoding type!
I'm assuming the data have already been thresholded?
I also don't get why you're adding back in +127 and -64.
OK I can losslessly forward then inverse as long as I store my post forward transform data in a short. Obviously this takes up a little more space than I was hoping for but this does allow me a good starting point for going into the various compression algorithms. You can also, nicely, compress 2 4 component pixels at a time using SSE2 instructions. This is the standard C forward transform I came up with:
const int16_t dr = (int16_t)pPixelsIn[n + 1].r - ((((int16_t)pPixelsIn[n].r + (int16_t)pPixelsIn[n + 2].r) >> 1));
const int16_t sr = (int16_t)pPixelsIn[n].r + ((((int16_t)pPixelsOut[d - 1].r + dr) >> 2));
const int16_t dg = (int16_t)pPixelsIn[n + 1].g - ((((int16_t)pPixelsIn[n].g + (int16_t)pPixelsIn[n + 2].g) >> 1));
const int16_t sg = (int16_t)pPixelsIn[n].g + ((((int16_t)pPixelsOut[d - 1].g + dg) >> 2));
const int16_t db = (int16_t)pPixelsIn[n + 1].b - ((((int16_t)pPixelsIn[n].b + (int16_t)pPixelsIn[n + 2].b) >> 1));
const int16_t sb = (int16_t)pPixelsIn[n].b + ((((int16_t)pPixelsOut[d - 1].b + db) >> 2));
pPixelsOut[d].r = dr;
pPixelsOut[s].r = sr;
pPixelsOut[d].g = dg;
pPixelsOut[s].g = sg;
pPixelsOut[d].b = db;
pPixelsOut[s].b = sb;
It is trivial to create the inverse of this (A VERY simple bit of algebra). Its worth noting, btw, that you need to inverse the image from right to left bottom to top. I'll next see if I can shunt this data into uint8_ts and lost a bit or 2 of accuracy. For compression this really isn't a problem.