Calculating bytes from sextets in C/C++ for steganography - c++

This should be a simple question. I'm well advanced in developing steganographic code in C, which required manipulating the least significant bit in each R, G, and B channel of a 24 bit (3 byte) pixel of an image. A pair of pixels has 6 bits (which I call a sextet for want of a better word) that can be used, and I have developed code that converts a buffer in bytes to a buffer in sextets, where each byte in the latter buffer only uses the 6 lower order bits, with the upper 2 bits being discarded when changing pixels. This all works correctly, and I can encode text in any language in an image.
In doing this the application calculates the number of sextets that can be embedded in an image. However, it is useful to know how many bytes can be processed, as both the input is originally in bytes, and the output is recovered in bytes. As 4 sextets correspond to 3 bytes, I'm using the statement:
maxNumBytes = (3 * maxNumSexts - 2 * (maxNumSexts % 4)) / 4;
which converts and rounds down to a multiple of 3, where maxNumSexts and maxNumBytes are respectively the maximum number of sextets and bytes that can be hidden in an RGB image, and these two variable have the type int32_t. This formula works but is rather cumbersome, and I was wondering if someone could find something simpler that works correctly.
Incidentally, although the code is in C, this applies exactly in C++, hence that has been included as a tag, and some C++ code may be added later.
Many thanks for any suggestions.

I want all values between 24 and 27 to evaluate to 18, and likewise values between 28 and 31 to evaluate to 21, etc.
Since you want only multiples of 3, the last operation should be the multiplication by 3. And the "steps" on the input value is by 4 increments. So you can use this formula in integer arithmetic:
maxNumBytes = 3 * (maxNumSexts / 4);
Note 1: However, the actual number of bytes encoded by 27 sextets is 20, because 27 sextets contain 81 bits.
Note 2: Yes, a half byte is called a "nibble", from the verb. The form "nybble" is known, but rarely used.

Related

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

JPEG Huffman Table

I have a question regarding the JPEG Huffman Table and using the Huffman Table to construct the symbol/binary string from a Tree. Suppose, that in an Huffman Table for 3-Bit code Length the number of codes is greater than 6, then how do we add all those codes in the Tree? If I am correct only 6 codes can be added at the 3-bit level/depth of the tree. So, how do we add the remaining codes if they won't fit in that level? Do we just ignore them?
Example
code length | Total Codes | Codes
3-Bit | 10 | 25 43 34 53 92 A2 B2 63 73 C2
In the above example if we go by order of constructing symbols/binary string for the code then up 'til A2 we can add codes in the tree at level 3-Bit, but what about B2,63,73,C2 etc? It's not possible to add them at 3-Bit level of the tree? So what do we do with them?
Well, clearly, the absolutely highest number of "things" that can be represented in 3 bits is 8 - (000, 001, 010, 011, 100, 101, 110, 111).
In Huffman encoding, bits represent "left" or "right" in a trie data-structure, to be able to "continue", you have to use SOME codes for "this continues another level", which is why not all 8 values can be encoded in 3 bits. If you have more values to encode, you need to use more bits (for some values - this is the whole point of Huffman coding, that SOME combinations are short, others are longer, and sometimes even longer than the original, but because it's based on what is the most common, it's fine, because they will be rare...)
How to construct and decode a Huffman tree is about four-five pages in your typical Algorithms book, and if you haven't got one of those, you probably want to find one - either a real paper one, or an e-book. There are LOTS of them - I'm not going to recommend one, since the ones I have are all about 15+ years old.
I should add that I think your question is missing something. Clearly, 3 bits can not possibly represent 10 values. And you can't build a [meaningful] Huffman tree on 10 values that all different - unless the idea is to split the values into pairs of {2,5}, {4,3}, {3,4}, {5,3}, {9,2}, {A,2}, {B,2}, {6,3}, {7,3}, {C,2} - which gives a fair number of repeated values - frequency of those are:
2 : 5
3 : 5
4 : 2
5 : 2
6 : 1
7 : 1
9 : 1
A : 1
B : 1
C : 1
But that's stil too many to represent anything meaningful...
Or is it the other way around, that we are supposed to use the bit values of those to decode? In which case we'd need the tree built from the original data to decode it...
In JPEG, a Huffman code can be up to 16-bits. The DHT market contains an array of 16 elements giving the number of codes for each length.
The JPEG standard explains how to use the code counts to do the Huffman translation. It is one of the few things explained in detail.
This book explains how it is done from a programmers perspective.
JPEG Book
The number of codes that exists at any code length depends upon the counts for other lengths.
I am wondering if you are really looking at the count of codes for length 4 rather than 3.
It looks like you're not following the correct procedure when creating your Huffman codes from the JPEG table. The count provided will fit in the number of bits unless the table has been corrupted. The reading out of the codes from a DHT marker is really simple. The more complicated part is how you define your lookup table from that data. A logical (but not practical) way is to create a reverse lookup table that's the maximum code length in size (16-bits = 65536 entries in the table). Then to decode your JPEG data, just pick up 16-bits of compressed data from the input stream and use it as an index in the table where you'll have the symbol and actual length of the code. I came up with a way to use a single, much smaller lookup table. I'm not going to share my specific code table method. What I will share is the basic format of the loop to create the codes from a DHT marker:
int iCurrentCode; // the current Huffman code
int iLength; // the code length in bits that you're working on
int i;
int iCount; // the number of codes defined for this length
int iSymbol; // JPEG symbol defined for each Huffman code
unsigned char *pData; // pointer to the data in the DHT marker
iCurrentCode = 0; // start with a Huffman code of 0
for (iLength = 1; iLength <= 16; iLength++)
{
iCount = *pData++; // get number of symbols for this bit length
for (i=0; i<iCount; i++) // read each of the codes for this bit length
{
iSymbol = *pData++; // get the JPEG symbol value (e.g. RRRR/SSSS value)
// It's up to you to create a lookup table from the code and its value
iCurrentCode++; // the Huffman bit pattern just increments for each code value
} // for each code defined at this bit length
iCurrentCode <<= 1; // shift the code left 1 bit to advance to the next bit length
} // for each bit length

Minimum description length and Huffman coding for two symbols?

I am confused about the interpretation of the minimum description length of an alphabet of two symbols.
To be more concrete, suppose that we want to encode a binary string where 1's occur with probability 0.80; for instance, here is a string of length 40, with 32 1's and 8 0's:
1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1
Following standard MDL analysis, we can encode this string using prefix codes (like Huffman's) and the code of encoding this string would be (-log(0.8) * 32 - log(0.2) * 8), which is lower than duplicating the string without any encoding.
Intuitively, it is "cheaper" to encode this string than some string where 1's and 0's occur with equal probability. However, in practice, I don't see why this would be the case. At the very least, we need one bit to distinguish between 1's and 0's. I don't see how prefix codes could do better than just writing the binary string without encoding.
Can someone help me clarify this, please?
I don't see how prefix codes could do better than just writing the
binary string without encoding.
You can't with prefix codes, unless you combine bits to make more symbols. For example, if you code every two bits, you now have four symbols with probabilities 0.64, 0.16, 0.16, and 0.04. That would be coded with 0, 10, 110, 111. That gives an average of 1.56 bits per symbol, or 0.7800 bits per original bit. We're getting a little closer to the optimal 0.7219 bits per bit (-0.2 log20.2 - 0.8 log20.8).
Do that for three-bit groupings, and you get 0.7280 bits per bit. Surprisingly close to the optimum. In this case, the code lengths just happen to group really nicely with the probabilities. The code is 1 bit (0) for the symbol with probability 0.512, 3 bits (100, 101, 110) for the three symbols with probability 0.128, and 5 bits (11100, 11101, 11110, 11111) for both the three symbols with probability 0.032 and the one symbol with probability 0.008.
You can keep going and get asymptotically close to the optimal 0.7219 bits per bit. Though it becomes more inefficient in time and space for larger groupings. The Pareto Front turns out to be at multiples of three bits up through 15. 6 bits gives 0.7252 bits per bit, 9 gives 0.7251, 12 is 0.7250, and 15 is 0.7249. The approach is monumentally slow, where you need to go to 28 bits to get to 0.7221. So you might as well stop at 6. Or even just 3 is pretty good.
Alternatively you can use something other than prefix coding, such as arithmetic coding, range coding, or asymmetric numeral system coding. They effectively use fractional bits for each symbol.

Create a sequence which is ordered by bits set

I'm looking for a reversible function unsigned f(unsigned) for which the number of bits set in f(i) increases with i, or at least does not decrease. Obviously, f(0) has to be 0 then, and f(~0) must come last. In between there's more flexibility. After f(0), the next 32* values must be 1U<<0 to 1U<<31, but I don't care a lot about the order (they all have 1 bit set).
I'd like an algorithm which doesn't need to calculate f(0)...f(i-1) in order to calculate f(i), and a complete table is also unworkable.
This is similar to Gray codes, but I can't see a way to reuse that algorithm. I'm trying to use this to label a large data set, and prioritize the order in which I search them. The idea is that I have a key C, and I'll check labels C ^ f(i). Low values of i should give me labels similar to C, i.e. differing in only a few bits.
[*] Bonus points for not assuming that unsigned has 32 bits.
[example]
A valid initial sequence:
0, 1, 2, 4, 16, 8 ... // 16 and 8 both have one bit set, so they compare equal
An invalid initial sequence:
0, 1, 2, 3, 4 ... // 3 has two bits set, so it cannot precede 4 or 2147483648.
Ok, seems like I have a reasonable answer. First let's define binom(n,k) as the number of ways in which we can set k out of n bits. That's the classic Pascal triangle:
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
...
Easily calculated and cached. Note that the sum of each line is 1<<lineNumber.
The next thing we'll need is the partial_sum of that triangle:
1 2
1 3 4
1 4 7 8
1 5 11 15 16
1 6 16 26 31 32
1 7 22 42 57 63 64
1 8 29 64 99 120 127 128
1 9 37 93 163 219 247 255 256
...
Again, this table can be created by summing two values from the previous line, except that the new entry on each line is now 1<<line instead of 1.
Let's use these tables above to construct f(x) for an 8 bits number (it trivially generalizes to any number of bits). f(0) still has to be 0. Looking up the 8th row in the first triangle, we see that next 8 entries are f(1) to f(9), all with one bit set. The next 28 entries (7+6+5+4+3+2+1) all have 2 bits set, so that's f(10) to f(37). The next 56 entries, f(38) to f(93) have 3 bits, and there are 70 entries with 4 bits set. From symmetry we can see that they're centered around f(128), in particular they're f(94) to f(163). And obviously, the only number with 8 bits set sorts last, as f(255).
So, with these tables we can quickly determine how many bits must be set in f(i). Just do a binary search in the last row of your table. But that doesn't answer exactly which bits are set. For that we need the previous rows.
The reason that each value in the table can be created from the previous line is simple. binom(n,k) == binom(k, n-1) + binom(k-1, n-1). There are two sorts of numbers with k bits set: Those that start with a 0... and numbers which start with 1.... In the first case, the next n-1 bits must contain those k bits, in the second case the next n-1 bits must contain only k-1 bits. Special cases are of course 0 out of n and n out of n.
This same stucture can be used to quickly tell us what f(16) must be. We already had established that it must contain 2 bits set, as it falls in the range f(10) - f(37). In particular, it's number 6 with 2 bits set (starting as usual with 0). It's useful to define this as an offset in a range as we'll try to shrink the length this range from 28 down to 1.
We now subdivide that range into 21 values which start with a zero and 7 which start a one. Since 6 < 21, we know that the first digit is a zero. Of the remaining 7 bits, still 2 need to be set, so we move up a line in the triangle and see that 15 values start with two zeroes, and 6 start with 01. Since 6 < 15, f(16) starts with 00. Going further up, 7 <= 10 so it starts with 000. But 6 == 6, so it doesn't start with 0000 but 0001. At this point we change the start of the range, so the new offset becomes 0 (6-6)
We know need can focus only on the numbers that start with 0001 and have one extra bit, which are f(16)...f(19). It should be obvious by know that the range is f(16)=00010001, f(17)=00010010, f(18)=00010100, f(19)=00011000.
So, to calculate each bit, we move one row up in the triangle, compare our "remainder", add a zero or one based on the comparison possibly go left one column. That means the computational complexity of f(x) is O(bits), or O(log N), and the storage needed is O(bits*bits).
For each given number k we know that there are binom(n, k) n-bit integers that have exactly k bits of value one. We can now generate a lookup table of n + 1 integers that store for each k how many numbers have less one bits. This lookup table can then be used to find the number o of one bits of f(i).
Once we know this number we subtract the lookup table value for this number of bits from i which leaves us with the permutation index p for numbers with the given number of one bits. Altough I have not done research in this area I am quite sure that there exists a method for finding the pth permutation of a std::vector<bool> which is initialized with zeros and o ones in the lowest bits.
The reverse function
Again the lookup table comes in handy. We can directly calculate the number of preceding numbers with less one bits by counting the one bits in the input integer and reading in the lookup table. Then you "only" need to determine the permutation index and add it to the looked up value and you are done.
Disclaimer
Of course this is only a rough outline and some parts (especially involving the permutations) might take longer than it sounds.
Addition
You stated yourself
I'm trying to use this to label a large data set, and prioritize the order in which I search them.
Which sounds to me as if you would be going from the low hamming distance to the high hamming distance. In this case it would be enough to have an incremental version which generates the next number from the previous:
unsigned next(unsigned previous)
{
if(std::next_permutation(previous))
return previous;
else
return (1 << (1 + countOneBits(previous))) - 1;
}
Of course std::next_permutation permutation does not work this way but I think it is clear how I mean to use it.

Compression algorithms for numbers only

I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.
Now these numbers are in fixed format string.
What are the algorithms for compressing numbers in fixed format?
Is number only compressions (if there any) better than string compression?
Should I directly compress string without converting it to numbers and then compress?
Thanks in advance.
This is one of these places where a little theory is helpful. You need to think about several things:
what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
are the measurements associated and in some order, or tossed together randomly?
Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.
Similarly with date time, what's the range; how many bits must you have?
Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.
The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.
Update: Oh, wait, fixed point strings?!
Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have
040.00105.0020090518212100Z
which you could convert to
| 4000 | short int, 16 bits |
| 10500 | short int, 16 bits |
| 20090518212100Z | 64 bits |
So that's 96 bits, 12 bytes versus 26 bytes.
Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).
Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.
I recommend you have a look at The Data Compression Book
What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.
Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.
In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.
An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.
That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.
It depends on what you are going to do with the data, and how much precision you need.
Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.
Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.
So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.
Obviously, if you don't need that much precision, you can back down your bit count.
Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.
Depending on the available characters, you could make something quite easily.
For example, if the input is only digits (0..9), here's a solution that will encode and decode them, in Kotlin (similar thing on Java) :
fun encodeDigitsOnlyString(stringWithDigitsOnly: String): ByteArray {
//we couple each 2 digits together into a single byte.
//For the last digit, if it has no digit to pair with, it's paired with something that's not a digit
val result = ArrayList<Byte>()
val length = stringWithDigitsOnly.length
var lastDigit: Byte? = null
for (i in 0 until length) {
val char = stringWithDigitsOnly[i]
val digitAsByte = char.toString().toInt().toByte()
if (lastDigit == null) {
if (i == length - 1) {
//last digit
val newByte = (digitAsByte + 0xf0).toByte()
result.add(newByte)
} else {
//more to go
lastDigit = digitAsByte
}
} else {
val newByte = (digitAsByte + lastDigit.toInt().shl(4)).toByte()
result.add(newByte)
lastDigit = null
}
}
return result.toByteArray()
}
fun decodeByteArrayToDigitsOnlyString(encodedDigitsOnlyByteArray: ByteArray): String {
val sb = StringBuilder(encodedDigitsOnlyByteArray.size * 2)
for (byte in encodedDigitsOnlyByteArray) {
val hex = Integer.toHexString(byte.toInt()).takeLast(2).padStart(2, '0')
if (hex[0].isLetter())
sb.append(hex.last())
else
sb.append(hex)
}
return sb.toString()
}
Example usage:
val inputString="12345"
val byteArray=encodeDigitsOnlyString(inputString) //produces a byte array of size 3
val outputString=decodeByteArrayToDigitsOnlyString(byteArray) //should be the same as the input