How to do memory implementation of Huffman Agorithm?

How to do memory implementation of Huffman Agorithm? - c++

I have a Huffman code algorithm that compresses characters into sequences of bits of arbitrary length, smaller than the default size of a char (8 bits on most modern platforms)
If the Huffman Code compresses an 8-bit character into 3 bits, how do I represent that 3-bit value in memory? To take this further, how do I combine multiple compressed characters into a compressed representation?
For example consider l which is "00000" (5x8 bits since 0 is also character). How do I represent l with 00000 (5 bits) instead of a character sequence?
A C or C++ implementation is preferred.

Now that this question is re-opened...
To make a variable that holds a variable number of bits, we just use use the lower bits of one unsigned int to store the bits, and use another unsigned int to remember how many bits we have stored.
When writing out a Huffman-compressed file, we wait until we have at least 8 bits stored. Then we write out a char using the top 8 bits and subtract 8 from the stored bit count.
Finally, at the end if you have any bits left to write out, you round up to an even multiple of 8 and write chars.
In C++, it's useful to encapsulate the output in some kind of BitOutputStream class, like:
class BitOutputStream
{
std::ostream m_out;
unsigned m_bitsPending;
unsigned m_numPending;
public:
BitOutputStream(const char *fileName)
:m_out(... /* you can do this part */)
{
m_bitsPending = 0;
m_numPending = 0;
}
// write out the lower <count> bits of <bits>
void write(unsigned bits, unsigned count)
{
if (count > 16)
{
//do it in two steps to prevent overflow
write(bits>>16, count-16);
count=16;
}
//make space for new bits
m_numPending += count;
m_bitsPending <<= count;
//store new bits
m_bitsPending |= (bits & ((1<<count)-1));
//write out any complete bytes
while(m_numPending >= 8)
{
m_numPending-=8;
m_out.put((char)(m_bitsPending >> m_numPending));
}
}
//write out any remaining bits
void flush()
{
if (m_numPending > 0)
{
m_out.put((char)(m_bitsPending << (8-m_numPending)));
}
m_bitsPending = m_numPending = 0;
m_out.flush();
}
}

If your Huffman coder returns an array of 1s and 0s representing the bits that should and should not be set in the output, you can shift these bits onto an unsigned char. Every eight shifts, you start writing to the next character, ultimately outputting an array of unsigned char. The number of these compressed characters that you will output is equal to the number of bits divided by eight, rounded up to the nearest natural number.
In C, this is a relatively simple function, consisting of a left shift (<<) and a bitwise OR (|). Here is the function, with an example to make it runnable. To see it with more extensive comments, please refer to this GitHub gist.
#include <stdlib.h>
#include <stdio.h>
#define BYTE_SIZE 8
size_t compress_code(const int *code, const size_t code_length, unsigned char **compressed)
{
if (code == NULL || code_length == 0 || compressed == NULL) {
return 0;
}
size_t compressed_length = (code_length + BYTE_SIZE - 1) / BYTE_SIZE;
*compressed = calloc(compressed_length, sizeof(char));
for (size_t char_counter = 0, i = 0; char_counter < compressed_length && i < code_length; ++i) {
if (i > 0 && (i % BYTE_SIZE) == 0) {
++char_counter;
}
// Shift the last bit to be set left by one
(*compressed)[char_counter] <<= 1;
// Put the next bit onto the end of the unsigned char
(*compressed)[char_counter] |= (code[i] & 1);
}
// Pad the remaining space with 0s on the right-hand-side
(*compressed)[compressed_length - 1] <<= compressed_length * BYTE_SIZE - code_length;
return compressed_length;
}
int main(void)
{
const int code[] = { 0, 1, 0, 0, 0, 0, 0, 1, // 65: A
0, 1, 0, 0, 0, 0, 1, 0 }; // 66: B
const size_t code_length = 16;
unsigned char *compressed = NULL;
size_t compressed_length = compress_code(code, code_length, &compressed);
for (size_t i = 0; i < compressed_length; ++i) {
printf("%c\n", compressed[i]);
}
return 0;
}
You can then just write the characters in the array to a file, or even copy the array's memory directly to a file, to write the compressed output.
Reading the compressed characters into bits, which will allow you to traverse your Huffman tree for decoding, is done with right shifts (>>) and checking the rightmost bit with bitwise AND (&).

Related

8b10b encoder with byte stream output (bits carry): faster bitwise algorithm?

I have written a 8b10b encoder that generates a stream of bytes intended to be sent to a serial transmitter which sends the bytes as-is LSb first.
What I'm doing here is basically lay down groups of 10 bits (encoded from the input stream of bytes) on groups of 8, so a varying number of bits get carried over from one output byte to the next - kind of like in music/rhythm.
The program has been successfully tested, but it is about 4-5x too slow for my application. I think it comes from the fact that every bit has to be looked up in an array. My guts tell me we could make that faster by having some sort of rolling mask but I can't yet see how to do that even by swapping out the 3d array of booleans to a 2D array of integers.
Any pointer or other idea?
Here is the code. Please ignore most of the macros and some of the code related to deciding which byte is to be written as this is application-specific.
Header:
#ifndef TX_BYTESTREAM_GEN_H_INCLUDED
#define TX_BYTESTREAM_GEN_H_INCLUDED
#include <stdint.h> //for standard portable types such as uint16_t
#define MAX_USB_TRANSFER_SIZE 1016 //Bytes, size of the max payload in a USB transaction. Determined using FT4222_GetMaxTRansferSize()
#define MAX_USB_PACKET_SIZE 62 //Bytes, max size of the payload of a single USB packet
#define MANDATORY_TX_PACKET_BLOCK 5 //Bytes, constant - equal to the minimum number of bytes of TX packet necessary to exactly transfer blocks of 10 bits of encoded data (LCF of 8 and 10)
#define SYNC_CHARS_MAX_INTERVAL 172 //Target number of payload bytes between sync chars. Max is 188 before desynchronisation
#define ROUND_UP(N, S) ((((N) + (S) - 1) / (S)) * (S)) //Macro to round up the integer N to the largest multiple of the integer S
#define ROUND_DOWN(N,S) ((N / S) * S) //Same rounding down
#define N_SYNC_CHAR_PAIRS_IN_PCKT(pcktSz) (ROUND_UP((pcktSz*1000/(SYNC_CHARS_MAX_INTERVAL+2)),1000)/1000) //Number of sync (K28.5) character/byte pairs in a given packet
#define TX_PAYLOAD_SIZE(pcktSz) ((pcktSz*4/5)-2*N_SYNC_CHAR_PAIRS_IN_PCKT(pcktSz)) //Size in bytes of the payload data before encoding in a single TX packet
#define MAX_TX_PACKET_SIZE (ROUND_DOWN((MAX_USB_TRANSFER_SIZE-MAX_USB_PACKET_SIZE),(MAX_USB_PACKET_SIZE*MANDATORY_TX_PACKET_BLOCK))) //Maximum size in bytes of a TX packet
#define DEFAULT_TX_PACKET_SIZE (MAX_TX_PACKET_SIZE-MAX_USB_PACKET_SIZE*MANDATORY_TX_PACKET_BLOCK) //Default size in bytes of a TX packet with some margin
#define MAX_TX_PAYLOAD_SIZE (TX_PAYLOAD_SIZE(MAX_TX_PACKET_SIZE)) //Maximum size in bytes of the payload in a TX packet
#define DEFAULT_TX_PAYLOAD_SIZE (TX_PAYLOAD_SIZE(DEFAULT_TX_PACKET_SIZE))//Default size in bytes of the payload in a TX packet with some margin
//See string descriptors below for definitions. Error codes are individual bits so can be combined.
enum ErrCode
{
NO_ERR = 0,
INVALID_DIN_SIZE = 1,
INVALID_DOUT_SIZE = 2,
NULL_DIN_PTR = 4,
NULL_DOUT_PTR = 8
};
char const * const ERR_CODE_DESC[] = {
"No error",
"Invalid size of input data",
"Invalid size of output buffer",
"Input data pointer is NULL",
"Output buffer pointer is NULL"
};
/** #brief Generates the bytestream to the transmitter by encoding the incoming data using 8b10b encoding
and inserting K28.5 synchronisation characters to maintain the synchronisation with the demodulator (LVDS passthrough mode)
#arg din is a pointer to an allocated array of bytes which contains the data to encode
#arg dinSize is the size of din in bytes. This size must be equal to TX_PAYLOAD_SIZE(doutSize)
#arg dout is a pointer to an allocated array of bytes which is intended to contain the output bytestream to the transmitter
#arg doutSize is the size of dout in bytes. This size must meet the conditions at the top of this function's implementation. Use DEFAULT_TX_PACKET_SIZE if in doubt.
#return error code (c.f. ErrCode) **/
int TX_gen_bytestream(uint8_t *din, uint16_t dinSize, uint8_t *dout, uint16_t doutSize);
#endif // TX_BYTESTREAM_GEN_H_INCLUDED
Source file:
#include "TX_bytestream_gen.h"
#include <cstddef> //NULL
#define N_BYTE_VALUES (256+1) //256 possible data values + 1 special character (only accessible to this module)
#define N_ENCODED_BITS 10 //Number of bits corresponding to the 8b10b encoding of a byte
//Map the current running disparity, the desired value to encode to the array of encoded bits for 8b10b encoding.
//The Last value is the K28.5 sync character, only accessible to this module
//Notation = MSb to LSb
bool const encodedBits[2][N_BYTE_VALUES][N_ENCODED_BITS] =
{
//Long table (see appendix)
};
//New value of the running disparity after encoding with the specified previous running disparity and requested byte value (c.f. above)
bool const encodingDisparity[2][N_BYTE_VALUES] =
{
//Long table (see appendix)
};
int TX_gen_bytestream(uint8_t *din, uint16_t dinSize, uint8_t *dout, uint16_t doutSize)
{
static bool RDp = false; //Running disparity is initially negative
int ret = 0;
//If the output buffer size is not a multiple of the mandatory payload block or of the USB packet size, or if it cannot be held in a single USB transaction
//return an invalid output buffer size error
if(doutSize == 0 || (doutSize % MANDATORY_TX_PACKET_BLOCK) || (doutSize % MAX_USB_PACKET_SIZE) || (doutSize > MAX_TX_PACKET_SIZE)) //Temp
ret |= INVALID_DOUT_SIZE;
//If the input data size is not consistent with the output buffer size, return the appropriate error code
if(dinSize == 0 || dinSize != TX_PAYLOAD_SIZE(doutSize))
ret |= INVALID_DIN_SIZE;
if(din == NULL)
ret |= NULL_DIN_PTR;
if(dout == NULL)
ret |= NULL_DOUT_PTR;
//If everything checks out, carry on
if(ret == NO_ERR)
{
uint16_t iByteIn = 0; //Index of the byte of input data currently being processed
uint16_t iByteOut = 0; //Index of the output byte currently being written to
uint8_t iBitOut = 0; //Starts with LSb
int16_t nBytesUntilSync = 0; //Countdown of bytes until a sync marker needs to be sent. Cyclic.
//For all output bytes to generate
while(iByteOut < doutSize)
{
bool sync = false; //Initially this byte is not considered a sync byte (in which case the next byte of data will be processed)
//If the maximum interval between sync characters has been reached, mark the two next bytes as sync bytes and reset the counter
if(nBytesUntilSync <= 0)
{
sync = true;
if(nBytesUntilSync == -1) //After the second SYNC is written, the counter is reset
{
nBytesUntilSync = SYNC_CHARS_MAX_INTERVAL;
}
}
//Append bit by bit the encoded data of the byte to write to the output bitstream (carried over from byte to byte) - LSb first
//The byte to write is either the last byte of the encodedBits map (the sync character K28.5) if sync is set, or the next byte of
//input data if it isn't
uint16_t const byteToWrite = (sync?(N_BYTE_VALUES-1):din[iByteIn]);
for(int8_t iEncodedBit = N_ENCODED_BITS-1 ; iEncodedBit >= 0 ; --iEncodedBit, iBitOut++)
{
//If the current output byte is complete, reset the bit index and select the next one
if(iBitOut >= 8)
{
iByteOut++;
iBitOut = 0;
}
//Effectively sets the iBitOut'th bit of the iByteOut'th byte out to the encoded value of the byte to write
bool bitToWrite = encodedBits[RDp][byteToWrite][iEncodedBit]; //Temp
dout[iByteOut] ^= (-bitToWrite ^ dout[iByteOut]) & (1 << iBitOut);
}
//The running disparity is also updated as per the standard (to achieve DC balance)
RDp = encodingDisparity[RDp][byteToWrite]; //Update the running disparity
//If sync was not set, this means a byte of the input data has been processed, in which case take the next one in
//Also decrement the synchronisation counter
if(!sync) {
iByteIn++;
}
//In any case, decrease the synchronisation counter. Even sync characters decrease it (c.f. top of while loop)
nBytesUntilSync--;
}
}
return ret;
}
Testbench:
#include <iostream>
#include "TX_bytestream_gen.h"
#define PACKET_DURATION 0.000992 //In seconds, time of continuous data stream corresponding to one packet (5MHz output, default packet size)
#define TIME_TO_SIMULATE 10 //In seconds
#define PACKET_SIZE DEFAULT_TX_PACKET_SIZE
#define PAYLOAD_SIZE DEFAULT_TX_PAYLOAD_SIZE
#define N_ITERATIONS (TIME_TO_SIMULATE/PACKET_DURATION)
#include <chrono>
using namespace std;
//Testbench: measure the time taken to simulate TIME_TO_SIMULATE seconds of continuous encoding
int main()
{
uint8_t toEncode[PAYLOAD_SIZE] = {100}; //Dummy data, doesn't matter
uint8_t out[PACKET_SIZE] = {0};
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
for(unsigned int i = 0 ; i < N_ITERATIONS ; i++)
{
TX_gen_bytestream(toEncode, PAYLOAD_SIZE, out, PACKET_SIZE);
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "Task execution time: " << elapsed_seconds.count()/TIME_TO_SIMULATE*100 << "% (for " << TIME_TO_SIMULATE << "s simulated)\n";
return 0;
}
Appendix: lookup tables. I don't have enough characters to paste it here, but it looks like so:
bool const encodedBits[2][N_BYTE_VALUES][N_ENCODED_BITS] =
{
//Running disparity = RD-
{
{1,0,0,1,1,1,0,1,0,0},
//...
},
//Running disparity = RD+
{
{0,1,1,0,0,0,1,0,1,1},
//...
}
};
bool const encodingDisparity[2][N_BYTE_VALUES] =
{
//Previous running disparity was RD-
{
0,
//...
},
//Previous running disparity was RD+
{
1,
//...
}
};

This will be a lot faster if you do everything a byte at time instead of a bit at a time.
First change the way you store your lookup tables. You should have something like:
// conversion from (RD, byte) to (RD, 10-bit code)
// in each word, the lower 10 bits are the code,
// and bit 10 (the 11th bit) is the new RD
// The first 256 values are for RD -1, the next
// for RD 1
static const uint16_t BYTE_TO_CODE[512] = {
...
}
Then you need to change our encoding loop to write a byte at a time. You can use a uint16_t to store the leftover bits from each byte you output.
Something like this (I didn't figure out your sync byte logic, but presumably you can put that in the input or output byte loop):
// returns next isRD1
bool TX_gen_bytestream(uint8_t *dest, const uint8_t *src, size_t src_len, bool isRD1)
{
// bits generated, but not yet written, LSB first
uint16_t bits = 0;
// number of bits in bits
unsigned numbits = 0;
// current RD, either 0 or 256
uint16_t rd = isRD1 ? 256 : 0;
for (const uint8_t *end = src + src_len; src < end; ++src) {
// lookup code and next rd
uint16_t code = BYTE_TO_CODE[rd + *src];
// new rd from code bit 10
rd = (code>>2) & 256;
// store bits
bits |= (code & (uint16_t)0x03FF) << numbits;
numbits+=10;
// write out any complete bytes
while(numbits >= 8) {
*dest++ = (uint8_t)bits;
bits >>=8;
numbits-=8;
}
}
// If src_len isn't divisible by 4, then we have some extra bits
if (numbits) {
*dest = (uint8_t)bits;
}
return !!rd;
}

How to grab specific bits from a 256 bit message?

I'm using winsock to receive udp messages 256 bits long. I use 8 32-bit integers to hold the data.
int32_t dataReceived[8];
recvfrom(client, (char *)&dataReceived, 8 * sizeof(int), 0, &fromAddr, &fromLen);
I need to grab specific bits like, bit #100, #225, #55, etc. So some bits will be in dataReceived[3], some in dataReceived[4], etc.
I was thinking I need to bitshift each array, but things got complicated. Am I approaching this all wrong?

Why are you using int32_t type for buffer elements and not uint32_t?
I usually use something like this:
int bit_needed = 100;
uint32_t the_bit = dataReceived[bit_needed>>5] & (1U << (bit_needed & 0x1F));
Or you can use this one (but it won't work for sign in signed integers):
int bit_needed = 100;
uint32_t the_bit = (dataReceived[bit_needed>>5] >> (bit_needed & 0x1F)) & 1U;
In other answers you can access only lowes 8bits in each int32_t.

When you count bits and bytes from 0:
int bit_needed = 100;
So:
int byte = int(bit_needed / 8);
int bit = bit_needed % 8;
int the_bit = dataReceived[byte] & (1 << bit);
If the recuired bit contains 0, then the_bit will be zero. If it's 1, then the_bit will hold 2 to the power of that bit ordinal place within the byte.

You can make a small function to do the job.
uint8_t checkbit(uint32_t *dataReceived, int bitToCheck)
{
byte = bitToCheck/32;
bit = bitToCheck - byte*32;
if( dataReceived[byte] & (1U<< bit))
return 1;
else
return 0;
}
Note that you should use uint32_t rather than int32_t, if you are using bit shifting. Signed integer bit shifts lead to unwanted results, especially if the MSbit is 1.

You can use a macro in C or C++ to check for specific bit:
#define bit_is_set(var,bit) ((var) & (1 << (bit)))
and then a simple if:
if(bit_is_set(message,29)){
//bit is set
}

Split up 32 bit value in C++ and concatenate the chunks in MATLAB

I'm working on a project where I have to send values of 32 bits over UART to MATLAB where I need to print them in the MATLAB terminal. I do this by splitting up the 32 bit value into 8 bit values like so (:
void Configurator::send(void) {
/**
* Split the 32 bits in chunks of 4 bytes of 8 bits
*/
union {
uint32_t data;
uint8_t bytes[4];
} splitData;
splitData.data = 1234587;
for (int n : splitData.bytes) {
XUartPs_SendByte(STDOUT_BASEADDRESS, splitData.bytes[n]);
}
}
In MATLAB I receive the following 4 bytes:
252
230
25
155
Now the question is, how do I restore the 1234587?
Am I correct in creating an array of size 4 as uint8_t? I would also like to note that I'm using union for readability. If I'm doing it wrong, I'd be happy to hear why!

You could use left shift to restore the value
uint32_t value = (byte[3]<<24) + (byte[2]<<16) + (byte[1]<<8) + (byte[0]<<0);

Try to avoid using unions for this sort of thing. It is not (in principle) portable, and can cause undefined behaviour. Instead write it like this:
void Configurator::send(void) {
/**
* Split the 32 bits in chunks of 4 bytes of 8 bits
*/
uint32_t data = 1234587;
for (int n = 0; n<4; n++) {
unsigned char octet = (data >> (n*8)) & 0xFF;
XUartPs_SendByte(STDOUT_BASEADDRESS, octet);
}
}
uint32_t recieveBytes(
{
uint32_t result = 0;
for (int n = 0; n<4; n++)
{
unsigned char octet = getOctet();
uint32_t octet32 = octet;
result != octet32 << (n*8);
}
return result;
}
The point is that by shifting out byte like this, you avoid any problems with endianness. The masking also means that if either end has 32-bit chars (such platforms exist), it all works anyway.

concatenating individual characters and converting to a combined decimal in c++

I have a sensor that stores the recorded information as a .pcap file. I have managed to load the file into an unsigned char array. the sensor stores information in a unique format. For instance representing an angle of 290.16, it stores the information as binary equivalent of 0x58 0x71.
what I have to do to get the correct angle is that concatenate 0x71 and 0x58 then convert the resultant hex value into a decimal divide it by 100 and then store it for further analysis.
My current approach is this:
//all header files are included
main
{
unsigned char data[50]; //I actually have the data loaded in this from a file
data[40] = 0x58;
data[41] = 0x71;
// The above maybe incorrect. What i am trying to imply is that if i use the statement
// printf("%.2x %.2x", data[40],data[41]);
// the resultant output you see on screen is
// 58 71
//I get the decimal value i wanted using the below statement
float gar = hex2Dec(dec2Hex(data[41])+dec2Hex(data[40]))/100.0;
}
hex2Dec and dec2Hex are my own functions.
unsigned int hex2Dec (const string Hex)
{
unsigned int DecimalValue = 0;
for (unsigned int i = 0; i < Hex.size(); ++i)
{
DecimalValue = DecimalValue * 16 + hexChar2Decimal (Hex[i]);
}
return DecimalValue;
}
string dec2Hex (unsigned int Decimal)
{
string Hex = "";
while (Decimal != 0)
{
int HexValue = Decimal % 16;
// convert deimal value to a Hex digit
char HexChar = (HexValue <= 9 && HexValue >= 0 ) ?
static_cast<char>(HexValue + '0' ) : static_cast<char> (HexValue - 10 + 'A');
Hex = HexChar + Hex;
Decimal = Decimal /16;
}
return Hex;
}
int hexChar2Decimal (char Ch)
{
Ch= toupper(Ch); //Change the chara to upper case
if (Ch>= 'A' && Ch<= 'F')
{
return 10 + Ch- 'A';
}
else
return Ch- '0';
}
The pain is that I have to do this conversion billions of time which really slows down the process. Is there any other efficient way to deal with this case?
A matlab code that my friend developed for a similar sensor, took him 3 hours to extract data that was worth only 1 minute of real time. I really need it to be as fast as possible.

As far as I can tell this does the same as
float gar = ((data[45]<<8)+data[44])/100.0;
For:
unsigned char data[50];
data[44] = 0x58;
data[45] = 0x71;
the value of gar will be 290.16.
Explanation:
It is not necessary to convert the value of an integer to a string to get the hex value, because decimal, hexadecimal, binary, etc. are only different representations of the same value. data[45]<<8 shifts the value of data[45] eight bits to the left. Before the operation is performed the type of the operand is promoted to int (except for some unusual implementations where it might be unsigned int), so the new data type should be large enough to not overflow. Shifting eight bits to the left is equivalent to shifting 2 digits to the left in hexadecimal representation. So the result is 0x7100. Then the value of data[44] is added to that and you get 0x7158. The result of type int is then cast to float and divided by 100.0.
In general int might be too small to apply the shift operation without shifting the sign if it is only 16-bit long. If you want to cover that case then explicitly cast to unsigned int:
float gar = (((unsigned int)data[45]<<8)+data[44])/100.0;

In here C convert hex to decimal format, Emil H
posted some sample code that looks very similar to what you want.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *hex_value_string = "deadbeef";
unsigned int out;
sscanf(hex_value_string, "%x", &out);
printf("%o %o\n", out, 0xdeadbeef);
printf("%x %x\n", out, 0xdeadbeef);
return 0;
}
Your conversion functions don't look particularly efficient, so hopefully this is faster.

bit pattern matching and replacing

I come across a very tricky problem with bit manipulation.
As far as I know, the smallest variable size to hold a value is one byte of 8 bits. The bit operations available in C/C++ apply to an entire unit of bytes.
Imagine that I have a map to replace a binary pattern 100100 (6 bits) with a signal 10000 (5 bits). If the 1st byte of input data from a file is 10010001 (8 bits) being stored in a char variable, part of it matches the 6 bit pattern and therefore be replaced by the 5 bit signal to give a result of 1000001 (7 bits).
I can use a mask to manipulate the bits within a byte to get a result of the left most bits to 10000 (5 bit) but the right most 3 bits become very tricky to manipulate. I cannot shift the right most 3 bits of the original data to get the correct result 1000001 (7 bit) followed by 1 padding bit in that char variable that should be filled by the 1st bit of next followed byte of input.
I wonder if C/C++ can actually do this sort of replacement of bit patterns of length that do not fit into a Char (1 byte) variable or even Int (4 bytes). Can C/C++ do the trick or we have to go for other assembly languages that deal with single bits manipulations?
I heard that Power Basic may be able to do the bit-by-bit manipulation better than C/C++.

If time and space are not important then you can convert the bits to a string representation and perform replaces on the string, then convert back when needed. Not an elegant solution but one that works.

<< shiftleft
^ XOR
>> shift right
~ one's complement
Using these operations, you could easily isolate the pieces that you are interested in and compare them as integers.
say the byte 001000100 and you want to check if it contains 1000:
char k = (char)68;
char c = (char)8;
int i = 0;
while(i<5){
if((k<<i)>>(8-3-i) == c){
//do stuff
break;
}
}
This is very sketchy code, just meant to be a demonstration.

I wonder if C/C++ can actually do this
sort of replacement of bit patterns of
length that do not fit into a Char (1
byte) variable or even Int (4 bytes).
What about std::bitset?

Here's a small bit reader class which may suit your needs. Of course, you may want to create a bit writer for your use case.
#include <iostream>
#include <sstream>
#include <cassert>
class BitReader {
public:
typedef unsigned char BitBuffer;
BitReader(std::istream &input) :
input(input), bufferedBits(8) {
}
BitBuffer peekBits(int numBits) {
assert(numBits <= 8);
assert(numBits > 0);
skipBits(0); // Make sure we have a non-empty buffer
return (((input.peek() << 8) | buffer) >> bufferedBits) & ((1 << numBits) - 1);
}
void skipBits(int numBits) {
assert(numBits >= 0);
numBits += bufferedBits;
while (numBits > 8) {
buffer = input.get();
numBits -= 8;
}
bufferedBits = numBits;
}
BitBuffer readBits(int numBits) {
assert(numBits <= 8);
assert(numBits > 0);
BitBuffer ret = peekBits(numBits);
skipBits(numBits);
return ret;
}
bool eof() const {
return input.eof();
}
private:
std::istream &input;
BitBuffer buffer;
int bufferedBits; // How many bits are buffered into 'buffer' (0 = empty)
};

Use a vector<bool> if you can read your data into the vector mostly at once. It may be more difficult to find-and-replace sequences of bits, though.

If I understood your questions correctly, you have an input stream and and output stream and you want to replace the 6bits of the input with 5 in the output - and your output still should be a bit stream?
So, the most important programmer's rule can be applied: Divide et impera!
You should split your component in three parts:
Input Stream converter: Convert every pattern in the input stream to a char array (ring) buffer. If I understood you correctly your input "commands" are 8bit long, so there is nothing special about this.
Do the replacement on the ring buffer in a way that you replace every matching 6-bit pattern with the 5bit one, but "pad" the 5 bit with a leading zero, so the total length is still 8bit.
Write an output handler that reads from the ring buffer and let this output handler write only the 7 LSB to the output stream from each input byte. Of course some bit manipulation is necessary again for this.
If your ring buffer size can be divided by 8 and 7 (= is a multiple of 56) you will have a clean buffer at the end and can start again with 1.
The most simplest way to implement this is to iterate over this 3 steps as long as input data is available.
If a performance really matters and you are running on a multi-core CPU you even could split the steps and 3 threads, but then you must carefully synchronize the access to the ring buffer.

I think the following does what you want.
PATTERN_LEN = 6
PATTERNMASK = 0x3F //6 bits
PATTERN = 0x24 //b100100
REPLACE_LEN = 5
REPLACEMENT = 0x10 //b10000
void compress(uint8* inbits, uint8* outbits, int len)
{
uint16 accumulator=0;
int nbits=0;
uint8 candidate;
while (len--) //for all input bytes
{
//for each bit (msb first)
for (i=7;i<=0;i--)
{
//add 1 bit to accumulator
accumulator<<=1;
accumulator|=(*inbits&(1<<i));
nbits++;
//check for pattern
candidate = accumulator&PATTERNMASK;
if (candidate==PATTERN)
{
//remove pattern
accumulator>>=PATTERN_LEN;
//add replacement
accumulator<<=REPLACE_LEN;
accumulator|=REPLACMENT;
nbits+= (REPLACE_LEN - PATTERN_LEN);
}
}
inbits++;
//move accumulator to output to prevent overflow
while (nbits>8)
{
//copy the highest 8 bits
nbits-=8;
*outbits++ = (accumulator>>nbits)&0xFF;
//clear them from accumulator
accumulator&= ~(0xFF<<nbits);
}
}
//copy remainder of accumulator to output
while (nbits>0)
{
nbits-=8;
*outbits++ = (accumulator>>nbits)&0xFF;
accumulator&= ~(0xFF<<nbits);
}
}
You could use a switch or a loop in the middle to check the candidate against multiple patterns. There might have to be some special handling after doing a replacment to ensure the replacement pattern is not re-checked for matches.

#include <iostream>
#include <cstring>
size_t matchCount(const char* str, size_t size, char pat, size_t bsize) noexcept
{
if (bsize > 8) {
return 0;
}
size_t bcount = 0; // curr bit number
size_t pcount = 0; // curr bit in pattern char
size_t totalm = 0; // total number of patterns matched
const size_t limit = size*8;
while (bcount < limit)
{
auto offset = bcount%8;
char c = str[bcount/8];
c >>= offset;
char tpat = pat >> pcount;
if ((c & 1) == (tpat & 1))
{
++pcount;
if (pcount == bsize)
{
++totalm;
pcount = 0;
}
}
else // mismatch
{
bcount -= pcount; // backtrack
//reset
pcount = 0;
}
++bcount;
}
return totalm;
}
int main(int argc, char** argv)
{
const char* str = "abcdefghiibcdiixyz";
char pat = 'i';
std::cout << "Num matches = " << matchCount(str, 18, pat, 7) << std::endl;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to do memory implementation of Huffman Agorithm? - c++

Related

8b10b encoder with byte stream output (bits carry): faster bitwise algorithm?

How to grab specific bits from a 256 bit message?

Split up 32 bit value in C++ and concatenate the chunks in MATLAB

concatenating individual characters and converting to a combined decimal in c++

bit pattern matching and replacing

Categories

Resources