My interrupt routine does not access an array correctly - c++

Update to this - seems like there are some issues with trig functions in math.h (using MPIDE compiler)- it is no wonder I couldn't see this with my debugger which was using its own math.h and therefore giving me the expected (correct solutions). I found this out by accident on the microchip boards and instead implemented a 'fast sine/cosine' algorithm instead (see devmaster dot com for this). My ISR and ColourWheel array now work perfectly.
I must say that, as a fairly newcomer to C/C++ I have spent a lot of hours reviewing and re-reviewing my own code for errors. The last possible thing on my mind was that some very basic functions that were no doubt written decades ago could give such problems.
I suppose I would have seen the problem earlier myself if I'd had access to a screen dump of the actual array but, as my chip is connected to my led cube I've no way to access the data in the chip directly.
Hey, ho !! - when I get the chance I'll post a link to a u tube video showing the wave function that I've now been able to program and looks pretty good on my LED cube.
Russell
ps
Thank you all so very much for your help here - it stopped me giving up completely by giving me some avenues to chase down - certainly did not know much about endianess before this so learned about that and some systematic ways to go about a robust debugging approach.
I have a problem when trying to access an array in an interrupt routine.
The following is a snippet of code from inside the ISroutine.
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
if((ColourWheel[Colour]>>16)&(1<<bitpos)) { // This line seems to cause trouble
setHigh(SINRED_PORT,SINRED_PIN);
}
else {
setLow(SINRED_PORT,SINRED_PIN);
}
}
}
..........
ColourWheel[Colour] has been declared as follows at the start of my program (outside any function)
static volatile uint32_t ColourWheel[255]; //this is the array from which
//the colours can be obtained -
//all set as 3 eight bit numbers
//using up 24 bits of a 32bit
//unsigned int.
What this snippet of code is doing is taking each bit of an eight bit segment of the code and setting the port/pin high or low accordingly with MSB first (I then have some other code which updates a TLC5940 IC LED driver chip for each high/low on the pin and the code goes on to take the green and blue 8 bits in a similar way).
This does not work and my colours output to my LEDs behave incorrectly.
However, if I change the code as follows then the routine works
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
if(0b00000000111111111110101010111110>>16)&(1<<bitpos)) { // This line seems to cause trouble
setHigh(SINRED_PORT,SINRED_PIN);}
else {
setLow(SINRED_PORT,SINRED_PIN);
}
}
}
..........
(the actual binary number in the line is irrelevant (the first 8 bits are always zero, the next 8 bits represent a red colour, the next a blue colour etc)
So why does the ISR work with the fixed number but not if I try to use a number held in an array.??
Following is the actual code showing the full RGB update:
if (CubeStatusArray[x][y][Layer]){
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
{if((ColourWheel[Colour]>>16)&(1<<bitpos))
{setHigh(SINRED_PORT,SINRED_PIN);}
else
{setLow(SINRED_PORT,SINRED_PIN);}}
{if((ColourWheel[Colour]>>8)&(1<<bitpos))
{setHigh(SINGREEN_PORT,SINGREEN_PIN);}
else
{setLow(SINGREEN_PORT,SINGREEN_PIN);}}
{if((ColourWheel[Colour])&(1<<bitpos))
{setHigh(SINBLUE_PORT,SINBLUE_PIN);}
else
{setLow(SINBLUE_PORT,SINBLUE_PIN);}}
pulse(SCLK_PORT, SCLK_PIN);
pulse(GSCLK_PORT, GSCLK_PIN);
Data_Counter++;
GSCLK_Counter++; }

I assume the missing ( after if is a typo.
The indicated research technique, in the absence of a debugger, is:
Confirm one more time that test if( ( 0b00000000111111111110101010111110 >> 16 ) & ( 1 << bitpos ) ) works. Collect (print) the result for each bitpos
Store 0b00000000111111111110101010111110 in element 0 of the array. Repeat with if( ( ColourWheel[0] >> 16 ) & ( 1 << bitpos ) ). Collect results and compare with base case.
Store 0b00000000111111111110101010111110 in all elements of the array. Repeat with if( ( ColourWheel[Colour] >> 16 ) & ( 1 << bitpos ) ) for several different Colour values (assigned manually, though). Collect results and compare with base case.
Store 0b00000000111111111110101010111110 in all elements of the array. Repeat with if( ( ColourWheel[Colour] >> 16 ) & ( 1 << bitpos ) ) with a value for Colour normally assigned. Collect results and compare with base case.
Revert to the original program and retest. Collect results and compare with base case.

Confident that the value in ColourWheel[Colour] is not as expected or unstable. Validate the index range and access once. Code speed enhancement included.
[Edit] If the receiving end does not like the slower signal changes caused by replacing a constant with ColourWheel[Colour]>>16, more effcient code may solve this.
if (CubeStatusArray[x][y][Layer]){
uint32_t value = 0;
uint32_t maskR = 0x800000UL;
uint32_t maskG = 0x8000UL;
uint32_t maskB = 0x80UL;
if ((Colour >= 0) && (Colour < 255)) {
value = ColourWheel[Colour];
}
// All you need to do is shift 'value'
for(int8_t bitpos=7; bitpos >= 0; bitpos--){
{ if( (value & maskR) // set red
}
{ if( (value & maskG) // set green
}
{ if( (value & maskB) // set blue
}
value <<= 1;
}

Related

Determining if a 16 bit binary number is negative or positive

I'm creating a library for a temperature sensor that has a 16-bit value in binary that is being returned. I'm trying to find the best way to check if that value returned is negative or positive. I'm curious as to whether or not I can check if the most significant bit is a 1 or a 0 and if that would be the best way to go about it, how to successfully implement it.
I know that I can convert it to decimal and check that way but I just was curious if there was an easier way. I've seen it implemented with shifting values but I don't fully understand that method. (I'm super new to c++)
float TMP117::readTempC(void)
{
int16_t digitalTemp; // Temperature stored in the TMP117 register
digitalTemp = readRegister(TEMP_RESULT); //Reads the temperature from the sensor
// Check if the value is a negative number
/*Insert code to check here*/
// Returns the digital temperature value multiplied by the resolution
// Resolution = .0078125
return digitalTemp*0.0078125;
}
I'm not sure how to check if the code works and I haven't been able to compile it and run it on the device because the new PCB design and sensor has not come in the mail yet.
I know that I can convert it to decimal and check that way
I am not sure what you mean. An integer is an integer, it is an arithmetic object you just compare it with zero:
if( digitalTemp < 0 )
{
// negative
}
else
{
// positive
}
You can as you suggest test the MSB, but there is no particular benefit, it lacks clarity, and will break or need modification if the type of digitalTemp changes.
if( (digitalTemp & 0x8000 )
{
// negative
}
else
{
// positive
}
"conversion to decimal", can only be interpreted as conversion to a decimal string representation of an integer, which does not make your task any simpler, and is entirely unnecessary.
I'm not sure how to check if the code works and I haven't been able to compile it and run it on the device because the new PCB design and sensor has not come in the mail yet.
Compile and run it on a PC in a test harness with stubs for teh hardware dependent functions. Frankly if you are new to C++, you are perhaps better off practising the fundamentals in a PC environment with generally better debug facilities and faster development/test iteration in any case.
In general
float TMP117::readTempC(void)
{
int16_t digitalTemp; // Temperature stored in the TMP117 register
digitalTemp = readRegister(TEMP_RESULT); //Reads the temperature from the sensor
// Check if the value is a negative number
if (digitalTemp < 0)
{
printf("Dang it is cold\n");
}
// Returns the digital temperature value multiplied by the resolution
// Resolution = .0078125
return digitalTemp*0.0078125;
}

Improving performance for a TM simulator

I am trying to simulate a lot of 2 state, 3 symbol (One direction tape) Turing machines. Each simulation will have different input, and will run for a fixed number of steps. The current bottleneck in the program seems to be the simulator, taking a ton of memory on Turing machines which do not halt.
The task is to simulate about 650000 TMs, each with about 200 non-blank inputs. The largest number of steps I am trying is 1 billion (10**9).
Below is the code I am running. vector<vector<int> > TM is a transition table.
vector<int> fast_simulate(vector<vector<int> > TM, string TM_input, int steps) {
/* Return the state reached after supplied steps */
vector<int> tape = itotape(TM_input);
int head = 0;
int current_state = 0;
int halt_state = 2;
for(int i = 0; i < steps; i++){
// Read from tape
if(head >= tape.size()) {
tape.push_back(2);
}
int cell = tape[head];
int data = TM[current_state][cell]; // get transition for this state/input
int move = data % 2;
int write = (data % 10) % 3;
current_state = data / 10;
if(current_state == halt_state) {
// This highlights the last place that is written to in the tape
tape[head] = 4;
vector<int> res = shorten_tape(tape);
res.push_back(i+1);
return res;
}
// Write to tape
tape[head] = write;
// move head
if(move == 0) {
if(head != 0) {
head--;
}
} else {
head++;
}
}
vector<int> res {-1};
return res;
}
vector<int> itotape(string TM_input) {
vector<int> tape;
for(char &c : TM_input) {
tape.push_back(c - '0');
}
return tape;
}
vector<int> shorten_tape(vector<int> tape) {
/* Shorten the tape by removing unnecessary 2's (blanks) from the end of it.
*/
int i = tape.size()-1;
for(; i >= 0; --i) {
if(tape[i] != 2) {
tape.resize(i+1);
return tape;
}
}
return tape;
}
Is there anywhere I can make improvements in terms of performance or memory usage? Even a 2% decrease would make a noticeable difference.
Make sure no allocations happen during the whole TM simulation.
Preallocate a single global array at program startup, which is big enough for any state of the tape (e.g. 10^8 elements). Put the machine at the beginning of this tape array initially. Maintain the segment [0; R] of the all cells which were visited by the current machine simulation: this allows you to avoid clearing the whole tape array when you start the new simulation.
Use the smallest integer type for tape elements which is enough (e.g. use unsigned char if the alphabet surely has less than 256 characters). Perhaps you can even switch to bitsets if alphabet is very small. This reduces memory footprint and improves cache/RAM performance.
Avoid using generic integer divisions in the innermost loop (they are slow), use only divisions by powers-of-two (they turn into bit shifts). As the final optimization, you may try to remove all branches from the innermost loop (there are various clever techniques for this).
Here is another answer with more algorithmic approaches.
Simulation by blocks
Since you have tiny alphabet and tiny number of states, you can accelerate the simulation by processing chunks of the tape at once. This is related to the well-known speedup theorem, although I suggest a slightly different method.
Divide the tape into blocks of 8 characters each. Each such block can be represented with 16-bit number (2 bits per character). Now imagine that the machine is located either at the first or at the last character of a block. Then its subsequent behavior depends only on its initial state and the initial value on the block, until the TM moves out of the block (either to the left or to the right). We can precompute the outcome for all (block value + state + end) combinations, or maybe lazily compute them during simulation.
This method can simulate about 8 steps at once, although if you are unlucky it can do only one step per iteration (moving back and forth around block boundary). Here is the code sample:
//R = table[s][e][V] --- outcome for TM which:
// starts in state s
// runs on a tape block with initial contents V
// starts on the (e = 0: leftmost, e = 1: rightmost) char of the block
//The value R is a bitmask encoding:
// 0..15 bits: the new value of the block
// 16..17 bits: the new state
// 18 bit: TM moved to the (0: left, 1: right) of the block
// ??encode number of steps taken??
uint32_t table[2][2][1<<16];
//contents of the tape (grouped in 8-character blocks)
uint16_t tape[...];
int pos = 0; //index of current block
int end = 0; //TM is currently located at (0: start, 1: end) of the block
int state = 0; //current state
while (state != 2) {
//take the outcome of simulation on the current block
uint32_t res = table[state][end][tape[pos]];
//decode it into parts
uint16_t newValue = res & 0xFFFFU;
int newState = (res >> 16) & 3U;
int move = (res >> 18);
//write new contents to the tape
tape[pos] = newValue;
//switch to the new state
state = newState;
//move to the neighboring block
pos += (2*move-1);
end = !move;
//avoid getting out of tape on the left
if (pos < 0)
pos = 0, move = 0;
}
Halting problem
The comment says that TM simulation is expected either to finish very early, or to run all the steps up to the predefined huge limit. Since you are going to simulate many Turing machines, it might be worth investing some time in solving the halting problem.
The first type of hanging which can be detected is: when machine stays at the same place without moving far away from it. Let's maintain surrounding of TM during simulation, which is the values of segment of characters at distance < 16 from TM's current location. If you have 3 characters, you can encode surrounding in a 62-bit number.
Maintain a hash table for each position of TM (as we'll see later, only 31 tables are necessary). After each step, store tuple (state, surrounding) in the hash table of current position. Now the important part: after each move, clear all hash tables at distance >= 16 from TM (actually, only one such hash table has to be cleared). Before each step, check if (state, surrounding) is already present in the hash table. If it is, then the machine is in infinite loop.
You can also detect another type of hanging: when machine moves to the right infinitely, but never returns back. In order to achieve that, you can use the same hashtables. If TM is located at the currently last character of the tape with index p, check current tuple (state, surrounding) not only in the p-th hashtable, but also in the (p-1)-th, (p-2)-th, ..., (p-15)-th hash tables. If you find a match, then TM is in infinite loop moving to the right.
Change
int move = data % 2;
To
int move = data & 1;
One is a divide, the other is a bitmask, both should give 0 or 1 base on the low bit. You can do this anytime you have % by a power of two.
You're also setting
cell = tape[head];
data = TM[current_state][cell];
int move = data % 2;
int write = (data % 10) % 3;
current_state = data / 10;
Every single step, regardless of whether tape[head] has changed and even on branches where you're not accessing those values at all. Take a careful look at which branches use which data, and only update things just as they're needed. See straight after that you write:
if(current_state == halt_state) {
// This highlights the last place that is written to in the tape
tape[head] = 4;
vector<int> res = shorten_tape(tape);
res.push_back(i+1);
return res;
}
^ This code doesn't reference "move" or "write", so you can put the calculation for "move"/"write" after it and only calculate them if current_state != halt_state
Also the true-branch of an if statement is the optimized branch. By checking for not the halt state, and putting the halt condition in the else branch you can improve the CPU branch prediction a little.

Bad optimization of std::fabs()?

Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;

Faster bit reading?

In my application 20% of cpu time is spent on reading bits (skip) through my bit reader. Does anyone have any idea on how one might make the following code faster? At any given time, I do not need more than 20 valid bits (which is why I, in some situations, can use fast_skip).
Bits are read in big-endian order, which is why the byte swap is needed.
class bit_reader
{
std::uint32_t* m_data;
std::size_t m_pos;
std::uint64_t m_block;
public:
bit_reader(void* data)
: m_data(reinterpret_cast<std::uint32_t*>(data))
, m_pos(0)
, m_block(_byteswap_uint64(*reinterpret_cast<std::uint64_t*>(data)))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
assert(m_pos + n_bits < 64);
return (m_block << m_pos) >> (64 - n_bits);
}
void skip(std::size_t n_bits) // 20% cpu time
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
if(m_pos > 31)
{
m_block = _byteswap_uint64(reinterpret_cast<std::uint64_t*>(++m_data)[0]);
m_pos -= 32;
}
}
void fast_skip(std::size_t n_bits)
{
assert(m_pos + n_bits < 42);
m_pos += n_bits;
}
};
Target hardware is x64.
I see from an earlier comment you are unpacking Huffman/arithmetic coded streams in JPEG.
skip() and value() are really simple enough to be inlined. There's a chance that the compiler will keep the shift register and buffer pointers in registers the whole while. Making all pointers here and in the caller with the restrict modifier might help by telling the compiler that you won't be writing the results of Huffman decoding into the bit-buffer, thus allowing further optimisation.
The average length of each Huffman/artimetic symbol is short - so, ~7 times out of 8, you won't need to top up the 64-bit shift register. Investigate giving the compiler a branch-prediction hint.
It's unusual for any symbol in the JPEG bitstream to be longer than 32-bits. Does this allow further optimization?
One very logical reason that skip() is a heavy path is that you're calling it a lot. You are consuming an entire symbol at once rather than every bit here aren't you? There are some clever tricks you can do by counting leading 0 or 1s in symbols and table lookup.
You might consider arranging your shift register such that the next bit in the stream is the LSB. This will avoid the shifts in value()
Shifting for 64 bits is definitely not a good idea. In many CPUs shift is a slow operation.
I would advise you to change your code to a byte addressing. This will limit the shift for 8 bits maximum.
In many cases you really do not need a bit by itself, but rather to check if it is present or not. This can be done with a code like:
if (data[bit_inx/64] & mask[bit_inx % 64])
{
....
}
Try substituting this line in skip:
m_block = (m_block << 32) | _byteswap_uint32(*++m_data);
I don't know if it's the cause and what the underlying implementation of _byteswap_uint64 looks like, but you should read Rob Pike's article on byteorder. Maybe that's your answer.
Abstract: endianness is less of a problem than it's often made up to be. And the implementation for byte order swapping often come with issues. But there's a simple alternative.
[EDIT] I've got a better theory. Pasted from my comment below:
Maybe it's aliasing. 64 bit architectures love to align the data by 64 bits, when you read across alignment boundaries, it gets pretty slow. So it could be the (++m_data)[0] part, as x64 is 64 bit aligned and when you reinterpret_cast a uint32_t* to uint64_t*, you are crossing alignment boundaries about half of the time.
If your source buffers are not huge, then you should pre-process them, byte-swap the buffers before you access them using the bit_reader!
Reading from your bit_reader will be much faster then, because:
you will save some conditional instructions
the CPU caches can be used more efficiently: it can read straight from memory, which is most probably already loaded into cpu cache, instead of reading from memory that will be modified after reading each 64bit chunk, and so, destroy the benefits of having had it in cache
EDIT
Oh wait, you do not modify the source buffer. However, putting the byteswap into a pre-processing stage should at least be worth a try.
Another point: make sure those assert() calls will only be in debug version.
EDIT 2
(deleted)
EDIT 3
Your code is definitely flawed, check the following usage scenario:
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x77665544
br.skip(16);
br.value(16); // -> 0x33221100
br.skip(16); // -> triggers reading more bits
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA9988
br.skip(16);
br.value(16); // -> 0x77665544
// that's not what you expect, right ???
EDIT 4
Well, no, EDIT 3 was wrong, but I can not help, the code is flawed. Isn't it?
uint32_t source[] = { 0x00112233, 0x44556677, 0x8899AABB, 0xCCDDEEFF };
bit_reader br(source); // -> m_block = 0x7766554433221100
// reading...
br.value(16); // -> 0x7766
br.skip(16);
br.value(16); // -> 0x5544
br.skip(16); // -> triggers reading more bits (because m_pos=32, which is: m_pos>31)
// -> m_block = 0xBBAA998877665544, m_pos = 0
br.value(16); // -> 0xBBAA --> not what you expect, right?
Here is another version I tried, which didn't give any performance improvements.
class bit_reader
{
public:
const std::uint64_t* m_data64;
std::size_t m_pos64;
std::uint64_t m_block0;
std::uint64_t m_block1;
bit_reader(const void* data)
: m_pos64(0)
, m_data64(reinterpret_cast<const std::uint64_t*>(data))
, m_block0(byte_swap(*m_data64++))
, m_block1(byte_swap(*m_data64++))
{
}
std::uint64_t value(std::size_t n_bits = 64)
{
return __shiftleft128(m_block1, m_block0, m_pos64) >> (64 - n_bits);
}
void skip(std::size_t n_bits)
{
m_pos64 += n_bits;
if(m_pos64 > 63)
{
m_block0 = m_block1;
m_block1 = byte_swap(*m_data64++);
m_pos64 -= 64;
}
}
void fast_skip(std::size_t n_bits)
{
skip(n_bits);
}
};
If possible it would be best to do this in multiple passes. Multiple runs can be optimized and reduced breaching.
In general it is best to do
const uint64_t * arr = data;
for(uint64_t * i = arr; i != &arr[len/sizeof(uint64_t)] ;i++)
{
*i = _byteswap_uint64(*i);
//no more operations here
}
// another similar for loop
Such code can reduce run time by huge factor
At worst you can do it in like runs of 100k blocks, to keep cache misses at minimum and single loading of data from RAM.
In your case you do it in streaming way witch is good only for keeping low memory and faster responses from slow data source, but not for speed.

Want to translate/typecast parts of a char array into values

I'm playing around with networking, and I've hit a bit of a road block with translating a packet of lots of data into the values I want.
Basically I've made a mockup packet of what I'm expecting my packets to look like a bit. Essentially a Char (8bit value) indicating what the message is, and that is detected by a switch statement which then populates values based off the data after that 8 bit value. I'm expecting my packet to have all sorts of messages in it which may not be in order.
Eg, I may end up with the heartbeat at the end, or a string of text from a chat message, etc.
I just want to be able to say to my program, take the data from a certain point in the char array and typecast (if thats the term for it?) them into what I want them to be. What is a nice easy way to do that?
char bufferIncoming[15];
ZeroMemory(bufferIncoming,15);
//Making a mock packet
bufferIncoming[0] = 0x01; //Heartbeat value
bufferIncoming[1] = 0x01; //Heartbeat again just cause I can
bufferIncoming[2] = 0x10; //This should = 16 if its just an 8bit number,
bufferIncoming[3] = 0x00; // This
bufferIncoming[4] = 0x00; // and this
bufferIncoming[5] = 0x00; // and this
bufferIncoming[6] = 0x09; // and this should equal "9" of its is a 32bit number (int)
bufferIncoming[7] = 0x00;
bufferIncoming[8] = 0x00;
bufferIncoming[9] = 0x01;
bufferIncoming[10] = 0x00; //These 4 should be 256 I think when combines into an unsigned int
//End of mockup packet
int bufferSize = 15; //Just an arbitrary value for now
int i = 0;
while (i < bufferSize)
{
switch (bufferIncoming[i])
{
case 0x01: //Heart Beat
{
cout << "Heartbeat ";
}
break;
case 0x10: //Player Data
{
//We've detected the byte that indicates the following 8 bytes will be player data. In this case a X and Y position
playerPosition.X = ??????????; //How do I combine the 4 hex values for this?
playerPosition.Y = ??????????;
}
break;
default:
{
cout << ".";
}
break;
}
i++;
}
cout << " End of Packet\n";
UPDATE
Following Clairvoire's idea I added the following.
playerPosition.X = long(bufferIncoming[3]) << 24 | long(bufferIncoming[4]) << 16 | long(bufferIncoming[5]) << 8 | long(bufferIncoming[6]);
Notice I changed around the shifting values.
Another important change was
unsigned char bufferIncoming[15]
If I didn't do that, I was getting negative values being mixed with the combining of each element. I don't know what the compiler was doing under the hood but it was bloody annoying.
As you can imagine this is not my preferred solution but I'll give it a go. "Chad" has a good example of how I could have structured it, and a fellow programmer from work also recommended his implementation. But...
I have this feeling that there must be a faster cleaner way of doing what I want. I've tried things like...
playerPosition.X = *(bufferIncoming + 4) //Only giving me the value of the one hex value, not the combined >_<
playerPosition.X = reinterpret_cast<unsigned long>(&bufferIncoming); //Some random number that I dont know what it was
..and a few other things that I've deleted that didn't work either. What I was expecting to do was point somewhere in that char buffer and say "hey playerPosition, start reading from this position, and fill in your values based off the byte data there".
Such as maybe...
playerPosition = (playerPosition)bufferIncoming[5]; //Reads from this spot and fills in the 8 bytes worth of data
//or
playerPosition.Y = (playerPosition)bufferIncoming[9]; //Reads in the 4 bytes of values
...Why doesnt it work like that, or something similar?
There is probably a pretty version of this, but personally I would combine the four char variables using left shifts and ors like so:
playerPosition.X = long(buffer[0]) | long(buffer[1])<<8 | long(buffer[2])<<16 | long(buffer[3])<<24;
Endianness shouldn't be a concern, since bitwise logic is always executed the same, with the lowest order on the right (like how the ones place is on the right for decimal numbers)
Edit: Endianness may become a factor depending on how the sending machine initially splits the integer up before sending it across the network. If it doesn't decompose the integer in the same way as it does to recompose it using shifts, you may get a value where the first byte is last and the last byte is first. It's small ambiguities like these that prompt most to use networking libraries, aha.
An example of splitting an integer using bitwise would look something like this
buffer[0] = integer&0xFF;
buffer[1] = (integer>>8)&0xFF;
buffer[2] = (integer>>16)&0xFF;
buffer[3] = (integer>>24)&0xFF;
In a typical messaging protocol, the most straight forward way is to have a set of messages that you can easily cast, using inheritance (or composition) along with byte aligned structures (important for casting from a raw data pointer in this case) can make this relatively easy:
struct Header
{
unsigned char message_type_;
unsigned long message_length_;
};
struct HeartBeat : public Header
{
// no data, just a heartbeat
};
struct PlayerData : public Header
{
unsigned long position_x_;
unsigned long position_y_;
};
unsigned char* raw_message; // filled elsewhere
// reinterpret_cast is usually best avoided, however in this particular
// case we are casting two completely unrelated types and is therefore
// necessary
Header* h = reinterpret_cast<Header*>(raw_message);
switch(h)
{
case HeartBeat_MessageType:
break;
case PlayerData_MessageType:
{
PlayerData* data = reinterpret_cast<PlayerData*>(h);
}
break;
}
Was talking to one of the programmers I know on Skype and he showed me the solution I was looking for.
playerPosition.X = *(int*)(bufferIncoming+3);
I couldn't remember how to get it to work, or what its called. But it seems all good now.
Thanks guys for helping out :)