Deinterleave audio data in varied bitrates - c++

I'm trying to write one function that can deinterleave 8/16/24/32 bit audio data, given that the audio data naturally arrives in an 8 bit buffer.
I have this working for 8 bit, and it works for 16/24/32, but only for the first channel (channel 0). I have tried so many + and * and other operators that I'm just guessing at this point. I cannot find the magic formula. I am using C++ but would also accept a memcpy into the vector if that's easiest.
Check out the code. If you change the demux call to another bitrate you will see the problem. There is an easy math solution here I am sure, I just cannot get it.
#include <vector>
#include <map>
#include <iostream>
#include <iomanip>
#include <string>
#include <string.h>
const int bitrate = 8;
const int channel_count = 5;
const int audio_size = bitrate * channel_count * 4;
uint8_t audio_ptr[audio_size];
const int bytes_per_channel = audio_size / channel_count;
void Demux(int bitrate){
int byterate = bitrate/8;
std::map<int, std::vector<uint8_t> > channel_audio;
for(int i = 0; i < channel_count; i++){
std::vector<uint8_t> audio;
audio.reserve(bytes_per_channel);
for(int x = 0; x < bytes_per_channel; x += byterate){
for(int z = 0; z < byterate; z++){
// What is the magic formula!
audio.push_back(audio_ptr[(x * channel_count) + i + z]);
}
}
channel_audio.insert(std::make_pair(i, audio));
}
int remapsize = 0;
std::cout << "\nRemapped Audio";
std::map<int, std::vector<uint8_t> >::iterator it;
for(it = channel_audio.begin(); it != channel_audio.end(); ++it){
std::cout << "\nChannel" << it->first << " ";
std::vector<uint8_t> v = it->second;
remapsize += v.size();
for(size_t i = 0; i < v.size(); i++){
std::cout << "0x" << std::hex << std::setfill('0') << std::setw(2) << +v[i] << " ";
if(i && (i + 1) % 32 == 0){
std::cout << std::endl;
}
}
}
std::cout << "Total remapped audio size is " << std::dec << remapsize << std::endl;
}
int main()
{
// External data
std::cout << "Raw Audio\n";
for(int i = 0; i < audio_size; i++){
audio_ptr[i] = i;
std::cout << "0x" << std::hex << std::setfill('0') << std::setw(2) << +audio_ptr[i] << " ";
if(i && (i + 1) % 32 == 0){
std::cout << std::endl;
}
}
std::cout << "Total raw audio size is " << std::dec << audio_size << std::endl;
Demux(8);
//Demux(16);
//Demux(24);
//Demux(32);
}

You're actually pretty close. But the code is confusing: specifically the variable names and what actual values they represent. As a result, you appear to be just guessing the math. So let's go back to square one and determine what exactly it is we need to do, and the math will very easily fall out of it.
First, just imagine we have one sample covering each of the five channels. This is called an audio frame for that sample. The frame looks like this:
[channel0][channel1][channel2][channel3][channel4]
The width of a sample in one channel is called byterate in your code, but I don't like that name. I'm going to call it bytes_per_sample instead. You can easily see the width of the entire frame is this:
int bytes_per_frame = bytes_per_sample * channel_count;
It should be equally obvious that to find the starting offset for channel c within a single frame, you multiply as follows:
int sample_offset_in_frame = bytes_per_sample * c;
That's just about all you need! The last bit is your z loop which covers each byte in a single sample for one channel. I don't know what z is supposed to represent, apart from being a random single-letter identifier you chose, but hey let's just keep it.
Putting all this together, you get the absolute offset of sample s in channel c and then you copy individual bytes out of it:
int sample_offset = bytes_per_frame * s + bytes_per_sample * c;
for (int z = 0; z < bytes_per_sample; ++z) {
audio.push_back(audio_ptr[sample_offset + z]);
}
This does actually assume you're looping over the number of samples, not the number of bytes in your channel. So let's show all the loops for completion sake:
const int bytes_per_sample = bitrate / 8;
const int bytes_per_frame = bytes_per_sample * channel_count;
const int num_samples = audio_size / bytes_per_frame;
for (int c = 0; c < channel_count; ++c)
{
int sample_offset = bytes_per_sample * c;
for (int s = 0; s < num_samples; ++s)
{
for (int z = 0; z < bytes_per_sample; ++z)
{
audio.push_back(audio_ptr[sample_offset + z]);
}
// Skip to next frame
sample_offset += bytes_per_frame;
}
}
You'll see here that I split the math up so that it's doing less multiplications in the loops. This is mostly for readability, but might also help a compiler understand what's happening when it tries to optimize. Concerns over optimization are secondary (and in your case, there are much more expensive worries going on with those vectors and the map)..
The most important thing is you have readable code with reasonable variable names that makes logical sense.

Related

When I try to decompose a double number into an array as elements, why do I get an offset preventing me from getting the last digit?

I have a problem. I want from a given number to get each digit as an element in the same array.
But when I compile, if I extend the range from one iteration above the size of the given number, I get a corrupted data exception from Visual Studio in Debug mode as an exception.
I thought first that was because the int type is only 4 digit max length as a 4 bytes entity because I used to get only one digit for greater number above 9999. But I noticed that my number starts at an iteration value one too late...which makes it impossible to show the last digit.
If I add a zero to my given number, I can manually offset in the opposite direction, but that doesn't work with my original number.
But, I can't find out how to fix that...Here is my code.
Before asking for help, here is a screenshot explaining the principle which is used to convert the number into an array: math theory formula
I wish to solve it with the number type only because the char type involves another way managing the memory with buffers...which I don't really know how to handle right know.
Can someone help me to complete the debugging please ?
#include <iostream>
#include <math.h>
//method to convert user number entry to array of digits
long long numToArray(double num,double arrDigits[], const long long n) {
//instanciate variables
//array of with m elements
arrDigits[n];
double* loopValue = new double(0);
//extract the digits and store them into arrDigits array
for (long long i = 0; i < n; i++) {
long temp = 0;
for (long k = 0; k < i + 1; k++) {
//mathematical general formula
temp += arrDigits[i - k] * pow(10, k);
loopValue = new double(0);
*loopValue = floor(num / pow(10, n - i)) - temp;
arrDigits[i] = *loopValue;
}
std::cout << "digits array value at " << i << " is " << arrDigits[i] << " \n";
}
return 0;
}
//main program interacting with the user
int main()
{
std::cout << "please type an integer: ";
double num;
const long long n = sizeof(num);
double array[n]{};
std::cin >> num;
//call the method to test if all values are in the array
numToArray(num, array, n);
return 0;
}
Explaining the troubleshoot
Note : Visual Studio shows error if I extend from n to n+1. If I let the type int or long, sizeof(num) is all the time 4...
Then, I had to set it as double and to extract it from the main scope, which makes it ...double...
People asking to remove pointer, it is impossible to run the program if I do so.
I want from a given number to get each digit as an element in the same array.
If you want to simply get each number into an array, it takes only a few lines of code to convert the decimal to a string, remove the decimal point (if it exists), and then copy the string to a buffer:
#include <iostream>
#include <vector>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>
#include <iomanip>
int main()
{
double d = 1.45624234;
std::ostringstream strm;
strm << std::setprecision(12);
// copy this to a string using the output stream
strm << d;
std::string s = strm.str();
// remove the decimal point
s.erase(std::remove(s.begin(), s.end(), '.'), s.end());
// Now copy each digit to a buffer (in this case, vector)
std::vector<int> v;
std::transform(s.begin(), s.end(), std::back_inserter(v), [&](char ch) { return ch - '0';});
// output the results
for (auto c : v )
std::cout << c;
}
Output:
145624234
All of the work you were doing is already done for you by the standard library. In this case the overloaded operator << for double when streamed to a buffer creates the string. How it does it? That is basically what your code is attempting to do, but obviously safely and correctly.
Then it's just a matter of transforming each digit character into an actual integer that represents that digit, and that is what std::transform does. Each digit character is copied to the vector by subtracting the character 0 from each char digit.
#include <iostream>
#include <math.h>
#include <list>
int main()
{
//Entry request of any natural integer within the range of double type
std::cout << "Please type a natural integer from 1 to 99999999\n";
double num;
std::cin >> num;
//counting the number of digits
int count = 0;
long long CountingNum = static_cast<long long>(num);
while (CountingNum != 0) {
CountingNum = CountingNum/10;
++count;
}
std::cout << "number of digits compositing your natural integer: " << count<<std::endl;
//process the value for conversion to list of digits, so you can
//access each digits by power and enhance your calculus operations
double converternum = num * 10;//removing the right offset to keep the last digit
const int containerSize = sizeof(double); //defining array constant size
int sizeRescale = containerSize - count;//set general offset to handle according to the user entry
double arrDigits[containerSize] = {};//initialize array with a sufficient size.
double* loopValue = new double(0); //define pointer variable to make to operation possible
//extract the digits and store them into arrDigits array
for (long long i = 0; i < containerSize; i++) {
long temp = 0;
for (long k = 0; k < i + 1; k++) {
//mathematical general formula adapted to the computation
temp += arrDigits[i - k] * pow(10, k);
loopValue = new double(0); //reinitialize the pointer
*loopValue = floor(converternum / pow(10, containerSize - i)) - temp; //assign the math formula to the pointer
arrDigits[i] = *loopValue;//assigne the formula for any i to the array relatively to k
}
std::cout << "digits array value at " << i << " is " << arrDigits[i] << " \n";
}
//convert array to a list
std::list<double> listDigits(std::begin(arrDigits), std::end(arrDigits));
//print the converted list
std::cout << "array converted to list: ";
for (double j : listDigits) {
std::cout << j << " ";
}
std::cout << std::endl;
//remove the zeros offset and resize the new converted list
for (int j = 0; j < sizeRescale; j++) {
listDigits.pop_front();
}
std::cout << "removed zero element to the list\n";
for (double i : listDigits) {
std::cout << i << " ";
}
std::cout << "natural integer successfully converted into list digits data\n";
return 0;
}
an example on debug mode in Visual Studio 2019
I finally encapsulated the whole code into two functions. But I have an extra value at first and last iteration...
The answer is almost complete, just need to solve the offset from inside the main moved to it's owned function. I finally added a new array variable with the exact size I want from the two new functions, so we get the array which will be possible to manipulate so far away.
#include <iostream>
#include <math.h>
#include <list>
int CountNumberDigits(int num) {
int count = 0;
long long CountingNum = static_cast<long long>(num);
while (CountingNum != 0) {
CountingNum = CountingNum / 10;
++count;
}
return count;
}
double* NumToArray(double num) {
double converternum = num * 10;//removing the right offset to keep the last digit
const int containerSize = sizeof(double); //defining array constant size
int sizeRescale = containerSize - CountNumberDigits(num);//set general offset to handle according to the user entry
double arrDigits[containerSize] = {};//initialize array with a sufficient size.
double* loopValue = new double(0); //define pointer variable to make to operation possible
//extract the digits and store them into arrDigits array
for (long long i = 0; i < containerSize; i++) {
long temp = 0;
for (long k = 0; k < i + 1; k++) {
//mathematical general formula adapted to the computation
temp += arrDigits[i - k] * pow(10, k);
loopValue = new double(0); //reinitialize the pointer
*loopValue = floor(converternum / pow(10, containerSize - i)) - temp; //assign the math formula to the pointer
arrDigits[i] = *loopValue;//assigne the formula for any i to the array relatively to k
}
}
//convert array to a list
std::list<double> listDigits(std::begin(arrDigits), std::end(arrDigits));
for (double j : listDigits) {
std::cout << j << " ";
}
//remove the zeros offset and resize the new converted list
for (int j = 0; j < sizeRescale; j++) {
listDigits.pop_front();
}
//convert list to array
double* arrOutput = new double[listDigits.size()]{};
std::copy(listDigits.begin(), listDigits.end(), arrOutput);
double* ptrResult = arrOutput;
return ptrResult;
}
int main()
{
//Entry request of any natural integer within the range of double type
std::cout << "Please type a natural integer from 1 to 99999999\n";
double num;
std::cin >> num;
int count = CountNumberDigits(num);
std::cout << "number of digits compositing your natural integer: " << count << std::endl;
double* ptrOutput = NumToArray(num);
//reduce the array to the num size
double* shrinkArray = new double[CountNumberDigits(num)];
for (int i = 0; i < CountNumberDigits(num); i++) {
*(shrinkArray+i) = ptrOutput[i];
std::cout << *(shrinkArray+i) << " ";
}

omp parallel for no optimization achieved for quadratic sieve

I am trying to implement parallel quadratic sieve using open mp. In sieving phase, I am using log approximations to check the divisibility. This is my code.
#pragma omp parallel for schedule (dynamic) num_threads(4)
for (int i = 0; i < factorBase.size(); ++i) {
const uint32_t p = factorBase[i];
const float logp = std::log(factorBase[i]) / std::log(2);
// Sieve first sequence.
while (startIndex.first[i] < intervalEnd) {
logApprox[startIndex.first[i] - intervalStart] -= logp;
startIndex.first[i] += p;
}
if (p == 2)
continue; // a^2 = N (mod 2) only has one root.
// Sieve second sequence.
while (startIndex.second[i] < intervalEnd) {
logApprox[startIndex.second[i] - intervalStart] -= logp;
startIndex.second[i] += p;
}
}
Here factorbase and logApprox are std::vectors initialized as follows
std::vector<float> logApprox(INTERVAL_LENGTH, 0);
std::vector<uint32_t> factorBase;
Whenever, I run this code and compare the running time, there is no much difference between sequential and parallel run. What are some optimizations that can be done? I am a beginner in openmp and any help is appreciated.Thanks
Very interesting task you have! Thanks!
Decided to make my own implementation with very many optimizations.
I achieved 20.4x times boost compared to your original code (your code gives 17.86 seconds, my gives 0.87 seconds). Also I used 2x times less memory for sieving compared to your algorithm, while achieving same goal.
To make comparison I simplified your code in such a way that it still does almost same thing and runs exactly same time, but looks much more simple:
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
You can see that I leaved only single sieve loop, second one does same thing and not necessary for demonstration, so I removed it. Also I removed startInterval as it is irrelevant to speed demonstration. And for simplicity I did += of logarithm instead of yours -=.
One important notice regarding your algorithm is that it doesn't do any synchronization, it means that different cores of CPU may write to same entry of logApprox array hence give wrong result.
And as I have measured this wrong result happens once or twice per hundred million entries of logApprox array. My optimized code overcame this limitation and did correct synchronization besides doing all speed optimizations.
I did following improvements to gain 20x times speedup:
I split whole array into blocks, approximately 2^13 elements in size. Each group of blocks is processed by separate thread/CPU-core hence no synchronization of threads is needed. Besides avoiding synchronization what is very important is that 2^13 block fits fully into L1 or L2 cache of CPU, hence speeds up things a lot.
Each block of 2^13 is processed for all possible primes. To keep track of which offsets of what primes are needed I created a special ring buffer of 2^7 size, this ring buffer is indexed with block number modulo 2^7 and keeps track which primes with which offsets are needed for each block (modulo 2^7).
I have as many threads as there are CPU cores. For each thread I precompute starting offsets of all primes for this thread, these starting offsets are computed through modular arithmetics based on startIndex array that you provided in your original code.
To speedup even more instead of float logarithm I use integer logarithm, which is based on uint16_t. This integer logarithm is computed as uint16_t integer_log = uint16_t(std::log2(p) * (1 << 8) + 0.5);. Besides increasing speed of computing += for integer logarithms, they also decrease occupied memory 2x times. If for some reason uint16_t logarithm is not enough for you then please replace using ILog2T = u16; with using ILog2T = u32; in my code, but this will double amount of used memory.
My code output following to console:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
Time simple is time of your original code for sieving array of size 2^28, time optimized is my code for same array, boost is how much my code is faster (you can see it is 20x times faster). Correct ratio says if there are any errors in your code, due to absence of multi-core synchronization (as you can see sometimes it is less than 1.0 hence there are some errors).
Full optimized code below:
Try it online!
#include <cstdint>
#include <random>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <thread>
#include <type_traits>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <mutex>
#include <omp.h>
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#define OSTR(code) ([&]{ std::ostringstream ss; ss code; return ss.str(); }())
#define COUT(code) { std::unique_lock<std::mutex> lock(cout_mux); std::cout code; std::cout << std::flush; }
#define LN { COUT(<< "LN " << __LINE__ << std::endl); }
#define DUMP(var) { COUT(<< #var << " = (" << (var) << ")" << std::endl); }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
using ILog2T = u16;
using PrimeT = u32;
std::mutex cout_mux;
template <typename T>
std::vector<T> GenPrimes(size_t end) {
thread_local std::vector<T> primes = {2, 3};
while (primes.back() < end) {
for (T p = primes.back() + 2;; p += 2) {
bool is_prime = true;
for (auto d: primes) {
if (u64(d) * d > p)
break;
if (p % d == 0) {
is_prime = false;
break;
}
}
if (is_prime) {
primes.push_back(p);
break;
}
}
}
primes.pop_back();
return primes;
}
void SieveA(std::vector<float> & logApprox, std::vector<PrimeT> const & factorBase, std::vector<PrimeT> startIndex) {
#pragma omp parallel for
for (size_t i = 0; i < factorBase.size(); ++i) {
auto const p = factorBase[i];
float const logp = std::log(p) / std::log(2);
while (startIndex[i] < logApprox.size()) {
logApprox[startIndex[i]] += logp;
startIndex[i] += p;
}
}
}
size_t NThreads() {
//return 1;
return std::thread::hardware_concurrency();
}
ILog2T LogToI(double x) { return ILog2T(x * (1ULL << (sizeof(ILog2T) * 8 - 8)) + 0.5); }
double IToLog(ILog2T x) { return x / double(1ULL << (sizeof(ILog2T) * 8 - 8)); }
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
std::string FloatToStr(double x, size_t round = 6) {
return OSTR(<< std::fixed << std::setprecision(round) << x);
}
double SieveB(std::vector<ILog2T> & logs, std::vector<PrimeT> const & primes, std::vector<PrimeT> const & starts0) {
auto const nthr = NThreads();
std::vector<std::vector<PrimeT>> starts(nthr, std::vector<PrimeT>(primes.size()));
std::vector<std::vector<ILog2T>> plogs(nthr, std::vector<ILog2T>(primes.size()));
std::vector<std::pair<u64, u64>> ranges(nthr);
size_t constexpr block_log2 = 13, block = 1 << block_log2, ring_log2 = 6, ring_size = 1ULL << ring_log2, ring_mask = ring_size - 1;
std::vector<std::vector<std::vector<std::pair<u32, u32>>>> ring(nthr, std::vector<std::vector<std::pair<u32, u32>>>(ring_size));
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
size_t const nblock = ((logs.size() + nthr - 1) / nthr + block - 1) / block * block,
begin = ithr * nblock, end = std::min<size_t>(logs.size(), (ithr + 1) * nblock);
ranges[ithr] = {begin, end};
for (size_t i = 0; i < primes.size(); ++i) {
PrimeT const p = primes[i];
size_t const mod0 = begin % p, mod = starts0[i] < mod0 ? p + starts0[i] - mod0 : starts0[i] - mod0;
starts[ithr][i] = mod;
plogs[ithr][i] = LogToI(std::log2(p));
ring[ithr][((begin + starts[ithr][i]) >> block_log2) & ring_mask].push_back({i, begin + starts[ithr][i]});
}
}
auto tim = Time();
#pragma omp parallel for
for (size_t ithr = 0; ithr < nthr; ++ithr) {
auto const [begin, end] = ranges[ithr];
auto const [bbegin, bend] = std::make_tuple(begin / block, (end - 1) / block + 1);
auto const & cstarts = starts.at(ithr);
auto const & cplogs = plogs.at(ithr);
auto & cring = ring[ithr];
std::decay_t<decltype(cring[0])> tmp;
size_t hit_cnt = 0, miss_cnt = 0;
for (size_t iblock = bbegin; iblock < bend; ++iblock) {
size_t const cbegin = iblock << block_log2, cend = std::min<size_t>(end, (iblock + 1) << block_log2);
auto & ring_cur = cring[iblock & ring_mask];
tmp = ring_cur;
ring_cur.clear();
for (auto [ip, off]: tmp)
if (off >= cend) {
//++miss_cnt;
ring_cur.push_back({ip, off});
} else {
//++hit_cnt;
auto const p = primes[ip];
auto const plog = cplogs[ip];
for (; off < cend; off += p) {
//if (8192 - 10 <= off && off <= 8192 + 10) COUT(<< "logs.size() " << logs.size() << " begin " << begin << " end " << end << " bbegin " << bbegin << " bend " << bend << " cbegin " << cbegin << " cend " << cend << " iblock " << iblock << " off " << off << " p " << p << " plog " << plog << std::endl);
logs[off] += plog;
}
if (off < end)
cring[(off >> block_log2) & ring_mask].push_back({ip, off});
}
}
//COUT(<< "hit_ratio " << std::fixed << std::setprecision(6) << double(hit_cnt) / (hit_cnt + miss_cnt) << std::endl);
}
return Time() - tim;
}
void Test() {
size_t constexpr len = 1ULL << 28;
std::mt19937_64 rng{123};
auto const primes = GenPrimes<PrimeT>(1 << 12);
std::vector<PrimeT> starts;
for (auto p: primes)
starts.push_back(rng() % p);
ASSERT(primes.size() == starts.size());
double tA = 0, tB = 0;
std::vector<float> logsA(len);
std::vector<ILog2T> logsB(len);
{
tA = Time();
SieveA(logsA, primes, starts);
tA = Time() - tA;
}
{
tB = SieveB(logsB, primes, starts);
}
size_t correct = 0;
for (size_t i = 0; i < len; ++i) {
//ASSERT_MSG(std::abs(logsA[i] - IToLog(logsB[i])) < 0.1, "i " + std::to_string(i) + " logA " + FloatToStr(logsA[i], 3) + " logB " + FloatToStr(IToLog(logsB[i]), 3));
if (std::abs(logsA[i] - IToLog(logsB[i])) < 0.1)
++correct;
}
std::cout << std::fixed << std::setprecision(3) << "time_simple " << tA << " sec, time_optimized " << tB << " sec, boost " << (tA / tB) << ", correct_ratio " << std::setprecision(9) << double(correct) / len << std::endl;
}
int main() {
try {
omp_set_num_threads(NThreads());
Test();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
time_simple 17.859 sec, time_optimized 0.874 sec, boost 20.434, correct_ratio 0.999999993
In my opinion, you should turn the schedule to static and give it chunk-size (https://software.intel.com/en-us/articles/openmp-loop-scheduling).
A small optimization should be :
outside of the big FOR loop, declare a const and initialize it to 1/std::log(2), and then inside the FOR loop, instead of dividing by std::log(2), do a multiplication of the previous const, division is very expensive in CPU cycles.

Looking for a sequence number in a series of char* buffers?

I am receiving a stream of const char* msg types of a certain size_t len. At some byte offset within there there is a sequence number (32 or 64 byte, im not sure which) so my idea was to do the following every time I get one of the msg things:
for (int i = 0; i < 30; ++i)
{
uint32_t seq = *(uint32_t*) msg[i];
cout << "seq" << i << " " << seq << endl;
}
//and similar for 64 bytes
so that afterwards I can group the lines with the same offset and see which offset i is giving me sequential looking output. The problem with this is that I segfault with stuff like:
(gdb) p *(uint32_t*) msg[i]
Cannot access memory at address 0x2d
How can I carry out my little search idea for the sequence numbers?
Try:
uint32_t seq = *(uint32_t*) &msg[i];
and
(gdb) p *(uint32_t*)&msg[i]
EDIT: A bigger change, which is potentially more portable is:
uint32_t seq;
memcpy(&seq, msg + i, sizeof(seq));
seq = ntohl(seq);
char msg[30];
for ( int i = 0; i < 30; i++ )
msg[i] = '\0';
char *iter_p = NULL;
iter_p = msg;
int i = 0;
while ( iter_p < &msg[30] ) {
uint32_t seq = *(uint32_t *)iter_p;
cout << "seq" << i << " " << seq << endl;
iter_p += 4;
i++;
}
Try iterating through like this, step an iterator pointer through. =)
iter_p +=4 --> step 32 bits, since iter_p is a character.
that's not how you convert bytes to an int, you are trying to dereference a pointer to a location in memory that doesn't exist. Try something like this: http://www.cplusplus.com/forum/beginner/3076/
You make a simple mistake: msg[i] return the VALUE of the char at position i. To get the address of it you should use msg + i or &msg[i].
But this code is not portable for some architectures that can't read unaligned word.
The best way of reading the unaligned word is using the packed structures:
#pragma pack(1)
struct Header {
uint32_t seq;
};
#pragma pack()
for (int i = 0; i < 30; ++i)
{
const Header *h = (const Header *)(msg + i);
cout << "seq" << i << " " << htonl(h->seq) << endl;
}
Pay attention for the endian issue and htonl call.

Primitive types slower than user types in C++?

I was curious and did a little benchmark to determine the performance delta between primitive types such as int or float and user types.
I created a template class Var, created some inline arithmetic operators. The test consisted of looping this loop for both the primitive and Var vectors:
for (unsigned i = 0; i < 1000; ++i) {
in1[i] = i;
in2[i] = -i;
out[i] = (i % 2) ? in1[i] + in2[i] : in2[i] - in1[i];
}
I was quite surprised with the results, turns out my Var class is faster most of the time, with int on average that loop took about 5700 nsec less with the class. Out of 3000 runs, int was faster 11 times vs. Var which was faster 2989 times. Similar results with float, where Var is 15100 nsec faster than floatin 2991 of the runs.
Shouldn't primitive types be faster?
Edit: Compiler is a rather ancient mingw 4.4.0, build options are the defaults of QtCreator, no optimizations:
qmake call: qmake.exe C:\...\untitled15.pro -r -spec win32-g++ "CONFIG+=release"
OK, posting full source, platform is 64 bit Win7, 4 GB DDR2-800, Core2Duo#3Ghz
#include <QTextStream>
#include <QVector>
#include <QElapsedTimer>
template<typename T>
class Var{
public:
Var() {}
Var(T val) : var(val) {}
inline T operator+(Var& other)
{
return var + other.value();
}
inline T operator-(Var& other)
{
return var - other.value();
}
inline T operator+(T& other)
{
return var + other;
}
inline T operator-(T& other)
{
return var - other;
}
inline void operator=(T& other)
{
var = other;
}
inline T& value()
{
return var;
}
private:
T var;
};
int main()
{
QTextStream cout(stdout);
QElapsedTimer timer;
unsigned count = 1000000;
QVector<double> pin1(count), pin2(count), pout(count);
QVector<Var<double> > vin1(count), vin2(count), vout(count);
unsigned t1, t2, pAcc = 0, vAcc = 0, repeat = 10, pcount = 0, vcount = 0, ecount = 0;
for (int cc = 0; cc < 5; ++cc)
{
for (unsigned c = 0; c < repeat; ++c)
{
timer.restart();
for (unsigned i = 0; i < count; ++i)
{
pin1[i] = i;
pin2[i] = -i;
pout[i] = (i % 2) ? pin1[i] + pin2[i] : pin2[i] - pin1[i];
}
t1 = timer.nsecsElapsed();
cout << t1 << endl;
timer.restart();
for (unsigned i = 0; i < count; ++i)
{
vin1[i] = i;
vin2[i] = -i;
vout[i] = (i % 2) ? vin1[i] + vin2[i] : vin2[i] - vin1[i];
}
t2 = timer.nsecsElapsed();
cout << t2 << endl;;
pAcc += t1;
vAcc += t2;
}
pAcc /= repeat;
vAcc /= repeat;
if (pAcc < vAcc) {
cout << "primitive was faster" << endl;
pcount++;
}
else if (pAcc > vAcc) {
cout << "var was faster" << endl;
vcount++;
}
else {
cout << "amazingly, both are equally fast" << endl;
ecount++;
}
cout << "Average for primitive type is " << pAcc << ", average for Var is " << vAcc << endl;
}
cout << "int was faster " << pcount << " times, var was faster " << vcount << " times, equal " << ecount << " times, " << pcount + vcount + ecount << " times ran total" << endl;
}
Relatively, with floats the Var class is 6-7% faster than floats, with ints about 3%.
I also ran the test with vector length of 10 000 000 instead of the original 1000 and results are still consistent and in favor of the class.
With QVector replaced by std::vector, at -O2 optimization level, code generated by GCC for the two types is exactly the same, instruction for instruction.
Without the replacement, the generated code is different, but that's hardly surprising, considering that QtVector is implemented differently for primitive and non-primitive types (look for QTypeInfo<T>::isComplex in qvector.h).
Update It looks like isComplex does not affect the linner oop, i.e. the measured part. The loop code still differs for the two types, albeit very slightly. It looks like the difference is due to GCC.
I benchmarked running time and memory allocation for QVector and float* with very little difference between both

How to put bit sequence into bytes (C/C++)

I have a couple of integers, for example (in binary represetation):
00001000, 01111111, 10000000, 00000001
and I need to put them in sequence to array of bytes(chars), without the leading zeros, like so:
10001111 11110000 0001000
I understand that it is must be done by bit shifting with <<,>> and using binary or |. But I can't find the correct algorithm, can you suggest the best approach?
The integers I need to put there are unsigned long long ints, so the length of one can be anywhere from 1 bit to 8 bytes (64 bits).
You could use a std::bitset:
#include <bitset>
#include <iostream>
int main() {
unsigned i = 242122534;
std::bitset<sizeof(i) * 8> bits;
bits = i;
std::cout << bits.to_string() << "\n";
}
There are doubtless other ways of doing it, but I would probably go with the simplest:
std::vector<unsigned char> integers; // Has your list of bytes
integers.push_back(0x02);
integers.push_back(0xFF);
integers.push_back(0x00);
integers.push_back(0x10);
integers.push_back(0x01);
std::string str; // Will have your resulting string
for(unsigned int i=0; i < integers.size(); i++)
for(int j=0; j<8; j++)
str += ((integers[i]<<j) & 0x80 ? "1" : "0");
std::cout << str << "\n";
size_t begin = str.find("1");
if(begin > 0) str.erase(0,begin);
std::cout << str << "\n";
I wrote this up before you mentioned that you were using long ints or whatnot, but that doesn't actually change very much of this. The mask needs to change, and the j loop variable, but otherwise the above should work.
Convert them to strings, then erase all leading zeros:
#include <iostream>
#include <sstream>
#include <string>
#include <cstdint>
std::string to_bin(uint64_t v)
{
std::stringstream ss;
for(size_t x = 0; x < 64; ++x)
{
if(v & 0x8000000000000000)
ss << "1";
else
ss << "0";
v <<= 1;
}
return ss.str();
}
void trim_right(std::string& in)
{
size_t non_zero = in.find_first_not_of("0");
if(std::string::npos != non_zero)
in.erase(in.begin(), in.begin() + non_zero);
else
{
// no 1 in data set, what to do?
in = "<no data>";
}
}
int main()
{
uint64_t v1 = 437148234;
uint64_t v2 = 1;
uint64_t v3 = 0;
std::string v1s = to_bin(v1);
std::string v2s = to_bin(v2);
std::string v3s = to_bin(v3);
trim_right(v1s);
trim_right(v2s);
trim_right(v3s);
std::cout << v1s << "\n"
<< v2s << "\n"
<< v3s << "\n";
return 0;
}
A simple approach would be having the "current byte" (acc in the following), the associated number of used bits in it (bitcount) and a vector of fully processed bytes (output):
int acc = 0;
int bitcount = 0;
std::vector<unsigned char> output;
void writeBits(int size, unsigned long long x)
{
while (size > 0)
{
// sz = How many bit we're about to copy
int sz = size;
// max avail space in acc
if (sz > 8 - bitcount) sz = 8 - bitcount;
// get the bits
acc |= ((x >> (size - sz)) << (8 - bitcount - sz));
// zero them off in x
x &= (1 << (size - sz)) - 1;
// acc got bigger and x got smaller
bitcount += sz;
size -= sz;
if (bitcount == 8)
{
// got a full byte!
output.push_back(acc);
acc = bitcount = 0;
}
}
}
void writeNumber(unsigned long long x)
{
// How big is it?
int size = 0;
while (size < 64 && x >= (1ULL << size))
size++;
writeBits(size, x);
}
Note that at the end of the processing you should check if there is any bit still in the accumulator (bitcount > 0) and you should flush them in that case by doing a output.push_back(acc);.
Note also that if speed is an issue then probably using a bigger accumulator is a good idea (however the output will depend on machine endianness) and also that discovering how many bits are used in a number can be made much faster than a linear search in C++ (for example x86 has a special machine language instruction BSR dedicated to this).