Persistent Byte Output Counter C++ [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am working on a project in which I need to keep track of how many bytes the software dishes out.
The software will be turned on and off occasionally, so I must implement a way to store the number of bytes outputted in a way so that an Administrator or another User cannot simply open the file and change the number of bytes outputted.
What is the best way to implement this?
Also, I am not able to use any libraries (ex: boost).

Here is an example of using https://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm
#include <iostream>
#include <fstream>
void encrypt(uint32_t* v, uint32_t* k) {
uint32_t v0 = v[0], v1 = v[1], sum = 0, i; /* set up */
uint32_t delta = 0x9e3779b9; /* a key schedule constant */
uint32_t k0 = k[0], k1 = k[1], k2 = k[2], k3 = k[3]; /* cache key */
for (i = 0; i < 32; i++) { /* basic cycle start */
sum += delta;
v0 += ((v1 << 4) + k0) ^ (v1 + sum) ^ ((v1 >> 5) + k1);
v1 += ((v0 << 4) + k2) ^ (v0 + sum) ^ ((v0 >> 5) + k3);
} /* end cycle */
v[0] = v0; v[1] = v1;
}
void decrypt(uint32_t* v, uint32_t* k) {
uint32_t v0 = v[0], v1 = v[1], sum = 0xC6EF3720, i; /* set up */
uint32_t delta = 0x9e3779b9; /* a key schedule constant */
uint32_t k0 = k[0], k1 = k[1], k2 = k[2], k3 = k[3]; /* cache key */
for (i = 0; i<32; i++) { /* basic cycle start */
v1 -= ((v0 << 4) + k2) ^ (v0 + sum) ^ ((v0 >> 5) + k3);
v0 -= ((v1 << 4) + k0) ^ (v1 + sum) ^ ((v1 >> 5) + k1);
sum -= delta;
} /* end cycle */
v[0] = v0; v[1] = v1;
}
int main()
{
uint32_t k[4] = { 123,456,789,10 }; // key
uint32_t v[2] = { 1000000, 1000000 }; // data
// save into file
std::ofstream ofs("save.dat", std::ios::binary);
encrypt(v, k);
ofs << v[0] << " " << v[1] << std::endl;
// read from file
std::ifstream ifs("save.dat", std::ios::binary);
uint32_t v2[2];
if (ifs >> v2[0] >> v2[1])
{
std::cout << "Filedata: " << v2[0] << " " << v2[1] << std::endl;
decrypt(v2, k);
if (v2[0] == v2[1])
std::cout << "Decrypted: " << v2[0] << std::endl;
else
std::cout << "Data was tampered with!" << std::endl;
}
}
http://coliru.stacked-crooked.com/view?id=d725bf798ff8ca12
Works pretty good and doesn't need any library. This is low level protection but should be hard enough to discourage your users.

Related

Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?

Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this:
const char* const str = "12345678";
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
Then unpacked will be 0x0807060504030201 on a little-endian system.
What is the fastest way to convert the number into 12345678, perhaps by multiplying it by some magic number or using SIMD up to AVX2?
UPDATE: 12345678 has to be a number stored in a 32-bit or 64-bit integer, not a string.
Multiplication in binary is just a series of shift & adds. A SWAR approach
shouldn't be too hard to understand. For a detailed walk-thru see:
https://johnnylee-sde.github.io/Fast-numeric-string-to-int/
https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html
https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
http://0x80.pl/articles/simd-parsing-int-sequences.html
// http://govnokod.ru/13461
static inline
uint32_t parse_8digits_swar_classic (char* str) {
uint64_t v;
memcpy(&v, str, 8);
v = (v & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
v = (v & 0x00FF00FF00FF00FF) * 6553601 >> 16;
v = (v & 0x0000FFFF0000FFFF) * 42949672960001 >> 32;
return v;
}
// attempt to improve the latency
static inline
uint32_t parse_8digits_swar_aqrit (char* str) {
const uint64_t mask = 0x000000FF000000FF;
uint64_t v, t;
memcpy(&v, str, 8);
v = (v * 10) + (v >> 8);
t = (v & mask) * 0x000F424000000064;
v = ((v >> 16) & mask) * 0x0000271000000001;
v = (v + t + 0xFF0915C600000000ULL) >> 32;
return v;
}
// SSSE3 needs less `shift & mask` operations...
static inline
uint32_t parse_8digits_simd_ssse3 (char* str) {
const __m128i mul1 = _mm_set_epi32(0, 0, 0x010A0A64, 0x14C814C8);
const __m128i mul2 = _mm_set_epi32(0, 0, 0x0001000A, 0x00FA61A8);
const __m128i mask = _mm_set1_epi8(0x0F);
__m128i v;
v = _mm_loadl_epi64((__m128i*)(void*)str);
v = _mm_and_si128(v, mask);
v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
v = _mm_add_epi32(_mm_add_epi32(v, v), _mm_shuffle_epi32(v, 1));
return (uint32_t)_mm_cvtsi128_si32(v);
}
On an older x86-64 system without AVX2, this simple version based on gathering digits in tree fashion is quite efficient, with performance on par with a simple SWAR-based implementation per my measurements. This requires a processor with a lot of instruction-level parallelism however, as it comprises 50% more instructions than the SWAR -based code when compiled with full optimizations.
/* convert a string of exactly eight 'char' into a 32-bit unsigned integer */
uint32_t string_to_number (const char * s)
{
uint32_t t0 = s[0] * 10 + s[1];
uint32_t t1 = s[2] * 10 + s[3];
uint32_t t2 = s[4] * 10 + s[5];
uint32_t t3 = s[6] * 10 + s[7];
uint32_t s0 = t0 * 100 + t1;
uint32_t s1 = t2 * 100 + t3;
uint32_t num = s0 * 10000 + s1;
uint32_t corr =
'0' * 10000000 +
'0' * 1000000 +
'0' * 100000 +
'0' * 10000 +
'0' * 1000 +
'0' * 100 +
'0' * 10 +
'0' * 1;
return num - corr;
}
If you change your input format to breadth-first element order like this:
Sample 9 numbers, interleaved
digit[]:
1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 ...
... 3 3 3 3 3 3 3 3 3 ....
for(int j=0; j<num_parse; j+=9)
{
for(int i=0; i<9; i++)
{
value[i] +=
(multiplier[i]*=10)*
(digit[i+j]-'0');
}
// value vector copied to output
// clear value & multiplier vectors
}
And if you convert more than just 9 values, like 512 or 8192 with padding to any multiple of 32, compiler should vectorize it.
To prepare input, you can use 8 different channels, 1 per digit of every parsed value.
I've implemented a small program to test some ideas. AVX2 implementation is ~1.5 times faster than the naive, with the table implementation in the middle:
AVX2: 12345678 in 3.42759
Naive: 12345678 in 5.12581
Table: 12345678 in 4.49478
Source code:
#include <cstdlib>
#include <cstdint>
#include <immintrin.h>
#include <iostream>
using namespace std;
const __m256i mask = _mm256_set1_epi32(0xf);
const __m256i mul = _mm256_setr_epi32(10000000, 1000000, 100000, 10000, 1000, 100, 10, 1);
const volatile char* str = "12345678";
volatile uint32_t h;
const int64_t nIter = 1000LL * 1000LL * 1000LL;
inline void parse_avx2() {
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
const __m128i a = _mm_set1_epi64x(unpacked);
const __m256i b = _mm256_cvtepu8_epi32(a);
const __m256i d = _mm256_mullo_epi32(b, mul);
const __m128i e = _mm_add_epi32(_mm256_extractf128_si256(d, 0), _mm256_extractf128_si256(d, 1));
const uint64_t f0 = _mm_extract_epi64(e, 0);
const uint64_t f1 = _mm_extract_epi64(e, 1);
const uint64_t g = f0 + f1;
h = (g>>32) + (g&0xffffffff);
}
inline void parse_naive() {
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
const uint8_t* a = reinterpret_cast<const uint8_t*>(&unpacked);
h = a[7] + a[6]*10 + a[5]*100 + a[4]*1000 + a[3]*10000 + a[2]*100000 + a[1]*1000000 + a[0]*10000000;
}
uint32_t summands[8][10];
inline void parse_table() {
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
- *reinterpret_cast<const uint64_t*>(base);
const uint8_t* a = reinterpret_cast<const uint8_t*>(&unpacked);
h = summands[7][a[0]] + summands[6][a[1]] + summands[5][a[2]] + summands[4][a[3]]
+ summands[3][a[4]] + summands[2][a[5]] + summands[1][a[6]] + summands[0][a[7]];
}
int main() {
clock_t start = clock();
for(int64_t i=0; i<nIter; i++) {
parse_avx2();
}
clock_t end = clock();
cout << "AVX2: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;
start = clock();
for(int64_t i=0; i<nIter; i++) {
parse_naive();
}
end = clock();
cout << "Naive: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;
uint32_t mul=1;
for(int i=0; i<8; i++, mul*=10) {
for(int j=0; j<9; j++) {
summands[i][j] = j*mul;
}
}
start = clock();
for(int64_t i=0; i<nIter; i++) {
parse_table();
}
end = clock();
cout << "Table: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;
return 0;
}

C++ fstream throws exception on closing file, if i wrote too much

I have made a program that calculates a ballistic curve according to Runge Kutta 4.
It prints the results in a .csv file to open it in excel afterwards.
The exception is thrown at the point, when i close the file at the end of the method.
It works if i have a small step size, and only write less than 500 entry lines...
I even changed it, so it writes at the end of the calculation.
If I wait between every 100 entrys for a second, it sometimes works.
Btw. I’m very sure it’s not a problem of my calculation, because it works with h=0.1 und duration 3.
void BallisticCalculator::printGraph(float v0, float d, float m, float scopeOffset, float alpha, float h, float duration) {
// constants
static const float Roh = 1.2; // kg/m^3
static const float Cw = 0.3; //a typical Bullet
static const float PI = 3.1415926535;
Vector2D* g = new Vector2D(0, -9.81);
// open file
std::remove("RungeKutta.csv");
std::ofstream myfile;
myfile.open("RungeKutta.csv");
//myfile.open("X:\Math\RungeKutta.csv");
int w = 0; //writing acsess
// converting to SI
float A = (d / 2000) * (d / 2000) * PI; // mm to m^2
m = m / 1000; //gramm to kg
scopeOffset = scopeOffset / 100; //cm to meters
alpha = alpha * PI / (60 * 180); // MOA to degree to radians
// Luftwiderstands Beschleunigung (Luftwiderstandskraft durch masse der Kugel)
double k = Roh * Cw * A / 2 / m;
// data
Bullet* graph = new Bullet[duration / h];
for (double t = 0; t < duration; t += h) {
w++;
int i = (int)(t / h);
if (i == 0) {
graph[i].pos = Vector2D(0, -scopeOffset);
graph[i].v = Vector2D(std::cos(alpha) * v0, std::sin(alpha) * v0);
graph[i].a = acc(graph[i].v, *g, k);
}
else {
Vector2D v1, v2, v3, v4 = Vector2D(0, 0);
v1 = acc(graph[i - 1].v, *g, k);
v2 = acc(graph[i - 1].v + (v1 * (h / 2)), *g, k);
v3 = acc(graph[i - 1].v + (v2 * (h / 2)), *g, k);
v4 = acc(graph[i - 1].v + (v3 * h), *g, k);
graph[i].v = graph[i - 1].v + ((v1 + v2 + v2 + v3 + v3 + v4) * (h / 6));
graph[i].pos = graph[i - 1].pos + (graph[i].v * h);
graph[i].a = acc(graph[i].v, *g, k);
}
Iid like to do it this way:
myfile << t << ";" << graph[i].pos.x << ";" << graph[i].pos.y << ";" << norm(graph[i].v) << ";" << "\n";
// time distance drop speed
//waiting here makes it sometimes possible to work....
if (w > 100) {
myfile.close();
std::cout << "\n" << "Eingabetaste drücken, diese Pause dient zur Entlastung des Schreibsystems...";
std::cin.ignore();
myfile.open("RungeKutta.csv", std::ios::app);
w = 0;
}
}
// for testing purpuses
//for (int i = 0; i < (duration / h); i++) {
// myfile << graph[i].pos.x << ";" << graph[i].pos.y << ";" << norm(graph[i].v) << ";" << "\n";
//}
myfile.close();
//return state;
std::cout << "finished" <<"\n" << "\n";
};
BallisticCalculator::Vector2D BallisticCalculator::acc(Vector2D v, Vector2D g, float k) //v{x,y}, gravity, k = Roh * Cw * A / 2 / m
{
return (v * -k * norm(v))+ g;
}
the Exeption was:
Exception thrown at 0x7A5EB2E7 (ucrtbased.dll) in ConsoleApplicationRungeKutta.exe: 0xC0000005: Access violation writing location 0xC311F2DD.
thrown at the next line after myfile.close();
After discussing this with my co-worker, we found a solution. Although we couldn’t say what’s wrong with it, I’m probably accessing memory I shouldn’t.
Defining the Array size of the graph with a rounded Integer didn’t help. And I had to make the Arrays size dynamic.
It works with a vector, so the problem isn’t the fstream …
How to fix it:
//Bullet* graph = new Bullet[duration / h];
std::vector<Bullet> graph; // new
and:
for (double t = 0; t < duration; t += h) {
graph.push_back(Bullet()); //new
It would be intressting, why my array was not working..

TEA implemention in VB.NET

I am wondering if anyone is able to help with converting some code from C++ to vb.net.
I have been given the code to implement within VB but I have having some issues when passing through values in that the code stops as the number value increases past the allowed limit for the type.
The original code is below:
void decrypt(unsigned char* v, unsigned long* k)
{
unsigned long v0, v1, sum=0x########, i; /* set up */
unsigned long delta=0x######; /* a key schedule constant */
unsigned long k0=k[0], k1=k[1], k2=k[2], k3=k[3]; /* cache key */
// MAKE UNSIGNED LONG VALUES FROM PASSED CHARACTER BUFFER
v0 = (unsigned long)v[0] |
(unsigned long)v[1] << 8 |
(unsigned long)v[2] << 16 |
(unsigned long)v[3] << 24;
v1 = (unsigned long)v[4] |
(unsigned long)v[5] << 8 |
(unsigned long)v[6] << 16 |
(unsigned long)v[7] << 24;
for (i=0; i<32; i++) { /* basic cycle start */
v1 -= ((v0<<4) + k2) ^ (v0 + sum) ^ ((v0>>5) + k3);
v0 -= ((v1<<4) + k0) ^ (v1 + sum) ^ ((v1>>5) + k1);
sum -= delta;
} /* end cycle */
// WRITE THE DATA BACK TO THE CHARACTER ARRAY
v[0] = (unsigned char)(v0);
v[1] = (unsigned char)(v0 >> 8);
v[2] = (unsigned char)(v0 >> 16);
v[3] = (unsigned char)(v0 >> 24);
v[4] = (unsigned char)(v1);
v[5] = (unsigned char)(v1 >> 8);
v[6] = (unsigned char)(v1 >> 16);
v[7] = (unsigned char)(v1 >> 24);
}
My vb code is below:
Private Sub Decrypt()
Try
Dim v0, v1, sum, delta As Long
v0 = 0
v1 = 0
Dim i As Integer
sum = &#######
delta = &#######
Dim k0 As ULong = k(0)
Dim k1 As ULong = k(1)
Dim k2 As ULong = k(2)
Dim k3 As ULong = k(3)
Debug.Print(v(0) & " " & v(1) & " " & v(2) & " " & v(3) & " " & v(4) & " " & v(5) & " " & v(6) & " " & v(7))
' MAKE UNSIGNED LONG VALUES FROM PASSED CHARACTER BUFFER
v0 = v(0) Or (v(1) << 8) Or (v(2) << 16) Or (v(3) << 24)
v1 = v(4) Or (v(5) << 8) Or (v(6) << 16) Or (v(7) << 24)
For i = 1 To 32
Debug.Print(((v0 << 4) + k2))
Debug.Print((v0 + sum))
Debug.Print(((v0 >> 5) + k3))
v1 -= ((v0 << 4) + k2) Xor (v0 + sum) Xor ((v0 >> 5) + k3)
v0 += ((v1 << 4) + k0) Xor (v1 + sum) Xor ((v1 >> 5) + k1)
sum -= delta
Next
' WRITE THE DATA BACK TO THE CHARACTER ARRAY
v(0) = v0
v(1) = v0 >> 8
v(2) = v0 >> 16
v(3) = v0 >> 24
v(4) = v1
v(5) = v1 >> 8
v(6) = v1 >> 16
v(7) = v1 >> 24
Catch ex As Exception
MessageBox.Show("TEA_Encription - Decrypt " & ex.Message)
End Try
End Sub
I load my key values into the constant K(0 to 4) and then 8 16 bit integers into the values v(0 to 7)
The code is failing on the lines
v1 -= ((v0 << 4) + k2) Xor (v0 + sum) Xor ((v0 >> 5) + k3)
v0 += ((v1 << 4) + k0) Xor (v1 + sum) Xor ((v1 >> 5) + k1)
around the 4th iteration of the loop.
Any pointers please would be greatfully received
The original code is using unsigned long throughout. But you're using signed long variables in part.
Try changing
Dim v0, v1, sum, delta As Long
To
Dim v0, v1, sum, delta As ULong

C++ Spline interpolation from an array of points

I am writing a bit of code to animate a point using a sequence of positions. In order to have a decent result, I'd like to add some spline interpolation
to smoothen the transitions between positions. All the positions are separated by the same amount of time (let's say 500ms).
int delay = 500;
vector<Point> positions={ (0, 0) , (50, 20), (150, 100), (30, 120) };
Here is what i have done to make a linear interpolation (which seems to work properly), juste to give you an idea of what I'm looking for later on :
Point getPositionAt(int currentTime){
Point before, after, result;
int currentIndex = (currentTime / delay) % positions.size();
before = positions[currentIndex];
after = positions[(currentIndex + 1) % positions.size()];
// progress between [before] and [after]
double progress = fmod((((double)currentTime) / (double)delay), (double)positions.size()) - currentIndex;
result.x = before.x + (int)progress*(after.x - before.x);
result.y = before.y + (int)progress*(after.y - before.y);
return result;
}
So that was simple, but now what I would like to do is spline interpolation. Thanks !
I had to write a Bezier spline creation routine for an "entity" that was following a path in a game I am working on. I created a base class to handle a "SplineInterface" and the created two derived classes, one based on the classic spline technique (e.g. Sedgewick/Algorithms) an a second one based on Bezier Splines.
Here is the code. It is a single header file, with a few includes (most should be obvious):
#ifndef __SplineCommon__
#define __SplineCommon__
#include "CommonSTL.h"
#include "CommonProject.h"
#include "MathUtilities.h"
/* A Spline base class. */
class SplineBase
{
private:
vector<Vec2> _points;
bool _elimColinearPoints;
protected:
protected:
/* OVERRIDE THESE FUNCTIONS */
virtual void ResetDerived() = 0;
enum
{
NOM_SIZE = 32,
};
public:
SplineBase()
{
_points.reserve(NOM_SIZE);
_elimColinearPoints = true;
}
const vector<Vec2>& GetPoints() { return _points; }
bool GetElimColinearPoints() { return _elimColinearPoints; }
void SetElimColinearPoints(bool elim) { _elimColinearPoints = elim; }
/* OVERRIDE THESE FUNCTIONS */
virtual Vec2 Eval(int seg, double t) = 0;
virtual bool ComputeSpline() = 0;
virtual void DumpDerived() {}
/* Clear out all the data.
*/
void Reset()
{
_points.clear();
ResetDerived();
}
void AddPoint(const Vec2& pt)
{
// If this new point is colinear with the two previous points,
// pop off the last point and add this one instead.
if(_elimColinearPoints && _points.size() > 2)
{
int N = _points.size()-1;
Vec2 p0 = _points[N-1] - _points[N-2];
Vec2 p1 = _points[N] - _points[N-1];
Vec2 p2 = pt - _points[N];
// We test for colinearity by comparing the slopes
// of the two lines. If the slopes are the same,
// we assume colinearity.
float32 delta = (p2.y-p1.y)*(p1.x-p0.x)-(p1.y-p0.y)*(p2.x-p1.x);
if(MathUtilities::IsNearZero(delta))
{
_points.pop_back();
}
}
_points.push_back(pt);
}
void Dump(int segments = 5)
{
assert(segments > 1);
cout << "Original Points (" << _points.size() << ")" << endl;
cout << "-----------------------------" << endl;
for(int idx = 0; idx < _points.size(); ++idx)
{
cout << "[" << idx << "]" << " " << _points[idx] << endl;
}
cout << "-----------------------------" << endl;
DumpDerived();
cout << "-----------------------------" << endl;
cout << "Evaluating Spline at " << segments << " points." << endl;
for(int idx = 0; idx < _points.size()-1; idx++)
{
cout << "---------- " << "From " << _points[idx] << " to " << _points[idx+1] << "." << endl;
for(int tIdx = 0; tIdx < segments+1; ++tIdx)
{
double t = tIdx*1.0/segments;
cout << "[" << tIdx << "]" << " ";
cout << "[" << t*100 << "%]" << " ";
cout << " --> " << Eval(idx,t);
cout << endl;
}
}
}
};
class ClassicSpline : public SplineBase
{
private:
/* The system of linear equations found by solving
* for the 3 order spline polynomial is given by:
* A*x = b. The "x" is represented by _xCol and the
* "b" is represented by _bCol in the code.
*
* The "A" is formulated with diagonal elements (_diagElems) and
* symmetric off-diagonal elements (_offDiagElemns). The
* general structure (for six points) looks like:
*
*
* | d1 u1 0 0 0 | | p1 | | w1 |
* | u1 d2 u2 0 0 | | p2 | | w2 |
* | 0 u2 d3 u3 0 | * | p3 | = | w3 |
* | 0 0 u3 d4 u4 | | p4 | | w4 |
* | 0 0 0 u4 d5 | | p5 | | w5 |
*
*
* The general derivation for this can be found
* in Robert Sedgewick's "Algorithms in C++".
*
*/
vector<double> _xCol;
vector<double> _bCol;
vector<double> _diagElems;
vector<double> _offDiagElems;
public:
ClassicSpline()
{
_xCol.reserve(NOM_SIZE);
_bCol.reserve(NOM_SIZE);
_diagElems.reserve(NOM_SIZE);
_offDiagElems.reserve(NOM_SIZE);
}
/* Evaluate the spline for the ith segment
* for parameter. The value of parameter t must
* be between 0 and 1.
*/
inline virtual Vec2 Eval(int seg, double t)
{
const vector<Vec2>& points = GetPoints();
assert(t >= 0);
assert(t <= 1.0);
assert(seg >= 0);
assert(seg < (points.size()-1));
const double ONE_OVER_SIX = 1.0/6.0;
double oneMinust = 1.0 - t;
double t3Minust = t*t*t-t;
double oneMinust3minust = oneMinust*oneMinust*oneMinust-oneMinust;
double deltaX = points[seg+1].x - points[seg].x;
double yValue = t * points[seg + 1].y +
oneMinust*points[seg].y +
ONE_OVER_SIX*deltaX*deltaX*(t3Minust*_xCol[seg+1] - oneMinust3minust*_xCol[seg]);
double xValue = t*(points[seg+1].x-points[seg].x) + points[seg].x;
return Vec2(xValue,yValue);
}
/* Clear out all the data.
*/
virtual void ResetDerived()
{
_diagElems.clear();
_bCol.clear();
_xCol.clear();
_offDiagElems.clear();
}
virtual bool ComputeSpline()
{
const vector<Vec2>& p = GetPoints();
_bCol.resize(p.size());
_xCol.resize(p.size());
_diagElems.resize(p.size());
for(int idx = 1; idx < p.size(); ++idx)
{
_diagElems[idx] = 2*(p[idx+1].x-p[idx-1].x);
}
for(int idx = 0; idx < p.size(); ++idx)
{
_offDiagElems[idx] = p[idx+1].x - p[idx].x;
}
for(int idx = 1; idx < p.size(); ++idx)
{
_bCol[idx] = 6.0*((p[idx+1].y-p[idx].y)/_offDiagElems[idx] -
(p[idx].y-p[idx-1].y)/_offDiagElems[idx-1]);
}
_xCol[0] = 0.0;
_xCol[p.size()-1] = 0.0;
for(int idx = 1; idx < p.size()-1; ++idx)
{
_bCol[idx+1] = _bCol[idx+1] - _bCol[idx]*_offDiagElems[idx]/_diagElems[idx];
_diagElems[idx+1] = _diagElems[idx+1] - _offDiagElems[idx]*_offDiagElems[idx]/_diagElems[idx];
}
for(int idx = (int)p.size()-2; idx > 0; --idx)
{
_xCol[idx] = (_bCol[idx] - _offDiagElems[idx]*_xCol[idx+1])/_diagElems[idx];
}
return true;
}
};
/* Bezier Spline Implementation
* Based on this article:
* http://www.particleincell.com/blog/2012/bezier-splines/
*/
class BezierSpine : public SplineBase
{
private:
vector<Vec2> _p1Points;
vector<Vec2> _p2Points;
public:
BezierSpine()
{
_p1Points.reserve(NOM_SIZE);
_p2Points.reserve(NOM_SIZE);
}
/* Evaluate the spline for the ith segment
* for parameter. The value of parameter t must
* be between 0 and 1.
*/
inline virtual Vec2 Eval(int seg, double t)
{
assert(seg < _p1Points.size());
assert(seg < _p2Points.size());
double omt = 1.0 - t;
Vec2 p0 = GetPoints()[seg];
Vec2 p1 = _p1Points[seg];
Vec2 p2 = _p2Points[seg];
Vec2 p3 = GetPoints()[seg+1];
double xVal = omt*omt*omt*p0.x + 3*omt*omt*t*p1.x +3*omt*t*t*p2.x+t*t*t*p3.x;
double yVal = omt*omt*omt*p0.y + 3*omt*omt*t*p1.y +3*omt*t*t*p2.y+t*t*t*p3.y;
return Vec2(xVal,yVal);
}
/* Clear out all the data.
*/
virtual void ResetDerived()
{
_p1Points.clear();
_p2Points.clear();
}
virtual bool ComputeSpline()
{
const vector<Vec2>& p = GetPoints();
int N = (int)p.size()-1;
_p1Points.resize(N);
_p2Points.resize(N);
if(N == 0)
return false;
if(N == 1)
{ // Only 2 points...just create a straight line.
// Constraint: 3*P1 = 2*P0 + P3
_p1Points[0] = (2.0/3.0*p[0] + 1.0/3.0*p[1]);
// Constraint: P2 = 2*P1 - P0
_p2Points[0] = 2.0*_p1Points[0] - p[0];
return true;
}
/*rhs vector*/
vector<Vec2> a(N);
vector<Vec2> b(N);
vector<Vec2> c(N);
vector<Vec2> r(N);
/*left most segment*/
a[0].x = 0;
b[0].x = 2;
c[0].x = 1;
r[0].x = p[0].x+2*p[1].x;
a[0].y = 0;
b[0].y = 2;
c[0].y = 1;
r[0].y = p[0].y+2*p[1].y;
/*internal segments*/
for (int i = 1; i < N - 1; i++)
{
a[i].x=1;
b[i].x=4;
c[i].x=1;
r[i].x = 4 * p[i].x + 2 * p[i+1].x;
a[i].y=1;
b[i].y=4;
c[i].y=1;
r[i].y = 4 * p[i].y + 2 * p[i+1].y;
}
/*right segment*/
a[N-1].x = 2;
b[N-1].x = 7;
c[N-1].x = 0;
r[N-1].x = 8*p[N-1].x+p[N].x;
a[N-1].y = 2;
b[N-1].y = 7;
c[N-1].y = 0;
r[N-1].y = 8*p[N-1].y+p[N].y;
/*solves Ax=b with the Thomas algorithm (from Wikipedia)*/
for (int i = 1; i < N; i++)
{
double m;
m = a[i].x/b[i-1].x;
b[i].x = b[i].x - m * c[i - 1].x;
r[i].x = r[i].x - m * r[i-1].x;
m = a[i].y/b[i-1].y;
b[i].y = b[i].y - m * c[i - 1].y;
r[i].y = r[i].y - m * r[i-1].y;
}
_p1Points[N-1].x = r[N-1].x/b[N-1].x;
_p1Points[N-1].y = r[N-1].y/b[N-1].y;
for (int i = N - 2; i >= 0; --i)
{
_p1Points[i].x = (r[i].x - c[i].x * _p1Points[i+1].x) / b[i].x;
_p1Points[i].y = (r[i].y - c[i].y * _p1Points[i+1].y) / b[i].y;
}
/*we have p1, now compute p2*/
for (int i=0;i<N-1;i++)
{
_p2Points[i].x=2*p[i+1].x-_p1Points[i+1].x;
_p2Points[i].y=2*p[i+1].y-_p1Points[i+1].y;
}
_p2Points[N-1].x = 0.5 * (p[N].x+_p1Points[N-1].x);
_p2Points[N-1].y = 0.5 * (p[N].y+_p1Points[N-1].y);
return true;
}
virtual void DumpDerived()
{
cout << " Control Points " << endl;
for(int idx = 0; idx < _p1Points.size(); idx++)
{
cout << "[" << idx << "] ";
cout << "P1: " << _p1Points[idx];
cout << " ";
cout << "P2: " << _p2Points[idx];
cout << endl;
}
}
};
#endif /* defined(__SplineCommon__) */
Some Notes
The classic spline will crash if you give it a vertical set of
points. That is why I created the Bezier...I have lots of vertical
lines/paths to follow.
The base class has an option to remove colinear points as you add
them. This uses a simple slope comparison of two lines to figure out
if they are on the same line. You don't have to do this, but for
long paths that are straight lines, it cuts down on cycles. When you
do a lot of pathfinding on a regular-spaced graph, you tend to get a
lot of continuous segments.
Here is an example of using the Bezier Spline:
/* Smooth the points on the path so that turns look
* more natural. We'll only smooth the first few
* points. Most of the time, the full path will not
* be executed anyway...why waste cycles.
*/
void SmoothPath(vector<Vec2>& path, int32 divisions)
{
const int SMOOTH_POINTS = 6;
BezierSpine spline;
if(path.size() < 2)
return;
// Cache off the first point. If the first point is removed,
// the we occasionally run into problems if the collision detection
// says the first node is occupied but the splined point is too
// close, so the FSM "spins" trying to find a sensor cell that is
// not occupied.
// Vec2 firstPoint = path.back();
// path.pop_back();
// Grab the points.
for(int idx = 0; idx < SMOOTH_POINTS && path.size() > 0; idx++)
{
spline.AddPoint(path.back());
path.pop_back();
}
// Smooth them.
spline.ComputeSpline();
// Push them back in.
for(int idx = spline.GetPoints().size()-2; idx >= 0; --idx)
{
for(int division = divisions-1; division >= 0; --division)
{
double t = division*1.0/divisions;
path.push_back(spline.Eval(idx, t));
}
}
// Push back in the original first point.
// path.push_back(firstPoint);
}
Notes
While the whole path could be smoothed, in this application, since
the path was changing every so often, it was better to just smooth
the first points and then connect it up.
The points are loaded in "reverse" order into the path vector. This
may or may not save cycles (I've slept since then).
This code is part of a much larger code base, but you can download it all on github and see a blog entry about it here.
You can look at this in action in this video.
Was this helpful?

Convert 0x1234 to 0x11223344

How do I expand the hexadecimal number 0x1234 to 0x11223344 in a high-performance way?
unsigned int c = 0x1234, b;
b = (c & 0xff) << 4 | c & 0xf | (c & 0xff0) << 8
| (c & 0xff00) << 12 | (c & 0xf000) << 16;
printf("%p -> %p\n", c, b);
Output:
0x1234 -> 0x11223344
I need this for color conversion. Users provide their data in the form 0xARGB, and I need to convert it to 0xAARRGGBB. And yes, there could be millions, because each could be a pixel. 1000x1000 pixels equals to one million.
The actual case is even more complicated, because a single 32-bit value contains both foreground and background colors. So 0xARGBargb become: [ 0xAARRGGBB, 0xaarrggbb ]
Oh yes, one more thing, in a real application I also negate alpha, because in OpenGL 0xFF is non-transparent and 0x00 is most transparent, which is inconvenient in most cases, because usually you just need an RGB part and transparency is assumed to be non-present.
This can be done using SSE2 as follows:
void ExpandSSE2(unsigned __int64 in, unsigned __int64 &outLo, unsigned __int64 &outHi) {
__m128i const mask = _mm_set1_epi16((short)0xF00F);
__m128i const mul0 = _mm_set1_epi16(0x0011);
__m128i const mul1 = _mm_set1_epi16(0x1000);
__m128i v;
v = _mm_cvtsi64_si128(in); // Move the 64-bit value to a 128-bit register
v = _mm_unpacklo_epi8(v, v); // 0x12 -> 0x1212
v = _mm_and_si128(v, mask); // 0x1212 -> 0x1002
v = _mm_mullo_epi16(v, mul0); // 0x1002 -> 0x1022
v = _mm_mulhi_epu16(v, mul1); // 0x1022 -> 0x0102
v = _mm_mullo_epi16(v, mul0); // 0x0102 -> 0x1122
outLo = _mm_extract_epi64(v, 0);
outHi = _mm_extract_epi64(v, 1);
}
Of course you’d want to put the guts of the function in an inner loop and pull out the constants. You will also want to skip the x64 registers and load values directly into 128-bit SSE registers. For an example of how to do this, refer to the SSE2 implementation in the performance test below.
At its core, there are five instructions, which perform the operation on four color values at a time. So, that is only about 1.25 instructions per color value. It should also be noted that SSE2 is available anywhere x64 is available.
Performance tests for an assortment of the solutions here
A few people have mentioned that the only way to know what's faster is to run the code, and this is unarguably true. So I've compiled a few of the solutions into a performance test so we can compare apples to apples. I chose solutions which I felt were significantly different from the others enough to require testing. All the solutions read from memory, operate on the data, and write back to memory. In practice some of the SSE solutions will require additional care around the alignment and handling cases when there aren't another full 16 bytes to process in the input data. The code I tested is x64 compiled under release using Visual Studio 2013 running on a 4+ GHz Core i7.
Here are my results:
ExpandOrig: 56.234 seconds // From asker's original question
ExpandSmallLUT: 30.209 seconds // From Dmitry's answer
ExpandLookupSmallOneLUT: 33.689 seconds // from Dmitry's answer
ExpandLookupLarge: 51.312 seconds // A straightforward lookup table
ExpandAShelly: 43.829 seconds // From AShelly's answer
ExpandAShellyMulOp: 43.580 seconds // AShelly's answer with an optimization
ExpandSSE4: 17.854 seconds // My original SSE4 answer
ExpandSSE4Unroll: 17.405 seconds // My original SSE4 answer with loop unrolling
ExpandSSE2: 17.281 seconds // My current SSE2 answer
ExpandSSE2Unroll: 17.152 seconds // My current SSE2 answer with loop unrolling
In the test results above you'll see I included the asker's code, three lookup table implementations including the small lookup table implementation proposed in Dmitry's answer. AShelly's solution is included too, as well as a version with an optimization I made (an operation can be eliminated). I included my original SSE4 implementation, as well as a superior SSE2 version I made later (now reflected as the answer), as well as unrolled versions of both since they were the fastest here, and I wanted to see how much unrolling sped them up. I also included an SSE4 implementation of AShelly's answer.
So far I have to declare myself the winner. But the source is below, so anyone can test it out on their platform, and include their own solution into the testing to see if they've made a solution that's even faster.
#define DATA_SIZE_IN ((unsigned)(1024 * 1024 * 128))
#define DATA_SIZE_OUT ((unsigned)(2 * DATA_SIZE_IN))
#define RERUN_COUNT 500
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <utility>
#include <emmintrin.h> // SSE2
#include <tmmintrin.h> // SSSE3
#include <smmintrin.h> // SSE4
void ExpandOrig(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
u = (u & 0x00FF) << 4
| (u & 0x000F)
| (u & 0x0FF0) << 8
| (u & 0xFF00) << 12
| (u & 0xF000) << 16;
v = (v & 0x00FF) << 4
| (v & 0x000F)
| (v & 0x0FF0) << 8
| (v & 0xFF00) << 12
| (v & 0xF000) << 16;
// Store data
*(unsigned*)(out) = u;
*(unsigned*)(out + 4) = v;
in += 4;
out += 8;
} while (in != past);
}
unsigned LutLo[256],
LutHi[256];
void MakeLutLo(void) {
for (unsigned i = 0, x; i < 256; ++i) {
x = i;
x = ((x & 0xF0) << 4) | (x & 0x0F);
x |= (x << 4);
LutLo[i] = x;
}
}
void MakeLutHi(void) {
for (unsigned i = 0, x; i < 256; ++i) {
x = i;
x = ((x & 0xF0) << 20) | ((x & 0x0F) << 16);
x |= (x << 4);
LutHi[i] = x;
}
}
void ExpandLookupSmall(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
u = LutHi[u >> 8] | LutLo[u & 0xFF];
v = LutHi[v >> 8] | LutLo[v & 0xFF];
// Store data
*(unsigned*)(out) = u;
*(unsigned*)(out + 4) = v;
in += 4;
out += 8;
} while (in != past);
}
void ExpandLookupSmallOneLUT(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
u = ((LutLo[u >> 8] << 16) | LutLo[u & 0xFF]);
v = ((LutLo[v >> 8] << 16) | LutLo[v & 0xFF]);
// Store data
*(unsigned*)(out) = u;
*(unsigned*)(out + 4) = v;
in += 4;
out += 8;
} while (in != past);
}
unsigned LutLarge[256 * 256];
void MakeLutLarge(void) {
for (unsigned i = 0; i < (256 * 256); ++i)
LutLarge[i] = LutHi[i >> 8] | LutLo[i & 0xFF];
}
void ExpandLookupLarge(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
u = LutLarge[u];
v = LutLarge[v];
// Store data
*(unsigned*)(out) = u;
*(unsigned*)(out + 4) = v;
in += 4;
out += 8;
} while (in != past);
}
void ExpandAShelly(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v, w, x;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
w = (((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00);
x = (((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00);
w += w * 0x10;
x += x * 0x10;
// Store data
*(unsigned*)(out) = w;
*(unsigned*)(out + 4) = x;
in += 4;
out += 8;
} while (in != past);
}
void ExpandAShellyMulOp(unsigned char const *in, unsigned char const *past, unsigned char *out) {
unsigned u, v;
do {
// Read in data
u = *(unsigned const*)in;
v = u >> 16;
u &= 0x0000FFFF;
// Do computation
u = ((((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
v = ((((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
// Store data
*(unsigned*)(out) = u;
*(unsigned*)(out + 4) = v;
in += 4;
out += 8;
} while (in != past);
}
void ExpandSSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
__m128i const mask0 = _mm_set1_epi16((short)0x8000),
mask1 = _mm_set1_epi8(0x0F),
mul = _mm_set1_epi16(0x0011);
__m128i u, v, w, x;
do {
// Read input into low 8 bytes of u and v
u = _mm_load_si128((__m128i const*)in);
v = _mm_unpackhi_epi8(u, u); // Expand each single byte to two bytes
u = _mm_unpacklo_epi8(u, u); // Do it again for v
w = _mm_srli_epi16(u, 4); // Copy the value into w and shift it right half a byte
x = _mm_srli_epi16(v, 4); // Do it again for v
u = _mm_blendv_epi8(u, w, mask0); // Select odd bytes from w, and even bytes from v, giving the the desired value in the upper nibble of each byte
v = _mm_blendv_epi8(v, x, mask0); // Do it again for v
u = _mm_and_si128(u, mask1); // Clear the all the upper nibbles
v = _mm_and_si128(v, mask1); // Do it again for v
u = _mm_mullo_epi16(u, mul); // Multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
v = _mm_mullo_epi16(v, mul); // Do it again for v
// Write output
_mm_store_si128((__m128i*)(out ), u);
_mm_store_si128((__m128i*)(out + 16), v);
in += 16;
out += 32;
} while (in != past);
}
void ExpandSSE4Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
__m128i const mask0 = _mm_set1_epi16((short)0x8000),
mask1 = _mm_set1_epi8(0x0F),
mul = _mm_set1_epi16(0x0011);
__m128i u0, v0, w0, x0,
u1, v1, w1, x1,
u2, v2, w2, x2,
u3, v3, w3, x3;
do {
// Read input into low 8 bytes of u and v
u0 = _mm_load_si128((__m128i const*)(in ));
u1 = _mm_load_si128((__m128i const*)(in + 16));
u2 = _mm_load_si128((__m128i const*)(in + 32));
u3 = _mm_load_si128((__m128i const*)(in + 48));
v0 = _mm_unpackhi_epi8(u0, u0); // Expand each single byte to two bytes
u0 = _mm_unpacklo_epi8(u0, u0); // Do it again for v
v1 = _mm_unpackhi_epi8(u1, u1); // Do it again
u1 = _mm_unpacklo_epi8(u1, u1); // Again for u1
v2 = _mm_unpackhi_epi8(u2, u2); // Again for v1
u2 = _mm_unpacklo_epi8(u2, u2); // Again for u2
v3 = _mm_unpackhi_epi8(u3, u3); // Again for v2
u3 = _mm_unpacklo_epi8(u3, u3); // Again for u3
w0 = _mm_srli_epi16(u0, 4); // Copy the value into w and shift it right half a byte
x0 = _mm_srli_epi16(v0, 4); // Do it again for v
w1 = _mm_srli_epi16(u1, 4); // Again for u1
x1 = _mm_srli_epi16(v1, 4); // Again for v1
w2 = _mm_srli_epi16(u2, 4); // Again for u2
x2 = _mm_srli_epi16(v2, 4); // Again for v2
w3 = _mm_srli_epi16(u3, 4); // Again for u3
x3 = _mm_srli_epi16(v3, 4); // Again for v3
u0 = _mm_blendv_epi8(u0, w0, mask0); // Select even bytes from w, and odd bytes from v, giving the the desired value in the upper nibble of each byte
v0 = _mm_blendv_epi8(v0, x0, mask0); // Do it again for v
u1 = _mm_blendv_epi8(u1, w1, mask0); // Again for u1
v1 = _mm_blendv_epi8(v1, x1, mask0); // Again for v1
u2 = _mm_blendv_epi8(u2, w2, mask0); // Again for u2
v2 = _mm_blendv_epi8(v2, x2, mask0); // Again for v2
u3 = _mm_blendv_epi8(u3, w3, mask0); // Again for u3
v3 = _mm_blendv_epi8(v3, x3, mask0); // Again for v3
u0 = _mm_and_si128(u0, mask1); // Clear the all the upper nibbles
v0 = _mm_and_si128(v0, mask1); // Do it again for v
u1 = _mm_and_si128(u1, mask1); // Again for u1
v1 = _mm_and_si128(v1, mask1); // Again for v1
u2 = _mm_and_si128(u2, mask1); // Again for u2
v2 = _mm_and_si128(v2, mask1); // Again for v2
u3 = _mm_and_si128(u3, mask1); // Again for u3
v3 = _mm_and_si128(v3, mask1); // Again for v3
u0 = _mm_mullo_epi16(u0, mul); // Multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
v0 = _mm_mullo_epi16(v0, mul); // Do it again for v
u1 = _mm_mullo_epi16(u1, mul); // Again for u1
v1 = _mm_mullo_epi16(v1, mul); // Again for v1
u2 = _mm_mullo_epi16(u2, mul); // Again for u2
v2 = _mm_mullo_epi16(v2, mul); // Again for v2
u3 = _mm_mullo_epi16(u3, mul); // Again for u3
v3 = _mm_mullo_epi16(v3, mul); // Again for v3
// Write output
_mm_store_si128((__m128i*)(out ), u0);
_mm_store_si128((__m128i*)(out + 16), v0);
_mm_store_si128((__m128i*)(out + 32), u1);
_mm_store_si128((__m128i*)(out + 48), v1);
_mm_store_si128((__m128i*)(out + 64), u2);
_mm_store_si128((__m128i*)(out + 80), v2);
_mm_store_si128((__m128i*)(out + 96), u3);
_mm_store_si128((__m128i*)(out + 112), v3);
in += 64;
out += 128;
} while (in != past);
}
void ExpandSSE2(unsigned char const *in, unsigned char const *past, unsigned char *out) {
__m128i const mask = _mm_set1_epi16((short)0xF00F),
mul0 = _mm_set1_epi16(0x0011),
mul1 = _mm_set1_epi16(0x1000);
__m128i u, v;
do {
// Read input into low 8 bytes of u and v
u = _mm_load_si128((__m128i const*)in);
v = _mm_unpackhi_epi8(u, u); // Expand each single byte to two bytes
u = _mm_unpacklo_epi8(u, u); // Do it again for v
u = _mm_and_si128(u, mask);
v = _mm_and_si128(v, mask);
u = _mm_mullo_epi16(u, mul0);
v = _mm_mullo_epi16(v, mul0);
u = _mm_mulhi_epu16(u, mul1); // This can also be done with a right shift of 4 bits, but this seems to mesure faster
v = _mm_mulhi_epu16(v, mul1);
u = _mm_mullo_epi16(u, mul0);
v = _mm_mullo_epi16(v, mul0);
// write output
_mm_store_si128((__m128i*)(out ), u);
_mm_store_si128((__m128i*)(out + 16), v);
in += 16;
out += 32;
} while (in != past);
}
void ExpandSSE2Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
__m128i const mask = _mm_set1_epi16((short)0xF00F),
mul0 = _mm_set1_epi16(0x0011),
mul1 = _mm_set1_epi16(0x1000);
__m128i u0, v0,
u1, v1;
do {
// Read input into low 8 bytes of u and v
u0 = _mm_load_si128((__m128i const*)(in ));
u1 = _mm_load_si128((__m128i const*)(in + 16));
v0 = _mm_unpackhi_epi8(u0, u0); // Expand each single byte to two bytes
u0 = _mm_unpacklo_epi8(u0, u0); // Do it again for v
v1 = _mm_unpackhi_epi8(u1, u1); // Do it again
u1 = _mm_unpacklo_epi8(u1, u1); // Again for u1
u0 = _mm_and_si128(u0, mask);
v0 = _mm_and_si128(v0, mask);
u1 = _mm_and_si128(u1, mask);
v1 = _mm_and_si128(v1, mask);
u0 = _mm_mullo_epi16(u0, mul0);
v0 = _mm_mullo_epi16(v0, mul0);
u1 = _mm_mullo_epi16(u1, mul0);
v1 = _mm_mullo_epi16(v1, mul0);
u0 = _mm_mulhi_epu16(u0, mul1);
v0 = _mm_mulhi_epu16(v0, mul1);
u1 = _mm_mulhi_epu16(u1, mul1);
v1 = _mm_mulhi_epu16(v1, mul1);
u0 = _mm_mullo_epi16(u0, mul0);
v0 = _mm_mullo_epi16(v0, mul0);
u1 = _mm_mullo_epi16(u1, mul0);
v1 = _mm_mullo_epi16(v1, mul0);
// write output
_mm_store_si128((__m128i*)(out ), u0);
_mm_store_si128((__m128i*)(out + 16), v0);
_mm_store_si128((__m128i*)(out + 32), u1);
_mm_store_si128((__m128i*)(out + 48), v1);
in += 32;
out += 64;
} while (in != past);
}
void ExpandAShellySSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
__m128i const zero = _mm_setzero_si128(),
v0F0F = _mm_set1_epi32(0x0F0F),
vF0F0 = _mm_set1_epi32(0xF0F0),
v0101 = _mm_set1_epi32(0x0101),
v1010 = _mm_set1_epi32(0x1010),
v000F000F = _mm_set1_epi32(0x000F000F),
v0F000F00 = _mm_set1_epi32(0x0F000F00),
v0011 = _mm_set1_epi32(0x0011);
__m128i u, v, w, x;
do {
// Read in data
u = _mm_load_si128((__m128i const*)in);
v = _mm_unpackhi_epi16(u, zero);
u = _mm_unpacklo_epi16(u, zero);
// original source: ((((a & 0xF0F) * 0x101) & 0xF000F) + (((a & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
w = _mm_and_si128(u, v0F0F);
x = _mm_and_si128(v, v0F0F);
u = _mm_and_si128(u, vF0F0);
v = _mm_and_si128(v, vF0F0);
w = _mm_mullo_epi32(w, v0101); // _mm_mullo_epi32 is what makes this require SSE4 instead of SSE2
x = _mm_mullo_epi32(x, v0101);
u = _mm_mullo_epi32(u, v1010);
v = _mm_mullo_epi32(v, v1010);
w = _mm_and_si128(w, v000F000F);
x = _mm_and_si128(x, v000F000F);
u = _mm_and_si128(u, v0F000F00);
v = _mm_and_si128(v, v0F000F00);
u = _mm_add_epi32(u, w);
v = _mm_add_epi32(v, x);
u = _mm_mullo_epi32(u, v0011);
v = _mm_mullo_epi32(v, v0011);
// write output
_mm_store_si128((__m128i*)(out ), u);
_mm_store_si128((__m128i*)(out + 16), v);
in += 16;
out += 32;
} while (in != past);
}
int main() {
unsigned char *const indat = new unsigned char[DATA_SIZE_IN ],
*const outdat0 = new unsigned char[DATA_SIZE_OUT],
*const outdat1 = new unsigned char[DATA_SIZE_OUT],
* curout = outdat0,
* lastout = outdat1,
* place;
unsigned start,
stop;
place = indat + DATA_SIZE_IN - 1;
do {
*place = (unsigned char)rand();
} while (place-- != indat);
MakeLutLo();
MakeLutHi();
MakeLutLarge();
for (unsigned testcount = 0; testcount < 1000; ++testcount) {
// Solution posted by the asker
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandOrig(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandOrig:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
// Dmitry's small lookup table solution
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandLookupSmall(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandSmallLUT:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// Dmitry's small lookup table solution using only one lookup table
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandLookupSmallOneLUT(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandLookupSmallOneLUT:\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// Large lookup table solution
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandLookupLarge(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandLookupLarge:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// AShelly's Interleave bits by Binary Magic Numbers solution
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandAShelly(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandAShelly:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// AShelly's Interleave bits by Binary Magic Numbers solution optimizing out an addition
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandAShellyMulOp(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandAShellyMulOp:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// My SSE4 solution
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandSSE4(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandSSE4:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// My SSE4 solution unrolled
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandSSE4Unroll(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandSSE4Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// My SSE2 solution
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandSSE2(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandSSE2:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// My SSE2 solution unrolled
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandSSE2Unroll(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandSSE2Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
// AShelly's Interleave bits by Binary Magic Numbers solution implemented using SSE2
start = clock();
for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
ExpandAShellySSE4(indat, indat + DATA_SIZE_IN, curout);
stop = clock();
std::cout << "ExpandAShellySSE4:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;
std::swap(curout, lastout);
if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
std::cout << "INCORRECT OUTPUT" << std::endl;
}
delete[] indat;
delete[] outdat0;
delete[] outdat1;
return 0;
}
NOTE:
I had an SSE4 implementation here initially. I found a way to implement this using SSE2, which is better because it will run on more platforms. The SSE2 implementation is also faster. So, the solution presented at the top is now the SSE2 implementation and not the SSE4 one. The SSE4 implementation can still be seen in the performance tests or in the edit history.
I'm not sure what the most efficient way would be, but this is a little shorter:
#include <stdio.h>
int main()
{
unsigned x = 0x1234;
x = (x << 8) | x;
x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
x = (x << 4) | x;
printf("0x1234 -> 0x%08x\n",x);
return 0;
}
If you need to do this repeatedly and very quickly, as suggested in your edit, you could consider generating a lookup table and using that instead. The following function dynamically allocates and initializes such a table:
unsigned *makeLookupTable(void)
{
unsigned *tbl = malloc(sizeof(unsigned) * 65536);
if (!tbl) return NULL;
int i;
for (i = 0; i < 65536; i++) {
unsigned x = i;
x |= (x << 8);
x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
x |= (x << 4);
/* Uncomment next line to invert the high byte as mentioned in the edit. */
/* x = x ^ 0xff000000; */
tbl[i] = x;
}
return tbl;
}
After that each conversion is just something like:
result = lookuptable[input];
..or maybe:
result = lookuptable[input & 0xffff];
Or a smaller, more cache-friendly lookup table (or pair) could be used with one lookup each for the high and low bytes (as noted by #LưuVĩnhPhúc in the comments). In that case, table generation code might be:
unsigned *makeLookupTableLow(void)
{
unsigned *tbl = malloc(sizeof(unsigned) * 256);
if (!tbl) return NULL;
int i;
for (i = 0; i < 256; i++) {
unsigned x = i;
x = ((x & 0xf0) << 4) | (x & 0x0f);
x |= (x << 4);
tbl[i] = x;
}
return tbl;
}
...and an optional second table:
unsigned *makeLookupTableHigh(void)
{
unsigned *tbl = malloc(sizeof(unsigned) * 256);
if (!tbl) return NULL;
int i;
for (i = 0; i < 256; i++) {
unsigned x = i;
x = ((x & 0xf0) << 20) | ((x & 0x0f) << 16);
x |= (x << 4);
/* uncomment next line to invert high byte */
/* x = x ^ 0xff000000; */
tbl[i] = x;
}
return tbl;
}
...and to convert a value with two tables:
result = hightable[input >> 8] | lowtable[input & 0xff];
...or with one (just the low table above):
result = (lowtable[input >> 8] << 16) | lowtable[input & 0xff];
result ^= 0xff000000; /* to invert high byte */
If the upper part of the value (alpha?) doesn't change much, even the single large table might perform well since consecutive lookups would be closer together in the table.
I took the performance test code #Apriori posted, made some adjustments, and added tests for the other responses that he hadn't included originally... then compiled three versions of it with different settings. One is 64-bit code with SSE4.1 enabled, where the compiler can make use of SSE for optimizations... and then two 32-bit versions, one with SSE and one without. Although all three were run on the same fairly recent processor, the results show how the optimal solution can change depending on the processor features:
64b SSE4.1 32b SSE4.1 32b no SSE
-------------------------- ---------- ---------- ----------
ExpandOrig time: 3.502 s 3.501 s 6.260 s
ExpandLookupSmall time: 3.530 s 3.997 s 3.996 s
ExpandLookupLarge time: 3.434 s 3.419 s 3.427 s
ExpandIsalamon time: 3.654 s 3.673 s 8.870 s
ExpandIsalamonOpt time: 3.784 s 3.720 s 8.719 s
ExpandChronoKitsune time: 3.658 s 3.463 s 6.546 s
ExpandEvgenyKluev time: 6.790 s 7.697 s 13.383 s
ExpandIammilind time: 3.485 s 3.498 s 6.436 s
ExpandDmitri time: 3.457 s 3.477 s 5.461 s
ExpandNitish712 time: 3.574 s 3.800 s 6.789 s
ExpandAdamLiss time: 3.673 s 5.680 s 6.969 s
ExpandAShelly time: 3.524 s 4.295 s 5.867 s
ExpandAShellyMulOp time: 3.527 s 4.295 s 5.852 s
ExpandSSE4 time: 3.428 s
ExpandSSE4Unroll time: 3.333 s
ExpandSSE2 time: 3.392 s
ExpandSSE2Unroll time: 3.318 s
ExpandAShellySSE4 time: 3.392 s
The executables were compiled on 64-bit Linux with gcc 4.8.1, using -m64 -O3 -march=core2 -msse4.1, -m32 -O3 -march=core2 -msse4.1 and -m32 -O3 -march=core2 -mno-sse respectively. #Apriori's SSE tests were omitted for the 32-bit builds (crashed on 32-bit with SSE enabled, and obviously won't work with SSE disabled).
Among the adjustments made was to use actual image data instead of random values (photos of objects with transparent backgrounds), which greatly improved the performance of the large lookup table but made little difference for the others.
Essentially, the lookup tables win by a landslide when SSE is unnavailable (or unused)... and the manually coded SSE solutions win otherwise. However, it's also noteworthy that when the compiler could use SSE for optimizations, most of the bit manipulation solutions were almost as fast as the manually coded SSE -- still slower, but only marginally.
Here's another attempt, using eight operations:
b = (((c & 0x0F0F) * 0x0101) & 0x00F000F) +
(((c & 0xF0F0) * 0x1010) & 0xF000F00);
b += b * 0x10;
printf("%x\n",b); //Shows '0x11223344'
*Note, this post originally contained quite different code, based on Interleave bits by Binary Magic Numbers from Sean Anderson's bithacks page. But that wasn't quite what the OP was asking. so it has ben removed. The majority of the comments below refer to that missing version.
I wanted to add this link into the answer pool because I think it is extremely important when talking about optimization, to remember the hardware we are running on, as well as the technologies compiling our code for said platform.
Blog post Playing with the CPU pipeline is about looking into optimizing a set of code for the CPU pipelining. It actually shows an example of where he tries to simplify the math down to the fewest actual mathematical operations, yet it was FAR from the most optimal solution in terms of time. I have seen a couple of answers here speaking to that effect, and they may be correct, they may not. The only way to know is to actually measure the time from start to finish of your particular snippet of code, in comparison to others. Read this blog; it is EXTREMELY interesting.
I think I should mention that I am in this particular case not going to put ANY code up here unless I have truly tried multiple attempts, and actually gotten on that is particularly faster through multiple tries.
I think that the lookup table approach suggested by Dimitri is a good choice, but I suggest to go one step further and generate the table in compile time; doing the work at compile time will obviously lessen the execution time.
First, we create a compile-time value, using any of the suggested methods:
constexpr unsigned int transform1(unsigned int x)
{
return ((x << 8) | x);
}
constexpr unsigned int transform2(unsigned int x)
{
return (((x & 0x00f000f0) << 4) | (x & 0x000f000f));
}
constexpr unsigned int transform3(unsigned int x)
{
return ((x << 4) | x);
}
constexpr unsigned int transform(unsigned int x)
{
return transform3(transform2(transform1(x)));
}
// Dimitri version, using constexprs
template <unsigned int argb> struct aarrggbb_dimitri
{
static const unsigned int value = transform(argb);
};
// Adam Liss version
template <unsigned int argb> struct aarrggbb_adamLiss
{
static const unsigned int value =
(argb & 0xf000) * 0x11000 +
(argb & 0x0f00) * 0x01100 +
(argb & 0x00f0) * 0x00110 +
(argb & 0x000f) * 0x00011;
};
And then, we create the compile-time lookup table with whatever method we have available, I'll wish to use the C++14 integer sequence but I don't know which compiler will the OP be using. So another possible approach would be to use a pretty ugly macro:
#define EXPAND16(x) aarrggbb<x + 0>::value, \
aarrggbb<x + 1>::value, \
aarrggbb<x + 2>::value, \
aarrggbb<x + 3>::value, \
aarrggbb<x + 4>::value, \
aarrggbb<x + 5>::value, \
aarrggbb<x + 6>::value, \
... and so on
#define EXPAND EXPAND16(0), \
EXPAND16(0x10), \
EXPAND16(0x20), \
EXPAND16(0x30), \
EXPAND16(0x40), \
... and so on
... and so on
See demo here.
PS: The Adam Liss approach could be used without C++11.
If multiplication is cheap and 64-bit arithmetics is available, you could use this code:
uint64_t x = 0x1234;
x *= 0x0001000100010001ull;
x &= 0xF0000F0000F0000Full;
x *= 0x0000001001001001ull;
x &= 0xF0F0F0F000000000ull;
x = (x >> 36) * 0x11;
std::cout << std::hex << x << '\n';
In fact, it uses the same idea as the original attempt by AShelly.
This works and may be easier to understand, but bit manipulations are so cheap that I wouldn't worry much about efficiency.
#include <stdio.h>
#include <stdlib.h>
void main() {
unsigned int c = 0x1234, b;
b = (c & 0xf000) * 0x11000 + (c & 0x0f00) * 0x01100 +
(c & 0x00f0) * 0x00110 + (c & 0x000f) * 0x00011;
printf("%x -> %x\n", c, b);
}
Assuming that, you want to always convert 0xWXYZ to 0xWWXXYYZZ, I believe that below solution would be little faster than the one you suggested:
unsigned int c = 0x1234;
unsigned int b = (c & 0xf) | ((c & 0xf0) << 4) |
((c & 0xf00) << 8) | ((c & 0xf000) << 12);
b |= (b << 4);
Notice that, one &(and) operation is saved from your solution. :-)
Demo.
Another way is:
DWORD OrVal(DWORD & nible_pos, DWORD input_val, DWORD temp_val, int shift)
{
if (nible_pos==0)
nible_pos = 0x0000000F;
else
nible_pos = nible_pos << 4;
DWORD nible = input_val & nible_pos;
temp_val |= (nible << shift);
temp_val |= (nible << (shift + 4));
return temp_val;
}
DWORD Converter2(DWORD input_val)
{
DWORD nible_pos = 0x00000000;
DWORD temp_val = 0x00000000;
temp_val = OrVal(nible_pos, input_val, temp_val, 0);
temp_val = OrVal(nible_pos, input_val, temp_val, 4);
temp_val = OrVal(nible_pos, input_val, temp_val, 8);
temp_val = OrVal(nible_pos, input_val, temp_val, 12);
return temp_val;
}
DWORD val2 = Converter2(0x1234);
An optimized version (3 times faster):
DWORD Converter3(DWORD input_val)
{
DWORD nible_pos = 0;
DWORD temp_val = 0;
int shift = 0;
DWORD bit_nible[4] = { 0x000F, 0x000F0, 0x0F00, 0xF000 };
for ( ; shift < 16; shift+=4 )
{
if (nible_pos==0)
nible_pos = 0x0000000F;
else
nible_pos = nible_pos << 4;
DWORD nible = input_val & nible_pos;
temp_val |= (nible << shift);
temp_val |= (nible << (shift + 4));
}
return temp_val;
}
Perhaps this could be more simpler & efficient.
unsigned int g = 0x1234;
unsigned int ans = 0;
ans = ( ( g & 0xf000 ) << 16) + ( (g & 0xf00 ) << 12)
+ ( ( g&0xf0 ) << 8) + ( ( g&0xf ) << 4);
ans = ( ans | ans>>4 );
printf("%p -> %p\n", g, ans);
unsigned long transform(unsigned long n)
{
/* n: 00AR
* 00GB
*/
n = ((n & 0xff00) << 8) | (n & 0x00ff);
/* n: 0AR0
* 0GB0
*/
n <<= 4;
/* n: AAR0
* GGB0
*/
n |= (n & 0x0f000f00L) << 4;
/* n: AARR
* GGBB
*/
n |= (n & 0x00f000f0L) >> 4;
return n;
}
The alpha and red components are shifted into the higher 2 bytes where they belong, and the result is then shifted left by 4 bits, resulting in every component being exactly where it needs to be.
With a form of 0AR0 0GB0, a bit mask and left-shift combination is OR'ed with the current value. This copies the A and G components to the position just left of them. The same thing is done for the R and B components, except in the opposite direction.
If you are going to do this for OpenGL, I suggest you to use a glTexImageXD function with type parameter set to GL_UNSIGNED_SHORT_4_4_4_4. Your OpenGL driver should do the rest. And about the transparency inversion you can always manipulate blending via the glBlendFunc and glBlendEquation functions.
While others operate on hard-core optimization...
Take this as your best bet:
std::string toAARRGGBB(const std::string &argb)
{
std::string ret("0x");
int start = 2; //"0x####";
// ^^ skipped
for (int i = start;i < argb.length(); ++i)
{
ret += argb[i];
ret += argb[i];
}
return ret;
}
int main()
{
std::string argb = toAARRGGBB("0xACED"); //!!!
}
Haha