Issues setting a hexadecimal value to a char variable for padding

Issues setting a hexadecimal value to a char variable for padding - c++

I'm having issues trying to set a char variable to a hexadecimal value.
I'm trying to implement a print function where I print the alignment of a structure. The "aa" is suppose to represent the padding, but I can't seem to set the variable in the default constructor.
Output that it should be
0x00: 00 00 00 00
0x04: 00 aa aa aa
.h file
struct A
{
int a0;
char a1;
char pad0;
char pad1;
char pad2;
A() {
a0 = 0;
a1 = 0;
pad1 = 0xAA;
pad2 = 0xAA;
}
};
.cpp
Alignment::Alignment:Print(void *data)
{
int d {0 };
for (int i = 0; i <2; ++i) {
printf("\n0x%02x:", (d));
d = d + 4;
for (int j = 0; j < sizeof(data); ++j)
{
printf(" %.2x", (*((unsigned char*)data+j )));
}
}
}
Main
A *pa = new A;
Pa is passed into the function
My Output
0x00: 00 00 00 00
0x04: 00 00 00 00

There are a few issues with your Print function. Some are "just" style, but they make fixing the real issue harder. (Also, you jumped to conclusions, possibly blinding you to the real issue.) First off, try to explain why the line
printf(" %.2x", (*((unsigned char*)data+j )));
is supposed to do something different just because i and d changed. Neither of those variables are in this line, so you get the same output with each iteration of the outer loop. The problem is not in the setting of data, but in the reading of it. For each line of output, you print the first four bytes of data when you intended to print the next four bytes.
To get the next four bytes, you need to add d to the pointer, but this is problematic because you increased d earlier in this iteration. Better would be to have d keep the same value throughout each iteration. You could increase it at the end of the iteration instead of the middle, but even slicker might be to use d to control the loop instead of i.
Finally, a bit of robustness: your code uses the magic number 4 when increasing d, which only works if sizeof(data) is 4. Fragile. Better would be to use a symbolic constant for this magic value to ensure consistency. I'll set it explicitly to 4 since I don't see why the size of a pointer should impact how the pointed-to data is displayed.
Alignment::Alignment::Print(void *data)
{
constexpr int BytesPerLine = 4; // could use sizeof(void *) instead of 4
const unsigned char * bytes = static_cast<unsigned char*>(data); // Taking the conversion out of the loop.
for (int d = 0; d < 2*BytesPerLine; d += BytesPerLine) {
printf("\n0x%02x:", (d));
for (int j = 0; j < BytesPerLine; ++j)
{
printf(" %.2x", *(bytes + d + j)); // Add d here!
}
}
}

Related

Most insanely fastest way to convert 9 char digits into an int or unsigned int

#include <stdio.h>
#include <iostream>
#include <string>
#include <chrono>
#include <memory>
#include <cstdlib>
#include <cstdint>
#include <cstring>
#include <immintrin.h>
using namespace std;
const int p[9] = {1, 10, 100,
1000, 10000, 100000,
1000000, 10000000, 100000000};
class MyTimer {
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
int64_t getCounterNs() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
}
};
int convert1(const char *a) {
int res = 0;
for (int i=0; i<9; i++) res = res * 10 + a[i] - 48;
return res;
}
int convert2(const char *a) {
return (a[0] - 48) * p[8] + (a[1] - 48) * p[7] + (a[2] - 48) * p[6]
+ (a[3] - 48) * p[5] + (a[4] - 48) * p[4] + (a[5] - 48) * p[3]
+ (a[6] - 48) * p[2] + (a[7] - 48) * p[1] + (a[8] - 48) * p[0];
}
int convert3(const char *a) {
return (a[0] - 48) * p[8] + a[1] * p[7] + a[2] * p[6] + a[3] * p[5]
+ a[4] * p[4] + a[5] * p[3] + a[6] * p[2] + a[7] * p[1] + a[8]
- 533333328;
}
const unsigned pu[9] = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000,
100000000};
int convert4u(const char *aa) {
const unsigned char *a = (const unsigned char*) aa;
return a[0] * pu[8] + a[1] * pu[7] + a[2] * pu[6] + a[3] * pu[5] + a[4] * pu[4]
+ a[5] * pu[3] + a[6] * pu[2] + a[7] * pu[1] + a[8] - (unsigned) 5333333328u;
}
int convert5(const char* a) {
int val = 0;
for(size_t k =0;k <9;++k) {
val = (val << 3) + (val << 1) + (a[k]-'0');
}
return val;
}
const unsigned pu2[9] = {100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1};
int convert6u(const char *a) {
return a[0]*pu2[0] + a[1]*pu2[1] + a[2]*pu2[2] + a[3] * pu2[3] + a[4] * pu2[4] + a[5] * pu2[5] + a[6] * pu2[6] + a[7] * pu2[7] + a[8] - (unsigned) 5333333328u;
}
constexpr std::uint64_t zeros(char z) {
std::uint64_t result = 0;
for (int i = 0; i < sizeof(result); ++i) {
result = result*256 + z;
}
return result;
}
int convertX(const char *a) {
constexpr std::uint64_t offset = zeros('0');
constexpr std::uint64_t o1 = 0xFF00FF00FF00FF00;
constexpr std::uint64_t o2 = 0xFFFF0000FFFF0000;
constexpr std::uint64_t o3 = 0xFFFFFFFF00000000;
std::uint64_t buffer;
std::memcpy(&buffer, a, sizeof(buffer));
const auto bytes = buffer - offset;
const auto b1 = (bytes & o1) >> 8;
const auto words = (bytes & ~o1) + 10*b1;
const auto w1 = (words & o2) >> 16;
const auto dwords = (words & ~o2) + 100*w1;
const auto d1 = (dwords & o3) >> 32;
const auto qwords = (dwords & ~o3) + 1000*d1;
const auto final = 10*static_cast<unsigned>(qwords) + (a[9] - '0');
return static_cast<int>(final);
}
//######################## ACCEPTED ANSWER
//########################
//########################
typedef struct { // for output into memory
alignas(16) unsigned hours;
unsigned minutes, seconds, nanos;
} hmsn;
void str2hmsn(hmsn *out, const char str[15]) // HHMMSSXXXXXXXXX 15 total, with 9-digit nanoseconds.
{ // 15 not including the terminating 0 (if any) which we don't read
//hmsn retval;
__m128i digs = _mm_loadu_si128((const __m128i*)str);
digs = _mm_sub_epi8( digs, _mm_set1_epi8('0') );
__m128i hms_x_words = _mm_maddubs_epi16( digs, _mm_set1_epi16( 10U + (1U<<8) )); // SSSE3 pairs of digits => 10s, 1s places.
__m128i hms_unpacked = _mm_cvtepu16_epi32(hms_x_words); // SSE4.1 hours, minutes, seconds unpack from uint16_t to uint32
//_mm_storeu_si128((__m128i*)&retval, hms_unpacked); // store first 3 struct members; last to be written separately
_mm_storeu_si128((__m128i*)out, hms_unpacked);
// or scalar extract with _mm_cvtsi128_si64 (movq) and shift / movzx
__m128i xwords = _mm_bsrli_si128(hms_x_words, 6); // would like to schedule this sooner, so oldest-uop-first starts this critical path shuffle ahead of pmovzx
// 8 bytes of data, lined up in low 2 dwords, rather than split across high 3
// could have got here with an 8-byte load that starts here, if we didn't want to get the H,M,S integers cheaply.
__m128i xdwords = _mm_madd_epi16(xwords, _mm_setr_epi16(100, 1, 100, 1, 0,0,0,0)); // low/high uint32 chunks, discard the 9th x digit.
uint64_t pair32 = _mm_cvtsi128_si64(xdwords);
uint32_t msd = 100*100 * (uint32_t)pair32; // most significant dword was at lower address (in printing order), so low half on little-endian x86. encourage compilers to use 32-bit operand-size for imul
uint32_t first8_x = msd + (uint32_t)(pair32 >> 32);
uint32_t nanos = first8_x * 10 + ((unsigned char)str[14] - '0'); // total*10 + lowest digit
out->nanos = nanos;
//retval.nanos = nanos;
//return retval;
// returning the struct by value encourages compilers in the wrong direction
// into not doing separate stores, even when inlining into a function that assigns the whole struct to a pointed-to output
}
hmsn mystruct;
int convertSIMD(const char* a)
{
str2hmsn(&mystruct, a);
return mystruct.nanos;
}
//########################
//########################
using ConvertFunc = int(const char*);
volatile int result = 0; // do something with the result of function to prevent unexpected optimization
void benchmark(ConvertFunc converter, string name, int numTest=1000) {
MyTimer timer;
const int N = 100000;
char *a = new char[9*N + 17];
int64_t runtime = 0;
for (int t=1; t<=numTest; t++) {
// change something to prevent unexpected optimization
for (int i=0; i<9*N; i++) a[i] = rand() % 10 + '0';
timer.startCounter();
for (int i=0; i<9*N; i+= 9) result = converter(a+i);
runtime += timer.getCounterNs();
}
cout << name << ": " << (runtime / (double(numTest) * N)) << "ns average\n";
delete[] a;
}
int main() {
benchmark(convert1, "slow");
benchmark(convert2, "normal");
benchmark(convert3, "fast");
benchmark(convert4u, "unsigned");
benchmark(convert5, "shifting");
benchmark(convert6u, "reverse");
benchmark(convertX, "swar64");
benchmark(convertSIMD, "manualSIMD");
return 0;
}
I want to find the fastest way turn char a[9] into an int. The full problem is convert char a[15] with form HHMMSSxxxxxxxxx timestamp to nanosecond, where ~50 bytes after the x are allocated and can be safely read (but not write). We only care about the last 9 digits in this question.
Version 1 is basic, version 2,3 try to save some computation. I compile with -O3 flag, and storing power of 10s in array is fine because it is optimized away (checked using Godbolt).
How can I make this faster? Yes I know this sounds like premature optimization, but let's assume I need that final 2-3% boost.
**Big edit:** I've replaced the code to reduce the effect of std::chrono on the measured time. The results is very different: 2700ms, 810ms, 670ms. On my laptop with i7 8750H, gcc 9.3.0 with -O3 flag, the result is: 355, 387, 320ms.
Version 3 is decidedly faster, while version 2 is slower due to code size. But can we do better than version 3? Invalid benchmark
Edit 2: the function can return unsigned int instead of int (i.e
unsigned convert1(char *a);
Edit 3: I noticed that the new code is an invalid benchmark, since convert(a) is only executed once. Using the original code, the difference is only ~1%.
Edit 4: New benchmark. using unsigned (convert4u, convert6u) is consistently 3-5% faster than using int. I will run a long (10+ min) benchmark to see if there's a winner. I've edited the code to use a new benchmark. It generates a large amount of data, then run the converter functions.
Edit 5: results: 4.19, 4.51, 3.82, 3.59, 7.64, 3.72 seconds. The unsigned version is fastest. Is it possible to use SIMD on just 9 bytes? If not, then I guess this is the best solution. I still hope there's a crazier solution, though
Edit 6: benchmark result on AMD Ryzen 4350G, gcc version 10.3, compile command gcc -o main main.cpp -std=c++17 -O3 -mavx -mavx2 -march=native
slow: 4.17794ns average
normal: 2.59945ns average
fast: 2.27917ns average
unsigned: 2.43814ns average
shifting: 4.72233ns average
reverse: 2.2274ns average
swar64: 2.17179ns average
manualSIMD: 1.55203ns average
The accepted answer does even more than the question require and compute HH/MM/SS/nanosec, so it's even faster than this benchmark shows.

Yes, SIMD is possible, as mentioned in comments. You can take advantage of it to parse the HH, MM, and SS parts of the string at the same time.
Since you have a 100% fixed format with leading 0s where necessary, this is easier than How to implement atoi using SIMD? - Place-values are fixed and we don't need any compare / bit-scan or pcmpistri to look up a shuffle control mask or scale-factor. Also SIMD string to unsigned int parsing in C# performance improvement has some good ideas, like tweaking the place-value multipliers to avoid a step at the end (TODO, do that here.)
9 decimal digits breaks down into two dwords and one leftover byte that's probably best to grab separately.
Assuming you care about throughput (ability to overlap this with surrounding code, or do this in a loop on independent elements) moreso than critical path latency in cycles from input pointer and data in memory being ready to nanoseconds integer being ready, SSSE3 SIMD should be very good on modern x86. (With SSE4.1 being useful if you want to unpack your hours, minutes, seconds into contiguous uint32_t elements e.g. in a struct). It might be competitive on latency, too, vs. scalar.
Fun fact: clang auto-vectorizes your convert2 / convert3 functions, widening to 8x dword in a YMM register for vpmulld (2 uops), then a chain of shuffle/add.
The strategy is to use pmaddubsw and pmaddwd to multiply-and-add pairs horizontally, in a way that gets each digit multiplied by its place value. e.g. 10 and 1 pairs, then 100 and 1 for pairs of integer that come from double-digits. Then extract to scalar for the last pair: multiply the most-significant part by 100 * 100, and add to the least-significant part. I'm pretty sure overflow is impossible at any step for inputs that are actually '0'..'9'; This runs and compiles to the asm I expected, but I didn't verify the numeric results.
#include <immintrin.h>
typedef struct { // for output into memory
alignas(16) unsigned hours;
unsigned minutes, seconds, nanos;
} hmsn;
void str2hmsn(hmsn *out, const char str[15]) // HHMMSSXXXXXXXXX 15 total, with 9-digit nanoseconds.
{ // 15 not including the terminating 0 (if any) which we don't read
//hmsn retval;
__m128i digs = _mm_loadu_si128((const __m128i*)str);
digs = _mm_sub_epi8( digs, _mm_set1_epi8('0') );
__m128i hms_x_words = _mm_maddubs_epi16( digs, _mm_set1_epi16( 10U + (1U<<8) )); // SSSE3 pairs of digits => 10s, 1s places.
__m128i hms_unpacked = _mm_cvtepu16_epi32(hms_x_words); // SSE4.1 hours, minutes, seconds unpack from uint16_t to uint32
//_mm_storeu_si128((__m128i*)&retval, hms_unpacked); // store first 3 struct members; last to be written separately
_mm_storeu_si128((__m128i*)out, hms_unpacked);
// or scalar extract with _mm_cvtsi128_si64 (movq) and shift / movzx
__m128i xwords = _mm_bsrli_si128(hms_x_words, 6); // would like to schedule this sooner, so oldest-uop-first starts this critical path shuffle ahead of pmovzx
// 8 bytes of data, lined up in low 2 dwords, rather than split across high 3
// could have got here with an 8-byte load that starts here, if we didn't want to get the H,M,S integers cheaply.
__m128i xdwords = _mm_madd_epi16(xwords, _mm_setr_epi16(100, 1, 100, 1, 0,0,0,0)); // low/high uint32 chunks, discard the 9th x digit.
uint64_t pair32 = _mm_cvtsi128_si64(xdwords);
uint32_t msd = 100*100 * (uint32_t)pair32; // most significant dword was at lower address (in printing order), so low half on little-endian x86. encourage compilers to use 32-bit operand-size for imul
uint32_t first8_x = msd + (uint32_t)(pair32 >> 32);
uint32_t nanos = first8_x * 10 + ((unsigned char)str[14] - '0'); // total*10 + lowest digit
out->nanos = nanos;
//retval.nanos = nanos;
//return retval;
// returning the struct by value encourages compilers in the wrong direction
// into not doing separate stores, even when inlining into a function that assigns the whole struct to a pointed-to output
}
On Godbolt with a test loop that uses asm("" ::"m"(sink): "memory" ) to make the compiler redo the work in a loop. Or a std::atomic_thread_fence(acq_rel) hack that gets MSVC to not optimize away the loop either. On my i7-6700k with GCC 11.1, x86-64 GNU/Linux, energy_performance_preference = performance, I got this to run at one iteration per 5 cycles.
IDK why it doesn't run at one per 4c; I tweaked GCC options to avoid the JCC erratum slowdown without padding, and to have the loop in hopefully 4 uop cache lines. (6 uops, 1 uop ended by a 32B boundary, 6 uops, 2 uops ended by the dec/jnz). Perf counters say the front-end was "ok", and uops_dispatched_port shows all 4 ALU ports at less than 4 uops per iteration, highest being port0 at 3.34.
Manually padding the early instructions gets it down to 3 total lines, of 3, 6, 6 uops but still no improvement from 5c per iter, so I guess the front-end really is ok.
LLVM-MCA seems very ambitious in projecting 3c per iter, apparently based on a wrong model of Skylake with a "dispatch" (front-end rename I think) width of 6. Even with -mcpu=haswell with a proper 4-wide model it projects 4.5c. (I used asm("# LLVM-MCA-BEGIN") etc. macros on Godbolt and included an LLVM-MCA output window for the test loop.) It doesn't have fully accurate uop->port mapping, apparently not knowing about slow-LEA running only on port 1, but IDK if that's significant.
Throughput may be limited by the ability to find instruction-level parallelism and overlap across several iterations, as in Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
The test loop is:
#include <stdlib.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
#include <stdint.h>
#if 1 && defined(__GNUC__)
#define LLVM_MCA_BEGIN asm("# LLVM-MCA-BEGIN")
#define LLVM_MCA_END asm("# LLVM-MCA-END")
#else
#define LLVM_MCA_BEGIN
#define LLVM_MCA_END
#endif
#if defined(__cplusplus)
#include <atomic>
using std::atomic_thread_fence, std::memory_order_acq_rel;
#else
#include <stdatomic.h>
#endif
unsigned testloop(const char str[15]){
hmsn sink;
for (int i=0 ; i<1000000000 ; i++){
LLVM_MCA_BEGIN;
str2hmsn(&sink, str);
// compiler memory barrier
// force materializing the result, and forget about the input string being the same
#ifdef __GNUC__
asm volatile("" ::"m"(sink): "memory");
#else
//#warning happens to be enough with current MSVC
atomic_thread_fence(memory_order_acq_rel); // strongest barrier that doesn't require any asm instructions on x86; MSVC defeats signal_fence.
#endif
}
LLVM_MCA_END;
volatile unsigned dummy = sink.hours + sink.nanos; // make sure both halves are really used, else MSVC optimizes.
return dummy;
}
int main(int argc, char *argv[])
{
// performance isn't data-dependent, so just use a handy string.
// alignas(16) static char str[] = "235959123456789";
uintptr_t p = (uintptr_t)argv[0];
p &= -16;
return testloop((char*)p); // argv[0] apparently has a cache-line split within 16 bytes on my system, worsening from 5c throughput to 6.12c
}
I compiled as follows, to squeeze the loop in so it ends before the 32-byte boundary it's almost hitting. Note that -march=haswell allows it to use AVX encodings, saving an instruction or two.
$ g++ -fno-omit-frame-pointer -fno-stack-protector -falign-loops=16 -O3 -march=haswell foo.c -masm=intel
$ objdump -drwC -Mintel a.out | less
...
0000000000001190 <testloop(char const*)>:
1190: 55 push rbp
1191: b9 00 ca 9a 3b mov ecx,0x3b9aca00
1196: 48 89 e5 mov rbp,rsp
1199: c5 f9 6f 25 6f 0e 00 00 vmovdqa xmm4,XMMWORD PTR [rip+0xe6f] # 2010 <_IO_stdin_used+0x10>
11a1: c5 f9 6f 15 77 0e 00 00 vmovdqa xmm2,XMMWORD PTR [rip+0xe77] # 2020 <_IO_stdin_used+0x20> # vector constants hoisted
11a9: c5 f9 6f 0d 7f 0e 00 00 vmovdqa xmm1,XMMWORD PTR [rip+0xe7f] # 2030 <_IO_stdin_used+0x30>
11b1: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
11bc: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
### Top of loop is 16-byte aligned here, instead of ending up with 8 byte default
11c0: c5 d9 fc 07 vpaddb xmm0,xmm4,XMMWORD PTR [rdi]
11c4: c4 e2 79 04 c2 vpmaddubsw xmm0,xmm0,xmm2
11c9: c4 e2 79 33 d8 vpmovzxwd xmm3,xmm0
11ce: c5 f9 73 d8 06 vpsrldq xmm0,xmm0,0x6
11d3: c5 f9 f5 c1 vpmaddwd xmm0,xmm0,xmm1
11d7: c5 f9 7f 5d f0 vmovdqa XMMWORD PTR [rbp-0x10],xmm3
11dc: c4 e1 f9 7e c0 vmovq rax,xmm0
11e1: 69 d0 10 27 00 00 imul edx,eax,0x2710
11e7: 48 c1 e8 20 shr rax,0x20
11eb: 01 d0 add eax,edx
11ed: 8d 14 80 lea edx,[rax+rax*4]
11f0: 0f b6 47 0e movzx eax,BYTE PTR [rdi+0xe]
11f4: 8d 44 50 d0 lea eax,[rax+rdx*2-0x30]
11f8: 89 45 fc mov DWORD PTR [rbp-0x4],eax
11fb: ff c9 dec ecx
11fd: 75 c1 jne 11c0 <testloop(char const*)+0x30>
# loop ends 1 byte before it would be a problem for the JCC erratum workaround
11ff: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
So GCC made the asm I had planned by hand before writing the intrinsics this way, using as few instructions as possible to optimize for throughput. (Clang favours latency in this loop, using a separate add instead of a 3-component LEA).
This is faster than any of the scalar versions that just parse X, and it's parsing HH, MM, and SS as well. Although clang auto-vectorization of convert3 may give this a run for its money in that department, but it strangely doesn't do that when inlining.
GCC's scalar convert3 takes 8 cycles per iteration. clang's scalar convert3 in a loop takes 7, running at 4.0 fused-domain uops/clock, maxing out the front-end bandwidth and saturating port 1 with one imul uop per cycle. (This is reloading each byte with movzx and storing the scalar result to a stack local every iteration. But not touching the HHMMSS bytes.)
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,idq_uops_not_delivered.cycles_fe_was_ok -r1 ./a.out
Performance counter stats for './a.out':
1,221.82 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
105 page-faults # 85.937 /sec
5,079,784,301 cycles # 4.158 GHz
16,002,910,115 instructions # 3.15 insn per cycle
15,004,354,053 uops_issued.any # 12.280 G/sec
18,003,922,693 uops_executed.thread # 14.735 G/sec
1,484,567 idq.mite_uops # 1.215 M/sec
5,079,431,697 idq_uops_not_delivered.cycles_fe_was_ok # 4.157 G/sec
1.222107519 seconds time elapsed
1.221794000 seconds user
0.000000000 seconds sys
Note that this is for 1G iterations, so 5.08G cycles means 5.08 cycles per iteration average throughput.
Removing the extra work to produce the HHMMSS part of the output (vpsrldq, vpmovzxwd, and vmovdqa store), just the 9-digit integer part, it runs at 4.0 cycles per iteration on Skylake. Or 3.5 without the scalar store at the end. (I edited GCC's asm output to comment that instruction, so I know it's still doing all the work.)
The fact that there's some kind of back-end bottleneck here (rather than front-end) is probably a good thing for overlapping this with independent work.

An alternative candidate
Use unsigned math to avoid UB of int overflow and allow for taking all the - 48 out and then into a constant.
const unsigned p[9] = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000,
100000000};
int convert4u(const char *aa) {
const unsigned char *a = (const unsigned char*) aa;
return a[0] * p[8] + a[1] * p[7] + a[2] * p[6] + a[3] * p[5] + a[4] * p[4]
+ a[5] * p[3] + a[6] * p[2] + a[7] * p[1] + a[8] - (unsigned) 5333333328u;
}
Also try ordering p[9] like a[]. Perhaps easier to parallel calculate. I see no down-side to re-ordering.
const unsigned p[9] = {100000000, 10000000, ..., 1};
int convert4u(const char *aa) {
const unsigned char *a = (const unsigned char*) aa;
return a[0]*p[0] + a[1]*p[1] ... a[1]*p[1] + a[8] - (unsigned) 5333333328u;
}

You don't necessarily need to use special SIMD instructions to do computation in parallel. By using a 64-bit unsigned integer, we can process eight of the nine bytes in parallel and then treat the ninth as a one-off at the end.
constexpr std::uint64_t zeros(char z) {
std::uint64_t result = 0;
for (int i = 0; i < sizeof(result); ++i) {
result = result*256 + z;
}
return result;
}
unsigned convertX(const char *a) {
constexpr std::uint64_t offset = zeros('0');
constexpr std::uint64_t o1 = 0xFF00FF00FF00FF00;
constexpr std::uint64_t o2 = 0xFFFF0000FFFF0000;
constexpr std::uint64_t o3 = 0xFFFFFFFF00000000;
std::uint64_t buffer;
std::memcpy(&buffer, a, sizeof(buffer));
const auto bytes = buffer - offset;
const auto b1 = (bytes & o1) >> 8;
const auto words = (bytes & ~o1) + 10*b1;
const auto w1 = (words & o2) >> 16;
const auto dwords = (words & ~o2) + 100*w1;
const auto d1 = (dwords & o3) >> 32;
const auto qwords = (dwords & ~o3) + 1000*d1;
const auto final = 10*static_cast<unsigned>(qwords) + (a[9] - '0');
return static_cast<unsigned>(final);
}
I tested with MS Visual C++ (64-bit), and the benchmark time for this solution was just over 200 ms versus all of the others which came in right at 400 ms. This makes sense since it uses about half of the multiply and add instructions that the "normal" solution does.
I know the memcpy looks wasteful, but it avoids undefined behavior and alignment problems.

Efficient bit operations

In C++ I want to encode the bits of 3 unsigned variables into one. More precisely, when the three variables are:
A: a3 a2 a1 a0
B: b3 b2 b1 b0
C: c3 c2 c1 c0
then the output variable shall contain such triples:
D: a3 b3 c3 a2 b2 c2 a1 b1 c1 a0 b0 c0
Let's assume that the output variable is large enough for all used bits. I have come up with
unsigned long long result(0);
unsigned a,b,c; // Some numbers to be encoded
for(int level=0;level<numLevels;++level)
{
int q(1<<level); // SearchBit q: 1<<level
int baseShift((3*level)-level); // 0,2,4,6
result|=( ((a&q)<<(baseShift+2)) | ((b&q)<<(baseShift+1)) | ((c&q)<<(baseShift)) );
}
...and it works sufficiently. But I wonder if there is a solution that does not require a loop that iterates over all bits separately.

Define a table mapping all or part of your bits to where they end up. Shift values appropriately.
unsigned long long encoder(unsigned a, unsigned b, unsigned c) {
static unsigned const encoding[16] = {
0b0000000000,
0b0000000001,
0b0000001000,
0b0000001001,
0b0001000000,
0b0001000001,
0b0001001000,
0b0001001001,
0b1000000000,
0b1000000001,
0b1000001000,
0b1000001001,
0b1001000000,
0b1001000001,
0b1001001000,
0b1001001001,
};
unsigned long long result(0);
int shift = 0;
do {
result += ((encoding[a & 0xF] << 2) | (encoding[b & 0xF] << 1) | encoding[c & 0xF]) << shift;
shift += 12;
a >>= 4;
b >>= 4;
c >>= 4;
} while (a || b || c);
return result;
}
encoding defines a table to map 4 bits into their encoded locations. This used directly for c, and shifted 1 or 2 bits for b and a. If you have more than 4 bits to process, the next 4 bits in the source values are offset 12 bits further to the left. Keep doing this until all nonzero bits have been processed.
This could use a while loop instead of a do/while but checking for zero before starting is useless unless most of the encodings are of all zero values.
If you frequently use more than 4 bits, the encoding table can be expanded and appropriate changes made to the loop to process more than 4 bits at a time.

Print Byte Array Vertically in C

I have 4 byte arrays containing 4 bytes each, and when I print them they print horizontally, but I need them to print vertically. Is this possible in C? I know when I use printf() and pass it the array, it prints all four bytes automatically. Can I sub-divide the array into four separate bytes like a normal array and print each byte on a separate line? There has to be a way.
To illustrate what I am doing, here is the segment that I am having trouble with:
byte a0 = state[0];
byte a1 = state[1];
byte a2 = state[2];
byte a3 = state[3];
state[0] = lookup_g2[a0] ^ lookup_g3[a1] ^ a2 ^ a3;
state[1] = lookup_g2[a1] ^ lookup_g3[a2] ^ a3 ^ a0;
state[2] = lookup_g2[a2] ^ lookup_g3[a3] ^ a0 ^ a1;
state[3] = lookup_g2[a3] ^ lookup_g3[a0] ^ a1 ^ a2;
for (int i = 0; i < 4; i++)
{
printf("%02x", state[i]);
}
printf("%s", "\n");
This prints each byte array in its entirety on each line until main()ends. Preferably, I would like to print each array in groups of four vertically.
Also, I know there is sprintf() which I was recommended to use but for now I'm using printf() until I can figure out how to print vertically.
Example:
printf("%02x", state[0]);
yields:
b9 e4 47 c5
The output I am looking for is:
b9
e4
47
c5
and then the following iterations will print their values next to the first column as shown above. Is this possible?

Then how do you print the next line of bytes in a column next to the
first up to four columns then move to a new line?
If I follow what you are attempting to do, you have 4 state arrays (or 4 rows and columns of something) and you want to output the first line with the first element of each (one in each column), the next line (one element in each column), and so on. If you want to output each value individually, then you will need nested loops, outputting one value at a time in the inner loop (printing each element in a column), then output the '\n' once the inner loop completes.
Something like the following:
#include <stdio.h>
typedef unsigned char byte;
int main (void) {
byte state[] = { 0xb9, 0xe4, 0x47, 0xc5 },
n = sizeof state / sizeof *state;
for (byte i = 0; i < n; i++) { /* outer loop */
for (byte j = 0; j < n; j++) /* inner loop */
printf (" %02x", state[(j+i)%n]); /* stuff */
putchar ('\n'); /* newline after inner loop completes */
}
return 0;
}
Example Use/Output
$ ./bin/byteperline
b9 e4 47 c5
e4 47 c5 b9
47 c5 b9 e4
c5 b9 e4 47
note: I just used the values from the first array 4-times to construct the four columns. You will adjust your indexing scheme to whatever is called for by your data (e.g. to print rows as columns or columns as rows) Regardless how you do it, you must output the values in a row-wise manner. You can't print a column and then back up 4-lines and offset by 5 chars and print the next column. (I mean you can, but that is well-beyond your question and would require a separate curses library to do in a portable fashion)
Look things over and let me know if this is what you intended.

Code to find Endianness-pointer typecasting

I was trying to search for a code to determine the endianness of the system, and this is what I found:
int main()
{
unsigned int i= 1;
char *c = (char *)&i;
if (*c) {
printf("Little Endian\n");
} else {
printf("Big Endian\n");
}
}
Could someone tell me how this code works? More specifically, why is the ampersand needed here in this typecasting :
char *c = (char *)&i;
What is getting stored into the pointer c.. the value i contains or the actual address i is contained in? Also why is this a char for this program?

While dereferencing a character pointer, only one byte is interpreted(Assuming a char variable takes one byte).And in little-endian mode,the least-significant-byte of an integer is stored first.So for a 4-byte integer,say 3,it is stored as
00000011 00000000 00000000 00000000
while for big-endian mode it is stored as:
00000000 00000000 00000000 00000011
So in the first case, the char* interprets the first byte and displays 3 but in the second case it displays 0.
Had you not typecasted it as :
char *c = (char *)&i;
it will show a warning about incompatible pointer type.Had c been an integer pointer, dereferencing it will get an integer value 3 irrespective of the endianness,as all 4 bytes will be interpreted.
NB You need to initialize the variable i to see the whole picture.Else a garbage value is stored in the variable by default.
Warning!! OP,we discussed the difference between little-endian and big-endian,but it's more important to know the difference between little-endian and little-indian.I noticed that you used the latter.Well, the difference is that little-indian can cost you your dream job in Google or a $3 million in venture capital if your interviewer is a Nikesh Arora,Sundar Pichai,Vinod Dham or Vinod Khosla :-)

Let's try to walk through this: (in comments)
int main(void){ /
unsigned int i = 1; // i is an int in memory that can be conceptualized as
// int[0x00 00 00 01]
char *c = *(char *)&i; // We take the address of i and then cast it to a char pointer
// which we then dereference. This cast from int(4 bytes)
// to char(1 byte) results in only keeping the lowest byte by
if(*c){ // Endian-ness.
puts("little!\n"); // This means that on a Little Endian machine, 0x01 will be
} else { // the byte kept, but on a Big Endian machine, 0x00 is kept.
puts("big!\n"); // int[0x00 00 00 (char)[01]] vs int[0x01 00 00 (char)[00]]
}
return 0;
}

Using char array as number for math in C++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I'm trying to make 128 and 256 bit integers in C++, and noticed casting char** to int* and int* to int (and backwards) can be used to convert char arrays to integers and integers to char arrays.
Also, char* + int works fine.
However, when I try char* + char* the compiler tells me the types are invalid. Is there any workaround for this, or will I have to write my own functions for the operators?
For example:
int32_t intValue = 2147483647;
char *charPointer = *( char** ) &intValue;
charPointer += 2147483647;
charPointer += 2;
cout << ( *( int64_t* ) &charPointer ) << endl;
output: 4294967296
Basically, what I do should be something like the following:
int32_t intValue = 2147483647;
somewhere in memory:
[ 05 06 07 08 09 0A 0B 0C ] ( address, in hex )
[ .. .. FF FF FF 7F .. .. ] ( value, in hex )
then:
char *charPointer = *( char** ) &intValue;
somewhere in memory:
[ 58 59 5A 5B 5C 5D 5E 5F ] ( address, in hex )
[ .. .. 07 00 00 00 .. .. ] ( value, in hex )
then:
charPointer += 2147483647;
I honestly have no idea what happens here.
It seems like it does something like this though:
[ 05 06 07 08 09 0A 0B 0C ] ( address, in hex )
[ .. .. FF FF FF FE .. .. ] ( value, in hex )
then:
charPointer += 2;
Same here.
Something like this:
[ 05 06 07 08 09 0A 0B 0C ] ( address, in hex )
[ .. .. 00 00 00 00 01 .. ] ( value, in hex )
And at last I just print it as if it were an 8 byte integer:
cout << ( *( int64_t* ) &charPointer ) << endl;
So, can anybody explain why it isn't the value of the pointer that is added but the value of what's being pointed to?

These conversions exist, but they don't do what you think they do. Converting a pointer to an integer is just treating it as an integer; it's not doing any actual "math". For example, char * s = "abcd"; int i = (int) s; will not give the same result every time, because s and i are both just the memory address that the string starts at. Neither has anything to do with the actual contents of the string.
Similarly, char* + int is just taking an offset. To write char * s = "abcd"; char * t = s + 2; is just another way to write char * s = "abcd"; char * t = &(s[2]);; that is, s is the memory location of the 'a', and t is the memory location of the 'c' (s, offset by two char-widths, that is, two bytes). No actual math has taken place, except in the sense that "pointer arithmetic" requires math to compute byte offsets and find memory locations.
char * + char * doesn't make sense: what would it mean to "add" two memory locations together?
Edit: Here is the code you added to your question:
int intValue = 5198;
char *charValue = *( char** ) &intValue;
charValue += 100;
cout << ( *( int* ) &charValue ) << endl;
Let me expand it a bit, so it's a bit clearer what's going on:
int intValue = 5198;
int * intPtr = &intValue;
// intPtr is now the address of the memory location containing intValue
char ** charPtrPtr = (char**) intPtr;
// charPtrPtr is now the address of the memory location containing intValue,
// but *pretending* that it's the address of a memory location that in turn
// contains the address of a memory location containing a char.
char *charPtr = *charPtrPtr;
// charPtr (note: you called it "charValue", but I've renamed it for clarity)
// is now intValue, but *pretending* that it's the address of a memory
// location containing a char.
charPtr += 100;
// charPtr is now 100 more than it was. It's still really just an integer,
// pretending to be a memory location. The above statement is equivalent to
// "charPtr = &(charPtr[100]);", that is, it sets charPtr to point 100 bytes
// later than it did before, but since it's not actually pointing to a real
// memory location, that's a poor way to look at it.
char ** charPtrPtr2 = &charPtr;
// charPtrPtr2 is now the address of the memory location containing charPtr.
// Note that it is *not* the same as charPtrPtr; we used charPtrPtr to
// initialize charPtr, but the two memory locations are distinct.
int * intPtr2 = (int *) charPtrPtr2;
// intPtr2 is now the address of the memory location containing charPtr, but
// *pretending* that it's the address of a memory location containing an
// integer.
int intValue2 = *intPtr2;
// intValue2 is now the integer that results from reading out of charPtrPtr2
// as though it were actually pointing to an integer. Which, in a perverse
// way, is actually true: charPtrPtr2 pointed to a memory location that held
// a value that was never *really* a memory location, anyway, just an integer
// masquerading as one. But this will depend on the specific platform,
// because there's no guarantee that an "int" and a pointer are the same
// size -- on some platforms "int" is 32 bits and pointers are 64 bits.
cout << intValue2 << endl;
Does that make sense?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issues setting a hexadecimal value to a char variable for padding - c++

Related

Most insanely fastest way to convert 9 char digits into an int or unsigned int

Efficient bit operations

Print Byte Array Vertically in C

Code to find Endianness-pointer typecasting

Using char array as number for math in C++ [closed]

Categories

Resources