hash_map implementation (Solaris 10, g++ 2.95.3) with default_alloc - c++

I am getting a core dump, the pstack core shows as below:
fed4ebd4 _lwp_kill (b, fe984948, 0, 1, fffffffc, 0) + 8
fdbd6c6c skgesigOSCrash (ffbfb5d4, ffbfb47c, fec31460, feb7536c, bc0f4, bc000) + 4c
fe1253a0 kpeDbgSignalHandler (ffbfb5d4, b13d630, 9e800, b, ffbfb9dd, 6060000) + 2f0
fdbd71ac skgesig_sigactionHandler (b, ffbfc260, febef8f8, 0, fec314a0, ffbfb5c0) + f0
fed4b00c __sighndlr (b, ffbfc260, ffbfbfa8, fdbd70bc, 0, 1) + c
fed3f6bc call_user_handler (b, 0, 8, 0, fc7f2a00, ffbfbfa8) + 3b8
fed3f8a4 sigacthandler (b, ffbfc260, ffbfbfa8, 29a8d0, 0, 0) + 60
--- called from signal handler with signal 11 (SIGSEGV) ---
fecd8028 _malloc_unlocked (308, bad8f38, bad8f38, bad8f38, fffffffc, 0) + 22c
fecd7de0 malloc (304, 1, ea654, 297f1c, fedc23f0, fedcc5e0) + 4c
002f0ee8 malloc (304, ffbfddd4, ba56c40, 1c, 3a, ba56d7d) + 54
001c4c78 allocate__t23__malloc_alloc_template1i0Ui (304, 0, 0, ffbfd99f, 0, 80808080)
+ c
001c4cfc allocate__t24__default_alloc_template2b0i0Ui (304, 304, ffbfd99f, 1, b, 0) +
18
0029a80c allocate__t12simple_alloc2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt
18string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt24
__default_alloc_template2b0i0Ui (c1, 0, 0, 0, 0, ffffffff) + 20
00299d38 _M_allocate__t18_Vector_alloc_base3ZPt15_Hashtable_node1Zt4pair2ZCt12basic_st
ring3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocat
or1ZiZt9allocator1Zt6vector2ZiZt9allocator1Zib1Ui (ffbfddf4, c1, 0, ffffffff, a, 808080
80) + 10
0029a8d0 _M_allocate_and_copy__H1ZPPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt1
8string_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1Zi_t6ve
ctor2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt (ffbf
ddf4, c1, 0, 0, ffbfe047, 1) + 1c
00299e24 reserve__t6vector2ZPt15_Hashtable_node1Zt4pair2ZCt12basic_string3ZcZt18string
_char_traits1ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt9allocator
1Zt6vector2ZiZt9allocator1ZiUi (ffbfddf4, c1, ffbfdb74, 0, ffbfe047, 7ffffff0) + 48
00297f1c _M_copy_from__t9hashtable6Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1
ZcZt24__default_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt12basic_string3ZcZt18st
ring_char_traits1ZcZt24__default_alloc_template2b0i0Z7strhashZt10_Select1st1Zt4pa (ffbf
ddf0, ffbfddd4, c1, ffbfdbe8, 38, ba56602) + 3c
0029ace8 __t9hashtable6Zt4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt24__defa
ult_alloc_template2b0i0Zt6vector2ZiZt9allocator1ZiZt12basic_string3ZcZt18string_char_tr
aits1ZcZt24__default_alloc_template2b0i0Z7strhashZt10_Select1st1Zt4pair2ZCt12basi (ffbf
ddf0, ffbfddd4, ba56c40, 1c, 3a, ba56d7d) + 124
002998ac __t8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_
template2b0i0Zt6vector2ZiZt9allocator1ZiZ7strhashZ5streqZt9allocator1Zt6vector2ZiZt9all
ocator1ZiRCt8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_a (ffbf
ddf0, ffbfddd4, ffbfdcef, ffbfdcee, ffbfdce0, 80808080) + 14
00294754 __Q212Notification4._47RCQ212Notification4._47 (ffbfddec, ffbfddd0, 484564, 1
, b, 0) + 1c
0029adf8 __t4pair2ZCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_tem
plate2b0i0ZQ212Notification9_RecordIdRCt12basic_string3ZcZt18string_char_traits1ZcZt24_
_default_alloc_template2b0i0RCQ212Notification9_RecordId (ffbfdde8, ffbfde88, ffbfddd0,
0, 0, ffffffff) + 2c
00299fcc __vc__t8hash_map5Zt12basic_string3ZcZt18string_char_traits1ZcZt24__default_al
loc_template2b0i0ZQ212Notification9_RecordIdZ7strhashZ5streqZt9allocator1ZQ212Notificat
ion9_RecordIdRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_tem (4845
64, ffbfde88, ffbfde80, ffffffff, a, 80808080) + 2c
002948a8 _Add__12NotificationPCcP8RecordId (484560, ba56d60, 0, ffbfe047, ffbfe047, 1)
+ a0
...
The core dump is coming when using hash_map with Solaris 10, g++ 2.95.3 compiled.
What's __Q212Notification4._47RCQ212Notification4._47?
Please suggest if any clue.

Related

Getting results of _mm256_cmpeq_epi8

I need to count the number of spaces in a string this way.
There's code:
std::size_t simd256_count_of_spaces(std::string& text) noexcept
{
std::size_t spaces = 0;
for (std::uint64_t i = 0; i < text.length(); i += 32)
{
__m256i __32 =
_mm256_set_epi8(
text[i ], text[i + 1],
text[i + 2], text[i + 3],
text[i + 4], text[i + 5],
text[i + 6], text[i + 7],
text[i + 8], text[i + 9],
text[i + 10], text[i + 11],
text[i + 12], text[i + 13],
text[i + 14], text[i + 15],
text[i + 16], text[i + 17],
text[i + 18], text[i + 19],
text[i + 20], text[i + 21],
text[i + 22], text[i + 23],
text[i + 24], text[i + 25],
text[i + 26], text[i + 27],
text[i + 28], text[i + 29],
text[i + 30], text[i + 31]
);
__m256i __cmp_mask =
_mm256_set_epi8(
32, 32, 32, 32, 32, 32, 32, 32,
32, 32, 32, 32, 32, 32, 32, 32,
32, 32, 32, 32, 32, 32, 32, 32,
32, 32, 32, 32, 32, 32, 32, 32
);
__m256i __cmp_result = _mm256_cmpeq_epi8(__32, __cmp_mask);
// ...
}
}
And i have vector at output like this:
255 255 0 0 0 0 255 0
255 0 0 255 255 0 255 0
255 255 255 0 0 255 255 0
255 255 0 0 0 0 255 0
But, after that, i can get count of count of 255's or 0's in this way:
std::uint8_t* cmp_res = reinterpret_cast<std::uint8_t*>(&__cmp_result);
for (int i = 0; i < 32; i++)
{
if (cmp_res[i] == 255) spaces++;
}
Is it possible to do the same thing(get count of 255's or 0's), but without additional loops?
UPDATE
This code solved my problem:
std::size_t spaces = 0;
const __m256i
__cmp = _mm256_set1_epi8(32);
__m256i __eq = _mm256_cmpeq_epi8(__32, __cmp);
spaces += _popcnt32(_mm256_movemask_epi8(__eq));
Just use for loop, compiler will optimize it for you

What could be cause of allocate failing with a SIGBUS

I am getting a runtime SIGBUS with the call stack showing as below:
--- called from signal handler with signal 10 (SIGBUS) ---
001279b8 allocate__t24__default_alloc_template2b0i0Ui (20, 20, 2fa3c0, 32, 0, 0
) + a4
00117380 __nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0_3RepUiUi (10, 10, 838e00, 0, 0, 0) + 14
001173c0 create__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_all
oc_template2b0i0_3RepUi (9, 9, 838e00, 9, 0, 0) + 24
00117784 replace__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_allo
c_template2b0i0UiUiPCcUi (fbf7f758, 0, ffffffff, fcbf40c2, 9, 80808080) + 114
0012b988 assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0PCcUi (fbf7f758, fcbf40c2, 9, ffffffff, ffffffff, 20) + 24
0012a35c assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc
_template2b0i0PCc (fbf7f758, fcbf40c2, 90, b0, 1ff0, 0) + 24
00127170 __t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_templ
ate2b0i0PCc (fbf7f758, fcbf40c2, fcbf40b8, 1c00, 9, 7c) + 28
What could be the cause?
Stack overflow? or no more space on heap?

Thread-safe std::string and std::stringbuf in C++

In my multithread C++ program on Solaris 10 using GNU 2.95.3, I am getting a contention issue while one thread is trying to call the string constructor and other thread is calling a stringbuf constructor, as mentioned below:
One thread is calling
__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc
i.e.
basic_string<char, string_char_traits<char>, __default_alloc_template<false, 0>>::basic_string(char const *)
Another thread is calling
__9stringbufRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0i
i.e.
stringbuf::stringbuf(basic_string<char, string_char_traits<char>, __default_alloc_template<false, 0> > const &, int)
Eventually both threads lead to allocating space for the string or stringbuf as applicable and results in a contention leading to SIGBUS error.
Any clue?
Do I need to overload malloc or new to be thread-safe or do I need to overload string and stringbuf to ensure thread safety?
I use g++ linker flag as below:
I g++ -g -lclntsh -ldl -lsocket -lrt –lthread
Do I need to add 'g++ -pthread' during compilation? I hope this would add -D_RENENTRANT as well. But Solaris g++ 2.95.3 does not support -pthread
Here's more pstack core output:
core 'core' of 28477:
----------------- lwp# 1 / thread# 1 --------------------
fcb48ac4 __lwp_park (0, 0, fcbb74c4, 1c00, 0, 0) + 14
ff375244 sem_wait (1da56fd8, 1da56c00, 8ea400, 2aae33d0, fcbb49cc, 0) + 20
....
001965f0 _start (0, 0, 0, 0, 0, 0) + 5c
----------------- lwp# 2 / thread# 2 --------------------
fcb4c714 _lwp_kill (a, fed84948, 0, 1, fffffffc, 0) + 8
fdfd6c6c skgesigOSCrash (fc3eea54, fc3ee8fc, ff031460, fef7536c, bc0f4, bc000) + 4c
fe5253a0 kpeDbgSignalHandler (fc3eea54, 2b200cd8, 9e800, a, fc3eee5d, 6060000) + 2f0
fdfd71ac skgesig_sigactionHandler (a, fc3ef6e0, fefef8f8, 0, ff0314a0, fc3eea40) + f0
fcb48b4c __sighndlr (a, fc3ef6e0, fc3ef428, fdfd70bc, 0, 1) + c
fcb3d1f8 call_user_handler (a, 0, 8, 0, fc540200, fc3ef428) + 3b8
fcb3d3cc sigacthandler (a, fc3ef6e0, fc3ef428, 1, fc540200, 0) + 4c
--- called from signal handler with signal 10 (SIGBUS) ---
00608e00 allocate__t24__default_alloc_template2b0i0Ui (20, 20, 9eff80, 1c7be, 2ab11e06, fcbb4f18) +
a4
003cde3c __nw__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3RepU
iUi (10, 10, ff000000, 4, fc540200, 861908) + 14
003cef8c create__Q2t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0_3Re
pUi (2, 2, 2b10d5c8, fcbb5434, fcbb5784, fffc00) + 24
003ee944 replace__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0UiUiP
CcUi (fc3efaf8, 0, ffffffff, 8618a0, 2, 80808080) + 114
006919cc assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCcUi
(fc3efaf8, 8618a0, 2, 0, fc540200, 7164b4) + 24
00666684 assign__t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc (f
c3efaf8, 8618a0, 8ea400, 0, fc5706c0, 0) + 24
005f7868 __t12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0PCc (fc3efaf
8, 8618a0, 861800, 0, fc3efa90, a4090) + 28
....
----------------- lwp# 3 / thread# 3 --------------------
fcb41818 mutex_lock_impl (fcbb3910, 0, 2dfc0030, fc3beee0, 6, 80808080) + 4
fcad640c malloc (6, 1, d9fd8, 6919cc, fcbb03a8, fcbba518) + 44
003c878c __builtin_vec_new (6, 3, 37, fc3bee38, ff32a888, 4) + 3c
005e16ec __9stringbufRCt12basic_string3ZcZt18string_char_traits1ZcZt24__default_alloc_template2b0i0
i (fc3bee60, fc3bee58, 3, 0, 2b2ab6e8, 20) + 64
...
----------------- lwp# 4 / thread# 4 --------------------
fcb48ac4 __lwp_park (0, 0, fcbb74c4, 1c00, 0, 0) + 14
ff375244 sem_wait (1da56f48, 1da56c00, 1, 0, fc541200, 0) + 20
....
fcb48a20 _lwp_start (0, 0, 0, 0, 0, 0)

Parallel algorithm that does a small insertion/shifting

Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

An integer [0,4095] 12bits to a tuble{A,B,C} the fastest way in c++

Intput: An integer [0,4095] 12bits.
Output: A tuble of {A,B,C} all [0,255]
The A,B,C are given as 0 to 255, where 255 maps to 15 in the 4 bits. Reason are that I want to construct a Color struct having RGB defined from 0 to 255.
I assume the solution to be something like bit shifting the input to extract the 3 sets of 4bits and then multiply by 17 as (255/15 | 15 = 1111(binary)).
How would you compute this fastest?
my own solution:
QColor mycolor(int value)
{
if(value > 0xFFF)
value = 0xFFF;
int a=0,b=0,c=0;
a = (value & 0xF) * 17;
b = ((value&(0xF<<4))>>4) *17;
c = ((value&(0xF<<8))>>8) *17;
return QColor(c,b,a);
}
cv::Mat cv_image(10,10,CV_16U,cv::Scalar::all(1));
QImage image(cv_image.data, 10,10,QImage::Format_RGB444);
QPainter p(&image);
p.setPen(mycolor(255));
p.drawLine(0,0,9,0);
p.setPen(mycolor(4095));
p.drawLine(0,1,9,1);
p.setPen(mycolor(0));
p.drawLine(0,2,9,2);
p.setPen(mycolor(10000));
p.drawLine(0,3,9,3);
********* Start testing of Test1 *********
Config: Using QTest library 4.7.4, Qt 4.7.4
PASS : Test1::initTestCase()
[255, 255, 255, 255, 255, 255, 255, 255, 255, 255;
4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095;
0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
PASS : Test1::test1()
First of all input 0...4096 is in fact 12 bits and this makes the question easier to understand. Here is one possible solution:
int val; // 0...4096
int red = ((val&(255<<8))>>8)*17;
int green = ((val&(255<<4))>>4)*17;
int blue = ((val&(255<<0))>>0)*17;
I have kept the bit shifting for blue as well so you can spot the similarity in the calculation. Hope this helps.
You can use unions to better parse your color coded 12 bit value.
union colorCoding
{
unsigned int val:12;
struct
{
unsigned int red:4;
unsigned int blue:4;
unsigned int green:4;
};
};
To get the first four bits from the input, you can AND it with 1111, then bitshift the input to the right by four bits and repeat the process. This gets you three integers in the range of 0 to 15.
If you then want to convert that to something in [0,255], then bitshift everything to the left by four bits and OR it with 1111 (for simplicity).
A = (input&15)<<4|15;
input >>= 4;
B = (input&15)<<4|15;
input >>= 4;
C = (input&15)<<4|15;
or (if you want 0 to map to 0)
A = input&15;
A = A<<4|A;
input >>= 4;
B = input&15;
B = B<<4|B;
input >>= 4;
C = input&15;
C = C<<4|C;