Suppose you have 32 threads and 32 pieces of data for each thread to operate on independently, e.g.
struct data
{
unsigned short int N;
char *features; //array length N
uint *values; //array length N
};
data alldata[32];
Suppose that the shared memory for these threads is "partitioned" into 32 "banks", where each bank is 4 bytes wide. Each thread can read from its corresponding "bank" in parallel, but if threads try to access the same bank simlutaneously, the read operation serializes.
bank | 0 | 1 | 2 | .....
bytes | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 | ....
bytes | 128 129 130 131 | 132 133 134 135 | 136 137 138 139 | ...
...............
...............
threads | 0 | 1 | 2 | .....
(This bizarre situation is called GPU computing).
Thus, for maximum parallelization:
(in terms of the picture above) the member variables of alldata[0] must only be written to the bytes in the first column. The member of variables of alldata[1] to the second column, etc. Equivalently,
In other words, I must write the contents of alldata[32] into one dynamic array, where the member variables of alldata[j] are written in 4-byte intervals once every 32*4 bytes. Then, when I copy this dynamic array into the shared memory for the threads, it will be properly aligned for the banks.
Question:
Does anybody know of any kind of package that will write variables to a byte array with proper spacing, as discussed above (once every 32*4 bytes) ?
This is a desperation question...
Related
Why take up 21 bytes, on a 32-bit system, three pointers plus two numbers 5 * 4 = 20 (should be 20 bytes ah)
Thank you for your answer!!!
https://redis.com/ebook/part-2-core-concepts/01chapter-9-reducing-memory-use/9-1-short-structures/9-1-1-the-ziplist-representation/
enter image description here
Your book counts the terminating \0 byte at the end of "one\0" as overhead, bringing the total to 21.
I want to clobber all load instructions - essentially, I want to find all load instructions, and after the load is complete I want to modify the value in the register that stores the value that was read from memory.
To do so, I instrument all instructions and when I find a load I insert a call to some function that will clobber the write register after the load. I pass in the register that needs to be modified (i.e. the register containing the data loaded from memory) using PIN_REGISTER*.
Assuming I know the type of data that was loaded (i.e. int, float, etc.) I can access the PIN_REGISTER union according to the data type (See this). However, as you can see in the link, PIN_REGISTER stores an array of values - i.e. it doesn't store one signed int but rather MAX_DWORDS_PER_PIN_REG signed ints.
Will the value loaded from memory always be stored at index 0? If for instance, I load a 32 bit signed int from memory into a register, can I always assume that it would be stored at s_dword[0]? What if for instance I write to the 8 bit AH/BH/CH/DH registers? Since these correspond to "middle" bits of 32 bit registers, I assume the data would not be at index 0 in the array?
What's the easiest way for me to figure out which index in the array the loaded data is stored at?
If for instance, I load a 32 bit signed int from memory into a register, can I always assume that it would be stored at s_dword[0]?
Yes.
If you are in long mode and have, e.g., the RAX register, you have two DWORDs: the lower less significant 32 bits (index 0 in s_dword) and the higher most significant 32 bits (index 1 in s_dword).
What if for instance I write to the 8 bit AH/BH/CH/DH registers? Since these correspond to "middle" bits of 32 bit registers, I assume the data would not be at index 0 in the array?
Note: AH is rAX[8:16] (rAX is RAX or EAX), not really in the 'middle'.
It really depends on which member of the union you are accessing. If we stay with the s_dword member (or dword), then AH is still in the "lowest" DWORD (index 0) of the 32 or 64-bit register. It' is at the same time in the high part (most significant 8 bits) of the lowest WORD (16-bit quantity).
// DWORD (32-bit quantity)
auto ah = pinreg->dword[0] >> 8;
auto al = pinreg->dword[0] & 0xff;
// still the same for word (16-bit quantity)
auto ah = pinreg->word[0] >> 8;
auto al = pinreg->word[0] & 0xff;
// not the same for byte (8-bit quantity)
auto ah = pinreg->byte[1];
auto al = pinreg->byte[0];
What's the easiest way for me to figure out which index in the array the loaded data is stored at?
Hard to say, it just seems natural to me to know at which index it is. As long as you know the size of the various denominations in the union, it's quite simple:
byte: 8 bits
word: 16 bits
dword: 32 bits
qword: 64 bits
Here's a crude drawing with different sizes:
+---+---+---+---+---+---+---+---+
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | byte
+---+---+---+---+---+---+---+---+
+-------+-------+-------+-------+
| 3 | 2 | 1 | 0 | word
+-------+-------+-------+-------+
+---------------+---------------+
| 1 | 0 | dword
+---------------+---------------+
+-------------------------------+
| 0 | qword
+-------------------------------+
^ ^
MSB LSB
The same with AL and AH (you can see that AH is byte[1] and AL is byte[0] both are in word[0], dword[0] and qword[0]):
+---+---+---+---+---+---+---+---+
| 7 | 6 | 5 | 4 | 3 | 2 | AH| AL| byte
+---+---+---+---+---+---+---+---+
+-------+-------+-------+-------+
| 3 | 2 | 1 | 0 | word
+-------+-------+-------+-------+
+---------------+---------------+
| 1 | 0 | dword
+---------------+---------------+
+-------------------------------+
| 0 | qword
+-------------------------------+
^ ^
MSB LSB
I have two functions that print 32bit number in binary.
First one divides the number into bytes and starts printing from the last byte (from the 25th bit of the whole integer).
Second one is more straightforward and starts from the 1st bit of the number.
It seems to me that these functions should have different outputs, because they process the bits in different orders. However the outputs are the same. Why?
#include <stdio.h>
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = (b[i] >> j) & 1;
printf("%u", byte);
}
}
puts("");
}
void printBits_2( unsigned *A) {
for (int i=31;i>=0;i--)
{
printf("%u", (A[0] >> i ) & 1u );
}
puts("");
}
int main()
{
unsigned a = 1014750;
printBits(sizeof(a), &a); // ->00000000000011110111101111011110
printBits_2(&a); // ->00000000000011110111101111011110
return 0;
}
Both your functions print binary representation of the number from the most significant bit to the least significant bit. Today's PCs (and majority of other computer architectures) use so-called Little Endian format, in which multi-byte values are stored with least significant byte first.
That means that 32-bit value 0x01020304 stored on address 0x1000 will look like this in the memory:
+--------++--------+--------+--------+--------+
|Address || 0x1000 | 0x1001 | 0x1002 | 0x1003 |
+--------++--------+--------+--------+--------+
|Data || 0x04 | 0x03 | 0x02 | 0x01 |
+--------++--------+--------+--------+--------+
Therefore, on Little Endian architectures, printing value's bits from MSB to LSB is equivalent to taking its bytes in reversed order and printing each byte's bits from MSB to LSB.
This is the expected result when:
1) You use both functions to print a single integer, in binary.
2) Your C++ implementation is on a little-endian hardware platform.
Change either one of these factors (with printBits_2 appropriately adjusted), and the results will be different.
They don't process the bits in different orders. Here's a visual:
Bytes: 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
Bits: 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
The fact that the output is the same from both of these functions tells you that your platform uses Little-Endian encoding, which means the most significant byte comes last.
The first two rows show how the first function works on your program, and the last row shows how the second function works.
However, the first function will fail on platforms that use Big-Endian encoding and output the bits in this order shown in the third row:
Bytes: 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
Bits: 8 7 6 5 4 3 2 1 16 15 14 13 12 11 10 9 24 23 22 21 20 19 18 17 32 31 30 29 28 27 26 25
For the printbits1 function, it is taking the uint32 pointer and assigning it to a char pointer.
unsigned char *b = (unsigned char*) ptr;
Now, in a big endian processor, b[0] will point to the Most significant byte of the uint32 value. The inner loop prints this byte in binary, and then b[1] will point to the next most significant byte in ptr. Therefore this method prints the uint32 value MSB first.
As for printbits2, you are using
unsigned *A
i.e. an unsigned int. This loop runs from 31 to 0 and prints the uint32 value in binary.
I have an array A[]={3,2,5,11,17} and B[]={2,3,6}, size of B is always less than A. Now I have to map from every element B to distinct elements of A such that the total difference sum( abs(Bi-Aj) ) becomes minimum (Where Bi has been mapped to Aj). What is the type of algorithm?
For the example input, I could select, 2->2=0 , 3->3=0 and then 6->5=1. So the total cost is 0+0+1 = 1. I have been thinking sorting both the arrays and then take the first sizeof B elements from the A. Will this work?
It can be thought of as an unbalanced Assignment Problem.
The cost matrix shall be the difference in values of B[i] and A[j]. You can add dummy elements to B so that the problem becomes balanced and put the costs associated very high.
Then Hungarian Algorithm can be applied to solve it.
For the example case A[]={3,2,5,11,17} and B[]={2,3,6} the cost matrix shall be:
. 3 2 5 11 17
2 1 0 3 9 15
3 0 1 2 8 14
6 3 4 1 5 11
d1 16 16 16 16 16
d2 16 16 16 16 16
I'm using C++. Using sort from STL is allowed.
I have an array of int, like this :
1 4 1 5 145 345 14 4
The numbers are stored in a char* (i read them from a binary file, 4 bytes per numbers)
I want to do two things with this array :
swap each number with the one after that
4 1 5 1 345 145 4 14
sort it by group of 2
4 1 4 14 5 1 345 145
I could code it step by step, but it wouldn't be efficient. What I'm looking for is speed. O(n log n) would be great.
Also, this array can be bigger than 500MB, so memory usage is an issue.
My first idea was to sort the array starting from the end (to swap the numbers 2 by 2) and treating it as a long* (to force the sorting to take 2 int each time). But I couldn't manage to code it, and I'm not even sure it would work.
I hope I was clear enough, thanks for your help : )
This is the most memory efficient layout I could come up with. Obviously the vector I'm using would be replaced by the data blob you're using, assuming endian-ness is all handled well enough. The premise of the code below is simple.
Generate 1024 random values in pairs, each pair consisting of the first number between 1 and 500, the second number between 1 and 50.
Iterate the entire list, flipping all even-index values with their following odd-index brethren.
Send the entire thing to std::qsort with an item width of two (2) int32_t values and a count of half the original vector.
The comparator function simply sorts on the immediate value first, and on the second value if the first is equal.
The sample below does this for 1024 items. I've tested it without output for 134217728 items (exactly 536870912 bytes) and the results were pretty impressive for a measly macbook air laptop, about 15 seconds, only about 10 of that on the actual sort. What is ideally most important is no additional memory allocation is required beyond the data vector. Yes, to the purists, I do use call-stack space, but only because q-sort does.
I hope you get something out of it.
Note: I only show the first part of the output, but I hope it shows what you're looking for.
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cstdint>
// a most-wacked-out random generator. every other call will
// pull from a rand modulo either the first, or second template
// parameter, in alternation.
template<int N,int M>
struct randN
{
int i = 0;
int32_t operator ()()
{
i = (i+1)%2;
return (i ? rand() % N : rand() % M) + 1;
}
};
// compare to integer values by address.
int pair_cmp(const void* arg1, const void* arg2)
{
const int32_t *left = (const int32_t*)arg1;
const int32_t *right = (const int32_t *)arg2;
return (left[0] == right[0]) ? left[1] - right[1] : left[0] - right[0];
}
int main(int argc, char *argv[])
{
// a crapload of int values
static const size_t N = 1024;
// seed rand()
srand((unsigned)time(0));
// get a huge array of random crap from 1..50
vector<int32_t> data;
data.reserve(N);
std::generate_n(back_inserter(data), N, randN<500,50>());
// flip all the values
for (size_t i=0;i<data.size();i+=2)
{
int32_t tmp = data[i];
data[i] = data[i+1];
data[i+1] = tmp;
}
// now sort in pairs. using qsort only because it lends itself
// *very* nicely to performing block-based sorting.
std::qsort(&data[0], data.size()/2, sizeof(data[0])*2, pair_cmp);
cout << "After sorting..." << endl;
std::copy(data.begin(), data.end(), ostream_iterator<int32_t>(cout,"\n"));
cout << endl << endl;
return EXIT_SUCCESS;
}
Output
After sorting...
1
69
1
83
1
198
1
343
1
367
2
12
2
30
2
135
2
169
2
185
2
284
2
323
2
325
2
347
2
367
2
373
2
382
2
422
2
492
3
286
3
321
3
364
3
377
3
400
3
418
3
441
4
24
4
97
4
153
4
210
4
224
4
250
4
354
4
356
4
386
4
430
5
14
5
26
5
95
5
145
5
302
5
379
5
435
5
436
5
499
6
67
6
104
6
135
6
164
6
179
6
310
6
321
6
399
6
409
6
425
6
467
6
496
7
18
7
65
7
71
7
84
7
116
7
201
7
242
7
251
7
256
7
324
7
325
7
485
8
52
8
93
8
156
8
193
8
285
8
307
8
410
8
456
8
471
9
27
9
116
9
137
9
143
9
190
9
190
9
293
9
419
9
453
With some additional constraints on both your input and your platform, you can probably use an approach like the one you are thinking of. These constraints would include
Your input contains only positive numbers (i.e. can be treated as unsigned)
Your platform provides uint8_t and uint64_t in <cstdint>
You address a single platform with known endianness.
In that case you can divide your input into groups of 8 bytes, do some byte shuffling to arrange each groups as one uint64_t with the "first" number from the input in the lower-valued half and run std::sort on the resulting array. Depending on endianness you may need to do more byte shuffling to rearrange each sorted 8-byte group as a pair of uint32_t in the expected order.
If you can't code this on your own, I'd strongly advise you not to take this approach.
A better and more portable approach (you have some inherent non-portability by starting from a not clearly specified binary file format), would be:
std::vector<int> swap_and_sort_int_pairs(const unsigned char buffer[], size_t buflen) {
const size_t intsz = sizeof(int);
// We have to assume that the binary format in buffer is compatible with our int representation
// we also require an even number of integers
assert(buflen % (2*intsz) == 0);
// load pairwise
std::vector< std::pair<int,int> > pairs;
pairs.reserve(buflen/(2*intsz));
for (const unsigned char* bufp=buffer; bufp<buffer+buflen; bufp+= 2*intsz) {
// It would be better to have a more portable binary -> int conversion
int first_value = *reinterpret_cast<int*>(bufp);
int second_value = *reinterpret_cast<int*>(bufp + intsz);
// swap each pair here
pairs.emplace_back( second_value, firstvalue );
}
// less<pair<..>> does lexicographical ordering, which is what you are looking ofr
std::sort(pairs.begin(), pairs.end());
// convert back to linear vector
std::vector<int> result;
result.reserve(2*pairs.size());
for (auto& entry : pairs) {
result.push_back(entry.first);
result.push_back(entry.second);
}
return result;
}
Both the inital parse/swap pass (which you need anyway) and the final conversion are O(N), so the total complexity is still (O(N log(N)).
If you can continue to work with pairs, you can save the final conversion. The other way to save that conversion would be to use a hand-coded sort with two-int strides and two-int swap: much more work - and possibly still hard to get as efficient as a well-tuned library sort.
Do one thing at a time. First, give your data some *struct*ure. It seems that each 8 byte form a unit of the
form
struct unit {
int key;
int value;
}
If the endianness is right, you can do this in O(1) with a reinterpret_cast. If it isn't, you'll have to live with a O(n) conversion effort. Both vanish compared to the O(n log n) search effort.
When you have an array of these units, you can use std::sort like:
bool compare_units(const unit& a, const unit& b) {
return a.key < b.key;
}
std::sort(array, length, compare_units);
The key to this solution is that you do the "swapping" and byte-interpretation first and then do the sorting.