Using values from `__m256i` to access an array efficiently - SIMD [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Let's say, for example, that I have 2 variables __m256i called rows and cols, the values inside them are:
rows: 0, 2, 7, 5, 7, 2, 3, 0
cols: 1, 2, 7, 5, 7, 2, 2, 6
Now, these values represent the x and y positions for 8 points, so, in this case I would have these points:
p0: [0, 1], p1: [2, 2], p2: [7, 7], p3: [5, 5]
p4: [7, 7], p5: [2, 2], p6: [3, 2], p7: [0, 6]
I also have an array called lut that will have values of int type:
lut: [0, 1, 2, 3, ..., 60, 61, 62, 63]
What I want to do, is to use these positions values from rows and cols variables, access the lut array with it and create a new __m256i value with the lut accessed values.
The way I know of how to do that would be to store rows and cols values in two int arrays of size 8, then read the values from lut array one at a time and then use _mm256_set_epi32() to create the new _m256i value.
This works, but it seems to me to be very inefficient.. So my question is if there is some way to do it faster.
Note that these values are just for a more concrete example, and lut doesn't need to have ordered values or size 64.
thanks!

You can build a solution using an avx2 gather instruction, like so
// index = (rows << 3) + cols;
const __m256i index = _mm256_add_epi32( _mm256_slli_epi32(rows, 3), cols);
// result = lut[index];
const __m256i result = _mm256_i32gather_epi32(lut, index, 4);
Be aware that on current CPUs gather instructions have quite huge latency, so unless you can interleave some instructions before actually using result, this may not be worth using.
To explain the factor of 4: The scale factor in
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
is considered as actual byte-offset, i.e., the returned value for each index is:
*(const int*)((const char*) base_addr + scale*index)
I don't know if there are many use-cases for that behavior (perhaps this is to make it possible to access a LUT with 1byte or 2byte entries (requiring some masking afterwards)). Perhaps this was just allowed, because scaling by 4 is possible, while scaling by 1/4 or 1/2 would not be (in case someone really needed that).

Related

is pyrr.Matrix44 layout actually column-major?

in the pyrr.Matrix docs it states:
Matrices are laid out in row-major format and can be loaded directly into OpenGL. To convert to column-major format, transpose the array using the numpy.array.T method.
creating a transformation matrix gives me:
Matrix44.from_translation( np.array([1,2,3]))
Matrix44([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 2, 3, 1]])
If the layout is row-major, I would expect the output to be the transpose:
Matrix44([[1, 0, 0, 1],
[0, 1, 0, 2],
[0, 0, 1, 3],
[0, 0, 0, 1]])
I'm most likely confused (I come from C/OpenGL background), but could anyone please enlighten me?
Jonathan
I was writing down a great answer. But I found this really interesting link I invite you to read it !
This is a small resume :
If it's row-major matrix, then the translation is stored in the 3, 7, and 11th indices.
If it's column-major, then the translation is stored in the 12, 13, and 14th indices.
The difference behind the scene is the way to store the data. As it is 16 float in memory, those floats are contiguous in the memory. So you have to define if you either store them in 4 float x 4 columns or 4 float x 4 rows. And then it change the way you access and use it.
You can look at this link too.

Find unique quadruplets in C++ [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
So, I need to find unique quadruples in C++. Any idea would help
Input 1 : [1, 0, 2, 3], [2, 0, 1, 3], [4, 5, 6, 7], [8, 9, 10, 11]
Output 1 : [2, 0, 1, 3], [4, 5, 6, ,7], [8, 9, 10, 11]
As [1,0,2,3] and [2,0,1,3] both contain same elements so either one can be in the output
Input 2 : [2, 0, 1, 3], [4, 5, 6, ,7], [8, 9, 10, 11], [15,16,17,18]
Output 2 : [2, 0, 1, 3], [4, 5, 6, ,7], [8, 9, 10, 11], [15,16,17,18]
I cannot initalize set (int,int,int,int). Any idea on how to get unique ones?
Update for people who asked for defining the question more:
A quadruple is a combination of 4 integers for the problem. Problem states to find unique quadruples from all the given quadruples. A quadruple (a,b,c,d) is unique , if no other quadruple exists with all the elements same as this one, i.e. any quadruple formed from the permutation of (a,b,c,d) is not unique. Quadruples (a,b,c,d) and (a,d,b,c) are the same, where as quadruples (a,b,c,d) and (a,e,f,b) are not. Quadruples are unique if they contain atleast 1 element which is not common to both.
Write a comparator that sorts the integers in the quadruples before comparing them.
struct CompareQuads
{
bool operator()(Quad x, Quad y) const
{
// sort x integers
...
// sort y integers
...
// return true if x < y (lexicographically)
...
}
};
Use the comparator in std::set to eliminate duplicates.
std::set<Quad, CompareQuads> s;
Add all the quads to s and the duplicates will be removed. Iterate through s and print the ones that remain.

Equal - depth binning- whether it is just grouping data into k groups

A small confusion on equal - depth or equal frequency binning
Equal depth binning says that - It divides the range into N intervals, each containing approximately same number of samples
Lets take a small portion of iris data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
If I need to bin my 1st column, what will be the results?
Whether it is just grouping the data or it includes some calculation like equal width binning.
What happens if number of elements to be binned is an odd number. How will I bin equally?
like #Anony-Mousse mentions, it is not always possible to exactly get the same number of samples in a bin, approximately is what is desired.
I will walk you through the case when unique(N)/bins > 0, where N represents the values in an array to be binned. Assume
N = [1, 1, 1, 1, 1, 1,
2, 3, 4, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
bins = 4
here, length(N) = 20 and length(unique(N)) = 6, making unique(N)/bins = 1.5 > 0. Which means every bin will have approximately 1.5 samples. So you will put 1 in bin1, carrying over the 0.5 residue to the next bin, making the number of elements in that bin to 1.5 + 0.5 = 2, so 2 and 3 will be in bin2. Extrapolating this logic the final bins will have the following split. [1], [2,3], [4], [5,6] of course 1 repeats 6 times and 6 repeats 10 times.
I would not like the ties to sit in separate bins, that is usually the point in having bins (grouping values close to one another).
For cases with unique(N)/bins < 0, the same logic can be applied. Hope this answers your question.
Sometimes you cannot make bins of exactly the same size.
For example, if your data is
1,1,1,2,99
and you want 4 bins, then the most intuitive result should be
[1,1,1], [2], [], [99]
Most tools will produce one of these answers:
[1,1,1], [], [2], [99]
[1,1], [1], [2], [99]
[1], [1], [1], [2,99]
None of them have exactly 1.25 elements in every bin. The two last solutions are closest, but also the least intuitive. That is why one only demands "approximately the same number". Sometimes, there is no good solution that exactly has this frequency.

Replacing elements in an array in Python

I want to look in an array of elements. If an element exceeds a certain value x, replace it with another value y. It could be a bunch of elements that need to be replaced. Is there a function (code) to do this at once. I don't want to use for loop.
Does the any() function help here?
Thanks
I really don't know how one could possibly achieve such a thing without the if statement.
Don't know about any() but I gave it a try with map since you don't want a for loop. But, do note that the complexity order (Big O) is still n.
>>> array = [1, 2, 3, 4, 2, -2, -3, 8, 3, 0]
>>> array = map(lambda x: x if x < 3 else 2, array)
>>> array
[1, 2, 2, 2, 2, -2, -3, 2, 2, 0]
Basically, x if x < 3 else 2 works like If an element exceeds a certain value x, replaces it with another value y.

Proper form for using a 2D array in Clojure and initializing each cell

(Lisp beginner)
I need to create a 2D array, and initialize each cell in the array. Each cell is initialized with a function that is based on data in a preceding cell. So the cell as 0,1 will be initialized with the result of a function that uses the data from cell 0,0, and so on.
I was wondering what is the proper clojure idiom for setting up a data structure like this.
Representation of your array actually depends on you needs of using it, not initializing it. For example, if you have dense matrix, you most probably should use vector of vectors like this:
[[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[9, 8, 7, 6, 5],
[4, 3, 2, 1, 0],
[0, 1, 2, 3, 4]]
or single vector with some additional info on raw length:
{:length 5
:data
[0, 1, 2, 3, 4,
5, 6, 7, 8, 9,
9, 8, 7, 6, 5,
4, 3, 2, 1, 0,
0, 1, 2, 3, 4]
}
and if you need sparse matrix, you can use hash-maps:
{0 {0 0, 4 4},
2 {2 7},
3 {0 4, 2 2}}
(since your 2D array is small and you generate next value based on previous one, I believe first option is better suited for you).
If you are going to make a lot of matrix-specific manipulations (multiplication, decomposition, etc) you may want to use some existing libraries like Incanter.
And as for filling, my proposal is to use transients and store interim results, i.e. (for one-dimensional vector):
(defn make-array [initial-value f length]
(loop [result (transient []), length-left length, interim-value initial-value]
(if (= length-left 0)
(persistent! result)
(recur (conj! result (f interim-value)) (- length-left 1) (f interim-value))))
Transients will avoid creating new data structure on each new element, and interim value will avoid need in reading previous element from transient structure.
I don't know if this is a bad technique but I've used hash (or usually ordered) maps to specify 2D "arrays". They build up like this:
{ [x y] value ... }
There are cons to this since you have to specify the limits of the array somehow. And probably it's very slow compared to straight vector presentations as described in ffriend's post.