Restricted range of SV_DispatchThreadID in ComputeShader - hlsl

The situation is the following:
I have a kernel:
#pragma kernel diffga
#pragma[numthreads(16, 8, 1)]
void diffga(uint3 id : SV_DispatchThreadID) {
/* code here */
}
I dispatch this kernel with the following:
_shader.Dispatch(kidiffga, 8, 16, 1)
If I capture the values of id.x and id.y, id.y ranges from 0:31 as expexted. However id.x only ranges from 0:7. If I change numthreads and dispatch such that the expected range is less than 8 then it functions fine. However any configuration of numthreads and dispatch which requires a range of greater than 8 is capped.
Any insight as to why this is the case would be much appreciated.

The source of the error came from instantiating the ComputeBuffer. I had the stride parameter set incorrectly.

Related

Julia - replace

I have tried the following method and it worked.
a=[1,2,3]
b=[5,6,7]
for i=1:3
a=replace(a,a[i]=>b[i]*a[i])
end
The result showed: a=[5,12,21], which is the product I wanted,elementwise product.
However, I tried to use the same method for getting the product I want but it didn't work.
a=[]
for i=1:10
a=push!(a,2^i)
end
for i=1:10
a=replace(a,a[i]=>a[i]*a[i])
end
But the result is
a=[65536,65536,4096,65536,1048576,4096,16384,65536,262144,1048576]
And I want to yield
a=[4,16,64,256,1024,4096,16384,65536,262144,1048576]
The problem here is, that replace might not do what you want. The command
replace(A, old => new)
takes a collection A and creates a new collection where every occurrence of old is replaced by new.
So if we look at your example, in the first iteration we replace every occurrence of a[1] == 2 by 4. This yields
a == [4, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
In the second iteration, we replace every occurrence of a[2] == 4 by 16. This yields
a == [16, 16, 8, 16, 32, 64, 128, 256, 512, 1024]
and so on. This should explain why you get that weird result.
Apart from the broadcasts a .= a .* a or a .= a .^ 2 that Oscar Smith mentioned in his comment, you could also use the functions map
a = map(x -> x^2, a)
or map!:
map!(x -> x^2, a, a)
The difference between map and map! is, that map creates a new version and map! writes to an already existing collection. In this example, the input collection is the same as the output collection.

Verfiy the number of times a cuda kernel is called

Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.

How to draw desired matrix using c++?

Draw below matrix using c++. Problem require a function, which could be called into the main().
x!x!x
~~~~~
x!x!x
~~~~~
x!x!x
I tried comparing the location 0,2,4. and tried to print but is there any other way to do this problem ?
If the matrix are characters, you could do something like this:
char board[] =
"x|x|x\n"
"-+-+-\n"
"x|x|x\n"
"-+-+-\n"
"x|x|x\n"
;
The columns containing the character 'x' are located at indices 0, 2, 4, 14, 16, 18, 26, 28, 30. Row indices are 0, 14, and 28.
Hint: there are 6 characters per row.
Hint: columns indices are (row * (characters per row)) + ((column - 1) * (2 characters per row))
This has the nice benefit of only requirement one statement to print:
std::cout.write(&board[0], sizeof(board) - 1U);
The - 1U is so that the terminating nul is not sent to cout.

Mod of two large numbers in C++

I have a class named LargeNum, which stores large numbers by array such as digit[]. Because int is not large enough to store it.
The base is 10000, so number '9876 8764 7263' is stored like:
digit[4] = {9876, 8764, 7263};
(the base can be changed into 10 or 100, like digit[12] = {9,8,7,6,8,7,6,4,7,2,6,3})
The problem is that I want to overload operator %, so than I can get the remainder of two large numbers. Overloading operator *, - between large numbers is finished by dealing with every digit of the large number. But I really don't how to do so with %. Like:
{1234,7890,1234} % {4567,0023}
Can anyone help me?
The pseudocode should be:
while digits_source > digits_base {
while first_digit_source > first_digit_base {
source -= base << (digits_source - digits_base)
}
second_digit_source += first_digit_source * LargeNum.base
first_digit_source = 0
digits_source--
}
while (source >= base) {
source -= base
}
return source
This should take advantage of your "digits" of the large number.
Edit: For simplicity, I am assuming that a single digit of you array can contain (numerically speaking) two digits. If it cannot, then the code would become quite tricky because you cannot do second_digit_source += first_digit_source * LargeNum.base
Edit:
Regarding an example operation (base = 10000)
{65,0000,0099} % {32,0001}
As 65 is > 32, then proceed to do:
65 - 32 = 33
0 - 1 = -1
Then we have {33, -1, 99} % {32, 1}. Proceed again
33 - 32 = 1
-1 - 1 = -2
We have {1, -2, 99} % {32, 1}. Because 32 > 1, the we join the two first digits of the source and we have {1*1000 - 2, 99} % {32, 1}. Now we can go into the simple while, simply by doing the minus operation. The while does a full comparison of source >= base because we cannot afford to have negative digits. However, during the first part of the algorithm we can because we are guaranteeing that the combination of the two first digits will be positive.

Adding constraints for a known solution causes out of bounds exception

I have a linear optimization goal to Maximize EE+FF, where EE and FF each consist of some C and D.
With code I've written, I can get solver to find:
EE_quantity: 0, FF_quantity: 7
...but I know there to be another solution:
EE_quantity: 1, FF_quantity: 6
In order to validate user input for other valid solutions, I added a constraint for both EE and FF. So I added the EE_quantity == 0, FF_quantity == 7 in the code below, which is a runnable example:
SolverContext c2 = SolverContext.GetContext();
Model m2 = c2.CreateModel();
p.elements = elements_multilevel_productmix();
Decision C_quantity = new Decision(Domain.IntegerNonnegative, "C_quantity");
Decision D_quantity = new Decision(Domain.IntegerNonnegative, "D_quantity");
Decision EE_quantity = new Decision(Domain.IntegerNonnegative, "EE_quantity");
Decision FF_quantity = new Decision(Domain.IntegerNonnegative, "FF_quantity");
m2.AddDecisions(C_quantity, D_quantity, EE_quantity, FF_quantity);
m2.AddConstraints("production",
6 * C_quantity + 4 * D_quantity <= 100,
1 * C_quantity + 2 * D_quantity <= 200,
2 * EE_quantity + 1 * FF_quantity <= C_quantity,
1 * EE_quantity + 2 * FF_quantity <= D_quantity,
EE_quantity == 0,
FF_quantity == 7
);
m2.AddGoal("fixed_EE_FF", GoalKind.Maximize, "EE_quantity + FF_quantity");
Solution sol = c2.Solve(new SimplexDirective());
foreach (var item in sol.Decisions)
{
System.Diagnostics.Debug.WriteLine(
item.Name + ": " + item.GetDouble().ToString()
);
}
It seems that Solver Foundation really doesn't like this specific combination. Using EE_quantity == 1, FF_quantity == 6 is fine, as is using just EE_quantity == 0 or FF_quantity == 7. But using both, AND having one of them being zero, throws an exception:
Index was outside the bounds of the array.
What is going on under the hood, here? And how do I specify that I want to find "all" solutions for a specific problem?
(Note: no new releases of Solver Foundation are forthcoming - it's essentially been dropped by Microsoft.)
The stack trace indicates that this is a bug in the simplex solver's presolve routine. Unfortunately the SimplexDirective does not have a way to disable presolve (unlike InteriorPointDirective). Therefore the way to get around this problem is to specify the fixed variables differently.
Remove the last two constraints that set EE_quantity and FF_quantity, and instead set both the upper and lower bounds to be 0 and 7 respectively when you create the Decision objects. This is equivalent to what you wanted to express, but appears to avoid the MSF bug:
Decision EE_quantity = new Decision(Domain.IntegerRange(0, 0), "EE_quantity");
Decision FF_quantity = new Decision(Domain.IntegerRange(7, 7), "FF_quantity");
The MSF simplex solver, like many mixed integer solvers, only returns the optimal solution. If you want MSF to return all solutions, change to the constraint programming solver (ConstraintProgrammingDirective). If you review the documentation for Solution.GetNext() you should figure out how to do this.
Of course the CP solver is not guaranteed to produce the globally optimal solution immediately. But if you iterate through solutions long enough, you'll get there.