Timing performance of multithreading - c++

I wrote a small program to check on the performance of threading which I found a couple of questions from the result i obtained
(cpu of my laptop is i5 3220M)
1) The time required pumped up for 2 thread every time I ran the program. Is it because the omp timer I use or I have some logical error in the program?
2) Also will it be better if I use cpu cycle to measure the performance instead?
3) The time continue to decrease as the number of thread increase. I know my program is simple enough so probably requires no context switch but where does the extra performance come? Coz the cpu adj itself to the turbo freq? (Normal 2.6MHz, turbo 3.3MHz according to intel website)
Thanks!
Output
Adding 1 for 1000 million times
Average Time Elapsed for 1 threads = 3.11565(Check = 5000000000)
Average Time Elapsed for 2 threads = 4.54309(Check = 5000000000)
Average Time Elapsed for 4 threads = 2.19321(Check = 5000000000)
Average Time Elapsed for 8 threads = 2.48927(Check = 5000000000)
Average Time Elapsed for 16 threads = 1.84427(Check = 5000000000)
Average Time Elapsed for 32 threads = 1.30958(Check = 5000000000)
Average Time Elapsed for 64 threads = 1.08472(Check = 5000000000)
Average Time Elapsed for 128 threads = 0.996898(Check = 5000000000)
Average Time Elapsed for 256 threads = 1.01366(Check = 5000000000)
Average Time Elapsed for 512 threads = 0.951436(Check = 5000000000)
Average Time Elapsed for 1024 threads = 0.973331(Check = 4999997440)
Program
#include <iostream>
#include <thread>
#include <algorithm> // for_each
#include <vector>
#include <omp.h> // omp_get_wtime
class Adder{
public:
long sum;
Adder(){};
void operator()(long endVal_i){
sum = 0;
for (long i = 1; i<= endVal_i; i++)
sum++;
};
};
int main()
{
long totalCount = 1000000000;
int maxThread = 1025;
int numSample = 5;
std::vector<std::thread> threads;
Adder adderArray[maxThread];
std::cout << "Adding 1 for " << totalCount/1000000 << " million times\n\n";
for (int numThread = 1; numThread <=maxThread; numThread=numThread*2){
double avgTime=0;
long check = 0;
for (int i = 1; i<=numSample; i++){
double startTime = omp_get_wtime();
long loop = totalCount/numThread;
for (int i = 0; i<numThread;i++)
threads.push_back(std::thread(std::ref(adderArray[i]), loop));
std::for_each(threads.begin(), threads.end(),std::mem_fn(&std::thread::join));
double endTime = omp_get_wtime();
for (int i = 0; i<numThread;i++)
check += adderArray[i].sum;
threads.erase(threads.begin(), threads.end());
avgTime += endTime - startTime;
}
std::cout << "Average Time Elapsed for " << numThread<< " threads = " << avgTime/numSample << "(Check = "<<check<<")\n";
}
}

Related

measuring elapsed seconds using chrono (stop watch) C++

I am currently trying to create a way to display the elapsed seconds (not the difference between cycles). My code is following:
#include <iostream>
#include <vector>
#include <chrono>
#include <Windows.h>
typedef std::chrono::high_resolution_clock::time_point TIME;
#define TIMENOW() std::chrono::high_resolution_clock::now()
#define TIMECAST(x) std::chrono::duration_cast<std::chrono::duration<double>>(x).count()
int main()
{
std::chrono::duration<double> ms;
double t = 0;
while (1)
{
TIME begin = TIMENOW();
int c = 0;
for (int i = 0; i < 10000000; i++)
{
c += i*100000;
}
TIME end = TIMENOW();
ms= std::chrono::duration_cast<std::chrono::duration<double>>(end - begin);
t =t+ ms.count();
std::cout << t << std::endl;
}
I expected adding the delta time over and over again to roughly give me the elapsed time in seconds, however I noticed that only if I do i < big number it sort of is fairly accurate. If its only 10,000 or so, t seems to accumulate slower and gradually faster. Maybe I am missing something but isnt the difference my delta time(the elapsed time between this and last cycle) and if I keep adding the delta times up, it should spit out seconds? Any help is appreciated.

Why doesn't my OpenMP program scale with number of threads?

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:
#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;
int main() {
int* arr = new int[SIZE];
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(N_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, 16) reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += arr[i];
}
}
double t2 = omp_get_wtime();
cout << "n_threads " << n_threads << endl;
cout << "time " << (t2 - t1)*1000 << endl;
cout << sum << endl;
}
The run time (in milliseconds) for different values of N_THREADS is as follows:
n_threads 1
time 3.6718
n_threads 2
time 2.5308
n_threads 3
time 3.4383
n_threads 4
time 3.7427
n_threads 5
time 2.4621
I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?
Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.
UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)
Here are the results, it still scales badly with number of threads:
n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000
n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000
n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000
n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000
n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000
5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.
Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.
The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?
I have reimplemented the test as a Google Benchmark with different values:
#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>
constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};
void original_benchmark(benchmark::State& state)
{
const int num_threads{state.range(0)};
const int array_size{state.range(1)};
const int chunk_size{state.range(2)};
auto arr = std::make_unique<int[]>(array_size);
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(num_threads);
// double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, chunk_size)
for (int i = 0; i < array_size; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, chunk_size) reduction(+:sum)
for (int i = 0; i < array_size; i++) {
sum += arr[i];
}
}
// double t2 = omp_get_wtime();
// cout << "n_threads " << n_threads << endl;
// cout << "time " << (t2 - t1)*1000 << endl;
// cout << sum << endl;
state.counters["n_threads"] = n_threads;
}
static void BM_original_benchmark(benchmark::State& state) {
for (auto _ : state) {
original_benchmark(state);
}
}
BENCHMARK(BM_original_benchmark)
->Args({1, ARRAY_SIZE, CHUNK_SIZE})
->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({1, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({2, ARRAY_SIZE, CHUNK_SIZE})
->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({2, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({4, ARRAY_SIZE, CHUNK_SIZE})
->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({4, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({8, ARRAY_SIZE, CHUNK_SIZE})
->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({8, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({16, ARRAY_SIZE, CHUNK_SIZE})
->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({16, ARRAY_SIZE, SCALE * CHUNK_SIZE});
BENCHMARK_MAIN();
I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

Measure CPU time spent on each thread separately in C++

I know that this question sounds like an easy question and a duplicate of former ones, in which boost.timer and the chrono facility of C++11 are given as answers.
But, what I have in mind is a bit different and I found no answer to it either on StackOverflow or elsewhere:
In my (C++11) program on Ubuntu Linux, I start several threads with std::async and the std::future mechanism.
Inside every thread I measure CPU-Time with boost.timer(). If I start only one thread I get a CPU time of (in my example) ~0.39 sec and an equal WC time of ~0.39 sec.
If I start several threads I get a longer WC time for each, say 0.8 sec for 16 threads and now the CPU-time for each is about 6.4 sec, that is 8 * 0.8 sec (I have a quad-core Xeon CPU).
So the CPU-Time of each thread is seemingly multiplied by the (number of CPU cores) * 2.
Of course(?) I would like to see a CPU-time near 0.39 sec for each thread, as this is probably still the time the thread uses the CPU for its purposes. The longer CPU time shown (multiplied with the "CPU number factor") is not so much of help in gauging the true CPU consumption of each thread separately.
For illustration I append my test program and its output, first for one thread, then for 16 threads.
So my question is: What can I do, which library, function or programming technique can I use, to get the true CPU usage of each thread which should not change much with the number of threads started?
#include <iostream>
#include <fstream>
#include <vector>
#include <cmath>
#include <future>
#include <mutex>
#include <chrono>
#include <boost/timer/timer.hpp>
std::mutex mtx;
class XTimer
{
public:
XTimer() {};
void start();
void stop();
double cpu_time();
double boost_cpu_time();
double wc_time();
std::chrono::time_point<std::chrono::system_clock> timestamp_wc;
std::chrono::time_point<std::chrono::steady_clock> timestamp_cpu;
boost::timer::cpu_timer timer_cpu;
double wc_time_val;
double cpu_time_val;
double boost_cpu_time_val;
};
void XTimer::start()
{
timestamp_wc = std::chrono::system_clock::now();
timestamp_cpu = std::chrono::steady_clock::now();
timer_cpu.start();
cpu_time_val = 0;
wc_time_val = 0;
boost_cpu_time_val = 0;
}
void XTimer::stop()
{
const auto ns_wc = std::chrono::system_clock::now() - timestamp_wc;
const auto ns_cpu = std::chrono::steady_clock::now() - timestamp_cpu;
auto elapsed_times(timer_cpu.elapsed());
auto cpu_elapsed(elapsed_times.system + elapsed_times.user);
//std::cout << "boost: cpu elapsed = " << cpu_elapsed << std::endl;
wc_time_val = double(ns_wc.count())/1e9;
cpu_time_val = double(ns_cpu.count())/1e9;
boost_cpu_time_val = double(cpu_elapsed)/1e9;
}
double XTimer::cpu_time()
{
return cpu_time_val;
}
double XTimer::boost_cpu_time()
{
return boost_cpu_time_val;
}
double XTimer::wc_time()
{
return wc_time_val;
}
template<class T>
int wait_for_all(std::vector<std::future<T>> & fuvec)
{
std::vector<T> res;
for(auto & fu: fuvec) {
res.push_back(fu.get());
}
return res.size();
}
int test_thread(int a)
{
const int N = 10000000;
double x = 0;
XTimer tt;
do {
std::lock_guard<std::mutex> lck {mtx};
std::cout << "start thread: " << a << std::endl;
} while (0);
tt.start();
for(int i = 0; i < N; ++i) {
if (i % 10000 == 0) {
//std::cout << (char((int('A') + a)));
}
x += sin(i);
}
tt.stop();
do {
std::lock_guard<std::mutex> lck {mtx};
std::cout << "end thread: " << a << std::endl;
std::cout << "boost cpu = " << tt.boost_cpu_time() << " wc = " << tt.wc_time() << std::endl;
} while (0);
return 0;
}
int test_threads_start(int num_threads)
{
std::vector<std::future<int>> fivec;
XTimer tt;
tt.start();
for(int i = 0; i < num_threads; ++i) {
fivec.push_back(std::async(test_thread, i));
}
int sz = wait_for_all(fivec);
tt.stop();
std::cout << std::endl << std::endl;
std::cout << "all threads finished: total wc time = " << tt.wc_time() << std::endl;
std::cout << "all threads finished: total boost cpu time = " << tt.boost_cpu_time() << std::endl;
}
int main(int argc, char** argv)
{
const int num_threads_default = 1;
int num_threads = num_threads_default;
//boost::timer::auto_cpu_timer ac;
if (argc > 1) {
num_threads = atoi(argv[1]);
}
std::cout << "starting " << num_threads << " threads." << std::endl;
test_threads_start(num_threads);
std::cout << "end." << std::endl;
return 0;
}
I can be compiled by
g++ -o testit testit.cpp -L/usr/lib/x86_64-linux-gnu -pthread -lboost_timer -lboost_system -lboost_thread
Sample output with 1 thread
starting 1 threads.
start thread: 0
end thread: 0
boost cpu = 0.37 wc = 0.374107
all threads finished: total wc time = 0.374374
all threads finished: total boost cpu time = 0.37
Sample output with 16 threads
starting 16 threads.
start thread: 0
start thread: 1
start thread: 2
start thread: 3
start thread: 4
start thread: 10
start thread: 5
start thread: 7
start thread: 6
start thread: 11
start thread: 8
start thread: 9
start thread: 13
start thread: 12
start thread: 14
start thread: 15
end thread: 1
boost cpu = 4.67 wc = 0.588818
end thread: 2
boost cpu = 5.29 wc = 0.66638
end thread: 0
boost cpu = 5.72 wc = 0.7206
end thread: 13
boost cpu = 5.82 wc = 0.728717
end thread: 11
boost cpu = 6.18 wc = 0.774979
end thread: 12
boost cpu = 6.17 wc = 0.773298
end thread: 6
boost cpu = 6.32 wc = 0.793143
end thread: 15
boost cpu = 6.12 wc = 0.767049
end thread: 4
boost cpu = 6.7 wc = 0.843377
end thread: 14
boost cpu = 6.74 wc = 0.84842
end thread: 3
boost cpu = 6.91 wc = 0.874065
end thread: 9
boost cpu = 6.83 wc = 0.86342
end thread: 5
boost cpu = 7 wc = 0.896873
end thread: 7
boost cpu = 7.05 wc = 0.917324
end thread: 10
boost cpu = 7.11 wc = 0.930335
end thread: 8
boost cpu = 7.03 wc = 0.940374
all threads finished: total wc time = 0.957748
all threads finished: total boost cpu time = 7.14
end.
Documentation of boost::timer does not mention anything about per thread measurements. Fortunately boost::chrono contains thread_clock which gives per thread CPU usage on platforms which support it. It uses the same interface as the std::chrono clocks and measures thread wall clock.
After adding following lines to your example code:
// Includes section
#include <boost/chrono.hpp>
// XTimer
boost::chrono::thread_clock::time_point timestamp_thread_wc;
double thread_wc_time_val;
// XTimer::start()
timestamp_thread_wc = boost::chrono::thread_clock::now();
// XTimer::stop()
const auto ns_thread_wc = boost::chrono::thread_clock::now() - timestamp_thread_wc;
thread_wc_time_val = double(ns_thread_wc.count())/1e9;
// test_thread() just after for loop
sleep(1);
// test_thread() in bottom do -> while(0) loop
std::cout << "thread cpu = " << tt.thread_wc_time_val << std::endl;
and compiling with additional -lboost_chrono option I get:
starting 1 threads.
start thread: 0
end thread: 0
boost cpu = 0.16 wc = 1.16715
thread cpu = 0.166943
all threads finished: total wc time = 1.16754
all threads finished: total boost cpu time = 0.16
end.
and:
starting 2 threads.
start thread: 0
start thread: 1
end thread: 1
boost cpu = 0.28 wc = 1.14168
thread cpu = 0.141524
end thread: 0
boost cpu = 0.28 wc = 1.14417
thread cpu = 0.14401
all threads finished: total wc time = 1.14442
all threads finished: total boost cpu time = 0.28
end.

Count the duration of a task inside a method called by multiple threads simultaneously

I have created a member function that is called by a number of threads at the same time. Inside this function I want to count the total duration of the execution of a function . The problem is that if I create 4 threads for example, the time I get back is 4 times the actual time! How can I get the actual time? My method looks like this:
void Class1::myTask() {
//...code
chrono::steady_clock::time_point start = chrono::steady_clock::now();
theFunction();
chrono::steady_clock::time_point end = chrono::steady_clock::now();
chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(end - start);
mytime = time_span.count(); // mytime is of atomic type
setTheTime(mytime);
//...more code
}
// The method to set the Total Time
void Class1::setTheTime(double mTime){
time = time + mTime; // time is of atomic type
}
This method is called for a very large number of times, so everytime the "end - start" returns something like 0.000897442 sec. The total duration is about 11 sec, but time is ending as something like 44 seconds!
Here is an example of code that works so that you can see the problem:
#include <iostream>
#include <cstdlib>
#include <string>
#include <vector>
#include <thread>
#include <chrono>
#include <atomic>
using namespace std;
atomic<double> time1;
atomic<double> mytime;
void theFunction() {
int x = 0;
for (int i = 0; i < 10000000; ++i) {
x++;
}
}
double setTheTime(double mTime1) {
time1 = time1 + mTime1;
}
void countTime() {
chrono::steady_clock::time_point start = chrono::steady_clock::now();
theFunction();
chrono::steady_clock::time_point end = chrono::steady_clock::now();
chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(end - start);
mytime = time_span.count();
setTheTime(mytime);
}
int main(int argc, char** argv) {
vector<thread> threads;
long double mt;
chrono::steady_clock::time_point start = chrono::steady_clock::now();
for (int i = 0; i < 4; i++)
threads.push_back(thread(countTime));
for (auto& thread : threads)
thread.join();
chrono::steady_clock::time_point end = chrono::steady_clock::now();
chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(end - start);
mt = time_span.count(); // mytime is of atomic type
cout << "Time out of the function: " << mt * 1000 << endl;
cout << "Time inside the function: " << time1 * 1000 << endl;
return 0;
}
Let there be N threads, which run in parallel during X seconds natural time.
So for the time S they accumulate
S = N * X
roughly holds.
And 44s indeed equals 4 * 11s .
So what is the problem? :)

OpenMP code which gives wrong answers when I start using 12 threads

I have this piece of Open MP code here which performs an integeration of the function 4.0/(1+x^2) on the interval [0,1]. The analytical answer to this is pi = 3.14159...
The method of integrating the function is just by a plain approximating Riemann sum. Now the code
gives me the correct answer when I use 1 OpenMP thread, upto 11 OpenMP threads.
However it starts giving increasingly wrong answers once I start using 12 OpenMP threads or more.
Why could this be happening? First here is the C++ code. I am using gcc in an Ubuntu 10.10 environment. The code is compiled with g++ -fopenmp integration_OpenMP.cpp
// f(x) = 4/(1+x^2)
// Domain of integration: [0,1]
// Integral over the domain = pi =(approx) 3.14159
#include <iostream>
#include <omp.h>
#include <vector>
#include <algorithm>
#include <functional>
#include <numeric>
int main (void)
{
//Information common to serial and parallel computation.
int num_steps = 2e8;
double dx = 1.0/num_steps;
//Serial Computation: Method pf integration is just a plain Riemann sum
double start = omp_get_wtime();
double serial_sum = 0;
double x = 0;
for (int i=0;i< num_steps; ++i)
{
serial_sum += 4.0*dx/(1.0+x*x);
x += dx;
}
double end = omp_get_wtime();
std::cout << "Time taken for the serial computation: " << end-start << " seconds";
std::cout << "\t\tPi serial: " << serial_sum << std::endl;
//OpenMP computation. Method of integration, just a plain Riemann sum
std::cout << "How many OpenMP threads do you need for parallel computation? ";
int t;//number of openmp threads
std::cin >> t;
start = omp_get_wtime();
double parallel_sum = 0; //will be modified atomically
#pragma omp parallel num_threads(t)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/t; //integer index of left end point of subinterval
int end = begin + num_steps/t; // integer index of right-endpoint of sub-interval
double dx_local = dx;
double temp = 0;
double x = begin*dx;
for (int i = begin; i < end; ++i)
{
temp += 4.0*dx_local/(1.0+x*x);
x += dx_local;
}
#pragma omp atomic
parallel_sum += temp;
}
end = omp_get_wtime();
std::cout << "Time taken for the parallel computation: " << end-start << " seconds";
std::cout << "\tPi parallel: " << parallel_sum << std::endl;
return 0;
}
Here is the output for different number of threads starting with 11 threads.
OpenMP: ./a.out
Time taken for the serial computation: 1.27744 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 11
Time taken for the parallel computation: 0.366467 seconds Pi parallel: 3.14159
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28167 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 12
Time taken for the parallel computation: 0.351284 seconds Pi parallel: 3.16496
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28178 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 13
Time taken for the parallel computation: 0.434283 seconds Pi parallel: 3.21112
OpenMP: ./a.out
Time taken for the serial computation: 1.2765 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 14
Time taken for the parallel computation: 0.375078 seconds Pi parallel: 3.27163
OpenMP:
Why not just use a parallel for with static partitioning instead?
#pragma omp parallel shared(dx) num_threads(t)
{
double x = omp_get_thread_num() * 1.0 / t;
#pragma omp for reduction(+ : parallel_Sum)
for (int i = 0; i < num_steps; ++i)
{
parallel_Sum += 4.0*dx/(1.0+x*x);
x += dx;
}
}
Then you won't need to manage all the partitioning and atomic collection of results by yourself.
In order to correctly initialize x, we notice that x = (begin * dx) = (threadIdx * num_steps/t) * (1.0 / num_steps) = (threadIdx * 1.0) / t.
Edit: Just tested this final version on my machine and it seems to work correctly.
The problem is in calculating begin:
while you set num_steps = 2e8, when threadIdx==11, num_steps * threadIdx will lead to 32-bit integer overflow, so your start will be calculated incorrectly.
I advise you use long long int for threadIdx, begin and end.
EDIT:
Also note, that your method of calculating begin and end can lead to steps (and precision ) will be lost. For example, for 313 threads you loose 199 steps.
Right way to calculate begin and end would be:
long long int begin = threadIdx * num_steps/t;
long long int end = (threadIdx + 1) * num_steps/t;
For the same reason, you cannot do the trick with parenthesis, but have to use long long.