Rounding integers routine - c++

There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.
The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:
#include <cmath>
int round_divide_by_float_casting(int a, int b){
return (int) std::roundf( a / (float) b);
}
Yet this seems like scratching your left ear with your right hand. I use:
int round_divide (int a, int b){
return a / b + a % b * 2 / b;
}
It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?
Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?

Arithmetic solution
If one defined what your functions should return, she would describe it as something close as "f(a, b) returns the closest integer of the result of the division of a by b in the real divisor ring."
Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.
There is exactly two candidates as the closest integer: a / b and (a / b) + 1(1). The selection is easy, if a % b is closer to 0 as it is to b, then a / b is our result. If not, (a / b) + 1 is.
One could then write something similar to, ignoring optimization and good practices:
int divide(int a, int b)
{
const int quot = a / b;
const int rem = a % b;
int result;
if (rem < b - rem) {
result = quot;
} else {
result = quot + 1;
}
return result;
}
While this definition satisfies out needs, one could optimize it by not computing two times the division of a by b with the use of std::div():
int divide(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem) {
++result;
}
return result;
}
The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.
(1)There is just one last thing to check: how does it behaves when a or b is negative? This is left to the reader ;).
Benchmark
#include <iostream>
#include <iomanip>
#include <string>
// solutions
#include <cmath>
#include <cstdlib>
// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>
//
// Solutions
//
namespace
{
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem)
{
++result;
}
return result;
}
}
//
// benchmark
//
class Randomizer
{
std::mt19937 _rng_engine;
std::uniform_int_distribution<int> _distri;
public:
Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
{
}
template<class ForwardIt>
void operator()(ForwardIt begin, ForwardIt end)
{
std::generate(begin, end, std::bind(_distri, _rng_engine));
}
};
class Clock
{
std::chrono::time_point<std::chrono::steady_clock> _start;
public:
static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }
Clock() : _start(now())
{
}
template<class DurationUnit>
std::size_t end()
{
return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
}
};
//
// Entry point
//
int main()
{
Randomizer randomizer;
std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
std::array<int, dividends.size()> divisors;
std::array<int, dividends.size()> results;
randomizer(std::begin(dividends), std::end(dividends));
randomizer(std::begin(divisors), std::end(divisors));
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_float_casting(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_modulo(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = divide_by_quotient_comparison(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
}
}
Outputs:
g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
round_divide_by_float_casting(): 54.7 ns
round_divide_by_modulo(): 24 ns
divide_by_quotient_comparison(): 25.5 ns
Demo
The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).

It would really depend on the processor, and the range of the integer which is better (and using double would resolve most of the range issues)
For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.
atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)
About four machine instructions.
On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.
tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;
So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b for a % b - but it's still two distinct divides).
Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).
On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.
Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++ (v3.9-pre-release, pretty close to released 3.8).
round_divide_by_float_casting(): 32.5 ns
round_divide_by_modulo(): 113 ns
divide_by_quotient_comparison(): 80.4 ns
However, the interesting thing I find when I look at the generated code:
xorps %xmm0, %xmm0
cvtsi2ssl 8016(%rsp,%rbp), %xmm0
xorps %xmm1, %xmm1
cvtsi2ssl 4016(%rsp,%rbp), %xmm1
divss %xmm1, %xmm0
callq roundf
cvttss2si %xmm0, %eax
movl %eax, 16(%rsp,%rbp)
addq $4, %rbp
cmpq $4000, %rbp # imm = 0xFA0
jne .LBB0_7
is that the round is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.
g++ gives better results with -ffast-math, which gives around:
round_divide_by_float_casting(): 17.6 ns
round_divide_by_modulo(): 43.1 ns
divide_by_quotient_comparison(): 18.5 ns
(This is with increased count to 100k values)

Prefer the standard solution. Use std::div family of functions declared in cstdlib.
See: http://en.cppreference.com/w/cpp/numeric/math/div
Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.

Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 1000;
//while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
divide_by_quotient_comparison(i, j);
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_float_casting(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_modulo(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
//}
return 0;
}
The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by #YSC and #teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.
A second test was performed to avoid certain compiler optimizations pointed out by #YSC:
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 100;
vector <int> randi, randj;
for (int i = 0; i < itr; i++) {
randi.push_back(rand());
int rj = rand();
if (rj == 0)
rj++;
randj.push_back(rj);
}
vector<int> f, m, q;
while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
cout << endl;
f.clear();
m.clear();
q.clear();
}
return 0;
}
In this second test the slowest was the divide_by_quotient() reliant on std::div_t, followed by divide_by_float(), and the fastest again was the divide_by_modulo(). However this time the performance difference was much, much lower, less than 20%.

Related

Why does threading floating point computations on the CPU make them take significantly longer?

I am currently working on a scientific simulation (Gravitational nbody). I first wrote it with a naive single-threaded algorithm, and this performed acceptably for a small number of particles. I then multi-threaded this algorithm (it is embarrassingly parallel), and the program took about 3x as long. What follows is a minimum, complete, verifiable example of a trivial algorithm with similar properties and output to a file in /tmp (it is designed to run on Linux, but the C++ is also standard). Be warned that if you decide to run this code, it will produce a 152.62MB file. The data is outputted to prevent the compiler from optimizing the computation out of the program.
#include <iostream>
#include <functional>
#include <thread>
#include <vector>
#include <atomic>
#include <random>
#include <fstream>
#include <chrono>
constexpr unsigned ITERATION_COUNT = 2000;
constexpr unsigned NUMBER_COUNT = 10000;
void runThreaded(unsigned count, unsigned batchSize, std::function<void(unsigned)> callback){
unsigned threadCount = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
threads.reserve(threadCount);
std::atomic<unsigned> currentIndex(0);
for(unsigned i=0;i<threadCount;++i){
threads.emplace_back([&currentIndex, batchSize, count, callback]{
unsigned startAt = currentIndex.fetch_add(batchSize);
if(startAt >= count){
return;
}else{
for(unsigned i=0;i<count;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
});
}
for(std::thread &thread : threads){
thread.join();
}
}
void threadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i) {
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
runThreaded(NUMBER_COUNT, 100, [&numbers, &newNumbers](unsigned x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
});
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
void unThreadedTest(){
std::mt19937_64 rnd(0);
std::vector<double> numbers;
numbers.reserve(NUMBER_COUNT);
for(unsigned i=0;i<NUMBER_COUNT;++i){
numbers.push_back(rnd());
}
std::vector<double> newNumbers = numbers;
std::ofstream fout("/tmp/test-data.bin");
for(unsigned i=0;i<ITERATION_COUNT;++i){
std::cout << "Iteration: " << i << "/" << ITERATION_COUNT << std::endl;
for(unsigned x=0;x<NUMBER_COUNT;++x){
double total = 0;
for(unsigned y=0;y<NUMBER_COUNT;++y){
total += numbers[y]*(y-x)*(y-x);
}
newNumbers[x] = total;
}
fout.write(reinterpret_cast<char*>(newNumbers.data()), newNumbers.size()*sizeof(double));
std::swap(numbers, newNumbers);
}
}
int main(int argc, char *argv[]) {
if(argv[1][0] == 't'){
threadedTest();
}else{
unThreadedTest();
}
return 0;
}
When I run this (compiled with clang 7.0.1 on Linux), I get the following times from the Linux time command. The difference between these is similar to what I see in my real program. The entry labelled "real" is what is relevant to this question, as this is the clock time that the program takes to run.
Single-threaded:
real 6m27.261s
user 6m27.081s
sys 0m0.051s
Multi-threaded:
real 14m32.856s
user 216m58.063s
sys 0m4.492s
As such, I ask what is causing this massive slowdown when I expect it to speed up significantly (roughly by a factor of 8, as I have an 8 core 16 thread CPU). I am not implementing this on the GPU as the next step is to make some changes to the algorithm to take it from O(n²) to O(nlogn), but that are also not amicable to a GPU. The changed algorithm will have less difference with my currently implemented O(n²) algorithm than the included example. Lastly, I want to observe that the subjective time to run each iteration (judged by the time between the iteration lines appearing) changes significantly in both the threaded and unthreaded runs.
It's kind of hard to follow this code, but I think you're duplicating work on a massive scale because each thread does nearly all the work, just skipping a small portion of it at the start.
I'm presuming the inner loop of runThreaded should be:
unsigned startAt = currentIndex.fetch_add(batchSize);
while (startAt < count) {
if (startAt >= count) {
return;
} else {
for(unsigned i=0;i<batchSize;++i){
unsigned index = startAt+i;
if(index >= count){
return;
}
callback(index);
}
}
startAt = currentIndex.fetch_add(batchSize);
}
Where i < batchSize is the key here. You should only do as much work as the batch dictates, not count times, which is the whole list minus the initial offset.
With this update the code runs significantly faster. I'm not sure if it does all the required work because it's hard to tell if that's actually happening, the output is very minimal.
For easy parallelization over multiple CPUs I recommend using tbb::parallel_for. It uses the correct number of CPUs and splits the range for you, completely eliminating the risk of implementing it wrong. Alternatively, there is a parallel for_each in C++17. In other words, this problem has a number of good solutions.
Vectorizing code is a difficult problem and neither clang++-6 not g++-8 auto-vectorize the baseline code. Hence, SIMD version below I used excellent Vc: portable, zero-overhead C++ types for explicitly data-parallel programming library.
Below is a working benchmark that compares:
The baseline version.
SIMD version.
SIMD + multi-threading version.
#include <Vc/Vc>
#include <tbb/parallel_for.h>
#include <algorithm>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
constexpr int ITERATION_COUNT = 20;
constexpr int NUMBER_COUNT = 20000;
double baseline() {
double result = 0;
std::vector<double> newNumbers(NUMBER_COUNT);
std::vector<double> numbers(NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers)
n = rnd();
for(int i = 0; i < ITERATION_COUNT; ++i) {
for(int x = 0; x < NUMBER_COUNT; ++x) {
double total = 0;
for(int y = 0; y < NUMBER_COUNT; ++y) {
auto d = (y - x);
total += numbers[y] * (d * d);
}
newNumbers[x] = total;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), 0.);
swap(numbers, newNumbers);
}
return result;
}
double simd() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const incv(Vc::double_v::Size);
for(int i = 0; i < ITERATION_COUNT; ++i) {
Vc::double_v x(Vc::IndexesFromZero);
for(auto& new_n : newNumbers) {
Vc::double_v totals;
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
auto d = y - x;
totals += n[j] * (d * d);
++y;
}
}
new_n = totals;
x += incv;
}
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
double simd_mt() {
double result = 0;
constexpr int SIMD_NUMBER_COUNT = NUMBER_COUNT / Vc::double_v::Size;
using vector_double_v = std::vector<Vc::double_v, Vc::Allocator<Vc::double_v>>;
vector_double_v newNumbers(SIMD_NUMBER_COUNT);
vector_double_v numbers(SIMD_NUMBER_COUNT);
std::mt19937 rnd(0);
for(auto& n : numbers) {
alignas(Vc::VectorAlignment) double t[Vc::double_v::Size];
for(double& v : t)
v = rnd();
n.load(t, Vc::Aligned);
}
Vc::double_v const v0123(Vc::IndexesFromZero);
for(int i = 0; i < ITERATION_COUNT; ++i) {
constexpr int SIMD_STEP = 4;
tbb::parallel_for(0, SIMD_NUMBER_COUNT, SIMD_STEP, [&](int ix) {
Vc::double_v xs[SIMD_STEP];
for(int is = 0; is < SIMD_STEP; ++is)
xs[is] = v0123 + (ix + is) * Vc::double_v::Size;
Vc::double_v totals[SIMD_STEP];
int y = 0;
for(auto const& n : numbers) {
for(unsigned j = 0; j < Vc::double_v::Size; ++j) {
for(int is = 0; is < SIMD_STEP; ++is) {
auto d = y - xs[is];
totals[is] += n[j] * (d * d);
}
++y;
}
}
std::copy_n(totals, SIMD_STEP, &newNumbers[ix]);
});
result += std::accumulate(newNumbers.begin(), newNumbers.end(), Vc::double_v{}).sum();
swap(numbers, newNumbers);
}
return result;
}
struct Stopwatch {
using Clock = std::chrono::high_resolution_clock;
using Seconds = std::chrono::duration<double>;
Clock::time_point start_ = Clock::now();
Seconds elapsed() const {
return std::chrono::duration_cast<Seconds>(Clock::now() - start_);
}
};
std::ostream& operator<<(std::ostream& s, Stopwatch::Seconds const& a) {
auto precision = s.precision(9);
s << std::fixed << a.count() << std::resetiosflags(std::ios_base::floatfield) << 's';
s.precision(precision);
return s;
}
void benchmark() {
Stopwatch::Seconds baseline_time;
{
Stopwatch s;
double result = baseline();
baseline_time = s.elapsed();
std::cout << "baseline: " << result << ", " << baseline_time << '\n';
}
{
Stopwatch s;
double result = simd();
auto time = s.elapsed();
std::cout << " simd: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
{
Stopwatch s;
double result = simd_mt();
auto time = s.elapsed();
std::cout << " simd_mt: " << result << ", " << time << ", " << (baseline_time / time) << "x speedup\n";
}
}
int main() {
benchmark();
benchmark();
benchmark();
}
Timings:
baseline: 2.76582e+257, 6.399848397s
simd: 2.76582e+257, 1.600373449s, 3.99897x speedup
simd_mt: 2.76582e+257, 0.168638435s, 37.9501x speedup
Notes:
My machine supports AVX but not AVX-512, so it is roughly 4x speedup when using SIMD.
simd_mt version uses 8 threads on my machine and larger SIMD steps. The theoretical speedup is 128x, on practice - 38x.
clang++-6 cannot auto-vectorize the baseline code, neither can g++-8.
g++-8 generates considerably faster code for SIMD versions than clang++-6 .
Your heart is certainly in the right place minus a bug or two.
par_for is a complex issue depending on the payload of your loop. There is
no one-size-fits-all solution to this. The payload can be anything from
a couple of adds to almost infinite mutex blocks - for example by doing memory
allocation.
The atomic variable as a work item pattern has always worked well for me but
remember that atomic variables have a high cost on X86 (~400 cycles) and even
incur a high cost if they are in an unexecuted branch as I found to my peril.
Some permutation of the following is usually good. Choosing the right chunks_per_thread (as in your batchSize) is critical. If you don't trust your
users, you can test execute a few iterations of the loop to guess the
best chunking level.
#include <atomic>
#include <future>
#include <thread>
#include <vector>
#include <stdio.h>
template<typename Func>
void par_for(int start, int end, int step, int chunks_per_thread, Func func) {
using namespace std;
using namespace chrono;
atomic<int> work_item{start};
vector<future<void>> futures(std::thread::hardware_concurrency());
for (auto &fut : futures) {
fut = async(std::launch::async, [&work_item, end, step, chunks_per_thread, &func]() {
for(;;) {
int wi = work_item.fetch_add(step * chunks_per_thread);
if (wi > end) break;
int wi_max = std::min(end, wi+step * chunks_per_thread);
while (wi < wi_max) {
func(wi);
wi += step;
}
}
});
}
for (auto &fut : futures) {
fut.wait();
}
}
int main() {
using namespace std;
using namespace chrono;
for (int k = 0; k != 2; ++k) {
auto t0 = high_resolution_clock::now();
constexpr int loops = 100000000;
if (k == 0) {
for (int i = 0; i != loops; ++i ) {
if (i % 10000000 == 0) printf("%d\n", i);
}
} else {
par_for(0, loops, 1, 100000, [](int i) {
if (i % 10000000 == 0) printf("%d\n", i);
});
}
auto t1 = high_resolution_clock::now();
duration<double, milli> ns = t1 - t0;
printf("k=%d %fms total\n", k, ns.count());
}
}
results
...
k=0 174.925903ms total
...
k=1 27.924738ms total
About a 6x speedup.
I avoid the term "embarassingly parallel" as it is almost never the case. You pay exponentially higher costs the more resources you use on your journey from level 1 cache (ns latency) to globe spanning cluster (ms latency). But I hope this code snippet is useful as an answer.

Using multithreading in C++

Trying to use multithreading in calculating of the integral. Please, correct me if I'm wrong.
I think, that calculating the integral from 0 to 20000 is slower(about time) than I'd divide it on (0-10000) to (10000-2000). That's why I wanted to check the time of realization a function.
At the first, I tried to calculate from 0 to 20000 and I got the defining time of realization of a function.
At the second, I was trying to divide it on segments and got the defining time of realization of functions but if I'd sum the times, and I'd get bigger time that it was before. Why?
In my opinion, I'm wrong with the using multithreading. Can you help me with that?
*But the second problem is that I get wrong resulsts with the calculating < but it happens, when I try to calculate big numbers, like 0-10000, for example. But as for 0-3, it's okay.Is my method of calculating mistaken? But why is it okay with small numbers (like:0-3)?
Here is my code, please help me to figure out..
The class
#pragma once
#include <iostream>
#include <cmath>
#include <mutex>
#define N 10000
class Integral
{
private:
std::mutex mtx;
double S, x, h;
public:
Integral() :S(0),x(0), h(0) {}
double givingFunction(int x);
void solvingFunction(double a, double b);
};
Integral.cpp
#include "Integral.h"
#include <chrono>
double Integral::givingFunction(int x)
{
double f;
f = x+x;
return f;
}
void Integral::solvingFunction(double a, double b)
{
auto start = std::chrono::high_resolution_clock::now();
mtx.lock();
/*std::cout << "Enter a: ";
std::cin >> a;
std::cout << "Enter b: ";
std::cin >> b;*/
//a = 3;
//b = 0;
std::cout <<std::endl;
h = (b - a) / N;
x = a + h;
while (x<b)
{
S = S + 4 * givingFunction(x);
x = x + h;
if (x >= b)
break;
S = S + 2 * givingFunction(x);
x = x + h;
}
S = (h / 3)*(S + givingFunction(a) + givingFunction(b));
std::cout << "Result:" << S <<std::endl;
mtx.unlock();
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
}
main.cpp
#include <iostream>
#include "Integral.h"
#include <thread>
int main()
{
Integral w;
std::thread Solve(&Integral::solvingFunction,&w,0.,1000.);
std::thread Solve1(&Integral::solvingFunction,&w, 1000., 20000.);
Solve.join();
Solve1.join();
system("pause");
return 0;
}

thrust vector distance calculation

Consider the following dataset and centroids. There are 7 individuals and two means each with 8 dimensions. They are stored row major order.
short dim = 8;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114
};
I want to calculate each euclidean distances. c1 - d1, c1 - d2 ....
On CPU I would do:
float dist = 0.0, dist_sqrt;
for(int i = 0; i < 2; i++)
for(int j = 0; j < 7; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
std::cout << dist_sqrt << std::endl;
}
Is there any built in solution of vector distance calculation in THRUST?
It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key, but it gets rather involved to present the data correctly to reduce_by_key.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator to do the subtraction (and squaring) of the I and C vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I and C.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I (data) and C (centr) use a counting_iterator combined with a custom indexing functor inside of a transform_iterator to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I and C vectors, virtually "lay out" the vectors via a permutation_iterator (one for each laid-out vector).
zip the 2 "laid out" virtual I and C vectors together, to create a <float, float> tuple vector (virtual).
take the zip_iterator from step 3, and combine with a custom distance-calculation functor ((I-C)^2) in a transform_iterator
use another transform_iterator, combining a counting_iterator with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_keyas the inputs (keys, values) to be reduced. The output vectors for reduce_by_key are also keys and values. We don't need the keys, so we'll use a discard_iterator to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
break;}
return pass;
}
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
{
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
{
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
}
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
}
if (print) std::cout << std::endl;
}
struct dkeygen : public thrust::unary_function<int, int>
{
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
}
};
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
{
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
}
};
struct d_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
}
};
struct c_idx : public thrust::unary_function<int, int>
{
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
}
};
struct my_sqrt : public thrust::unary_function<float, float>
{
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
}
};
unsigned long long eucl_dist_thrust(thrust::host_vector<float> &centroids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
cudaDeviceSynchronize();
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
}
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
}
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
};
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
};
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
Notes:
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int to unsigned long long, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.

Using pow() for large number

I am trying to solve a problem, a part of which requires me to calculate (2^n)%1000000007 , where n<=10^9. But my following code gives me output "0" even for input like n=99.
Is there anyway other than having a loop which multilplies the output by 2 every time and finding the modulo every time (this is not I am looking for as this will be very slow for large numbers).
#include<stdio.h>
#include<math.h>
#include<iostream>
using namespace std;
int main()
{
unsigned long long gaps,total;
while(1)
{
cin>>gaps;
total=(unsigned long long)powf(2,gaps)%1000000007;
cout<<total<<endl;
}
}
You need a "big num" library, it is not clear what platform you are on, but start here:
http://gmplib.org/
this is not I am looking for as this will be very slow for large numbers
Using a bigint library will be considerably slower pretty much any other solution.
Don't take the modulo every pass through the loop: rather, only take it when the output grows bigger than the modulus, as follows:
#include <iostream>
int main() {
int modulus = 1000000007;
int n = 88888888;
long res = 1;
for(long i=0; i < n; ++i) {
res *= 2;
if(res > modulus)
res %= modulus;
}
std::cout << res << std::endl;
}
This is actually pretty quick:
$ time ./t
./t 1.19s user 0.00s system 99% cpu 1.197 total
I should mention that the reason this works is that if a and b are equivalent mod m (that is, a % m = b % m), then this equality holds multiple k of a and b (that is, the foregoing equality implies (a*k)%m = (b*k)%m).
Chris proposed GMP, but if you need just that and want to do things The C++ Way, not The C Way, and without unnecessary complexity, you may just want to check this out - it generates few warnings when compiling, but is quite simple and Just Works™.
You can split your 2^n into chunks of 2^m. You need to find: `
2^m * 2^m * ... 2^(less than m)
Number m should be 31 is for 32-bit CPU. Then your answer is:
chunk1 % k * chunk2 * k ... where k=1000000007
You are still O(N). But then you can utilize the fact that all chunk % k are equal except last one and you can make it O(1)
I wrote this function. It is very inefficient but it works with very large numbers. It uses my self-made algorithm to store big numbers in arrays using a decimal like system.
mpfr2.cpp
#include "mpfr2.h"
void mpfr2::mpfr::setNumber(std::string a) {
for (int i = a.length() - 1, j = 0; i >= 0; ++j, --i) {
_a[j] = a[i] - '0';
}
res_size = a.length();
}
int mpfr2::mpfr::multiply(mpfr& a, mpfr b)
{
mpfr ans = mpfr();
// One by one multiply n with individual digits of res[]
int i = 0;
for (i = 0; i < b.res_size; ++i)
{
for (int j = 0; j < a.res_size; ++j) {
ans._a[i + j] += b._a[i] * a._a[j];
}
}
for (i = 0; i < a.res_size + b.res_size; i++)
{
int tmp = ans._a[i] / 10;
ans._a[i] = ans._a[i] % 10;
ans._a[i + 1] = ans._a[i + 1] + tmp;
}
for (i = a.res_size + b.res_size; i >= 0; i--)
{
if (ans._a[i] > 0) break;
}
ans.res_size = i+1;
a = ans;
return a.res_size;
}
mpfr2::mpfr mpfr2::mpfr::pow(mpfr a, mpfr b) {
mpfr t = a;
std::string bStr = "";
for (int i = b.res_size - 1; i >= 0; --i) {
bStr += std::to_string(b._a[i]);
}
int i = 1;
while (!0) {
if (bStr == std::to_string(i)) break;
a.res_size = multiply(a, t);
// Debugging
std::cout << "\npow() iteration " << i << std::endl;
++i;
}
return a;
}
mpfr2.h
#pragma once
//#infdef MPFR2_H
//#define MPFR2_H
// C standard includes
#include <iostream>
#include <string>
#define MAX 0x7fffffff/32/4 // 2147483647
namespace mpfr2 {
class mpfr
{
public:
int _a[MAX];
int res_size;
void setNumber(std::string);
static int multiply(mpfr&, mpfr);
static mpfr pow(mpfr, mpfr);
};
}
//#endif
main.cpp
#include <iostream>
#include <fstream>
// Local headers
#include "mpfr2.h" // Defines local mpfr algorithm library
// Namespaces
namespace m = mpfr2; // Reduce the typing a bit later...
m::mpfr tetration(m::mpfr, int);
int main() {
// Hardcoded tests
int x = 7;
std::ofstream f("out.txt");
m::mpfr t;
for(int b=1; b<x;b++) {
std::cout << "2^^" << b << std::endl; // Hardcoded message
t.setNumber("2");
m::mpfr res = tetration(t, b);
for (int i = res.res_size - 1; i >= 0; i--) {
std::cout << res._a[i];
f << res._a[i];
}
f << std::endl << std::endl;
std::cout << std::endl << std::endl;
}
char c; std::cin.ignore(); std::cin >> c;
return 0;
}
m::mpfr tetration(m::mpfr a, int b)
{
m::mpfr tmp = a;
if (b <= 0) return m::mpfr();
for (; b > 1; b--) tmp = m::mpfr::pow(a, tmp);
return tmp;
}
I created this for tetration and eventually hyperoperations. When the numbers get really big it can take ages to calculate and a lot of memory. The #define MAX 0x7fffffff/32/4 is the number of decimals one number can have. I might make another algorithm later to combine multiple of these arrays into one number. On my system the max array length is 0x7fffffff aka 2147486347 aka 2^31-1 aka int32_max (which is usually the standard int size) so I had to divide int32_max by 32 to make the creation of this array possible. I also divided it by 4 to reduce memory usage in the multiply() function.
- Jubiman

Most efficient way to find the greatest of three ints

Below is my pseudo code.
function highest(i, j, k)
{
if(i > j && i > k)
{
return i;
}
else if (j > k)
{
return j;
}
else
{
return k;
}
}
I think that works, but is that the most efficient way in C++?
To find the greatest you need to look at exactly 3 ints, no more no less. You're looking at 6 with 3 compares. You should be able to do it in 3 and 2 compares.
int ret = max(i,j);
ret = max(ret, k);
return ret;
Pseudocode:
result = i
if j > result:
result = j
if k > result:
result = k
return result
How about
return i > j? (i > k? i: k): (j > k? j: k);
two comparisons, no use of transient temporary stack variables...
Your current method:
http://ideone.com/JZEqZTlj (0.40s)
Chris's solution:
int ret = max(i,j);
ret = max(ret, k);
return ret;
http://ideone.com/hlnl7QZX (0.39s)
Solution by Ignacio Vazquez-Abrams:
result = i;
if (j > result)
result = j;
if (k > result)
result = k;
return result;
http://ideone.com/JKbtkgXi (0.40s)
And Charles Bretana's:
return i > j? (i > k? i: k): (j > k? j: k);
http://ideone.com/kyl0SpUZ (0.40s)
Of those tests, all the solutions take within 3% the amount of time to execute as the others. The code you are trying to optimize is extremely short as it is. Even if you're able to squeeze 1 instruction out of it, it's not likely to make a huge difference across the entirety of your program (modern compilers might catch that small optimization). Spend your time elsewhere.
EDIT: Updated the tests, turns out it was still optimizing parts of it out before. Hopefully it's not anymore.
For a question like this, there is no substitute for knowing just what your optimizing compiler is doing and just what's available on the hardware. If the fundamental tool you have is binary comparison or binary max, two comparisons or max's are both necessary and sufficient.
I prefer Ignacio's solution:
result = i;
if (j > result)
result = j;
if (k > result)
result = k;
return result;
because on the common modern Intel hardware, the compiler will find it extremely easy to emit just two comparisons and two cmov instructions, which place a smaller load on the I-cache and less stress on the branch predictor than conditional branches. (Also, the code is clear and easy to read.) If you are using x86-64, the compiler will even keep everything in registers.
Note you are going to be hard pressed to embed this code into a program where your choice makes a difference...
I like to eliminate conditional jumps as an intellectual exercise. Whether this has any measurable effect on performance I have no idea though :)
#include <iostream>
#include <limits>
inline int max(int a, int b)
{
int difference = a - b;
int b_greater = difference >> std::numeric_limits<int>::digits;
return a - (difference & b_greater);
}
int max(int a, int b, int c)
{
return max(max(a, b), c);
}
int main()
{
std::cout << max(1, 2, 3) << std::endl;
std::cout << max(1, 3, 2) << std::endl;
std::cout << max(2, 1, 3) << std::endl;
std::cout << max(2, 3, 1) << std::endl;
std::cout << max(3, 1, 2) << std::endl;
std::cout << max(3, 2, 1) << std::endl;
}
This bit twiddling is just for fun, the cmov solution is probably a lot faster.
Not sure if this is the most efficient or not, but it might be, and it's definitely shorter:
int maximum = max( max(i, j), k);
There is a proposal to include this into the C++ library under N2485. The proposal is simple, so I've included the meaningful code below. Obviously, this assumes variadic templates
template < typename T >
const T & max ( const T & a )
{ return a ; }
template < typename T , typename ... Args >
const T & max( const T & a , const T & b , const Args &... args )
{ return max ( b > a ? b : a , args ...); }
The easiest way to find a maximum or minimum of 2 or more numbers in c++ is:-
int a = 3, b = 4, c = 5;
int maximum = max({a, b, c});
int a = 3, b = 4, c = 5;
int minimum = min({a, b, c});
You can give as many variables as you want.
Interestingly enough it is also incredibly efficient, at least as efficient as Ignacio Vazquez-Abrams'solution (https://godbolt.org/z/j1KM97):
mov eax, dword ptr [rsp + 8]
mov ecx, dword ptr [rsp + 4]
cmp eax, ecx
cmovl eax, ecx
mov ecx, dword ptr [rsp]
cmp eax, ecx
cmovl eax, ecx
Similar with GCC, while MSVC makes a mess with a loop.
public int maximum(int a,int b,int c){
int max = a;
if(b>max)
max = b;
if(c>max)
max = c;
return max;
}
I think by "most efficient" you are talking about performance, trying not to waste computing resources. But you could be referring to writing fewer lines of code or maybe about the readability of your source code. I am providing an example below, and you can evaluate if you find something useful or if you prefer another version from the answers you received.
/* Java version, whose syntax is very similar to C++. Call this program "LargestOfThreeNumbers.java" */
class LargestOfThreeNumbers{
public static void main(String args[]){
int x, y, z, largest;
x = 1;
y = 2;
z = 3;
largest = x;
if(y > x){
largest = y;
if(z > y){
largest = z;
}
}else if(z > x){
largest = z;
}
System.out.println("The largest number is: " + largest);
}
}
#include<stdio.h>
int main()
{
int a,b,c,d,e;
scanf("%d %d %d",&a,&b,&c);
d=(a+b+abs(a-b))/2;
e=(d+c+abs(c-d))/2;
printf("%d is Max\n",e);
return 0;
}
I Used This Way, It Took 0.01 Second
#include "iostream"
using std::cout;
using std::cin;
int main()
{
int num1, num2, num3;
cin>>num1>>num2>>num3;
int cot {((num1>num2)?num1:num2)};
int fnl {(num3>cot)?num3:cot};
cout<<fnl;
}
Or This
#include "iostream"
using std::cout;
using std::cin;
int main()
{
int num1, num2, num3;
cin>>num1>>num2>>num3;
int cot {(((num1>num2)?num1:num2)>((num3>cot)?num3:cot)?((num1>num2)?num1:num2):((num3>cot)?num3:cot))};
cout<<cot;
}
The most efficient way to find the greatest among 3 numbers is by using max function. Here is a small example:
#include <iostream>
#include <algorithm>
using namespace std;
int main() {
int x = 3, y = 4, z = 5;
cout << max(x, max(y, z)) << endl;
return 0;
}
If you have C++ 11, then you can do it as follow:
int main() {
int x = 3, y = 4, z = 5;
cout << std::max({x, y, z}); << endl;
return 0;
}
If you are interested to use a function, so that you can call it easily multiple times, here is the code:
using namespace std;
int test(int x, int y, int z) { //created a test function
//return std::max({x, y, z}); //if installed C++11
return max(x, max(y, z));
}
int main() {
cout << test(1, 2, 3) << endl;
cout << test(1, 3, 2) << endl;
cout << test(1, 1, 1) << endl;
cout << test(1, 2, 2) << endl;
return 0;
}
Here is a small function you can use:
int max3(int a, int b, int c=INT_MIN) {
return max(a, max(b, c));
}
In C# finding the greatest and smallest number between 3 digit
static void recorrectFindSmallestNumber()
{
int x = 30, y = 22, z = 11;
if (x < y)
{
if (x < z)
{
Console.WriteLine("X is Smaller Numebr {0}.", x);
}
else
{
Console.WriteLine("z is Smaller Numebr {0}.", z);
}
}
else if (x > y)
{
if (y < z)
{
Console.WriteLine("y is Smaller number.{0}", y);
}
else
{
Console.WriteLine("z is Smaller number.{0}", z);
}
}
else
{
}
}
=================================================================
static void recorrectFindLargeNumber()
{
int x, y, z;
Console.WriteLine("Enter the first number:");
x = int.Parse(Console.ReadLine());
Console.WriteLine("Enter the second number:");
y = int.Parse(Console.ReadLine());
Console.WriteLine("Enter the third nuumnber:");
z = int.Parse(Console.ReadLine());
if (x > y)
{
if (x > z)
{
Console.WriteLine("X is Greater numbaer: {0}.", x);
}
else
{
Console.WriteLine("Z is greatest number: {0}.", z);
}
}
else if (x < y)
{
if (y > z)
{
Console.WriteLine("y is Greater Number: {0}", y);
}
else
{
Console.WriteLine("Z is Greater Number; {0}", z);
}
}
else
{
}
}
#include<iostream>
#include<cmath>
using namespace std;
int main()
{
int num1,num2,num3,maximum;
cout<<"Enter 3 numbers one by one "<<endl;
cin>>num1;
cin>>num2;
cin>>num3;
maximum=max(max(num1,num2),num3);
cout<<"maximum of 3 numbers is "<<maximum<<endl;
}