Count floating-point instructions

Count floating-point instructions - c++

I am trying to count the number of floating point operations in one of my programs and I think perf could be the tool I am looking for (are there any alternatives?), but I have trouble limiting it to a certain function/block of code. Lets take the following example:
#include <complex>
#include <cstdlib>
#include <iostream>
#include <type_traits>
template <typename T>
typename std::enable_if<std::is_floating_point<T>::value, T>::type myrand()
{
return static_cast <T> (std::rand()) / static_cast <T> (RAND_MAX);
}
template <typename T>
typename std::enable_if<!std::is_floating_point<T>::value, std::complex<typename T::value_type>>::type myrand()
{
typedef typename T::value_type S;
return std::complex<S>(
static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX),
static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX)
);
}
int main()
{
auto const a = myrand<Type>();
auto const b = myrand<Type>();
// count here
auto const c = a * b;
// stop counting here
// prevent compiler from optimizing away c
std::cout << c << "\n";
return 0;
}
The myrand() function simply returns a random number, if the type T is complex then a random complex number. I did not hardcode doubles into the program because they would be optimized away by the compiler.
You can compile the file (lets call it bench.cpp) with c++ -std=c++0x -DType=double bench.cpp.
Now I would like to count the number of floating point operations, which can be done on my processor (Nehalem architecture, x86_64 where floating point is done with scalar SSE) with the event r8010 (see Intel Manual 3B, Section 19.5). This can be done with
perf stat -e r8010 ./a.out
and works as expected; however it counts the overall number of uops (is there a table telling how many uops a movsd e.g. is?) and I am only interested in the number for the multiplication (see in the example above).
How can this be done?

I finally found a way to do this, although not using perf but instead the corresponding perf API. One first has to define a perf_event_open function which is actually a syscall:
#include <cstdlib> // stdlib.h for C
#include <cstdio> // stdio.h for C
#include <cstring> // string.h for C
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>
long perf_event_open(
perf_event_attr* hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags
) {
int ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
return ret;
}
Next, one selects the events one wishes to count:
perf_event_attr attr;
// select what we want to count
std::memset(&attr, 0, sizeof(perf_event_attr));
attr.size = sizeof(perf_event_attr);
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.disabled = 1;
attr.exclude_kernel = 1; // do not count the instruction the kernel executes
attr.exclude_hv = 1;
// open a file descriptor
int fd = perf_event_open(&attr, 0, -1, -1, 0);
if (fd == -1)
{
// handle error
}
In this case I want to count simply the number of instructions. Floating point instructions can be counted on my processor (Nehalem) by replacing the corresponding lines with
attr.type = PERF_TYPE_RAW;
attr.config = 0x8010; // Event Number = 10H, Umask Value = 80H
By setting the type to RAW one can basically count every event the processor is offering; the number 0x8010 specifies which one. Note that this number is highly processor-dependent! One can find the right numbers in the Intel Manual 3B, Part2, Chapter 19, by picking the right subsection.
One can then measure the code by enclosing it in
// reset and enable the counter
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
// perform computation that should be measured here
// disable and read out the counter
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
long long count;
read(fd, &count, sizeof(long long));
// count now has the (approximated) result
// close the file descriptor
close(fd);

Related

Is there a built-in function that quickly counts how many bits an int takes up?

I am trying to count how many bits an int takes up,e.g. count(10)=4,count(7)=3,count(127)=7 etc.
I have tried brute-forcing(<<ing a 1 until it's strictly bigger than the number) and using floor(log2(v))+1,but both are too slow for my needs.
I know that there exists a __builtin_popcnt function that quickly count how many 1s there are in an int,but I had trouble finding a built-in that fits my applictions,is there no such function or have I overlooked something?
Edit:I'm working with g++ version 9.3.0
Edit2:mediocrevegetable1's answer was chosen because it was the only one usable with g++ at the time.However,future readers may also try out chris's answer for better compatibility or in hopes that the compiler will give a more efficient implementation.

C++20 added std::bit_width (live example):
#include <bit>
#include <iostream>
int main() {
std::cout
<< std::bit_width(10u) << ' ' // 4
<< std::bit_width(7u) << ' ' // 3
<< std::bit_width(127u); // 7
}
With Clang trunk and, for example, -O3 -march=skylake, a function doing nothing but calling bit_width produces the following assembly:
count(unsigned int):
lzcnt ecx, edi
mov eax, 32
sub eax, ecx
ret
There are a number of similar functions in this new <bit> header as well. You can see them all here.

You can use the GCC built-in function __builtin_clz, which gives the leading zero's:
#include <climits>
constexpr unsigned bit_space(unsigned n)
{
return (sizeof n * CHAR_BIT) - __builtin_clz(n);
}

you can use asm
int index;
int value = 0b100010000000;
asm("bsr %1, %0":"=a"(index):"a"(value)); // now index equals 11
this method returns zero-based index of the last set bit
which means it will also return zero if no bit is set
to get around this, one can first check if value is not zero
if value is zero that means no bit is set
if value is not zero that means index + 1 is the number of bits used by value

There is a standard library function for this.
No need to write anything yourself.
https://en.cppreference.com/w/cpp/utility/bitset/count
#include <cassert>
#include <bitset>
int main()
{
std::bitset<32> value = 0b10010110;
auto count = value.count();
assert(count == 4);
}
After the feedback below, that this was the incorrect answer.
I don't know if there is a build in function.
But this compile time constexpr will give the same result.
#include <cassert>
static constexpr auto bits_needed(unsigned int value)
{
size_t n = 0;
for (; value > 0; value >>= 1, ++n);
return n;
}
int main()
{
assert(3 == bits_needed(7));
assert(4 == bits_needed(10));
assert(4 == bits_needed(9));
assert(16 == bits_needed(65000));
}

Choose 4 randoms printf

So, I am having a problem with my code. The program needs to choose randomized one from 4 printfs and print it in the terminal. I am new at this so sorry about that.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main () {
setlocale (LC_ALL, "Portuguese");
int opcao;
opcao = rand() % 3 + 1;
if (opcao == 0) {
printf ("\nA opção sorteada foi a de que o 1º classificado atual será o campeão (FC Porto)");
}
if (opcao == 1) {
printf ("\nA opção sorteada foi a de que o 1º classificado na 1ª volta será o campeão (SL Benfica)");
}
if (opcao == 2) {
printf ("\nA opção sorteada foi a de que Porto e Benfica farão um jogo em campo neutro para determinar o campeão!");
}
if (opcao == 4) {
printf ("\nFoi sorteada a opção de que não haverá campeão esta época");
}
return 0;
}
This is my code but is only choosing the same printf ever and ever.

Use the <random> library instead of the obsolete and error-prone std::rand (using the modulo operator to obtain a random integer in a range is a common mistake). See Why is the new random library better than std::rand()? for more information.
#include <iostream>
#include <random>
int main()
{
std::mt19937 engine{std::random_device{}()};
std::uniform_int_distribution<int> dist{0, 3};
switch (dist(eng)) {
case 0:
std::cout << "...\n";
break;
case 1:
std::cout << "...\n";
break;
case 2:
std::cout << "...\n";
break;
case 3:
std::cout << "...\n";
break;
}
}
Here, we first create a std::mt19937 engine, which produces uniformly distributed integers in the half-open range [0, 232), and seed it using a std::random_device, which is supposed to generate a non-deterministic number (it may be implemented using the system time, for example). Then, we create a std::uniform_int_distribution to map the random numbers generated by the engine to integers in the inclusive interval [0, 3] by calling it with the engine as argument.
This can be generalized by taking a range of strings to print:
template <typename RanIt, typename F>
decltype(auto) select(RanIt begin, RanIt end, F&& f)
{
if (begin == end) {
throw std::invalid_argument{"..."};
}
thread_local std::mt19937 engine{std::random_device{}()};
using index_t = long long; // for portability
std::uniforn_int_distribution<index_t> dist{0, index_t{end - begin - 1}};
return std::invoke(std::forward<F>(f), begin[dist(engine)]);
}
int main()
{
const std::array<std::string, 4> messages {
// ...
};
select(messages.begin(), messages.end(),
[](const auto& string) {
std::cout << string << '\n';
});
}
Here, we take a pair of random access iterators and a Callable object to support selecting an element from an arbitrary random-accessible range and performing an arbitrary operation on it.
First, we check if the range is empty, in which case selection is not possible and an error is reported by throwing an exception.
Then, we create a std::mt19937 engine that is thread_local (that is, each thread has its own engine) to prevent data races. The state of the engine is maintained between calls, so we only seed it once for every thread.
After that, we create a std::uniform_int_distribution to generate a random index. Note that we used long long instead of typename std::iterator_traits<RanIt>::difference_type: std::uniform_int_distribution is only guaranteed to work with short, int, long, long long, unsigned short, unsigned int, unsigned long, and unsigned long long, so if difference_type is signed char or an extended signed integer type, it results in undefined behavior. long long is the largest supported signed integer type, and we use braced initialization to prevent narrowing conversions.
Finally, we std::forward the Callable object and std::invoke it with the selected element. The decltype(auto) specifier makes sure that the type and value category of the invocation is preserved.
We call the function with an std::array and a lambda expression that prints the selected element.
Since C++20, we can constrain the function template using concepts:
template <std::random_access_iterator RanIt,
std::indirectly_unary_invocable<RanIt> F>
decltype(auto) select(RanIt begin, RanIt end, F&& f)
{
// ...
}
Before C++20, we can also use SFINAE:
template <typename RanIt, typename F>
std::enable_if_t<
std::is_base_of_v<
std::random_access_iterator_tag,
typename std::iterator_traits<RanIt>::iterator_category
>,
std::invoke_result_t<F, typename std::iterator_traits<RanIt>::value_type>
> select(RanIt begin, RanIt end, F&& f)
{
// ...
}

I wanted to add some things here, adding to the previous answers.
First and foremost, you are not writing the code for yourself. As a non-native English speaker, I understand why it may seem easier to write down a code in your native language, but don't do it!
Secondly, I have made changes to the code in order to make it easier and more readable:
#include <time.h>
#include <cstdint>
#include <iostream>
#include <string>
constexpr uint32_t NUM_OF_POSSIBLE_PROMPTS = 4;
int main () {
srand(time(NULL)); // seed according to current time
for (uint32_t i = 0; i < 10; ++i)
{
int option = rand() % (NUM_OF_POSSIBLE_PROMPTS);
std::string prompt = "";
switch (option)
{
case 0:
prompt = "option 0";
break;
case 1:
prompt = "option 1";
break;
case 2:
prompt = "option 2";
break;
case 3:
prompt = "option 3";
break;
default:
// some error handling!
break;
}
std::cout << prompt << std::endl;
}
return 0;
}
I'm using switch-case instead of if-else-if-else which is much more readable and efficient!
I'm using constexpr in order to store my hard-coded number - it is a bad habit having hardcoded numbers in the code (in a real program, I would make a constexpr value of 10 as well for the loop boundaries).
In c++ (unlike in c), we use std::cout and its operator << in order to print, and not the printf function. It makes a unified behavior with other types of streams, such as stringstream (which is helpful when trying to construct strings in real time, however, it is somewhat heavy on resources!).
Since this code is more organized, it easier for you to understand where a possible error may occur, and also makes it less likely for that to happen in the first place.
For example, using gcc's -Wswitch-enum flag will make sure that if you use an enum, all values must be handled in switch-case sections (which makes your program less error prone, of course).
P.S, I have added the loop only to show you that this code is getting different results every round, and you can test this by running the code multiple times.

You did not provide a seed for the random number generator. From the man page,
If no seed value is provided, the functions are automatically seeded
with a value of 1.
If you have the same seed on every run, you'll always get the same random sequence.

A couple of problems with your program:
You don't have a seed, that's the reason why the numbers are repeated. Use
srand (time(NULL)); // #include <time.h>
before you use rand()
Your random numbers are not sequenced,you have 0-2 and then 4, when you get 3 there is no option available. If it's on purpose, ignore this remark.
With rand() % 3 + 1; your random numbers will range from 1 to 3, so opcao == 0 and opcao == 4 will never occur. For a 0-4 interval you will need something like:
opcao = rand() % 5;

Small sized binary searches on CUDA GPUs

I have a large device array inputValues of int64_t type. Every 32 elements of this array are sorted in an ascending order. I have an unsorted search array removeValues.
My intention is to look for all the elements in removeValues inside inputValues and mark them as -1. What is the most efficient method to achieve this? I am using a 3.5 cuda device if that helps.
I am not looking for a higher level solution, i.e. I do not want to use thrust or cub, but I want to write this using cuda kernels.
My initial approach was to load every 32 values in shared memory in a thread block. Every thread also loads a single value from removeValues and does an independent binary search on the shared memory array. If found, the value is set according by using an if condition.
Wouldn't this approach involve a lot of bank conflicts and branch divergence? Do you think that branch divergence can be addressed by using ternary operators while implementing the binary search? Even if that is solved, how can bank conflict be eliminated? Since the size of sorted arrays is 32, would it be possible to implement a binary search using shuffle instructions? Would that help?
EDIT : I have added an example to show what I intend to achieve.
Let's say that inputValues is a vector where every 32 elements are sorted:
[2, 4, 6, ... , 64], [95, 97, ... , 157], [1, 3, ... , 63], [...]
The typical size for this array can range between 32*2 to 32*32. The values could range from 0 to INT64_MAX.
An example of removeValues would be:
[7, 75, 95, 106]
The typical size for this array could range from 1 to 1024.
After the operation removeValues would be:
[-1, 75, -1, 106]
The values in inputValues remain unchanged.

I would concur with the answer (now deleted) and comment by #harrism. Since I put some effort into the non-thrust approach, I'll present my findings.
I tried to naively implement a binary search at the warp-level using __shfl(), and then repeat that binary search across the data set, passing the data set through each 32-element group.
It's embarrassing, but my code is around 20x slower than thrust (in fact it may be worse than that if you do careful timing with nvprof).
I made the data sizes a little larger than what was proposed in the question, because the data sizes in the question are so small that the timing is in the dust.
Here's a fully worked example of 2 approaches:
What is approximately outlined in the question, i.e. create a binary search using warp shuffle that can search up to 32 elements against a 32-element ordered array. Repeat this process for as many 32-element ordered arrays as there are, passing the entire data set through each ordered array (hopefully you can start to see some of the inefficiency now.)
Use thrust, essentially the same as what is outlined by #harrism, i.e. sort the grouped data set, and then run a vectorized thrust::binary_search on that.
Here's the example:
$ cat t1030.cu
#include <stdio.h>
#include <assert.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>
typedef long mytype;
const int gsize = 32;
const int nGRP = 512;
const int dsize = nGRP*gsize;//gsize*nGRP;
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__device__ T my_shfl32(T val, unsigned lane){
return __shfl(val, lane);
}
template <typename T>
__device__ T my_shfl64(T val, unsigned lane){
T retval = val;
int2 t1 = *(reinterpret_cast<int2 *>(&retval));
t1.x = __shfl(t1.x, lane);
t1.y = __shfl(t1.y, lane);
retval = *(reinterpret_cast<T *>(&t1));
return retval;
}
template <typename T>
__device__ bool bsearch_shfl(T grp_val, T my_val){
int src_lane = gsize>>1;
bool return_val = false;
T test_val;
int shift = gsize>>2;
for (int i = 0; i <= gsize>>3; i++){
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, src_lane);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, src_lane);}
else assert(0);
if (test_val == my_val) return_val = true;
src_lane += (((test_val<my_val)*2)-1)*shift;
shift>>=1;
assert ((src_lane < gsize)&&(src_lane > 0));}
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, 0);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, 0);}
else assert(0);
if (test_val == my_val) return_val = true;
return return_val;
}
template <typename T>
__global__ void bsearch_grp(const T * __restrict__ search_grps, T *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int tid = threadIdx.x;
if (idx < gsize*nGRP){
T grp_val = search_grps[idx];
while (tid < dsize){
T my_val = data[tid];
if (bsearch_shfl(grp_val, my_val)) data[tid] = -1;
tid += blockDim.x;}
}
}
int main(){
// data setup
assert(gsize == 32); //mandatory (warp size)
assert((dsize % 32)==0); //needed to preserve shfl capability
thrust::host_vector<mytype> grps(gsize*nGRP);
thrust::host_vector<mytype> data(dsize);
thrust::host_vector<mytype> result(dsize);
for (int i = 0; i < gsize*nGRP; i++) grps[i] = i;
for (int i = 0; i < dsize; i++) data[i] = i;
// method 1: individual shfl-based binary searches on each group
mytype *d_grps, *d_data;
cudaMalloc(&d_grps, gsize*nGRP*sizeof(mytype));
cudaMalloc(&d_data, dsize*sizeof(mytype));
cudaMemcpy(d_grps, &(grps[0]), gsize*nGRP*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_data, &(data[0]), dsize*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long long my_time = dtime_usec(0);
bsearch_grp<<<nGRP, gsize>>>(d_grps, d_data);
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
cudaMemcpy(&(result[0]), d_data, dsize*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i = 0; i < dsize; i++) if (result[i] != -1) {printf("method 1 mismatch at %d, was %d, should be -1\n", i, (int)(result[i])); return 1;}
printf("method 1 time: %fs\n", my_time/(float)USECPSEC);
// method 2: thrust sort, followed by thrust binary search
thrust::device_vector<mytype> t_grps = grps;
thrust::device_vector<mytype> t_data = data;
thrust::device_vector<bool> t_rslt(t_data.size());
my_time = dtime_usec(0);
thrust::sort(t_grps.begin(), t_grps.end());
thrust::binary_search(t_grps.begin(), t_grps.end(), t_data.begin(), t_data.end(), t_rslt.begin());
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
thrust::host_vector<bool> rslt = t_rslt;
for (int i = 0; i < dsize; i++) if (rslt[i] != true) {printf("method 2 mismatch at %d, was %d, should be 1\n", i, (int)(rslt[i])); return 1;}
printf("method 2 time: %fs\n", my_time/(float)USECPSEC);
// method 3: multiple thrust merges, followed by thrust binary search
return 0;
}
$ nvcc -O3 -arch=sm_35 t1030.cu -o t1030
$ ./t1030
method 1 time: 0.009075s
method 2 time: 0.000516s
$
I was running this on linux, CUDA 7.5, GT640 GPU. Obviously the performance will be different on different GPUs, but I'd be surprised if any GPU significantly closed the gap.
In short, you'd be well advised to use a well-tuned library like thrust or cub. If you don't like the monolithic nature of thrust, you could try cub. I don't know if cub has a binary search, but a single binary search against the whole sorted data set is not a difficult thing to write, and it's the smaller part of the time involved (for method 2 -- identifiable using nvprof or additional timing code).
Since your 32-element grouped ranges are already sorted, I also pondered the idea of using multiple thrust::merge operations rather than a single sort. I'm not sure which would be faster, but since the thrust method is already so much faster than the 32-element shuffle search method, I think thrust (or cub) is the obvious choice.

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?

I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

Can I make this C++ code faster without making it much more complex?

here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.

try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js