How could I parallelize the following loop USING C++ AMP?

How could I parallelize the following loop USING C++ AMP? - c++

I have the following loop in c++
dword result = 0;
for ( int i = 0; i < 16; i++ ) {
result |= ( value[i] << (unsigned int)( i << 1 ) );
}
And I would like to parallelize it in amp. I know it might go slower then the actual non-parallelized version above, but I want to do it to learn something more about AMP.
My idea was to loop trough the value array in parallel:
And fill a new array with newarray[0] = value[0] << (unsigned int)(0 << 1 ), newarray[1] = value[1] << (unsigned int)(1 << 1 ), etc. Then I would OR the values in the array in parallel in a tree structure (see image).
I have tried to put this idea in some simple c++ amp code, but I don't succeed in it, so any help would be appreciated.
Thank you for your consideration of this matter, I look forward to a response.

The following code is part of what I think you need. This code will take a number of elements as input and preps the vector on the CPU, then it does the bit shift operations in parallel on the GPU. Then I set av[elements] back to 0 because I am using that element to store your final result. It's rough, but AMP is pretty restrictive about what data types can be processed on the GPU, so I just use an extra element of the existing array for it. After the bit shifting is done, I do another parallel for each for the bitwise OR function. This one also happens on the GPU, but it is less satisfactory because every operation is ORing any given element of the array with exactly the av[elements] element, so that will create a bottleneck. Your tree structure will make this part run much more quickly, but I was unable to figure out how to do that part easily. As it is, this program can process 100 million elements in a couple seconds on a fairly old computer. Apologies in advance for any best-practice violations in the code; I am a novice as well. The code follows:
#include <conio.h>
#include <amp.h>
#include <iostream>
using namespace concurrency;
using namespace std;
unsigned int doParallel(unsigned int);
unsigned int elements;
void main()
{
int ch=NULL;
cout<<"\nHow many elements to populate: ";
cin>>elements;
cout<<"The result is: "<<doParallel(elements);
cout<<"\nPress 'X' to exit.";
do
{
ch=_getch();
} while (ch!='X' && ch!='x');
exit(0);
}
unsigned int doParallel(unsigned int elements)
{
vector<unsigned int> v(elements+1);
for (unsigned int i = 0; i<elements+1;i++)
{
v[i]=i;
}
array_view<unsigned int,1> av(elements+1,v);
parallel_for_each(av.extent,[=](index<1> idx)
restrict(amp)
{
av[idx] = static_cast<unsigned int>(av[idx])<<1;
});
av[elements]=0;
parallel_for_each(av.extent,[=](index<1> idx)
restrict(amp)
{
av[elements] |= static_cast<unsigned int>(av[idx]);
});
return av[elements];
}

Related

What is the fastest implementation for accessing and changing a long array of boolean?

I want to implement a very long boolean array (as a binary genome) and access some intervals to check if that interval is all true or not, and in addition I want to change some intervals value,
For example, I can create 4 representations:
boolean binaryGenome1[10e6]={false};
vector<bool> binaryGenome2; binaryGenome2.resize(10e6);
vector<char> binaryGenome3; binaryGenome3.resize(10e6);
bitset<10e6> binaryGenome4;
and access this way:
inline bool checkBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
if(binaryGenome[i] == false)
return false;
return true;
}
inline void changeBinGenome(long long start , long long end){
for(long long i = start; i < end+1 ; i++)
binaryGenome[i] = true;
}
vector<char> and normal boolean array (ass stores every boolean in a byte) both seem to be a poor choice as I need to be efficient in space. But what are the differences between vector<bool> and bitset?
Somewhere else I read that vector has some overhead as you can choose it's size and compile time - "overhead" for what - accessing? And how much is that overhead?
As I want to access array elements many times using CheckBinGenome() and changeBinGenome(), what is the fastest implementation?

Use std::bitset It's the best.

If the length of the data is known at compile time, consider std::array<bool> or std::bitset. The latter is likely to be more space-efficient (you'll have to measure whether the associated extra work in access times outweighs the speed gain from reducing cache pressure - that will depend on your workload).
If your array's length is not fixed, then you'll need a std::vector<bool> or std::vector<char>; there's also boost::dynamic_bitset but I've never used that.
If you will be changing large regions at once, as your sample implies, it may well be worth constructing your own representation and manipulating the underlying storage directly, rather than one bit at a time through the iterators. For example, if you use an array of char as the underlying representation, then setting a large range to 0 or 1 is mostly a memset() or std::fill() call, with computation only for the values at the start and end of the range. I'd start with a simple implementation and a good set of unit tests before trying anything like that.
It is (at least theoretically) possible that your Standard Library has specialized versions of algorithms for the iterators of std::vector<bool>, std::array<bool> and/or std::bitset that do exactly the above, or you may be able to write and contribute such specializations. That's a better path if possible - the world may thank you, and you'll have shared some of the maintenance responsibility.
Important note
If using std::array<bool>, you do need to be aware that, unlike other std::array<> instantiations, it does not implement the standard container semantics. That's not to say it shouldn't be used, but make sure you understand its foibles!

E.g., checking whether all the elements are true
I am really NOT sure whether this will give us more overheads than speedup or not. Actually I think that nowadays CPU can do this quite fast, are you really experiencing a poor performance? (or is this just a skeleton of your real problem?)
#include <omp.h>
#include <iostream>
#include <cstring>
using namespace std;
#define N 10000000
bool binaryGenome[N];
int main() {
memset(binaryGenome, true, sizeof(bool) * N);
int shouldBreak = 0;
bool result = true;
cout << result << endl;
binaryGenome[9999995] = false;
bool go = true;
uint give = 0;
#pragma omp parallel
{
uint start, stop;
#pragma omp critical
{
start = give;
give += N / omp_get_num_threads();
stop = give;
if (omp_get_thread_num() == omp_get_num_threads() - 1)
stop = N;
}
while (start < stop && go) {
if (!binaryGenome[start]) {
cout << start << endl;
go = false;
result = false;
}
++start;
}
}
cout << result << endl;
}

Small sized binary searches on CUDA GPUs

I have a large device array inputValues of int64_t type. Every 32 elements of this array are sorted in an ascending order. I have an unsorted search array removeValues.
My intention is to look for all the elements in removeValues inside inputValues and mark them as -1. What is the most efficient method to achieve this? I am using a 3.5 cuda device if that helps.
I am not looking for a higher level solution, i.e. I do not want to use thrust or cub, but I want to write this using cuda kernels.
My initial approach was to load every 32 values in shared memory in a thread block. Every thread also loads a single value from removeValues and does an independent binary search on the shared memory array. If found, the value is set according by using an if condition.
Wouldn't this approach involve a lot of bank conflicts and branch divergence? Do you think that branch divergence can be addressed by using ternary operators while implementing the binary search? Even if that is solved, how can bank conflict be eliminated? Since the size of sorted arrays is 32, would it be possible to implement a binary search using shuffle instructions? Would that help?
EDIT : I have added an example to show what I intend to achieve.
Let's say that inputValues is a vector where every 32 elements are sorted:
[2, 4, 6, ... , 64], [95, 97, ... , 157], [1, 3, ... , 63], [...]
The typical size for this array can range between 32*2 to 32*32. The values could range from 0 to INT64_MAX.
An example of removeValues would be:
[7, 75, 95, 106]
The typical size for this array could range from 1 to 1024.
After the operation removeValues would be:
[-1, 75, -1, 106]
The values in inputValues remain unchanged.

I would concur with the answer (now deleted) and comment by #harrism. Since I put some effort into the non-thrust approach, I'll present my findings.
I tried to naively implement a binary search at the warp-level using __shfl(), and then repeat that binary search across the data set, passing the data set through each 32-element group.
It's embarrassing, but my code is around 20x slower than thrust (in fact it may be worse than that if you do careful timing with nvprof).
I made the data sizes a little larger than what was proposed in the question, because the data sizes in the question are so small that the timing is in the dust.
Here's a fully worked example of 2 approaches:
What is approximately outlined in the question, i.e. create a binary search using warp shuffle that can search up to 32 elements against a 32-element ordered array. Repeat this process for as many 32-element ordered arrays as there are, passing the entire data set through each ordered array (hopefully you can start to see some of the inefficiency now.)
Use thrust, essentially the same as what is outlined by #harrism, i.e. sort the grouped data set, and then run a vectorized thrust::binary_search on that.
Here's the example:
$ cat t1030.cu
#include <stdio.h>
#include <assert.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>
typedef long mytype;
const int gsize = 32;
const int nGRP = 512;
const int dsize = nGRP*gsize;//gsize*nGRP;
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__device__ T my_shfl32(T val, unsigned lane){
return __shfl(val, lane);
}
template <typename T>
__device__ T my_shfl64(T val, unsigned lane){
T retval = val;
int2 t1 = *(reinterpret_cast<int2 *>(&retval));
t1.x = __shfl(t1.x, lane);
t1.y = __shfl(t1.y, lane);
retval = *(reinterpret_cast<T *>(&t1));
return retval;
}
template <typename T>
__device__ bool bsearch_shfl(T grp_val, T my_val){
int src_lane = gsize>>1;
bool return_val = false;
T test_val;
int shift = gsize>>2;
for (int i = 0; i <= gsize>>3; i++){
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, src_lane);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, src_lane);}
else assert(0);
if (test_val == my_val) return_val = true;
src_lane += (((test_val<my_val)*2)-1)*shift;
shift>>=1;
assert ((src_lane < gsize)&&(src_lane > 0));}
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, 0);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, 0);}
else assert(0);
if (test_val == my_val) return_val = true;
return return_val;
}
template <typename T>
__global__ void bsearch_grp(const T * __restrict__ search_grps, T *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int tid = threadIdx.x;
if (idx < gsize*nGRP){
T grp_val = search_grps[idx];
while (tid < dsize){
T my_val = data[tid];
if (bsearch_shfl(grp_val, my_val)) data[tid] = -1;
tid += blockDim.x;}
}
}
int main(){
// data setup
assert(gsize == 32); //mandatory (warp size)
assert((dsize % 32)==0); //needed to preserve shfl capability
thrust::host_vector<mytype> grps(gsize*nGRP);
thrust::host_vector<mytype> data(dsize);
thrust::host_vector<mytype> result(dsize);
for (int i = 0; i < gsize*nGRP; i++) grps[i] = i;
for (int i = 0; i < dsize; i++) data[i] = i;
// method 1: individual shfl-based binary searches on each group
mytype *d_grps, *d_data;
cudaMalloc(&d_grps, gsize*nGRP*sizeof(mytype));
cudaMalloc(&d_data, dsize*sizeof(mytype));
cudaMemcpy(d_grps, &(grps[0]), gsize*nGRP*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_data, &(data[0]), dsize*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long long my_time = dtime_usec(0);
bsearch_grp<<<nGRP, gsize>>>(d_grps, d_data);
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
cudaMemcpy(&(result[0]), d_data, dsize*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i = 0; i < dsize; i++) if (result[i] != -1) {printf("method 1 mismatch at %d, was %d, should be -1\n", i, (int)(result[i])); return 1;}
printf("method 1 time: %fs\n", my_time/(float)USECPSEC);
// method 2: thrust sort, followed by thrust binary search
thrust::device_vector<mytype> t_grps = grps;
thrust::device_vector<mytype> t_data = data;
thrust::device_vector<bool> t_rslt(t_data.size());
my_time = dtime_usec(0);
thrust::sort(t_grps.begin(), t_grps.end());
thrust::binary_search(t_grps.begin(), t_grps.end(), t_data.begin(), t_data.end(), t_rslt.begin());
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
thrust::host_vector<bool> rslt = t_rslt;
for (int i = 0; i < dsize; i++) if (rslt[i] != true) {printf("method 2 mismatch at %d, was %d, should be 1\n", i, (int)(rslt[i])); return 1;}
printf("method 2 time: %fs\n", my_time/(float)USECPSEC);
// method 3: multiple thrust merges, followed by thrust binary search
return 0;
}
$ nvcc -O3 -arch=sm_35 t1030.cu -o t1030
$ ./t1030
method 1 time: 0.009075s
method 2 time: 0.000516s
$
I was running this on linux, CUDA 7.5, GT640 GPU. Obviously the performance will be different on different GPUs, but I'd be surprised if any GPU significantly closed the gap.
In short, you'd be well advised to use a well-tuned library like thrust or cub. If you don't like the monolithic nature of thrust, you could try cub. I don't know if cub has a binary search, but a single binary search against the whole sorted data set is not a difficult thing to write, and it's the smaller part of the time involved (for method 2 -- identifiable using nvprof or additional timing code).
Since your 32-element grouped ranges are already sorted, I also pondered the idea of using multiple thrust::merge operations rather than a single sort. I'm not sure which would be faster, but since the thrust method is already so much faster than the 32-element shuffle search method, I think thrust (or cub) is the obvious choice.

Thrust: summing the elements of an array indexed by another array [Matlab's syntax sum(x(indices))]

I'm trying to sum the elements of an array indexed by another array using the Thrust library, but I couldn't find an example. In other words, I want to implement Matlab's syntax
sum(x(indices))
Here is a guideline code trying to point out what do I like to achieve:
#define N 65536
// device array copied using cudaMemcpyToSymbol
__device__ int global_array[N];
// function to implement with thrust
__device__ int support(unsigned short* _memory, unsigned short* _memShort)
{
int support = 0;
for(int i=0; i < _memSizeShort; i++)
support += global_array[_memory[i]];
return support;
}
Also, from the host code, can I use the global_array[N] without copying it back with cudaMemcpyFromSymbol ?
Every comment/answer is appreciated :)
Thanks

This is a very late answer provided here to remove this question from the unanswered list. I'm sure that the OP has already found a solution (since May 2012 :-)), but I believe that the following could be useful to other users.
As pointed out by #talonmies, the problem can be solved by a fused gather-reduction. The solution is indeed an application of Thurst's permutation_iterator and reduce. The permutation_iterator allows to (implicitly) reorder the target array x according to the indices in the indices array. reduce performs the sum of the (implicitly) reordered array.
This application is part of Thrust's documentation, below reported for convenience
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
// this example fuses a gather operation with a reduction for
// greater efficiency than separate gather() and reduce() calls
int main(void)
{
// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;
// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;
// fuse gather with reduction:
// sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
// print sum
std::cout << "sum is " << sum << std::endl;
return 0;
}
In the above example, map plays the role of indices, while source plays the role of x.
Concerning the additional question in your comment (iterating over a reduced number of terms), it will be sufficient to change the following line
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
to
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.begin()+N));
if you want to iterate only over the first N terms of the indexing array map.
Finally, concerning the possibility of using global_array from the host, you should notice that this is a vector residing on the device, so you do need a cudaMemcpyFromSymbol to move it to the host first.

C++ vector and memoization runtime error issues

I encountered a problem here at Codechef. I am trying to use a vector for memoization. As I am still new at programming and quite unfamiliar with STL containers, I have used vector, for the lookup table. (although, I was suggested that using map helps to solve the problem).
So, my question is how is the solution given below running into a run time error. In order to get the error, I used the boundary value for the problem (100000000) as the input. The error message displayed by my Netbeans IDE is RUN FAILED (exit value 1, total time: 4s) with input as 1000000000. Here is the code:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <string>
#define LCM 12
#define MAXSIZE 100000000
using namespace std;
/*
*
*/
vector<unsigned long> lookup(MAXSIZE,0);
int solve(int n)
{
if ( n < 12) {
return n;
}
else {
if (n < MAXSIZE) {
if (lookup[n] != 0) {
return lookup[n];
}
}
int temp = solve(n/2)+solve(n/3)+solve(n/4);
if (temp >= lookup[n] ) {
lookup[n] = temp;
}
return lookup[n];
}
}
int main(int argc, char** argv) {
int t;
cin>>t;
int n;
n = solve(t);
if ( t >= n) {
cout<<t<<endl;
}
else {
cout<<n<<endl;
}
return 0;
}

I doubt if this is a memory issue because he already said that the program actually runs and he inputs 100000000.
One things that I noticed, in the if condition you're doing a lookup[n] even if n == MAXSIZE (in this exact condition). Since C++ is uses 0-indexed vectors, then this would be 1 beyond the end of the vector.
if (n < MAXSIZE) {
...
}
...
if (temp >= lookup[n] ) {
lookup[n] = temp;
}
return lookup[n];
I can't guess what the algorithm is doing but I think the closing brace } of the first "if" should be lower down and you could return an error on this boundary condition.

You either don't have enough memory or don't have enough contiguous address space to store 100,000,000 unsigned longs.

This mostly is a memory issue. For a vector, you need contiguous memory allocation [so that it can keep up with its promise of constant time lookup]. In your case, with an 8 byte double, you are basically requesting your machine to give you around 762 mb of memory, in a single block.
I don't know which problem you're solving, but it looks like you're solving Bytelandian coins. For this, it is much better to use a map, because:
You will mostly not be storing the values for all 100000000 cases in a test case run. So, what you need is a way to allocate memory for only those values that you are actually memoize.
Even if you are, you have no need for a constant time lookup. Although it would speed up your program, std::map uses trees to give you logarithmic look up time. And it does away with the requirement of using up 762 mb contiguously. 762 mb is not a big deal, but expecting in a single block is.
So, the best thing to use in your situation is an std::map. In your case, actually just replacing std::vector<unsigned long> by std::map<int, unsigned long> would work as map also has [] operator access [for the most part, it should].

Can I make this C++ code faster without making it much more complex?

here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.

try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js