Efficient index bound check and double to int cast

Efficient index bound check and double to int cast - c++

Consider the following code snippet
double *x, *id;
int i, n; // = vector size
// allocate and zero x
// set id to 0:n-1
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.
For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.
The questions:
Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?
Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.
For convenience, the entire code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
toc("x(id) = 1");
}

EDIT: Disregard if you can't split the arrays!
I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:
struct my_arrays
{
double x;
int id;
};
struct my_arrays* arr = malloc(sizeof(my_arrays)*n);
Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.
EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.
My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this:
What is the most effective way for float and double comparison?

It might be worth considering examination of double type representation.
For example, the following code shows how to compare a double number greater than 1 to 999:
bool check(double x)
{
union
{
double d;
uint32_t y[2];
};
d = x;
bool answer;
uint32_t exp = (y[1] >> 20) & 0x3ff;
uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
if (fraction2 != 0 || fraction1 != 0)
answer = false;
else if (exp > 8)
answer = false;
else if (exp == 8)
answer = (y[1] < 0x408f3800); // this is the representation of 999
else
answer = true;
return answer;
}
This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.

Related

Performance gap between vector<bool> and array

I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.
So I first came up with some code:
int countPrimes(int n) {
vector<bool> flag(n+1,1);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:
int countPrimes(int n) {
// vector<bool> flag(n+1,1);
bool flag[n+1] ;
fill(flag,flag+n+1,true);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.
EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.

std::vector<bool> isn't like any other vector. The documentation says:
std::vector<bool> is a possibly space-efficient specialization of
std::vector for the type bool.
That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.

std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).
Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.
Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).
As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.

I'd like to add some remarks to the good answers already posted.
The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.
See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).
This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.
Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.
If you can bare the less readable code, you could try to profile the following snippet.
int countPrimes(int n)
{
if ( n < 2 )
return 0;
// Sieve starting from 3 up to n, the number of odd number between 3 and n are
int sieve_size = n / 2 - 1;
std::vector<char> sieve(sieve_size);
int result = 1; // 2 is a prime.
for (int i = 0; i < sieve_size; ++i)
{
if ( sieve[i] == 0 )
{
// It's a prime, no need to scan the vector again
++result;
// Some ugly transformations are needed, here
int prime = i * 2 + 3;
for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
sieve[j / 2 - 1] = 1;
}
}
return result;
}
Edit
As Peter Cordes noted in the comments, using an unsigned type for the variable j
the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.
It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.

I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:
vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69
Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>
size_t countPrimes(size_t n) {
std::vector<bool> flag(n+1,1);
//std::vector<char> flag(n+1,1);
//bool flag[n+1]; std::fill(flag,flag+n+1,true);
for(size_t i=2;i<n;i++) {
if(flag[i]==1) {
for(size_t j=i;i*j<n;j++) {
flag[i*j]=0;
}
}
}
size_t result=0;
for(size_t i=2;i<n;i++) {
result+=flag[i];
}
return result;
}
int main() {
{
const rlim_t kStackSize = 16*1024*1024;
struct rlimit rl;
int result = getrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
if(rl.rlim_cur < kStackSize) {
rl.rlim_cur = kStackSize;
result = setrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
}
}
printf("%zu\n", countPrimes(10e6));
return 0;
}

Small sized binary searches on CUDA GPUs

I have a large device array inputValues of int64_t type. Every 32 elements of this array are sorted in an ascending order. I have an unsorted search array removeValues.
My intention is to look for all the elements in removeValues inside inputValues and mark them as -1. What is the most efficient method to achieve this? I am using a 3.5 cuda device if that helps.
I am not looking for a higher level solution, i.e. I do not want to use thrust or cub, but I want to write this using cuda kernels.
My initial approach was to load every 32 values in shared memory in a thread block. Every thread also loads a single value from removeValues and does an independent binary search on the shared memory array. If found, the value is set according by using an if condition.
Wouldn't this approach involve a lot of bank conflicts and branch divergence? Do you think that branch divergence can be addressed by using ternary operators while implementing the binary search? Even if that is solved, how can bank conflict be eliminated? Since the size of sorted arrays is 32, would it be possible to implement a binary search using shuffle instructions? Would that help?
EDIT : I have added an example to show what I intend to achieve.
Let's say that inputValues is a vector where every 32 elements are sorted:
[2, 4, 6, ... , 64], [95, 97, ... , 157], [1, 3, ... , 63], [...]
The typical size for this array can range between 32*2 to 32*32. The values could range from 0 to INT64_MAX.
An example of removeValues would be:
[7, 75, 95, 106]
The typical size for this array could range from 1 to 1024.
After the operation removeValues would be:
[-1, 75, -1, 106]
The values in inputValues remain unchanged.

I would concur with the answer (now deleted) and comment by #harrism. Since I put some effort into the non-thrust approach, I'll present my findings.
I tried to naively implement a binary search at the warp-level using __shfl(), and then repeat that binary search across the data set, passing the data set through each 32-element group.
It's embarrassing, but my code is around 20x slower than thrust (in fact it may be worse than that if you do careful timing with nvprof).
I made the data sizes a little larger than what was proposed in the question, because the data sizes in the question are so small that the timing is in the dust.
Here's a fully worked example of 2 approaches:
What is approximately outlined in the question, i.e. create a binary search using warp shuffle that can search up to 32 elements against a 32-element ordered array. Repeat this process for as many 32-element ordered arrays as there are, passing the entire data set through each ordered array (hopefully you can start to see some of the inefficiency now.)
Use thrust, essentially the same as what is outlined by #harrism, i.e. sort the grouped data set, and then run a vectorized thrust::binary_search on that.
Here's the example:
$ cat t1030.cu
#include <stdio.h>
#include <assert.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>
typedef long mytype;
const int gsize = 32;
const int nGRP = 512;
const int dsize = nGRP*gsize;//gsize*nGRP;
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__device__ T my_shfl32(T val, unsigned lane){
return __shfl(val, lane);
}
template <typename T>
__device__ T my_shfl64(T val, unsigned lane){
T retval = val;
int2 t1 = *(reinterpret_cast<int2 *>(&retval));
t1.x = __shfl(t1.x, lane);
t1.y = __shfl(t1.y, lane);
retval = *(reinterpret_cast<T *>(&t1));
return retval;
}
template <typename T>
__device__ bool bsearch_shfl(T grp_val, T my_val){
int src_lane = gsize>>1;
bool return_val = false;
T test_val;
int shift = gsize>>2;
for (int i = 0; i <= gsize>>3; i++){
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, src_lane);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, src_lane);}
else assert(0);
if (test_val == my_val) return_val = true;
src_lane += (((test_val<my_val)*2)-1)*shift;
shift>>=1;
assert ((src_lane < gsize)&&(src_lane > 0));}
if (sizeof(T)==4){
test_val = my_shfl32(grp_val, 0);}
else if (sizeof(T)==8){
test_val = my_shfl64(grp_val, 0);}
else assert(0);
if (test_val == my_val) return_val = true;
return return_val;
}
template <typename T>
__global__ void bsearch_grp(const T * __restrict__ search_grps, T *data){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int tid = threadIdx.x;
if (idx < gsize*nGRP){
T grp_val = search_grps[idx];
while (tid < dsize){
T my_val = data[tid];
if (bsearch_shfl(grp_val, my_val)) data[tid] = -1;
tid += blockDim.x;}
}
}
int main(){
// data setup
assert(gsize == 32); //mandatory (warp size)
assert((dsize % 32)==0); //needed to preserve shfl capability
thrust::host_vector<mytype> grps(gsize*nGRP);
thrust::host_vector<mytype> data(dsize);
thrust::host_vector<mytype> result(dsize);
for (int i = 0; i < gsize*nGRP; i++) grps[i] = i;
for (int i = 0; i < dsize; i++) data[i] = i;
// method 1: individual shfl-based binary searches on each group
mytype *d_grps, *d_data;
cudaMalloc(&d_grps, gsize*nGRP*sizeof(mytype));
cudaMalloc(&d_data, dsize*sizeof(mytype));
cudaMemcpy(d_grps, &(grps[0]), gsize*nGRP*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_data, &(data[0]), dsize*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long long my_time = dtime_usec(0);
bsearch_grp<<<nGRP, gsize>>>(d_grps, d_data);
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
cudaMemcpy(&(result[0]), d_data, dsize*sizeof(mytype), cudaMemcpyDeviceToHost);
for (int i = 0; i < dsize; i++) if (result[i] != -1) {printf("method 1 mismatch at %d, was %d, should be -1\n", i, (int)(result[i])); return 1;}
printf("method 1 time: %fs\n", my_time/(float)USECPSEC);
// method 2: thrust sort, followed by thrust binary search
thrust::device_vector<mytype> t_grps = grps;
thrust::device_vector<mytype> t_data = data;
thrust::device_vector<bool> t_rslt(t_data.size());
my_time = dtime_usec(0);
thrust::sort(t_grps.begin(), t_grps.end());
thrust::binary_search(t_grps.begin(), t_grps.end(), t_data.begin(), t_data.end(), t_rslt.begin());
cudaDeviceSynchronize();
my_time = dtime_usec(my_time);
thrust::host_vector<bool> rslt = t_rslt;
for (int i = 0; i < dsize; i++) if (rslt[i] != true) {printf("method 2 mismatch at %d, was %d, should be 1\n", i, (int)(rslt[i])); return 1;}
printf("method 2 time: %fs\n", my_time/(float)USECPSEC);
// method 3: multiple thrust merges, followed by thrust binary search
return 0;
}
$ nvcc -O3 -arch=sm_35 t1030.cu -o t1030
$ ./t1030
method 1 time: 0.009075s
method 2 time: 0.000516s
$
I was running this on linux, CUDA 7.5, GT640 GPU. Obviously the performance will be different on different GPUs, but I'd be surprised if any GPU significantly closed the gap.
In short, you'd be well advised to use a well-tuned library like thrust or cub. If you don't like the monolithic nature of thrust, you could try cub. I don't know if cub has a binary search, but a single binary search against the whole sorted data set is not a difficult thing to write, and it's the smaller part of the time involved (for method 2 -- identifiable using nvprof or additional timing code).
Since your 32-element grouped ranges are already sorted, I also pondered the idea of using multiple thrust::merge operations rather than a single sort. I'm not sure which would be faster, but since the thrust method is already so much faster than the 32-element shuffle search method, I think thrust (or cub) is the obvious choice.

Count the occurrence of each element in large data stream

I have a simulation, with N particles, running over T timesteps. At each timestep, each particle calculates some data about itself and the other particles nearby (within radius), which is bitpacked into a c-string of 4-22 bytes long (depending on how many nearby particles there are). I call this a State String.
I need to count how many times each state string occurs, to form a histogram. I've tried using Google's Sparse Hash Map, but the memory overhead is crazy.
I've been running some reduced tests (attached) over 100,000 Timesteps, for 500 particles. This results in just over 18.2mil unique state strings out of 50mil possible state strings, which is consistent with the actual work that needs to be done.
It ends up using 323 MB in space for the char* and int for each unique entry as well as as the actual state string itself. However, task manager is reporting 870M used. This is 547M of overhead, or ~251.87 bits/entry, way over what Google advertises of about 4-5 bits.
So I figure I've got to be doing something wrong. But then I found this site, which showed similar results, however, I'm not sure if his charts show just the hash table size, or include the size of the actual data as well. Additionally, his code does not free any strings being inserted into the hashmap that already exist (Meaning if his charts do include the size of the actual data, it is going to be over).
Here is some code showing the problem with the output:
#include <google/sparse_hash_map>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
//String equality
struct eqstrc
{
bool operator()(const char* s1, const char* s2) const
{
return (s1 == s2) || (s1 && s2 && !strcmp(s1,s2));
}
};
//Hashing function
template <class T>
class fnv1Hash
{
public:
size_t operator()(const T& c) const {
unsigned int hash = 2166136261;
const unsigned char *key = (const unsigned char*)(c);
size_t L = strlen((const char*)c);
size_t i = 0;
for(const unsigned char *s = key; i < L; ++s, ++i)
hash = (16777619 * hash) ^ (*s);
return (size_t)hash;
}
};
//Function to form new string
char * new_string_from_integer(int num)
{
int ndigits = num == 0 ? 1 : (int)log10((float)num) + 1;
char * str = (char *)malloc(ndigits + 1);
sprintf(str, "%d", num);
return str;
}
typedef google::sparse_hash_map<const char*, int, fnv1Hash<const char*>, eqstrc> HashCharMap;
int main()
{
HashCharMap hashMapChar;
int N = 500;
int T = 100000;
//Fill hash table with strings
for(int k = 0; k < T; ++k)
{
for(int i = 0; i < N; ++i)
{
char * newString = new_string_from_integer(i*k);
std::pair<HashCharMap::iterator, bool> res = hashMapChar.insert(HashCharMap::value_type(newString, HashCharMap::data_type()));
(res.first)->second++;
if(res.second == false) //If the string already in hash map, don't need this memory
free(newString);
}
}
//Count memory used by key
size_t dataCount = 0;
for(HashCharMap::iterator hashCharItr = hashMapChar.begin(); hashCharItr != hashMapChar.end(); ++hashCharItr)
{
dataCount += sizeof(char*) + sizeof(unsigned int); //Size of data to store entries
dataCount += (((strlen(hashCharItr->first) + 1) + 3) & ~0x03); //Size of entries, padded to 4 byte boundaries
}
printf("Hash Map Size: %lu\n", (unsigned long)hashMapChar.size());
printf("Bytes written: %lu\n", (unsigned long)dataCount);
system("pause");
}
Output
Hash Map Size: 18218975
Bytes written: 339018772
Peak Working Set (Reported by TaskManager): 891,228 K
Overhead: 560,155 K, or 251.87 bits/entry
I've tried both Google Sparse Hash Map v1.10 and v2.0.2.
Am I doing something wrong in my use of the hash map. Or is there a better way to approach this, because with these strings, I'd be almost as well off just storing the list of strings, sorting, then counting consecutive entries.
Thanks for any help
Edit
Because I was asked, here is format of the actual data:
Each component is 2 bytes, and broken up into two subparts. 12bits, and 4bits.
First two bytes (short): [id of current particle (12 bits) | angle of
current particle (4 bits)]
Second short: [number of interacting
particles (12 bits)(N) | previous angle of current particle (4 bits)]
For next N shorts: [id of particle i (12 bits) | previous angle of particle i (4 bits)]
Angles are approximated (divided by 16), to store in 4 bits.
That's a bit wordy, so I'll write an example:
0x120A 0x001B 0x136F = Particle 288 (0x120), with angle 10 (0xA). Had angle 11 (0xB) in previous timestep. Interacts with 1 (0x001) other particle. This other particle is Particle 310 (0x136) and had angle 15 (0xF) in previous timestep.
Particles interact with between 0 to 9 other particles, hence the 4-22 bytes I mentioned above (although, rarely, can interact with up to 12 or more other particles. There is no limit. If all 500 particles are within the radius, then the string will be 1004 bytes long)
Additional information: The hashing function and compare function in my actual code use the size stored in the most significant 12 bits of the second short to do processing, since non-terminal 0x0000s can appear in my state strings. That all works fine.

These figures are from experiments with gcc on Linux. Allocating short chunks of 4-22 bytes requires 16 bytes for lengths from 1 - 12, 24 bytes for 13 - 20 and 32 bytes for the rest.
This means that your experiment with the 18218975 strings ("0".."50000000") requires 291503600 bytes on the heap, with the sum of their lengths (plus trailing 0) being 156681483.
Thus you have 135MB overhead simply due to malloc.
(Is this Peak Working Set size a reliable figure?)

C hack for storing a bit that takes 1 bit space?

I have a long list of numbers between 0 and 67600. Now I want to store them using an array that is 67600 elements long. An element is set to 1 if a number was in the set and it is set to 0 if the number is not in the set. ie. each time I need only 1bit information for storing the presence of a number. Is there any hack in C/C++ that helps me achieve this?

In C++ you can use std::vector<bool> if the size is dynamic (it's a special case of std::vector, see this) otherwise there is std::bitset (prefer std::bitset if possible.) There is also boost::dynamic_bitset if you need to set/change the size at runtime. You can find info on it here, it is pretty cool!
In C (and C++) you can manually implement this with bitwise operators. A good summary of common operations is here. One thing I want to mention is its a good idea to use unsigned integers when you are doing bit operations. << and >> are undefined when shifting negative integers. You will need to allocate arrays of some integral type like uint32_t. If you want to store N bits, it will take N/32 of these uint32_ts. Bit i is stored in the i % 32'th bit of the i / 32'th uint32_t. You may want to use a differently sized integral type depending on your architecture and other constraints. Note: prefer using an existing implementation (e.g. as described in the first paragraph for C++, search Google for C solutions) over rolling your own (unless you specifically want to, in which case I suggest learning more about binary/bit manipulation from elsewhere before tackling this.) This kind of thing has been done to death and there are "good" solutions.
There are a number of tricks that will maybe only consume one bit: e.g. arrays of bitfields (applicable in C as well), but whether less space gets used is up to compiler. See this link.
Please note that whatever you do, you will almost surely never be able to use exactly N bits to store N bits of information - your computer very likely can't allocate less than 8 bits: if you want 7 bits you'll have to waste 1 bit, and if you want 9 you will have to take 16 bits and waste 7 of them. Even if your computer (CPU + RAM etc.) could "operate" on single bits, if you're running in an OS with malloc/new it would not be sane for your allocator to track data to such a small precision due to overhead. That last qualification was pretty silly - you won't find an architecture in use that allows you to operate on less than 8 bits at a time I imagine :)

You should use std::bitset.
std::bitset functions like an array of bool (actually like std::array, since it copies by value), but only uses 1 bit of storage for each element.
Another option is vector<bool>, which I don't recommend because:
It uses slower pointer indirection and heap memory to enable resizing, which you don't need.
That type is often maligned by standards-purists because it claims to be a standard container, but fails to adhere to the definition of a standard container*.
*For example, a standard-conforming function could expect &container.front() to produce a pointer to the first element of any container type, which fails with std::vector<bool>. Perhaps a nitpick for your usage case, but still worth knowing about.

There is in fact! std::vector<bool> has a specialization for this: http://en.cppreference.com/w/cpp/container/vector_bool
See the doc, it stores it as efficiently as possible.
Edit: as somebody else said, std::bitset is also available: http://en.cppreference.com/w/cpp/utility/bitset

If you want to write it in C, have an array of char that is 67601 bits in length (67601/8 = 8451) and then turn on/off the appropriate bit for each value.

Others have given the right idea. Here's my own implementation of a bitsarr, or 'array' of bits. An unsigned char is one byte, so it's essentially an array of unsigned chars that stores information in individual bits. I added the option of storing TWO or FOUR bit values in addition to ONE bit values, because those both divide 8 (the size of a byte), and would be useful if you want to store a huge number of integers that will range from 0-3 or 0-15.
When setting and getting, the math is done in the functions, so you can just give it an index as if it were a normal array--it knows where to look.
Also, it's the user's responsibility to not pass a value to set that's too large, or it will screw up other values. It could be modified so that overflow loops back around to 0, but that would just make it more convoluted, so I decided to trust myself.
#include<stdio.h>
#include <stdlib.h>
#define BYTE 8
typedef enum {ONE=1, TWO=2, FOUR=4} numbits;
typedef struct bitsarr{
unsigned char* buckets;
numbits n;
} bitsarr;
bitsarr new_bitsarr(int size, numbits n)
{
int b = sizeof(unsigned char)*BYTE;
int numbuckets = (size*n + b - 1)/b;
bitsarr ret;
ret.buckets = malloc(sizeof(ret.buckets)*numbuckets);
ret.n = n;
return ret;
}
void bitsarr_delete(bitsarr xp)
{
free(xp.buckets);
}
void bitsarr_set(bitsarr *xp, int index, int value)
{
int buckdex, innerdex;
buckdex = index/(BYTE/xp->n);
innerdex = index%(BYTE/xp->n);
xp->buckets[buckdex] = (value << innerdex*xp->n) | ((~(((1 << xp->n) - 1) << innerdex*xp->n)) & xp->buckets[buckdex]);
//longer version
/*unsigned int width, width_in_place, zeros, old, newbits, new;
width = (1 << xp->n) - 1;
width_in_place = width << innerdex*xp->n;
zeros = ~width_in_place;
old = xp->buckets[buckdex];
old = old & zeros;
newbits = value << innerdex*xp->n;
new = newbits | old;
xp->buckets[buckdex] = new; */
}
int bitsarr_get(bitsarr *xp, int index)
{
int buckdex, innerdex;
buckdex = index/(BYTE/xp->n);
innerdex = index%(BYTE/xp->n);
return ((((1 << xp->n) - 1) << innerdex*xp->n) & (xp->buckets[buckdex])) >> innerdex*xp->n;
//longer version
/*unsigned int width = (1 << xp->n) - 1;
unsigned int width_in_place = width << innerdex*xp->n;
unsigned int val = xp->buckets[buckdex];
unsigned int retshifted = width_in_place & val;
unsigned int ret = retshifted >> innerdex*xp->n;
return ret; */
}
int main()
{
bitsarr x = new_bitsarr(100, FOUR);
for(int i = 0; i<16; i++)
bitsarr_set(&x, i, i);
for(int i = 0; i<16; i++)
printf("%d\n", bitsarr_get(&x, i));
for(int i = 0; i<16; i++)
bitsarr_set(&x, i, 15-i);
for(int i = 0; i<16; i++)
printf("%d\n", bitsarr_get(&x, i));
bitsarr_delete(x);
}

Can I make this C++ code faster without making it much more complex?

here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.

try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js