Improving performance for a TM simulator

Improving performance for a TM simulator - c++

I am trying to simulate a lot of 2 state, 3 symbol (One direction tape) Turing machines. Each simulation will have different input, and will run for a fixed number of steps. The current bottleneck in the program seems to be the simulator, taking a ton of memory on Turing machines which do not halt.
The task is to simulate about 650000 TMs, each with about 200 non-blank inputs. The largest number of steps I am trying is 1 billion (10**9).
Below is the code I am running. vector<vector<int> > TM is a transition table.
vector<int> fast_simulate(vector<vector<int> > TM, string TM_input, int steps) {
/* Return the state reached after supplied steps */
vector<int> tape = itotape(TM_input);
int head = 0;
int current_state = 0;
int halt_state = 2;
for(int i = 0; i < steps; i++){
// Read from tape
if(head >= tape.size()) {
tape.push_back(2);
}
int cell = tape[head];
int data = TM[current_state][cell]; // get transition for this state/input
int move = data % 2;
int write = (data % 10) % 3;
current_state = data / 10;
if(current_state == halt_state) {
// This highlights the last place that is written to in the tape
tape[head] = 4;
vector<int> res = shorten_tape(tape);
res.push_back(i+1);
return res;
}
// Write to tape
tape[head] = write;
// move head
if(move == 0) {
if(head != 0) {
head--;
}
} else {
head++;
}
}
vector<int> res {-1};
return res;
}
vector<int> itotape(string TM_input) {
vector<int> tape;
for(char &c : TM_input) {
tape.push_back(c - '0');
}
return tape;
}
vector<int> shorten_tape(vector<int> tape) {
/* Shorten the tape by removing unnecessary 2's (blanks) from the end of it.
*/
int i = tape.size()-1;
for(; i >= 0; --i) {
if(tape[i] != 2) {
tape.resize(i+1);
return tape;
}
}
return tape;
}
Is there anywhere I can make improvements in terms of performance or memory usage? Even a 2% decrease would make a noticeable difference.

Make sure no allocations happen during the whole TM simulation.
Preallocate a single global array at program startup, which is big enough for any state of the tape (e.g. 10^8 elements). Put the machine at the beginning of this tape array initially. Maintain the segment [0; R] of the all cells which were visited by the current machine simulation: this allows you to avoid clearing the whole tape array when you start the new simulation.
Use the smallest integer type for tape elements which is enough (e.g. use unsigned char if the alphabet surely has less than 256 characters). Perhaps you can even switch to bitsets if alphabet is very small. This reduces memory footprint and improves cache/RAM performance.
Avoid using generic integer divisions in the innermost loop (they are slow), use only divisions by powers-of-two (they turn into bit shifts). As the final optimization, you may try to remove all branches from the innermost loop (there are various clever techniques for this).

Here is another answer with more algorithmic approaches.
Simulation by blocks
Since you have tiny alphabet and tiny number of states, you can accelerate the simulation by processing chunks of the tape at once. This is related to the well-known speedup theorem, although I suggest a slightly different method.
Divide the tape into blocks of 8 characters each. Each such block can be represented with 16-bit number (2 bits per character). Now imagine that the machine is located either at the first or at the last character of a block. Then its subsequent behavior depends only on its initial state and the initial value on the block, until the TM moves out of the block (either to the left or to the right). We can precompute the outcome for all (block value + state + end) combinations, or maybe lazily compute them during simulation.
This method can simulate about 8 steps at once, although if you are unlucky it can do only one step per iteration (moving back and forth around block boundary). Here is the code sample:
//R = table[s][e][V] --- outcome for TM which:
// starts in state s
// runs on a tape block with initial contents V
// starts on the (e = 0: leftmost, e = 1: rightmost) char of the block
//The value R is a bitmask encoding:
// 0..15 bits: the new value of the block
// 16..17 bits: the new state
// 18 bit: TM moved to the (0: left, 1: right) of the block
// ??encode number of steps taken??
uint32_t table[2][2][1<<16];
//contents of the tape (grouped in 8-character blocks)
uint16_t tape[...];
int pos = 0; //index of current block
int end = 0; //TM is currently located at (0: start, 1: end) of the block
int state = 0; //current state
while (state != 2) {
//take the outcome of simulation on the current block
uint32_t res = table[state][end][tape[pos]];
//decode it into parts
uint16_t newValue = res & 0xFFFFU;
int newState = (res >> 16) & 3U;
int move = (res >> 18);
//write new contents to the tape
tape[pos] = newValue;
//switch to the new state
state = newState;
//move to the neighboring block
pos += (2*move-1);
end = !move;
//avoid getting out of tape on the left
if (pos < 0)
pos = 0, move = 0;
}
Halting problem
The comment says that TM simulation is expected either to finish very early, or to run all the steps up to the predefined huge limit. Since you are going to simulate many Turing machines, it might be worth investing some time in solving the halting problem.
The first type of hanging which can be detected is: when machine stays at the same place without moving far away from it. Let's maintain surrounding of TM during simulation, which is the values of segment of characters at distance < 16 from TM's current location. If you have 3 characters, you can encode surrounding in a 62-bit number.
Maintain a hash table for each position of TM (as we'll see later, only 31 tables are necessary). After each step, store tuple (state, surrounding) in the hash table of current position. Now the important part: after each move, clear all hash tables at distance >= 16 from TM (actually, only one such hash table has to be cleared). Before each step, check if (state, surrounding) is already present in the hash table. If it is, then the machine is in infinite loop.
You can also detect another type of hanging: when machine moves to the right infinitely, but never returns back. In order to achieve that, you can use the same hashtables. If TM is located at the currently last character of the tape with index p, check current tuple (state, surrounding) not only in the p-th hashtable, but also in the (p-1)-th, (p-2)-th, ..., (p-15)-th hash tables. If you find a match, then TM is in infinite loop moving to the right.

Change
int move = data % 2;
To
int move = data & 1;
One is a divide, the other is a bitmask, both should give 0 or 1 base on the low bit. You can do this anytime you have % by a power of two.
You're also setting
cell = tape[head];
data = TM[current_state][cell];
int move = data % 2;
int write = (data % 10) % 3;
current_state = data / 10;
Every single step, regardless of whether tape[head] has changed and even on branches where you're not accessing those values at all. Take a careful look at which branches use which data, and only update things just as they're needed. See straight after that you write:
if(current_state == halt_state) {
// This highlights the last place that is written to in the tape
tape[head] = 4;
vector<int> res = shorten_tape(tape);
res.push_back(i+1);
return res;
}
^ This code doesn't reference "move" or "write", so you can put the calculation for "move"/"write" after it and only calculate them if current_state != halt_state
Also the true-branch of an if statement is the optimized branch. By checking for not the halt state, and putting the halt condition in the else branch you can improve the CPU branch prediction a little.

Related

for-loop behaviour changing from seemingly unrelated code

The code example is used to create a binary counter using a row of LEDs. I was checking out an online tutorial that issued this as a challenge. My goal was to do this in a way that is scalable.
The below code is functioning, but I have an issue. Within the setup function, I call the Serial.begin function which I was using for logging while writing the code. As it is, the code loops through from 0 to 16, flashing the correct corresponding LED.
When I remove the Serial.begin line, the loop breaks, but at a weird point. It goes all the way through to 16 once (ie all 4 LEDs lit) and then it loops back around, and then gets stuck flashing just the one LED (indicating 1). The puzzling thing to me is obviously since the loop starts at 0, it's actually going through to the second iteration of the loop when it fails.
I'm not otherwise using any other Serial functions and since it works with Serial.begin, I feel like it's mathematically sound. It's leading me to think this is something Arduino specific and I'd really like to understand what's happening here to produce the different results.
I'm also new to C++ and Arduino in general, so general advice or feedback is also appreciated!
/*
* The following code runs an Arduino powering 4 LEDs counting in binary
* Mission was to do this dynamically using for-loops, so that it is scalable for adding say, more LEDs
* To add an LED, all one needs to do is update the variable 'bits', increase the size of the 'ledArray' and 'myArray' and assign Arduino pins to the new LEDs
*/
// using Arduino pins 3, 5, 6 & 9. While I'm using PWM pins, this code uses digital write and there's no need to stick to these for the purpose of this code
int led1 = 3;
int led2 = 5;
int led3 = 6;
int led4 = 9;
int bits = 4; // the number of bits AKA the number of LEDs in the circuit
int del = 350; // the interval for each flash
int topNumber = pow(2,bits); // define the decimal number that can be counted to on a given set.
int ledArray[4] = {led1, led2, led3, led4}; // this array is made up of the Arduino pins. The array size should be the same as the number of bits
int myArray[4] = {0}; // this array is the array that creates the binary string. The array size should be the same as the number of bits
void setup() {
Serial.begin(9600); // Getting some odd behaviour with this :/ originally here for troubleshooting
pinMode(led1, OUTPUT);
pinMode(led2, OUTPUT);
pinMode(led3, OUTPUT);
pinMode(led4, OUTPUT);
}
void loop() {
// the following for-loop iterates each number in the set from 0 to the ceiling (topNumber)
for (int n = 0; n <= topNumber; n++) {
int tempValue = n; // set a temporary variable for n - we're going to manipulate n to format our binary, but we want to come back to this value before this loop concludes to increment on it
// the following for-loop builds an array to present the binary string eg 1,0,0,1 for 1001
for (int i = 0; n > 0; i++) {
myArray[i] = n % 2;
n /= 2;
}
// the following for-loop matches the leds in the ledArray with the binary array and sets the voltage high as required
for (int i = 0; i <= bits; i++) {
if(myArray[i] == 1) {
digitalWrite(ledArray[i], HIGH);
} else {
}
}
delay(del);
// the following for-loop iterates through the ledArray and switches them all off after completion
for (int i = 0; i <= bits; i++) {
digitalWrite(ledArray[i],LOW);
}
n = tempValue; // return n to its original value so that the for-loop will iterate as intended
}
}

n <= topNumber should be n < topNumber.
Here's what the as-written code does:
topNumber = 2 raised to the 4th power, that is 16. 16 in binary is 10000; note that that uses 5 bits rather than 4. When the loop()'s for() loop reaches n = 16, it writes into myArray[4], which doesn't exist.
Because C and C++ don't protect against out of bounds array accesses, that loop writes a value into some other place in memory. My guess is that adding or removing the Serial.begin() line likely rearranges memory a little, which changes the effect of writing into that non-existent myArray[4].
One C language thing you probably know: arrays declare the number of items in the array; not the index of the last member of the array. So myArray[4] declares an array with myArray[0], myArray[1], myArray[2], and myArray[3]. There is no myArray[4].

Function to count time with precision less than a millisecond

I have a function here that can make program count, wait etc with least count of 1 millisecond. But i was wondering if i can do same will lower precision. I have read other answers but they are mostly about changing to linux or sleep is guesstimate and whats more is those answers were around a decade old so maybe there might have come new function to do it.
Here's function-
void sleep(unsigned int mseconds)
{
clock_t goal = mseconds + clock();
while (goal > clock());
}
Actually, i was trying to make a function similar to secure_compare but i dont think it is wise idea to waste 1 millisecond(current least count) on just comparing two strings.
Here is function i made for the same -
bool secure_compare(string a,string b){
clock_t limit=wait + clock(); //limit of time program can take to compare
bool x = (a==b);
if(clock()>limit){ //if time taken to compare is more increase wait so it takes this new max time for other comparisons too
wait = clock()-limit;
cout<<"Error";
secure_compare(a,b);
}
while(clock()<limit); //finishing time left to make it constant time function
return x;
}

You're trying to make a comparison function time-independent. There are basically two ways to do this:
Measure the time taken for the call and sleep the appropriate amount
This might only swap out one side channel (timing) with another (power consumption, since sleeping and computation might have different power usage characteristics).
Make the control flow more data-independent:
Instead of using the normal string comparison, you could implement your own comparison that compares all characters and not just up until the first mismatch, like this:
bool match = true;
size_t min_length = min(a.size(), b.size());
for (size_t i = 0; i < min_length; ++i) {
match &= (a[i] == b[i]);
}
return match;
Here, no branching (conditional operations) takes place, so every call of this method with strings of the same length should take roughly the same time. So the only side-channel information you leak is the length of the strings you compare, but that would be difficult to hide anyways, if they are of arbitrary length.
EDIT: Incorporating Passer By's comment:
If we want to reduce the size leakage, we could try to round the size up and clamp the index values.
bool match = true;
size_t min_length = min(a.size(), b.size());
size_t rounded_length = (min_length + 1023) / 1024 * 1024;
for (size_t i = 0; i < rounded_length; ++i) {
size_t clamped_i = min(i, min_length - 1);
match &= (a[clamped_i] == b[clamped_i]);
}
return match;
There might be a tiny cache timing sidechannel present (because we don't get any more cache misses if i > clamped_i), but since a and b should be in the cache hierarchy anyways, I doubt the difference is usable in any way.

Efficient index bound check and double to int cast

Consider the following code snippet
double *x, *id;
int i, n; // = vector size
// allocate and zero x
// set id to 0:n-1
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.
For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.
The questions:
Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?
Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.
For convenience, the entire code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
toc("x(id) = 1");
}

EDIT: Disregard if you can't split the arrays!
I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:
struct my_arrays
{
double x;
int id;
};
struct my_arrays* arr = malloc(sizeof(my_arrays)*n);
Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.
EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.
My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this:
What is the most effective way for float and double comparison?

It might be worth considering examination of double type representation.
For example, the following code shows how to compare a double number greater than 1 to 999:
bool check(double x)
{
union
{
double d;
uint32_t y[2];
};
d = x;
bool answer;
uint32_t exp = (y[1] >> 20) & 0x3ff;
uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
if (fraction2 != 0 || fraction1 != 0)
answer = false;
else if (exp > 8)
answer = false;
else if (exp == 8)
answer = (y[1] < 0x408f3800); // this is the representation of 999
else
answer = true;
return answer;
}
This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];

Can I make this C++ code faster without making it much more complex?

here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.

try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js