c++ insane memory consumption on large file

c++ insane memory consumption on large file - c++

I am loading a 10GB file into memory and I find that even if I strip away any extra overhead and store the data in nothing but an array it still takes up 53 GB of ram. This seems crazy to me since I am converting some of the text data to longs which take up less room and convert the rest to char * which should take up the same amount of room as a text file. I have about 150M rows of data in the file I am trying to load. Is there any reason why this should take up so much ram when I load it the way I do below?
There are three files here a fileLoader class and its header file and a main that simply runs them.
To answer some questions:
OS is UBUNTU 12.04 64bit
This is on a machien with 64GB of RAM and an SSD hd that I have providing 64GB of swap space for RAM
I am loading all of the data at once becuase of the need for speed. It is critical for the application. All sorting, indexing, and lots of the data intensive work runs on the GPU.
The other reason is that loading all of the data at once made it much simpler for me to write the code. I dont have to worry about indexed files, and mappings to locations in another file for example.
Here is the header file:
#ifndef FILELOADER_H_
#define FILELOADER_H_
#include <iostream>
#include <fstream>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <string>
class fileLoader {
public:
fileLoader();
virtual ~fileLoader();
void loadFile();
private:
long long ** longs;
char *** chars;
long count;
long countLines(std::string inFile);
};
#endif /* FILELOADER_H_ */
Here is the CPP file
#include "fileLoader.h"
fileLoader::fileLoader() {
// TODO Auto-generated constructor stub
this->longs = NULL;
this->chars = NULL;
}
char ** split(char * line,const char * delim,int size){
char ** val = new char * [size];
int i = 0;
bool parse = true;
char * curVal = strsep(&line,delim);
while(parse){
if(curVal != NULL){
val[i] = curVal;
i++;
curVal = strsep(&line,delim);
}else{
parse = false;
}
}
return val;
}
void fileLoader::loadFile(){
const char * fileName = "/blazing/final/tasteslikevictory";
std::string fileString(fileName);
//-1 since theres a header row and we are skipinig it
this->count = countLines(fileString) -1;
this->longs = new long long*[this->count];
this->chars = new char **[this->count];
std::ifstream inFile;
inFile.open(fileName);
if(inFile.is_open()){
std::string line;
int i =0;
getline(inFile,line);
while(getline(inFile,line)){
this->longs[i] = new long long[6];
this->chars[i] = new char *[7];
char * copy = strdup(line.c_str());
char ** splitValues = split(copy,"|",13);
this->longs[i][0] = atoll(splitValues[4]);
this->longs[i][1] = atoll(splitValues[5]);
this->longs[i][2] = atoll(splitValues[6]);
this->longs[i][3] = atoll(splitValues[7]);
this->longs[i][4] = atoll(splitValues[11]);
this->longs[i][5] = atoll(splitValues[12]);
this->chars[i][0] = strdup(splitValues[0]);
this->chars[i][1] = strdup(splitValues[1]);
this->chars[i][2] = strdup(splitValues[2]);
this->chars[i][3] = strdup(splitValues[3]);
this->chars[i][4] = strdup(splitValues[8]);
this->chars[i][5] = strdup(splitValues[9]);
this->chars[i][6] = strdup(splitValues[10]);
i++;
delete[] splitValues;
free(copy);
}
}
}
fileLoader::~fileLoader() {
// TODO Auto-generated destructor stub
if(this->longs != NULL){
delete[] this->longs;
}
if(this->chars != NULL){
for(int i =0; i <this->count;i++ ){
free(this->chars[i]);
}
delete[] this->chars;
}
}
long fileLoader::countLines(std::string inFile){
int BUFFER_SIZE = 16*1024;
int fd = open(inFile.c_str(), O_RDONLY);
if(fd == -1)
return 0;
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
long lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
return 0;
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
Here is the file with my main function:
#include "fileLoader.h"
int main()
{
fileLoader loader;
loader.loadFile();
return 0;
}
Here is an example of the data that I am loading:
13|0|1|1997|113|1|4|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
14|0|1|1997|113|1|5|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
15|0|1|1997|113|1|6|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
16|0|1|1997|113|1|7|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
17|0|1|1997|113|1|8|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
18|0|1|1997|113|1|9|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
19|0|1|1997|113|1|10|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
20|0|1|1997|113|1|11|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
21|0|1|1997|113|1|12|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
9|0|1|1997|113|1|13|12408012|C9FF921CA04ADA3D606BF6DAC4A0B092|SEMANAL|66C5E828DC69F857ADE060B8062C923E|113|1
27|0|1|1992|125|1|1|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
28|0|1|1992|125|1|2|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
29|0|1|1992|125|1|3|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
30|0|1|1992|125|1|4|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
31|0|1|1992|125|1|5|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
32|0|1|1992|125|1|6|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
33|0|1|1992|125|1|7|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
34|0|1|1992|125|1|8|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
35|0|1|1992|125|1|9|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
36|0|1|1992|125|1|10|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
37|0|1|1992|125|1|11|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
38|0|1|1992|125|1|12|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
39|0|1|1992|125|1|13|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
40|0|1|1992|125|1|14|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
41|0|1|1992|125|1|15|10183|9EF534D2CF74B24AC28CBD9BE937A412|SEMANAL|375CCE505F5353CCDE85D4E84A9888D8|125|1
10|0|1|1996|126|1|1||||||

You are allocating nine chunks of memory for each line, so you are allocating a total of 1350 million pieces of memory. These allocations have a certain overhead, usually at least twice the size of a pointer, possibly even more. On a 64 bit machine, that is already 16 bytes, so you get 21.6 GB of overhead.
In addition to that, you get the overhead of heap fragmentation and alignment: Even if you only ever store a string in it, the allocator has to align the memory allocations so that you can store the largest possible values in it without triggering misalignment. Alignment may depend on the vector unit of your CPU, which can require very significant alignments, 16 byte alignment not being uncommon.
Doing the calculation with 16 bytes allocation overhead and 16 bytes alignment, we get allocations of 43.2 GB without the original data. With the original data this calculation is already very close to your measurement.

Each of those objects and strings you create has individual memory management overhead. So you load the string "0" from column 2, depending on your memory manager, it probably takes between two and four full words (could be more). Call it 16 to 32 bytes of storage to hold a one byte string. Then you load the "1" from column 3. And so on.

Related

Why performance is reduced much when I run a for loop with a count of 5368709120 and few lines of memcpy?

I am allocating three large sized byte arrays and initialized them to some values. I have to perform operations on every 64 bits between these three arrays. I have created a for loop to loop through these arrays and convert consecutive 8 byte(64bit) into 64 bit integer using memcpy and perform operations between them. Later, have calculated the time taken by for loop. I have given my code here.
#include <stdio.h>
#include<iostream>
#include<ctime>
#include<Windows.h>
#include<chrono>
using namespace std;
BYTE* buffer1;
BYTE* buffer2;
BYTE* buffer3;
int main()
{
unsigned long long offsetValue = 0;
int64_t data1, data2, data3;
unsigned long long BufferSize = 5368709120;
buffer1 = (BYTE*)malloc(BufferSize);
buffer2 = (BYTE*)malloc(BufferSize);
buffer3 = (BYTE*)malloc(BufferSize);
memset(buffer1, 0, BufferSize);
memset(buffer2, 1, BufferSize);
memset(buffer3, 1, BufferSize);
bool overallResult = false;
bool stopOnFail = false;
auto start = chrono::steady_clock::now();
for (unsigned long long i = 0, cycle = 0; i<BufferSize; i += 8, ++cycle)
{
long long offset = (offsetValue * 8) + i;
if (offset> BufferSize - 1)
break;
else if (offset< 0)
continue;
memcpy(&data1, buffer1 + offset, sizeof(int64_t));
if (data1 == -1)
continue;
memcpy(&data2, buffer2 + offset, sizeof(int64_t));
memcpy(&data3, buffer3 + offset, sizeof(int64_t));
int64_t Exor = data2 ^ data3^-1;
int64_t Or = Exor | data1;
bool result = Or == -1;
overallResult &= result;
if (!result)
{
if (stopOnFail)
break;
}
}
auto ending = chrono::steady_clock::now();
cout << "For loop Execution time in milliseconds :"
<< chrono::duration_cast<chrono::milliseconds>(ending - start).count()
<< " ms" << endl;
free(buffer1);
free(buffer2);
free(buffer3);
system("pause");
return 0;
}
For loop count of 4294967296 gave me a time of 760 milliseconds. But for loop count of 5368709120 gives me a time of 25000 milliseconds. What is draining the time in for loop? How should I optimize?

1. You're not using the value overallResult outside the loop, so a good optimizing compiler can optimize away the loop entirely. MSVC is probably isn't that smart, but it's still a good idea to e.g. print out overallResult at the end.
2. You're allocating (and actually using) 3 × 5,368,709,120 bytes = 15 GB. A Windows 10 system uses a lot more than 1 GB to run (especially in combination with Visual Studio), so on a system with 16 GB, allocating 15 GB would inevitably cause paging, which is most probably what you're observing (also, a ~20..40x slowdown is characteristic of memory paging).
To verify:
Open up Performance Monitor (perfmon.exe)
Add Counters -> Paging File -> % Usage
Run your program
If the paging counters are > 0, then you don't have enough RAM, and looping over memory will slow down due to reading of pages from disk.
You can also watch RAM usage in Task Manager -> Performance tab.

How to write a large binary file to a disk

I am writing a program which requires writing a large binary file (about 12 GiB or more) to a disk. I have created a small test program to test this functionality. Although allocating the RAM memory for the buffer is not a problem, my program does not write the data to a file. The file remains empty. Even for 3.72 GiB files.
//size_t bufferSize=1000; //ok
//size_t bufferSize=100000000; //ok
size_t bufferSize=500000000; //fails although it is under 4GiB, which shouldn't cause problem anyways
double mem=double(bufferSize)*double(sizeof(double))/std::pow(1024.,3.);
cout<<"Total memory used: "<<mem<<" GiB"<<endl;
double *buffer=new double[bufferSize];
/* //enable if you want to fill the buffer with random data
printf("\r[%i \%]",0);
for (size_t i=0;i<(size_t)bufferSize;i++)
{
if ((i+1)%100==0) printf("\r[%i %]",(size_t)(100.*double(i+1)/bufferSize));
buffer[i]=rand() % 100;
}
*/
cout<<endl;
std::ofstream outfile ("largeStuff.bin",std::ofstream::binary);
outfile.write ((char*)buffer,((size_t)(bufferSize*double(sizeof(double)))));
outfile.close();
delete[] buffer;

I actually compiled and ran the code exactly as you have pasted there and it works. It creates a 4GB file.
If you are on a FAT32 filesystem the max filesize is 4GB.
Otherwise I suggest you check:
The amount of free disk space you have.
Whether your user account has any disk usage limits in place.
The amount of free RAM you have.
Whether there are any runtime errors.
#enhzflep's suggestion about the number of prints (although that is
commented out)

It seems that you want to have a buffer that contains the whole file's contents prior to writing it.
You're doing it wrong, through: the virtual memory requirements are essentially double of what they need to be. Your process retains the buffer, but when you write that buffer to disk it gets duplicated in operating system's buffers. Now, most OSes will notice that you write sequentially and may discard their buffers quickly, but still: it's rather wasteful.
Instead, you should create an empty file, grow it to its desired size, then map its view into memory, and do the modifications on the file's view in memory. For 32 bit hosts your file size is limited to <1GB. For 64 bit hosts, it's limited by the filesystem only. On modern hardware, creating and filling a 1GB file that way takes on the order of 1 second (!) if you have enough free RAM available.
Thanks to the wonders of RAII, you don't need to do anything special to release the mapped memory, or to close/finalize the file. By leveraging boost you can avoid writing platform-specific code, too.
// https://github.com/KubaO/stackoverflown/tree/master/questions/mmap-boost-40308164
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/filesystem.hpp>
#include <cassert>
#include <cstdint>
#include <fstream>
namespace bip = boost::interprocess;
void fill(const char * fileName, size_t size) {
using element_type = uint64_t;
assert(size % sizeof(element_type) == 0);
std::ofstream().open(fileName); // create an empty file
boost::filesystem::resize_file(fileName, size);
auto mapping = bip::file_mapping{fileName, bip::read_write};
auto mapped_rgn = bip::mapped_region{mapping, bip::read_write};
const auto mmaped_data = static_cast<element_type*>(mapped_rgn.get_address());
const auto mmap_bytes = mapped_rgn.get_size();
const auto mmap_size = mmap_bytes / sizeof(*mmaped_data);
assert(mmap_bytes == size);
element_type n = 0;
for (auto p = mmaped_data; p < mmaped_data+mmap_size; ++p)
*p = n++;
}
int main() {
const uint64_t G = 1024ULL*1024ULL*1024ULL;
fill("tmp.bin", 1*G);
}

Weird C stack memory overrides

I am implementing a version of malloc and free for practice. So, I have a static char array of a fixed length (10000). Then, I implemented a struct memblock that holds information like size of the block, if it is free...
The way I am implementing malloc is such that I put small blocks (< 8 bytes) to the front of the char array and larger ones to the other end. So, I am basically using two linked lists to link the blocks in front and blocks in back. However, I am having weird problems with initializing the lists (on first call of malloc).
This is my code:
#define MEMSIZE 10000 // this is the maximum size of the char * array
#define BLOCKSIZE sizeof(memblock) // size of the memblock struct
static char memory[MEMSIZE]; // array to store all the memory
static int init; // checks if memory is initialized
static memblock root; // general ptr that deals with both smallroot and bigroot
static memblock smallroot, bigroot; // pointers to storage of small memory blocks and bigger blocks
void initRoots(size_t size, char* fileName, int lineNum)
{
smallroot = (memblock)memory;
smallroot->prev = smallroot->next = 0;
smallroot->size = MEMSIZE - 2 * BLOCKSIZE;
smallroot->isFree = 1;
smallroot->file = fileName;
smallroot->lineNum = lineNum;
bigroot = (memblock)(((char *)memory) + MEMSIZE - BLOCKSIZE - 1);
bigroot->prev = bigroot->next = 0;
bigroot->size = MEMSIZE - 2 * BLOCKSIZE;
bigroot->isFree = 1;
bigroot->file = fileName;
bigroot->lineNum = lineNum;
init = 1;
}
I used GDB to see where I am getting a Seg Fault. It happens when bigroot->next = 0; is executed. This somehow sets smallroot to 0. What is more weird? If I set bigroot->next = 0x123, then smallroot becomes 0x1. If I set 0x1234, then it becomes 0x12. It is setting smallroot to the value of bigroot->next's value excluding its last two bits. I really don't understand how this is happening!
This is the definition of memblock:
typedef struct memblock_* memblock;
struct memblock_ {
struct memblock_ *prev, *next; // pointers to next and previous blocks
/* size: size of allocated memory
isFree: 0 if not free, 1 if free
lineNum: line number of user's file where malloc was invoked
*/
size_t size, isFree, lineNum;
char* file; // user's file name where the block was malloced
};

#define BLOCKSIZE sizeof(memblock) // size of the memblock struct
You want:
#define BLOCKSIZE sizeof(*memblock) // size of the memblock_ struct
Also the -1 here is bogus (creates mis-aligned pointer):
bigroot = (memblock)(((char *)memory) + MEMSIZE - BLOCKSIZE - 1);
Actually, I am storing the pointer to the memblock in the memory array. The values of memblock are stored in stack.
No, they are not. The smallroot and bigroot clearly point into the array itself.

C++ Optimal Block Size For Reading From A File

I have a program that generates files containing random distributions of the character A - Z. I have written a method that reads these files (and counts each character) using fread with different buffer sizes in an attempt to determine the optimal block size for reads. Here is the method:
int get_histogram(FILE * fp, long *hist, int block_size, long *milliseconds, long *filelen)
{
char *buffer = new char[block_size];
bzero(buffer, block_size);
struct timeb t;
ftime(&t);
long start_in_ms = t.time * 1000 + t.millitm;
size_t bytes_read = 0;
while (!feof(fp))
{
bytes_read += fread(buffer, 1, block_size, fp);
if (ferror (fp))
{
return -1;
}
int i;
for (i = 0; i < block_size; i++)
{
int j;
for (j = 0; j < 26; j++)
{
if (buffer[i] == 'A' + j)
{
hist[j]++;
}
}
}
}
ftime(&t);
long end_in_ms = t.time * 1000 + t.millitm;
*milliseconds = end_in_ms - start_in_ms;
*filelen = bytes_read;
return 0;
}
However, when I plot bytes/second vs. block size (buffer size) using block sizes of 2 - 2^20, I get an optimal block size of 4 bytes -- which just can't be correct. Something must be wrong with my code but I can't find it.
Any advice is appreciated.
Regards.
EDIT:
The point of this exercise is to demonstrate the optimal buffer size by recording the read times (plus computation time) for different buffer sizes. The file pointer is opened and closed by the calling code.

There are many bugs in this code:
It uses new[], which is C++.
It doesn't free the allocated memory.
It always loops over block_size bytes of input, not bytes_read as returned by fread().
Also, the actual histogram code is rather inefficient, since it seems to loop over each character to determine which character it is.
UPDATE: Removed claim that using feof() before I/O is wrong, since that wasn't true. Thanks to Eric for pointing this out in a comment.

You're not stating what platform you're running this on, and what compile time parameters you use.
Of course, the fread() involves some overhead, leaving user mode and returning. On the other hand, instead of setting the hist[] information directly, you're looping through the alphabet. This is unnecessary and, without optimization, causes some overhead per byte.
I'd re-test this with hist[j-26]++ or something similar.
Typically, the best timing would be achieved if your buffer size equals the system's buffer size for the given media.

Can I make this C++ code faster without making it much more complex?

here's a problem I've solved from a programming problem website(codechef.com in case anyone doesn't want to see this solution before trying themselves). This solved the problem in about 5.43 seconds with the test data, others have solved this same problem with the same test data in 0.14 seconds but with much more complex code. Can anyone point out specific areas of my code where I am losing performance? I'm still learning C++ so I know there are a million ways I could solve this problem, but I'd like to know if I can improve my own solution with some subtle changes rather than rewrite the whole thing. Or if there are any relatively simple solutions which are comparable in length but would perform better than mine I'd be interested to see them also.
Please keep in mind I'm learning C++ so my goal here is to improve the code I understand, not just to be given a perfect solution.
Thanks
Problem:
The purpose of this problem is to verify whether the method you are using to read input data is sufficiently fast to handle problems branded with the enormous Input/Output warning. You are expected to be able to process at least 2.5MB of input data per second at runtime. Time limit to process the test data is 8 seconds.
The input begins with two positive integers n k (n, k<=10^7). The next n lines of input contain one positive integer ti, not greater than 10^9, each.
Output
Write a single integer to output, denoting how many integers ti are divisible by k.
Example
Input:
7 3
1
51
966369
7
9
999996
11
Output:
4
Solution:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(){
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total;
//initialize total to zero
total=0;
//read in n and k from stdin
scanf("%i%i",&n,&k);
//loop n times and if k divides into n, increment total
for (n; n>0; n--)
{
scanf("%i",&inputnum);
if(inputnum % k==0) total += 1;
}
//output value of total
printf("%i",total);
return 0;
}

The speed is not being determined by the computation—most of the time the program takes to run is consumed by i/o.
Add setvbuf calls before the first scanf for a significant improvement:
setvbuf(stdin, NULL, _IOFBF, 32768);
setvbuf(stdout, NULL, _IOFBF, 32768);
-- edit --
The alleged magic numbers are the new buffer size. By default, FILE uses a buffer of 512 bytes. Increasing this size decreases the number of times that the C++ runtime library has to issue a read or write call to the operating system, which is by far the most expensive operation in your algorithm.
By keeping the buffer size a multiple of 512, that eliminates buffer fragmentation. Whether the size should be 1024*10 or 1024*1024 depends on the system it is intended to run on. For 16 bit systems, a buffer size larger than 32K or 64K generally causes difficulty in allocating the buffer, and maybe managing it. For any larger system, make it as large as useful—depending on available memory and what else it will be competing against.
Lacking any known memory contention, choose sizes for the buffers at about the size of the associated files. That is, if the input file is 250K, use that as the buffer size. There is definitely a diminishing return as the buffer size increases. For the 250K example, a 100K buffer would require three reads, while a default 512 byte buffer requires 500 reads. Further increasing the buffer size so only one read is needed is unlikely to make a significant performance improvement over three reads.

I tested the following on 28311552 lines of input. It's 10 times faster than your code. What it does is read a large block at once, then finishes up to the next newline. The goal here is to reduce I/O costs, since scanf() is reading a character at a time. Even with stdio, the buffer is likely too small.
Once the block is ready, I parse the numbers directly in memory.
This isn't the most elegant of codes, and I might have some edge cases a bit off, but it's enough to get you going with a faster approach.
Here are the timings (without the optimizer my solution is only about 6-7 times faster than your original reference)
[xavier:~/tmp] dalke% g++ -O3 my_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
0.284u 0.057s 0:00.39 84.6% 0+0k 0+1io 0pf+0w
[xavier:~/tmp] dalke% g++ -O3 your_solution.cpp
[xavier:~/tmp] dalke% time ./a.out < c.dat
15728647
3.585u 0.087s 0:03.72 98.3% 0+0k 0+0io 0pf+0w
Here's the code.
#include <iostream>
#include <stdio.h>
using namespace std;
const int BUFFER_SIZE=400000;
const int EXTRA=30; // well over the size of an integer
void read_to_newline(char *buffer) {
int c;
while (1) {
c = getc_unlocked(stdin);
if (c == '\n' || c == EOF) {
*buffer = '\0';
return;
}
*buffer++ = c;
}
}
int main() {
char buffer[BUFFER_SIZE+EXTRA];
char *end_buffer;
char *startptr, *endptr;
//n is number of integers to perform calculation on
//k is the divisor
//inputnum is the number to be divided by k
//total is the total number of inputnums divisible by k
int n,k,inputnum,total,nbytes;
//initialize total to zero
total=0;
//read in n and k from stdin
read_to_newline(buffer);
sscanf(buffer, "%i%i",&n,&k);
while (1) {
// Read a large block of values
// There should be one integer per line, with nothing else.
// This might truncate an integer!
nbytes = fread(buffer, 1, BUFFER_SIZE, stdin);
if (nbytes == 0) {
cerr << "Reached end of file too early" << endl;
break;
}
// Make sure I read to the next newline.
read_to_newline(buffer+nbytes);
startptr = buffer;
while (n>0) {
inputnum = 0;
// I had used strtol but that was too slow
// inputnum = strtol(startptr, &endptr, 10);
// Instead, parse the integers myself.
endptr = startptr;
while (*endptr >= '0') {
inputnum = inputnum * 10 + *endptr - '0';
endptr++;
}
// *endptr might be a '\n' or '\0'
// Might occur with the last field
if (startptr == endptr) {
break;
}
// skip the newline; go to the
// first digit of the next number.
if (*endptr == '\n') {
endptr++;
}
// Test if this is a factor
if (inputnum % k==0) total += 1;
// Advance to the next number
startptr = endptr;
// Reduce the count by one
n--;
}
// Either we are done, or we need new data
if (n==0) {
break;
}
}
// output value of total
printf("%i\n",total);
return 0;
}
Oh, and it very much assumes the input data is in the right format.

try to replace if statement with count += ((n%k)==0);. that might help little bit.
but i think you really need to buffer your input into temporary array. reading one integer from input at a time is expensive. if you can separate data acquisition and data processing, compiler may be able to generate optimized code for mathematical operations.

The I/O operations are bottleneck. Try to limit them whenever you can, for instance load all data to a buffer or array with buffered stream in one step.
Although your example is so simple that I hardly see what you can eliminate - assuming it's a part of the question to do subsequent reading from stdin.
A few comments to the code: Your example doesn't make use of any streams - no need to include iostream header. You already load C library elements to global namespace by including stdio.h instead of C++ version of the header cstdio, so using namespace std not necessary.

You can read each line with gets(), and parse the strings yourself without scanf(). (Normally I wouldn't recommend gets(), but in this case, the input is well-specified.)
A sample C program to solve this problem:
#include <stdio.h>
int main() {
int n,k,in,tot=0,i;
char s[1024];
gets(s);
sscanf(s,"%d %d",&n,&k);
while(n--) {
gets(s);
in=s[0]-'0';
for(i=1; s[i]!=0; i++) {
in=in*10 + s[i]-'0'; /* For each digit read, multiply the previous
value of in with 10 and add the current digit */
}
tot += in%k==0; /* returns 1 if in%k is 0, 0 otherwise */
}
printf("%d\n",tot);
return 0;
}
This program is approximately 2.6 times faster than the solution you gave above (on my machine).

You could try to read input line by line and use atoi() for each input row. This should be a little bit faster than scanf, because you remove the "scan" overhead of the format string.

I think the code is fine. I ran it on my computer in less than 0.3s
I even ran it on much larger inputs in less than a second.
How are you timing it?
One small thing you could do is remove the if statement.
start with total=n and then inside the loop:
total -= int( (input % k) / k + 1) //0 if divisible, 1 if not

Though I doubt CodeChef will accept it, one possibility is to use multiple threads, one to handle the I/O, and another to process the data. This is especially effective on a multi-core processor, but can help even with a single core. For example, on Windows you code use code like this (no real attempt at conforming with CodeChef requirements -- I doubt they'll accept it with the timing data in the output):
#include <windows.h>
#include <process.h>
#include <iostream>
#include <time.h>
#include "queue.hpp"
namespace jvc = JVC_thread_queue;
struct buffer {
static const int initial_size = 1024 * 1024;
char buf[initial_size];
size_t size;
buffer() : size(initial_size) {}
};
jvc::queue<buffer *> outputs;
void read(HANDLE file) {
// read data from specified file, put into buffers for processing.
//
char temp[32];
int temp_len = 0;
int i;
buffer *b;
DWORD read;
do {
b = new buffer;
// If we have a partial line from the previous buffer, copy it into this one.
if (temp_len != 0)
memcpy(b->buf, temp, temp_len);
// Then fill the buffer with data.
ReadFile(file, b->buf+temp_len, b->size-temp_len, &read, NULL);
// Look for partial line at end of buffer.
for (i=read; b->buf[i] != '\n'; --i)
;
// copy partial line to holding area.
memcpy(temp, b->buf+i, temp_len=read-i);
// adjust size.
b->size = i;
// put buffer into queue for processing thread.
// transfers ownership.
outputs.add(b);
} while (read != 0);
}
// A simplified istrstream that can only read int's.
class num_reader {
buffer &b;
char *pos;
char *end;
public:
num_reader(buffer *buf) : b(*buf), pos(b.buf), end(pos+b.size) {}
num_reader &operator>>(int &value){
int v = 0;
// skip leading "stuff" up to the first digit.
while ((pos < end) && !isdigit(*pos))
++pos;
// read digits, create value from them.
while ((pos < end) && isdigit(*pos)) {
v = 10 * v + *pos-'0';
++pos;
}
value = v;
return *this;
}
// return stream status -- only whether we're at end
operator bool() { return pos < end; }
};
int result;
unsigned __stdcall processing_thread(void *) {
int value;
int n, k;
int count = 0;
// Read first buffer: n & k followed by values.
buffer *b = outputs.pop();
num_reader input(b);
input >> n;
input >> k;
while (input >> value && ++count < n)
result += ((value %k ) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
// Then read subsequent buffers:
while ((b=outputs.pop()) && (b->size != 0)) {
num_reader input(b);
while (input >> value && ++count < n)
result += ((value %k) == 0);
// Ownership was transferred -- delete buffer when finished.
delete b;
}
return 0;
}
int main() {
HANDLE standard_input = GetStdHandle(STD_INPUT_HANDLE);
HANDLE processor = (HANDLE)_beginthreadex(NULL, 0, processing_thread, NULL, 0, NULL);
clock_t start = clock();
read(standard_input);
WaitForSingleObject(processor, INFINITE);
clock_t finish = clock();
std::cout << (float)(finish-start)/CLOCKS_PER_SEC << " Seconds.\n";
std::cout << result;
return 0;
}
This uses a thread-safe queue class I wrote years ago:
#ifndef QUEUE_H_INCLUDED
#define QUEUE_H_INCLUDED
namespace JVC_thread_queue {
template<class T, unsigned max = 256>
class queue {
HANDLE space_avail; // at least one slot empty
HANDLE data_avail; // at least one slot full
CRITICAL_SECTION mutex; // protect buffer, in_pos, out_pos
T buffer[max];
long in_pos, out_pos;
public:
queue() : in_pos(0), out_pos(0) {
space_avail = CreateSemaphore(NULL, max, max, NULL);
data_avail = CreateSemaphore(NULL, 0, max, NULL);
InitializeCriticalSection(&mutex);
}
void add(T data) {
WaitForSingleObject(space_avail, INFINITE);
EnterCriticalSection(&mutex);
buffer[in_pos] = data;
in_pos = (in_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(data_avail, 1, NULL);
}
T pop() {
WaitForSingleObject(data_avail,INFINITE);
EnterCriticalSection(&mutex);
T retval = buffer[out_pos];
out_pos = (out_pos + 1) % max;
LeaveCriticalSection(&mutex);
ReleaseSemaphore(space_avail, 1, NULL);
return retval;
}
~queue() {
DeleteCriticalSection(&mutex);
CloseHandle(data_avail);
CloseHandle(space_avail);
}
};
}
#endif
Exactly how much you gain from this depends on the amount of time spent reading versus the amount of time spent on other processing. In this case, the other processing is sufficiently trivial that it probably doesn't gain much. If more time was spent on processing the data, multi-threading would probably gain more.

2.5mb/sec is 400ns/byte.
There are two big per-byte processes, file input and parsing.
For the file input, I would just load it into a big memory buffer. fread should be able to read that in at roughly full disc bandwidth.
For the parsing, sscanf is built for generality, not speed. atoi should be pretty fast. My habit, for better or worse, is to do it myself, as in:
#define DIGIT(c)((c)>='0' && (c) <= '9')
bool parsInt(char* &p, int& num){
while(*p && *p <= ' ') p++; // scan over whitespace
if (!DIGIT(*p)) return false;
num = 0;
while(DIGIT(*p)){
num = num * 10 + (*p++ - '0');
}
return true;
}
The loops, first over leading whitespace, then over the digits, should be nearly as fast as the machine can go, certainly a lot less than 400ns/byte.

Dividing two large numbers is hard. Perhaps an improvement would be to first characterize k a little by looking at some of the smaller primes. Let's say 2, 3, and 5 for now. If k is divisible by any of these, than inputnum also needs to be or inputnum is not divisible by k. Of course there are more tricks to play (you could use bitwise and of inputnum to 1 to determine whether you are divisible by 2), but I think just removing the low prime possibilities will give a reasonable speed improvement (worth a shot anyway).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js