Big array of size 1mega caused high CPU? - c++

I have a multithreaded server application. This application receives data from sockets then handles these data like unpacking package, adding to data queue, etc, the function is as below. This function is called frequently. There is a select statement and if it finds there is data it will call this function to receive):
//the main function used to receive
//file data from clients
void service(void){
while(1){
....
struct timeval timeout;
timeout.tv_sec = 3;
...
ret = select(maxFd+1, &read_set, NULL, NULL, &timeout);
if (ret > 0){
//get socket from SocketsMap
//if fd in SocketsMap and its being set
//then receive data from the socket
receive_data(fd);
}
}
}
void receive_data(int fd){
const int ONE_MEGA = 1024 * 1024;
//char *buffer = new char[ONE_MEGA]; consumes much less CPU
char buffer[ONE_MEGA]; // cause high CPU
int readn = recv(fd, buffer, ONE_MEGA, 0);
//handle the data
}
I found the above consumes too much CPU -- usually 80% to 90%, but if I create the buffer from heap instead the CPU is only 14%. Why?
[update]
Added more code
[update2]
The stangest thing is that I also wrote another simple data-receiving server and client. The server simply receives data from sockets then discard it. Both types of space allocating works almost the same, no big difference in CPU usage. In the multithreaded server application which has the problem, I even reset the process stack size to 30M, using array still results in the problem, but allocating from heap solves it. I don't know why.
Regarding the "sizeof(buffer)", thanks for pointing out this, but I am 100% sure that it is not the problem, because in my application I don't use sizeof(buffer), but ONE_MEGA (1024*1024) instead.
By the way, there is one more thing to mention though I am not sure it's useful or not. Replacing the array with a smaller one such as "char buffer[1024]; also decreases the cpu usage dramatically.
[update3]
All sockets are in non-blocking mode.

I just wrote this:
#include <iostream>
#include <cstdio>
using namespace std;
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
const int M = 1024*1024;
void bigstack()
{
FILE *f = fopen("test.txt", "r");
unsigned long long time;
char buffer[M];
time = rdtsc();
fread(buffer, M, 1, f);
time = rdtsc() - time;
fclose(f);
cout << "bs: Time = " << time / 1000 << endl;
}
void bigheap()
{
FILE *f = fopen("test.txt", "r");
unsigned long long time;
char *buffer = new char[M];
time = rdtsc();
fread(buffer, M, 1, f);
time = rdtsc() - time;
delete [] buffer;
fclose(f);
cout << "bh: Time = " << time / 1000 << endl;
}
int main()
{
for(int i = 0; i < 10; i++)
{
bigstack();
bigheap();
}
}
The output is something like this:
bs: Time = 8434
bh: Time = 7242
bs: Time = 1094
bh: Time = 2060
bs: Time = 842
bh: Time = 830
bs: Time = 785
bh: Time = 781
bs: Time = 782
bh: Time = 804
bs: Time = 782
bh: Time = 778
bs: Time = 792
bh: Time = 809
bs: Time = 785
bh: Time = 786
bs: Time = 782
bh: Time = 829
bs: Time = 786
bh: Time = 781
In other words, allocating from the stack of the heap makes absolutely no difference. The small amount of "slowness" in the beginning has to do with "warming up the caches".
And I'm fairly convinced that the reason your code behaves differently between the two is something else - maybe what simonc says: sizeof buffer is the problem?

If all things are equal, memory is memory and it shouldn't matter whether your buffer is on the heap or on the stack.
But clearly all things aren't equal. I suspect the allocation of the 1M buffer on the stack INTERFERES/OVERLAPS with the stack space allocated to the OTHER threads. That is, to grow the stack requires either relocating the stack of the current thread, or relocating the stacks of the other threads. This takes time. This time is not needed when allocating from the heap or if the stack allocation is small enough not to interfere, as it is with the 1K example.
Assuming you are using a Posix-compatible thread implementation, take a look at
pthread_create
pthread_attr_getstack
pthread_attr_setstack
for giving the thread with the 1M buffer more stack space at thread creation time.
-Jeff

You're ignoring the return value from recv. That's not good. Partial reads are a fact of life, and very likely if you pass such a large buffer. When you start processing parts of the buffer that don't contain valid data, unexpected things could happen.
The maximum frame size for the most commonly used protocol is 64kB. It's even possible (although unlikely) that something in the system only uses the lowest 16 bits of the buffer size, which incidentally you've set to zero. Which that would cause recv to return immediately without doing anything, resulting in an endless loop and high CPU usage.
Of course none of this should be any different with a dynamically-allocated buffer, but if you also used sizeof (buffer) and ended up with the heap-user code reading only a pointer-sized chunk at once, it could be different.

Related

Why performance is reduced much when I run a for loop with a count of 5368709120 and few lines of memcpy?

I am allocating three large sized byte arrays and initialized them to some values. I have to perform operations on every 64 bits between these three arrays. I have created a for loop to loop through these arrays and convert consecutive 8 byte(64bit) into 64 bit integer using memcpy and perform operations between them. Later, have calculated the time taken by for loop. I have given my code here.
#include <stdio.h>
#include<iostream>
#include<ctime>
#include<Windows.h>
#include<chrono>
using namespace std;
BYTE* buffer1;
BYTE* buffer2;
BYTE* buffer3;
int main()
{
unsigned long long offsetValue = 0;
int64_t data1, data2, data3;
unsigned long long BufferSize = 5368709120;
buffer1 = (BYTE*)malloc(BufferSize);
buffer2 = (BYTE*)malloc(BufferSize);
buffer3 = (BYTE*)malloc(BufferSize);
memset(buffer1, 0, BufferSize);
memset(buffer2, 1, BufferSize);
memset(buffer3, 1, BufferSize);
bool overallResult = false;
bool stopOnFail = false;
auto start = chrono::steady_clock::now();
for (unsigned long long i = 0, cycle = 0; i<BufferSize; i += 8, ++cycle)
{
long long offset = (offsetValue * 8) + i;
if (offset> BufferSize - 1)
break;
else if (offset< 0)
continue;
memcpy(&data1, buffer1 + offset, sizeof(int64_t));
if (data1 == -1)
continue;
memcpy(&data2, buffer2 + offset, sizeof(int64_t));
memcpy(&data3, buffer3 + offset, sizeof(int64_t));
int64_t Exor = data2 ^ data3^-1;
int64_t Or = Exor | data1;
bool result = Or == -1;
overallResult &= result;
if (!result)
{
if (stopOnFail)
break;
}
}
auto ending = chrono::steady_clock::now();
cout << "For loop Execution time in milliseconds :"
<< chrono::duration_cast<chrono::milliseconds>(ending - start).count()
<< " ms" << endl;
free(buffer1);
free(buffer2);
free(buffer3);
system("pause");
return 0;
}
For loop count of 4294967296 gave me a time of 760 milliseconds. But for loop count of 5368709120 gives me a time of 25000 milliseconds. What is draining the time in for loop? How should I optimize?
1. You're not using the value overallResult outside the loop, so a good optimizing compiler can optimize away the loop entirely. MSVC is probably isn't that smart, but it's still a good idea to e.g. print out overallResult at the end.
2. You're allocating (and actually using) 3 × 5,368,709,120 bytes = 15 GB. A Windows 10 system uses a lot more than 1 GB to run (especially in combination with Visual Studio), so on a system with 16 GB, allocating 15 GB would inevitably cause paging, which is most probably what you're observing (also, a ~20..40x slowdown is characteristic of memory paging).
To verify:
Open up Performance Monitor (perfmon.exe)
Add Counters -> Paging File -> % Usage
Run your program
If the paging counters are > 0, then you don't have enough RAM, and looping over memory will slow down due to reading of pages from disk.
You can also watch RAM usage in Task Manager -> Performance tab.

A separate loop slows down an independent earlier loop?

How can a separate loop affect the performance of an independent earlier loop?
My first loop reads some large text files and counts the lines/rows.
After a malloc, the second loop populates the allocated matrix.
If the second loop is commented out, the first loop takes 1.5 sec. However, compiling WITH the second loop slows down the first loop, which now takes 30-40 seconds!
In other words: the second loop somehow slows down the first loop. I have tried to change scope, change compilers, change compiler flags, change the loop itself, bring everything into main(), use boost::iostream and even place one loop in a shared library, but in every attempt the same problem persists!
The first loop is fast until the program is compiled with the second loop.
EDIT: Here is a full example of my problem ------------>
#include <iostream>
#include <vector>
#include "string.h"
#include "boost/chrono.hpp"
#include "sys/mman.h"
#include "sys/stat.h"
#include "fcntl.h"
#include <algorithm>
unsigned long int countLines(char const *fname) {
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1) {
std::cout << "Open Error" << std::endl;
std::exit(EXIT_FAILURE);
}
posix_fadvise(fd, 0, 0, 1);
char buf[BUFFER_SIZE + 1];
unsigned long int lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) {
if(bytes_read == (size_t)-1) {
std::cout << "Read Failed" << std::endl;
std::exit(EXIT_FAILURE);
}
if (!bytes_read)
break;
int n;
char *p;
for(p = buf, n=bytes_read ; n > 0 && (p = (char*) memchr(p, '\n', n)) ; n = (buf+bytes_read) - ++p)
++lines;
}
close(fd);
return lines;
}
int main(int argc, char *argv[])
{
// initial variables
int offset = 55;
unsigned long int rows = 0;
unsigned long int cols = 0;
std::vector<unsigned long int> dbRows = {0, 0, 0};
std::vector<std::string> files = {"DATA/test/file1.csv", // large files: 3Gb
"DATA/test/file2.csv", // each line is 55 chars long
"DATA/test/file3.csv"};
// find each file's number of rows
for (int x = 0; x < files.size(); x++) { // <--- FIRST LOOP **
dbRows[x] = countLines(files[x].c_str());
}
// define matrix row as being the largest row found
// define matrix col as being 55 chars long for each csv file
std::vector<unsigned long int>::iterator maxCount;
maxCount = std::max_element(dbRows.begin(), dbRows.end());
rows = dbRows[std::distance(dbRows.begin(), maxCount)]; // typically rows = 72716067
cols = dbRows.size() * offset; // cols = 165
// malloc required space (11998151055)
char *syncData = (char *)malloc(rows*cols*sizeof(char));
// fill up allocated memory with a test letter
char t[]= "x";
for (unsigned long int x = 0; x < (rows*cols); x++) { // <--- SECOND LOOP **
syncData[x] = t[0];
}
free(syncData);
return 0;
}
I have also noticed that lowering the number of columns speeds up the first loop.
A profiler points the finger to this line:
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
The program idles on this line for 30 seconds or a wait count of 230,000.
In assembly, the wait count occurs on:
Block 5:
lea 0x8(%rsp), %rsi
mov %r12d, %edi
mov $0x4000, %edx
callq 0x402fc0 <------ stalls on callq
Block 6:
mov %rax, %rbx
test %rbx, %rbx
jz 0x404480 <Block 18>
My guess is that an API block occurs when reading from stream, but I don't know why?
My theory:
Allocating and touching all that memory evicts the big files from disk cache, so the next run has to read them from disk.
If you ran the version without loop2 a couple times to warm up the disk cache, then run a version with loop2, I predict it will be fast the first time, but slow for further runs without warming up the disk cache again first.
The memory consumption happens after the files have been read. This causes "memory pressure" on the page cache (aka disk cache), causing it to evict data from the cache to make room for the pages for your process to write to.
Your computer probably has just barely enough free RAM to cache your working set. Closing your web browser might free up enough to make a difference! Or not, since your 11998151055 is 11.1GiB, and you're writing every page of it. (Every byte, even. You could do it with memset for higher performance, although I assume what you've shown is just a dummy version)
BTW, another tool for investigating this would be time ./a.out. It can show you if your program is spending all its CPU time in user-space vs. kernel ("system") time.
If user+sys adds up to the real time, your process is CPU bound. If not, it's I/O bound, and your process is blocking on disk I/O (which is normal, since counting newlines should be fast).
From THIS PAGE:
"The function close closes the file descriptor filedes. Closing a file has the following consequences:
The file descriptor is deallocated.
Any record locks owned by the process on the file are unlocked.
When all file descriptors associated with a pipe or FIFO have been closed, any unread data is discarded."
I think you might have the resources locked by previous read. Try closing the file and let us know the result.

Slow heap array performance

I'm experiencing strange memory access performance problem, any ideas?
int* pixel_ptr = somewhereFromHeap;
int local_ptr[307200]; //local
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = someCalculatedVal ;
}
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = 1 ; //constant
}
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = someCalculatedVal ;
}
Tried consolidating values to local scanline
int scanline[640]; // local
//this is very slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = val ;
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = 1 ; //constant
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = i; //or a constant
pixel_ptr[screen_pos] = val ;
}
//this is slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[0];
pixel_ptr[screen_pos] = val ;
}
Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.
update4:
I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.
Now with proper test program. thks to #Iwillnotexist.
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define SIZE 307200
#define SAMPLES 1000
double local_test(){
int local_array[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
local_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
double heap_test(){
int* heap_array=new int[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
delete[] heap_array;
return cpu_time_used;
}
double heap_test2(){
static int* heap_array = NULL;
if(heap_array==NULL){
heap_array = new int[SIZE];
}
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
int main (int argc, char** argv){
double cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=local_test();
printf("local: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test();
printf("heap_: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test2();
printf("heap2: %f ms\n",cpu_time_used);
}
Complied with no optimization.
local: 577.201000 ms
heap_: 826.802000 ms
heap2: 686.401000 ms
The first heap test with new and delete is 2x slower. (paging as suggested?)
The second heap with reused heap array is still 1.2x slower.
But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during
prograim initialization.
But if anyone has solutions/idea to speeding things up please reply!
I'm still perplexed why heap write is so much slower than stack segment.
Surely there must be some tricks to make the heap more cpu/cache flavourable.
Final update?:
I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints
don't activate. The program looks suspiciously shorter thus I suspect the complier might
have removed the redundant dummy code I put in which explains why the local array is magically many times faster.
I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.
The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).
If the assembly is the same, then the difference must come from the paging mechanism. The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.
So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.
For all the tests I used Windows, with MS compiler targeting x64.
EDIT: For the test, I measured a single, larger loop, so there was only one access at each memory location. deleteing the array and measuring the same loop multiple time should give similar times for stack and heap, because deleteing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).
The following two code examples should not differ in runtime with a good compiler setting. Probably your compiler will generate the same code:
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = pixel_ptr[i];
}
Please try to increase optimization setting.

why buffering speed just increases with increasing buffer size in ifstream?

I have an information retrieval and storage course project which for the first part I have to find the optimum buffer size for reading big files from the hard disk. our t.a says with increasing the buffer size up to a certain point (usually 4 bytes) the reading speed will increase but after that it decreases. but with my code below, it just increases no matter the buffer size or the file size (I have tested it on 100 mb). from what I know buffering only makes sense in parallel asynchronous processes (like threads) and that expectation for the buffer size-reading speed curve should hold true when the file is defragmented and\or the cost of looking up file directory and addresses(for the disk) is significant enough, so is the problem related to my code or the way ifstream handles things or maybe those conditions just don't hold up here?
ifstream in("D:ISR\\Articles.dat", std::ifstream::binary);
if(in)
{
in.seekg(0, in.end);
int length = in.tellg();
length = 100 * 1024 * 1024;
int bufferSize = 2;
int blockSize = 1024;//1kB
int numberOfBlocks = length / blockSize;
if(length % blockSize > 0) numberOfBlocks++;
clock_t t;
double time;
for(int i = 0; i < 5; i++)
{
in.seekg(0, in.beg);
int position = 0;
int bufferPosition;
char* streamBuffer = new char[bufferSize];
in.rdbuf()->pubsetbuf(streamBuffer, bufferSize);
t = clock();
for(int i = 0; i < numberOfBlocks; i++)
{
char* buffer = new char[blockSize];
bufferPosition = 0;
while(bufferPosition < blockSize && position < length)
{
in.read(buffer + bufferPosition, bufferSize);
position += bufferSize;
bufferPosition += bufferSize;
}
delete[] buffer;
}
t = clock() - t;
time = double(t) / CLOCKS_PER_SEC;
cout << "Buffer size : " << bufferSize << " -> Total time in seconds : " << time << "\n";
bufferSize *= 2;
}
what I know buffering only makes sense in parallel asynchronous
processes
No! No! Buffering make sense in many situations. A common situation is I/O. If you increase the size of read/write buffer. Operating system can touch the I/O device less.
And it can read/write larger blocks in each operation. Then, the performance gets better.
Choose buffer size in 2^n: 128, 512, 1024,... otherwise it can decrease the performance.
it just increases no matter the buffer size or the file size
The above statement does not hold true. Since you measure your program repeatedly, the successive result will be better than the previous ones due to the benefits of system cache. In fact, you access the file content from system cache instead of hard disk. BUT after the buffer size overs a threshold, the reading performance WILL decrease. Thanks to Richard Steven's chapter 3 in APUE 2nd, you can find the detailed and extensive experiments of reading & writing buffers.

C++ Optimal Block Size For Reading From A File

I have a program that generates files containing random distributions of the character A - Z. I have written a method that reads these files (and counts each character) using fread with different buffer sizes in an attempt to determine the optimal block size for reads. Here is the method:
int get_histogram(FILE * fp, long *hist, int block_size, long *milliseconds, long *filelen)
{
char *buffer = new char[block_size];
bzero(buffer, block_size);
struct timeb t;
ftime(&t);
long start_in_ms = t.time * 1000 + t.millitm;
size_t bytes_read = 0;
while (!feof(fp))
{
bytes_read += fread(buffer, 1, block_size, fp);
if (ferror (fp))
{
return -1;
}
int i;
for (i = 0; i < block_size; i++)
{
int j;
for (j = 0; j < 26; j++)
{
if (buffer[i] == 'A' + j)
{
hist[j]++;
}
}
}
}
ftime(&t);
long end_in_ms = t.time * 1000 + t.millitm;
*milliseconds = end_in_ms - start_in_ms;
*filelen = bytes_read;
return 0;
}
However, when I plot bytes/second vs. block size (buffer size) using block sizes of 2 - 2^20, I get an optimal block size of 4 bytes -- which just can't be correct. Something must be wrong with my code but I can't find it.
Any advice is appreciated.
Regards.
EDIT:
The point of this exercise is to demonstrate the optimal buffer size by recording the read times (plus computation time) for different buffer sizes. The file pointer is opened and closed by the calling code.
There are many bugs in this code:
It uses new[], which is C++.
It doesn't free the allocated memory.
It always loops over block_size bytes of input, not bytes_read as returned by fread().
Also, the actual histogram code is rather inefficient, since it seems to loop over each character to determine which character it is.
UPDATE: Removed claim that using feof() before I/O is wrong, since that wasn't true. Thanks to Eric for pointing this out in a comment.
You're not stating what platform you're running this on, and what compile time parameters you use.
Of course, the fread() involves some overhead, leaving user mode and returning. On the other hand, instead of setting the hist[] information directly, you're looping through the alphabet. This is unnecessary and, without optimization, causes some overhead per byte.
I'd re-test this with hist[j-26]++ or something similar.
Typically, the best timing would be achieved if your buffer size equals the system's buffer size for the given media.