Related
I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}
in my program I want to read several text files (more than ~800 files), each with 256 lines and their filenames starting from 1.txt to n.txt, and store them into a database after several processing steps. My problem is the data's reading speed. I could speed the program up to about twice the speed it had before by using OpenMP multithreading for the reading loop. Is there a way to speed it up a bit more? My actual code is
std::string CCD_Folder = CCDFolder; //CCDFolder is a pointer to a char array
int b = 0;
int PosCounter = 0;
int WAVENUMBER, WAVELUT;
std::vector<std::string> tempstr;
std::string inputline;
//Input
omp_set_num_threads(YValue);
#pragma omp parallel for private(WAVENUMBER) private(WAVELUT) private(PosCounter) private(tempstr) private(inputline)
for(int i = 1; i < (CCD_Filenumbers+1); i++)
{
//std::cout << omp_get_thread_num() << ' ' << i << '\n';
//Umwandlung und Erstellung des Dateinamens, Öffnen des Lesekanals
std::string CCD_Filenumber = boost::lexical_cast<string>(i);
std::string CCD_Filename = CCD_Folder + '\\' + CCD_Filenumber + ".txt";
std::ifstream datain(CCD_Filename, std::ifstream::in);
while(!datain.eof())
{
std::getline(datain, inputline);
//Processing
};
};
All variables which are not defined here are defined somewhere else in my code, and it is working. So is there a possibility to speed this code a bit more up?
Thank you very much!
Some experiment:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
void generateFiles(int n) {
char fileName[32];
char fileStr[1032];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "w" );
for (int j=0;j<256;j++) {
int lineLen = rand() % 1024;
memset(fileStr, 'X', lineLen );
fileStr[lineLen] = 0x0D;
fileStr[lineLen+1] = 0x0A;
fileStr[lineLen+2] = 0x00;
fwrite( fileStr, 1, lineLen+2, f );
}
fclose(f);
}
}
void readFiles(int n) {
char fileName[32];
for (int i=0;i<n;i++) {
sprintf( fileName, "c:\\t\\%i.txt", i );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
}
}
DWORD WINAPI readInThread( LPVOID lpParam )
{
int * number = (int *)lpParam;
char fileName[32];
sprintf( fileName, "c:\\t\\%i.txt", *number );
FILE * f = fopen( fileName, "r" );
fseek(f, 0L, SEEK_END);
int size = ftell(f);
fseek(f, 0L, SEEK_SET);
char * data = (char*)malloc(size);
fread(data, size, 1, f);
free(data);
fclose(f);
return 0;
}
int main(int argc, char ** argv) {
long t1 = GetTickCount();
generateFiles(256);
printf("Write: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
readFiles(256);
printf("Read: %li ms\n", GetTickCount() - t1 );
t1 = GetTickCount();
const int MAX_THREADS = 256;
int pDataArray[MAX_THREADS];
DWORD dwThreadIdArray[MAX_THREADS];
HANDLE hThreadArray[MAX_THREADS];
for( int i=0; i<MAX_THREADS; i++ )
{
pDataArray[i] = (int) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
sizeof(int));
pDataArray[i] = i;
hThreadArray[i] = CreateThread(
NULL,
0,
readInThread,
&pDataArray[i],
0,
&dwThreadIdArray[i]);
}
WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);
printf("Read (threaded): %li ms\n", GetTickCount() - t1 );
}
first function just ugly thing to make a test dataset ( I know it can be done much better, but I honestly have no time )
1st experiment - sequential read
2nd experiment - read all in parallel
results:
256 files:
Write: 250 ms
Read: 140 ms
Read (threaded): 78 ms
1024 files:
Write: 1250 ms
Read: 547 ms
Read (threaded): 843 ms
I think second attempt clearly shows that on a long run 'dumb' thread creation just makes things even worse. Of course it needs improvements in a sense of preallocated workers, some thread pool etc, but I think with such fast operation as reading 100-200k from disk there is no really benefit of moving this functionality into thread. I have no time to write more 'clever' solution, but I have my doubts that it will be much faster because you will have to add system calls for mutexes etc...
going extreme you could think of preallocating memory pools etc.. but as being mentioned before code your posted just wrong.. it's a matter of milliseconds, but for sure not seconds
800 files (20 chars per line, 256 lines)
Write: 250 ms
Read: 63 ms
Read (threaded): 500 ms
Conclusion:
ANSWER IS:
Your reading code is wrong, you reading files so slow that there is a significant increase in speed then you make tasks runs in parallel. In the code above reading is actually faster then an expenses to spawn a thread
Your primary bottleneck is physically reading from the hard disk.
Unless you have the files on separate drives, the drive can only read data from one file at a time. Your best bet is to read each file as a whole rather read a portion of one file, tell the drive to locate to another file, read from there, and repeat. Repositioning the drive head to other locations, especially other files, is usually more expensive than letting the drive finish reading the single file.
The next bottle neck is the data channel between the processor and the hard drive. If your hard drives share any kind of communications channel, you will see a bottleneck, as data from each drive must come through the communications channel to your processor. Your processor will be sending commands to the drive(s) through this communications channel (PATA, SATA, USB, etc.).
The objective of the next steps is to reduce the overhead of the "middle men" between your program's memory and the hard drive communications interface. The most efficient is to access the controller directly; lesser efficient are using the OS functions; the "C" functions (fread and familiy) and least is the C++ streams. With increased efficiency comes tighter coupling with the platform and reduced safety (and simplicity).
I suggest the following:
Create multiple buffers in memory, large enough to save time, small
enough to prevent the OS from paging the memory to the hard drive.
Create a thread that reads the files into memory, as necessary.
Search the web for "double buffering". As long as there is space in
the buffer, this thread will read data.
Create multiple "outgoing" buffers.
Create a second thread that removes data from memory and "processes"
it, and inserts into the "outgoing" buffers.
Create a third thread that takes the data in the "outgoing" buffers
and sends to the databases.
Adjust the size of the buffers for the best efficiency within the
limitations of memory.
If you can access the DMA channels, use them to read from the hard drive into the "read buffers".
Next, you can optimize your code to efficiently use the data cache of the processor. For example, set up your "processing" so the data structures to not exceed a data line in the cache. Also, optimize your code to use registers (either specify the register keyword or use statement blocks so that the compiler knows when variables can be reused).
Other optimizations that may help:
Align data to the processors native word size, pad if necessary. For
example, prefer using 32 bytes instead of 13 or 24.
Fetch data in quantities of the processor's word size. For example,
access 4 octets (bytes) at a time on a 32-bit processor rather than 4
accesses of 1 byte.
Unroll loops - more instructions inside the loop, as branch
instructions slow down processing.
You are probably hitting the read limit of your disks, which means your options are somewhat limited. If this is a constant problem you could consider a different RAID structure, which will give you greater read throughput because more than one read head can access data at the same time.
To see if disk access really is the bottleneck, run your program with the time command:
>> /usr/bin/time -v <my program>
In the output you'll see how much CPU time you were utilizing compared to the amount of time required for things like disk access.
I would try going with C code for reading the file. I suspect that it'll be faster.
FILE* f = ::fopen( CCD_Filename.c_str(), "rb" );
if( f == NULL )
{
return;
}
::fseek( f, 0, SEEK_END );
const long lFileBytes = ::ftell( f );
::fseek( f, 0, SEEK_SET );
char* fileContents = new char[lFileBytes + 1];
const size_t numObjectsRead = ::fread( fileContents, lFileBytes, 1, f );
::fclose( f );
if( numObjectsRead < 1 )
{
delete [] fileContents;
return;
}
fileContents[lFileBytes] = '\0';
// assign char buffer of file contents here
delete [] fileContents;
I prefer not to use XML library parser out there, so can you give me suggestion which good write function to use to write data to XML file? I will make alot of to calls to the write function so the write function should be able to keep track of the last write position and it should not take too much resource. I have two different write below but I can't keep track the last write position unless I have to read the file until end of file.
case#1
FILE *pfile = _tfopen(GetFileNameXML(), _T("w"));
if(pfile)
{
_fputts(TEXT(""), pfile);
}
if(pfile)
{
fclose(pfile);
pfile = NULL;
}
case#2
HANDLE hFile = CreateFile(GetFileNameXML(), GENERIC_READ|GENERIC_WRITE,
FILE_SHARE_WRITE|FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if(hFile != INVALID_HANDLE_VALUE)
{
WriteFile(hFile,,,,,);
}
CloseHandle(hFile);
thanks.
If all you need is to write some text files, use C++'s standard library file facilities. The samples here will be helpful: http://www.cplusplus.com/doc/tutorial/files/
First, what's your aversion to using a standard XML processing library?
Next, if you decide to roll your own, definitely don't go directly at the Win32 APIs - at least not unless you're going to write out the generated XML in large chunks, or you're going to implement your own buffering layer.
It's not going to matter for dealing with tiny files, but you specifically mention good performance and many calls to the write function. WriteFile has a fair amount of overhead, it does a lot of work and involves user->kernel->user mode switches, which are expensive. If you're dealing with "normally sized" XML files you probably won't be able to see much of a difference, but if you're generating monstrously sized dumps it's definitely something to keep in mind.
You mention tracking the last write position - first off, it should be easy... with FILE buffers you have ftell, with raw Win32 API you have SetFilePointerEx - call it with liDistanceToMove=0 and dwMoveMethod=FILE_CURRENT, and you get the current file position after a write. But why do you need this? If you're streaming out an XML file, you should generally keep on streaming until you're done writing - are you closing and re-opening the file? Or are you writing a valid XML file which you want to insert more data into later?
As for the overhead of the Win32 file functions, it may or may not be relevant in your case (depending on the size of the files you're dealing with), but with larger files it matters a lot - included below is a micro-benchmark that simpy reads a file to memory with ReadFile, letting you specify different buffer sizes from the command line. It's interesting to look at, say, Process Explorer's IO tab while running the tool. Here's some statistics from my measly laptop (Win7-SP1 x64, core2duo P7350#2.0GHz, 4GB ram, 120GB Intel-320 SSD).
Take it for what it is, a micro-benchmark. The performance might or might not matter in your particular situation, but I do believe the numbers demonstrate that there's considerable overhead to the Win32 file APIs, and that doing a little buffering of your own helps.
With a fully cached 2GB file:
BlkSz Speed
32 14.4MB/s
64 28.6MB/s
128 56MB/s
256 107MB/s
512 205MB/s
1024 350MB/s
4096 800MB/s
32768 ~2GB/s
With a "so big there will only be cache misses" 4GB file:
BlkSz Speed CPU
32 13MB/s 49%
64 26MB/s 49%
128 52MB/s 49%
256 99MB/s 49%
512 180MB/s 49%
1024 200MB/s 32%
4096 185MB/s 22%
32768 205MB/s 13%
Keep in mind that 49% CPU usage means that one CPU core is pretty much fully pegged - a single thread can't really push the machine much harder. Notice the pathological behavior of the 4kb buffer in the second table - it was reproducible, and I don't have an explanation for it.
Crappy micro-benchmark code goes here:
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
#include <string>
#include <assert.h>
unsigned getDuration(FILETIME& timeStart, FILETIME& timeEnd)
{
// duration is in 100-nanoseconds, we want milliseconds
// 1 millisecond = 1000 microseconds = 1000000 nanoseconds
LARGE_INTEGER ts, te, res;
ts.HighPart = timeStart.dwHighDateTime; ts.LowPart = timeStart.dwLowDateTime;
te.HighPart = timeEnd.dwHighDateTime; te.LowPart = timeEnd.dwLowDateTime;
res.QuadPart = ((te.QuadPart - ts.QuadPart) / 10000);
assert(res.QuadPart < UINT_MAX);
return res.QuadPart;
}
int main(int argc, char* argv[])
{
if(argc < 2) {
puts("Syntax: ReadFile [filename] [blocksize]");
return 0;
}
char *filename= argv[1];
int blockSize = atoi(argv[2]);
if(blockSize < 1) {
puts("Please specify a blocksize larger than 0");
return 1;
}
HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, 0);
if(INVALID_HANDLE_VALUE == hFile) {
puts("error opening input file");
return 1;
}
std::vector<char> buffer(blockSize);
LARGE_INTEGER fileSize;
if(!GetFileSizeEx(hFile, &fileSize)) {
puts("Failed getting file size.");
return 1;
}
std::cout << "File size " << fileSize.QuadPart << ", that's " << (fileSize.QuadPart / blockSize) <<
" blocks of " << blockSize << " bytes - reading..." << std::endl;
FILETIME dummy, kernelStart, userStart;
GetProcessTimes(GetCurrentProcess(), &dummy, &dummy, &kernelStart, &userStart);
DWORD ticks = GetTickCount();
DWORD bytesRead = 0;
do {
if(!ReadFile(hFile, &buffer[0], blockSize, &bytesRead, 0)) {
puts("Error calling ReadFile");
return 1;
}
} while(bytesRead == blockSize);
ticks = GetTickCount() - ticks;
FILETIME kernelEnd, userEnd;
GetProcessTimes(GetCurrentProcess(), &dummy, &dummy, &kernelEnd, &userEnd);
CloseHandle(hFile);
std::cout << "Reading with " << blockSize << " sized blocks took " << ticks << "ms, spending " <<
getDuration(kernelStart, kernelEnd) << "ms in kernel and " <<
getDuration(userStart, userEnd) << "ms in user mode. Hit enter to countinue." << std::endl;
std::string dummyString;
std::cin >> dummyString;
return 0;
}
My platform is windows vista 32, with visual c++ express 2008 .
for example:
if i have a file contains 4000 bytes, can i have 4 threads read from the file at same time? and each thread access a different section of the file.
thread 1 read 0-999, thread 2 read 1000 - 2999, etc.
please give a example in C language.
If you don't write to them, no need to take care of sync / race condition.
Just open the file with shared reading as different handles and everything would work. (i.e., you must open the file in the thread's context instead of sharing same file handle).
#include <stdio.h>
#include <windows.h>
DWORD WINAPI mythread(LPVOID param)
{
int i = (int) param;
BYTE buf[1000];
DWORD numread;
HANDLE h = CreateFile("c:\\test.txt", GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, 0, NULL);
SetFilePointer(h, i * 1000, NULL, FILE_BEGIN);
ReadFile(h, buf, sizeof(buf), &numread, NULL);
printf("buf[%d]: %02X %02X %02X\n", i+1, buf[0], buf[1], buf[2]);
return 0;
}
int main()
{
int i;
HANDLE h[4];
for (i = 0; i < 4; i++)
h[i] = CreateThread(NULL, 0, mythread, (LPVOID)i, 0, NULL);
// for (i = 0; i < 4; i++) WaitForSingleObject(h[i], INFINITE);
WaitForMultipleObjects(4, h, TRUE, INFINITE);
return 0;
}
There's not even a big problem writing to the same file, in all honesty.
By far the easiest way is to just memory-map the file. The OS will then give you a void* where the file is mapped into memory. Cast that to a char[], and make sure that each thread uses non-overlapping subarrays.
void foo(char* begin, char*end) { /* .... */ }
void* base_address = myOS_memory_map("example.binary");
myOS_start_thread(&foo, (char*)base_address, (char*)base_address + 1000);
myOS_start_thread(&foo, (char*)base_address+1000, (char*)base_address + 2000);
myOS_start_thread(&foo, (char*)base_address+2000, (char*)base_address + 3000);
As others have noted already, there is no inherent problem in having multiple threads read from the same file, as long as they have their own file descriptor/handles. However, I'm a little curious about your motives. Why do you want to read a file in parallell? If you're only reading a file into memory, your bottleneck is likely the disk itself, in which case multiple thread won't help you at all (it'll just clutter your code).
And as always when optimizing, you should not attempt it until you (1) have a easy to understand, working, solution, and (2) you've measured your code to know where you should optimize.
Windows supports overlapped I/O, which allows a single thread to asynchronously queue multiple I/O requests for better performance. This could conceivably be used by multiple threads simultaneously as long as the file you are accessing supports seeking (i.e. this is not a pipe).
Passing FILE_FLAG_OVERLAPPED to CreateFile() allows simultaneous reads and writes on the same file handle; otherwise, Windows serializes them. Specify the file offset using the Offset and OffsetHigh members of the OVERLAPPED structure.
For more information see Synchronization and Overlapped Input and Output.
You can certainly have multiple threads reading from a data structure, race conditions can potentially occur if any writing is taking place.
To avoid such race conditions you need to define the boundaries that threads can read, if you have an explicit number of data segments and an explicit number of threads to match these then that is easy.
As for an example in C you would need to provide some more information, like the threading library you are using. Attempt it first, then we can help you fix any issues.
The easiest way is to open the file within each parallel instance, but just open it as readonly.
The people who say there may be an IO bottleneck are probably wrong. Any modern operating system caches file reads. Which means the first time you read a file will be the slowest, and any subsequent reads will be lightning fast. A 4000 byte file can even rest inside the processor's cache.
I don't see any real advantage to doing this.
You may have multiple threads reading from the device but your bottleneck will not be CPU but rather disk IO speed.
If you are not careful you may even slow the processes down (but you will need to measure it to know for certain).
It is possible though i'm not sure it will be worth the effort. Have you considered reading the entire file into memory within a single thread and then allow multiple threads to access that data?
You shouldn't need to do anything particularly clever if all they're doing is reading. Obviously you can read it as many times in parallel as you like, as long as you don't exclusively lock it. Writing is clearly another matter of course...
I do have to wonder why you'd want to though - it will likely perform badly since your HDD will waste a lot of time seeking back and forth rather than reading it all in one (relatively) uninterrupted sweep. For small files (like your 4000 line example) where that might not be such a problem, it doesn't seem worth the trouble.
Reading: No need to lock the file. Just open the file as read only or shared read
Writing: Use a mutex to ensure the file is only written to by one person.
std::mutex mtx;
void worker(int n)
{
mtx.lock();
char * memblock;
ifstream file ("D:\\test.txt", ios::in);
if (file.is_open())
{
memblock = new char [1000];
file.seekg (n * 999, ios::beg);
file.read (memblock, 999);
memblock[999] = '\0';
cout << memblock << endl;
file.close();
delete[] memblock;
}
else
cout << "Unable to open file";
mtx.unlock();
}
int main()
{
vector<std::thread> vec;
for(int i=0; i < 3; i++)
{
vec.push_back(std::thread(&worker,i));
}
std::for_each(vec.begin(), vec.end(), [](std::thread& th)
{
th.join();
});
return 0;
}
You need a way to sync those threads. There're different solutions to mutex http://en.wikipedia.org/wiki/Mutual_exclusion
He wants to read from a file in different threads. I guess that should be ok if the file is opened as read-only by each thread.
I hope you don't want to do this for performance though, since you will have to scan large parts of the file for newline characters in each thread.
Profiling my program and the function print is taking a lot of time to perform. How can I send "raw" byte output directly to stdout instead of using fwrite, and making it faster (need to send all 9bytes in the print() at the same time to the stdout) ?
void print(){
unsigned char temp[9];
temp[0] = matrix[0][0];
temp[1] = matrix[0][1];
temp[2] = matrix[0][2];
temp[3] = matrix[1][0];
temp[4] = matrix[1][1];
temp[5] = matrix[1][2];
temp[6] = matrix[2][0];
temp[7] = matrix[2][1];
temp[8] = matrix[2][2];
fwrite(temp,1,9,stdout);
}
Matrix is defined globally to be a unsigned char matrix[3][3];
IO is not an inexpensive operation. It is, in fact, a blocking operation, meaning that the OS can preempt your process when you call write to allow more CPU-bound processes to run, before the IO device you're writing to completes the operation.
The only lower level function you can use (if you're developing on a *nix machine), is to use the raw write function, but even then your performance will not be that much faster than it is now. Simply put: IO is expensive.
The top rated answer claims that IO is slow.
Here's a quick benchmark with a sufficiently large buffer to take the OS out of the critical performance path, but only if you're willing to receive your output in giant blurps. If latency to first byte is your problem, you need to run in "dribs" mode.
Write 10 million records from a nine byte array
Mint 12 AMD64 on 3GHz CoreDuo under gcc 4.6.1
340ms to /dev/null
710ms to 90MB output file
15254ms to 90MB output file in "dribs" mode
FreeBSD 9 AMD64 on 2.4GHz CoreDuo under clang 3.0
450ms to /dev/null
550ms to 90MB output file on ZFS triple mirror
1150ms to 90MB output file on FFS system drive
22154ms to 90MB output file in "dribs" mode
There's nothing slow about IO if you can afford to buffer properly.
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, char* argv[])
{
int dribs = argc > 1 && 0==strcmp (argv[1], "dribs");
int err;
int i;
enum { BigBuf = 4*1024*1024 };
char* outbuf = malloc (BigBuf);
assert (outbuf != NULL);
err = setvbuf (stdout, outbuf, _IOFBF, BigBuf); // full line buffering
assert (err == 0);
enum { ArraySize = 9 };
char temp[ArraySize];
enum { Count = 10*1000*1000 };
for (i = 0; i < Count; ++i) {
fwrite (temp, 1, ArraySize, stdout);
if (dribs) fflush (stdout);
}
fflush (stdout); // seems to be needed after setting own buffer
fclose (stdout);
if (outbuf) { free (outbuf); outbuf = NULL; }
}
The rawest form of output you can do is the probable the write system call, like this
write (1, matrix, 9);
1 is the file descriptor for standard out (0 is standard in, and 2 is standard error). Your standard out will only write as fast as the one reading it at the other end (i.e. the terminal, or the program you're pipeing into) which might be rather slow.
I'm not 100% sure, but you could try setting non-blocking IO on fd 1 (using fcntl) and hope the OS will buffer it for you until it can be consumed by the other end. It's been a while, but I think it works like this
fcntl (1, F_SETFL, O_NONBLOCK);
YMMV though. Please correct me if I'm wrong on the syntax, as I said, it's been a while.
Perhaps your problem is not that fwrite() is slow, but that it is buffered.
Try calling fflush(stdout) after the fwrite().
This all really depends on your definition of slow in this context.
All printing is fairly slow, although iostreams are really slow for printing.
Your best bet would be to use printf, something along the lines of:
printf("%c%c%c%c%c%c%c%c%c\n", matrix[0][0], matrix[0][1], matrix[0][2], matrix[1][0],
matrix[1][1], matrix[1][2], matrix[2][0], matrix[2][1], matrix[2][2]);
As everyone has pointed out IO in tight inner loop is expensive. I have normally ended up doing conditional cout of Matrix based on some criteria when required to debug it.
If your app is console app then try redirecting it to a file, it will be lot faster than doing console refreshes. e.g app.exe > matrixDump.txt
What's wrong with:
fwrite(matrix,1,9,stdout);
both the one and the two dimensional arrays take up the same memory.
Try running the program twice. Once with output and once without. You will notice that overall, the one without the io is the fastest. Also, you could fork the process (or create a thread), one writing to a file(stdout), and one doing the operations.
So first, don't print on every entry. Basically what i am saying is do not do like that.
for(int i = 0; i<100; i++){
printf("Your stuff");
}
instead allocate a buffer either on stack or on heap, and store you infomration there and then just throw this bufffer into stdout, just liek that
char *buffer = malloc(sizeof(100));
for(int i = 100; i<100; i++){
char[i] = 1; //your 8 byte value goes here
}
//once you are done print it to a ocnsole with
write(1, buffer, 100);
but in your case, just use write(1, temp, 9);
I am pretty sure you can increase the output performance by increasing the buffer size. So you have less fwrite calls. write might be faster but I am not sure. Just try this:
❯ yes | dd of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 2.18338 s, 234 MB/s
vs
> yes | dd of=/dev/null count=100000 bs=50KB iflag=fullblock
100000+0 records in
100000+0 records out
5000000000 bytes (5.0 GB, 4.7 GiB) copied, 2.63986 s, 1.9 GB/s
The same applies to your code. Some tests during the last days show that probably good buffer sizes are around 1 << 12 (=4096) and 1<<16 (=65535) bytes.
You can simply:
std::cout << temp;
printf is more C-Style.
Yet, IO operations are costly, so use them wisely.