Parsing large file with MPI in C++

Parsing large file with MPI in C++ - c++

I have a C++ program in which I want to parse a huge file, looking for some regex that I've implemented. The program was working ok when executed sequentially but then I wanted to run it using MPI.
I started the adaptation to MPI by differentiating the master (the one who coordinates the execution) from the workers (the ones that parse the file in parallel) in the main function. Something like this:
MPI::Init(argc, argv);
...
if(rank == 0) {
...
// Master sends initial and ending byte to every worker
for(int i = 1; i < total_workers; i++) {
array[0] = (i-1) * first_worker_file_part;
array[1] = i * first_worker_file_part;
MPI::COMM_WORLD.Send(array, 2, MPI::INT, i, 1);
}
}
if(rank != 0)
readDocument();
...
MPI::Finalize();
The master will send to every worker an array with 2 position that contains the byte where it will start the reading of the file in position 0 and the byte where it needs to stop reading in position 1.
The readDocument() function looks like this by now (not parsing, just each worker reading his part of the file):
void readDocument()
{
array = new int[2];
MPI::COMM_WORLD.Recv(array, 10, MPI::INT, 0, 1, status);
int read_length = array[1] - array[0];
char* buffer = new char [read_length];
if (infile)
{
infile.seekg(array[0]); // Start reading in supposed byte
infile.read(buffer, read_length);
}
}
I've tried different examples, from writing to a file the output of the reading to running it with different number of processes. What happens is that when I run the program with 20 processes instead of 10, for example, it lasts twice the time to read the file. I expected it to be nearly half the time and I can't figure why this is happening.
Also, in a different matter, I want to make the master wait for all the workers to complete their execution and then print the final time. Is there any way to "block" him while the workers are processing? Like a cond_wait in C pthreads?

In my experience people working on computer systems with parallel file systems tend to know about those parallel file systems so your question marks you out, initially, as someone not working on such a system.
Without specific hardware support reading from a single file boils down to the system positioning a single read head and reading a sequence of bytes from the disk to memory. This situation is not materially altered by the complex realities of many modern file systems, such as RAID, which may in fact store a file across multiple disks. When multiple processes ask the operating system for access to files at the same time the o/s parcels out disk access according to some notion, possibly of fairness, so that no process gets starved. At worst the o/s spends so much time switching disk access from process to process that the rate of reading drops significantly. The most efficient, in terms of throughput, approach is for a single process to read an entire file in one go while other processes do other things.
This situation, multiple processes contending for scarce disk i/o resources, applies whether or not those processes are part of a parallel, MPI (or similar) program or entirely separate programs running concurrently.
The impact is what you observe -- instead of 10 processes each waiting to get their own 1/10th share of the file you have 20 processes each waiting for their 1/20th share. Oh, you cry, but each process is only reading half as much data so the whole gang should take the same amount of time to get the file. No, I respond, you've forgotten to add the time it takes the o/s to position and reposition the read/write heads between accesses. Read time comprises latency (how long does it take reading to start once the request has been made) and throughput (how fast can the i/o system pass the bytes to and fro).
It should be easy to come up with some reasonable estimates of latency and bandwidth that explains the twice as long reading by 20 processes as by 10.
How can you solve this ? You can't, not without a parallel file system. But you might find that having the master process read the whole file and then parcel it out to be faster than your current approach. You might not, you might just find that the current approach is the fastest for your whole computation. If read time is, say, 10% of total computation time you might decide it's a reasonable overhead to live with.

To add to High Performance Mark's correct answer, one can use MPI-IO to do the file reading, providing (in this case) hints to the IO routines not to read from every processor; but this same code with a modified (or empty) MPI_Info should be able to take advantage of a parallel file system as well should you move to a cluster that has one. For the most common implementation of MPI-IO, Romio, the manual describing what hints are available is here; in particular, we're using
MPI_Info_set(info, "cb_config_list","*:1");
to set the number of readers to be one per node. The code below will let you try reading the file using MPI-IO or POSIX (eg, seek).
#include <iostream>
#include <fstream>
#include <mpi.h>
void partitionFile(const int filesize, const int rank, const int size,
const int overlap, int *start, int *end) {
int localsize = filesize/size;
*start = rank * localsize;
*end = *start + localsize-1;
if (rank != 0) *start -= overlap;
if (rank != size-1) *end += overlap;
}
void readdataMPI(MPI_File *in, const int rank, const int size, const int overlap,
char **data, int *ndata) {
MPI_Offset filesize;
int start;
int end;
// figure out who reads what
MPI_File_get_size(*in, &filesize);
partitionFile((int)filesize, rank, size, overlap, &start, &end);
*ndata = end - start + 1;
// allocate memory
*data = new char[*ndata + 1];
// everyone reads in their part
MPI_File_read_at_all(*in, (MPI_Offset)start, *data,
(MPI_Offset)(*ndata), MPI_CHAR, MPI_STATUS_IGNORE);
(*data)[*ndata] = '\0';
}
void readdataSeek(std::ifstream &infile, int array[2], char *buffer)
{
int read_length = array[1] - array[0];
if (infile)
{
infile.seekg(array[0]); // Start reading in supposed byte
infile.read(buffer, read_length);
}
}
int main(int argc, char **argv) {
MPI_File in;
int rank, size;
int ierr;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (argc != 3) {
if (rank == 0)
std::cerr << "Usage: " << argv[0] << " infilename [MPI|POSIX]" << std::endl;
MPI_Finalize();
return -1;
}
std::string optionMPI("MPI");
if ( !optionMPI.compare(argv[2]) ) {
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list","*:1"); // ROMIO: one reader per node
// Eventually, should be able to use io_nodes_list or similar
ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, info, &in);
if (ierr) {
if (rank == 0)
std::cerr << "Usage: " << argv[0] << " Couldn't open file " << argv[1] << std::endl;
MPI_Finalize();
return -1;
}
const int overlap=1;
char *data;
int ndata;
readdataMPI(&in, rank, size, overlap, &data, &ndata);
std::cout << "MPI: Rank " << rank << " has " << ndata << " characters." << std::endl;
delete [] data;
MPI_File_close(&in);
MPI_Info_free(&info);
} else {
int fsize;
if (rank == 0) {
std::ifstream file( argv[1], std::ios::ate );
fsize=file.tellg();
file.close();
}
MPI_Bcast(&fsize, 1, MPI_INT, 0, MPI_COMM_WORLD);
int start, end;
partitionFile(fsize, rank, size, 1, &start, &end);
int array[2] = {start, end};
char *buffer = new char[end-start+2];
std::ifstream infile;
infile.open(argv[1], std::ios::in);
readdataSeek(infile, array, buffer);
buffer[end-start+1] = '\0';
std::cout << "Seeking: Rank " << rank << " has " << end-start+1 << " characters." << std::endl;
infile.close() ;
delete [] buffer;
}
MPI_Finalize();
return 0;
}
On my desktop, I don't get much of a performance difference, even oversubscribing the cores (eg, using lots of seeks):
$ time mpirun -np 20 ./read-chunks moby-dick.txt POSIX
Seeking: Rank 0 has 62864 characters.
[...]
Seeking: Rank 8 has 62865 characters.
real 0m1.250s
user 0m0.290s
sys 0m0.190s
$ time mpirun -np 20 ./read-chunks moby-dick.txt MPI
MPI: Rank 1 has 62865 characters.
[...]
MPI: Rank 4 has 62865 characters.
real 0m1.272s
user 0m0.337s
sys 0m0.265s

Related

Unix command line failing to run program after compiling with no error message

I'm trying to run a C++ program I've been writing from my school's Unix Command-Line based server. The program is supposed to use commands like pipe() and fork() to calculate an integer in the child process and send it to the parent process through a pipe. The problem I've come across is when I try to run the program after compiling it, nothing happens at all except for a '0' is inserted before the prompt. I don't completely understand forking and piping so I'll post the entire program in case the problem is in my use of those commands. There are probably errors because I haven't been able to successfully run it yet. Here is my code:
#include <cstdlib>
#include <iostream>
#include <string>
#include <array>
#include <cmath>
#include <unistd.h>
using namespace std;
// Return bool for whether an int is prime or not
bool primeChecker(int num)
{
bool prime = true;
for (int i = 2; i <= num / 2; ++i)
{
if (num%i == 0)
{
prime = false;
break;
}
}
return prime;
}
int main(int argc, char *argv[])
{
int *array;
array = new int[argc - 1]; // dynamically allocated array (size is number of parameters)
int fd[2];
int count = 0; // counts number of primes already found
int num = 1; // sent to primeChecker
int k = 1; // index for argv
int addRes = 0;
// Creates a pair of file descriptors (int that represents a file), pointing to a pipe inode,
// and places them in the array pointed to. fd[0] is for reading, fd[1] is for writing
pipe(fd);
while (k < argc)
{
if (primeChecker(num)) // if the current number is prime,
{
count++; // increment the prime number count
if (count == (stoi(argv[k]))) // if the count reaches one of the arguments...
{
array[k - 1] = num; // store prime number
k++; // increment the array of arguments
}
}
num++;
}
pid_t pid;
pid = fork();
if (pid < 0) // Error occurred
{
cout << "Fork failed.";
return 0;
}
else if(pid == 0) // Child process
{
for (int i = 0; i < (argc-1); i++)
{
// Close read descriptor (not used)
close(fd[0]);
// Write data
write(fd[1], &addRes, sizeof(addRes)); /* write(fd, writebuffer, max write lvl) */
// Close write descriptor
close(fd[1]);
}
}
else // Parent process
{
// Wait for child to finish
wait(0);
// Close write descriptor (not used)
close(fd[1]);
// Read data
read(fd[0], &addRes, sizeof(addRes));
cout << addRes;
// Close read descriptor
close(fd[0]);
}
return 0;
}
Here is what I'm seeing in the command window (including the prompt) when I try to compile and run my program:
~/cs3270j/Prog2$ g++ -o prog2.exe prog2.cpp
~/cs3270j/Prog2$ ./prog2.exe
0~/cs3270j/Prog2$
and nothing happens. I've tried different naming variations as well as running it from 'a.out' with no success.
tl;dr after compiling and attempting to execute my program, the Unix command prompt simply adds a 0 to the beginning of the prompt and does nothing else.
Any help that anybody could give me would be very much appreciated as I can't find any information whatsoever about a '0' appearing before the prompt.

Your program is doing exactly what you're telling it to do! You feed addRes into the pipe, and then print it out. addRes is initialized to 0 and never changed. In your child, you want to pass num instead. Also, you may want to print out a new line as well ('\n').

You never write anything to the pipe; writing is once per each command line argument, and ./prog2.exe does not supply any, so the loop never executes
If you passed one argument, you would write addRes; you never change addRes, so you'd get 0 in the parent
If you passed multiple arguments, you'd write one addRes then close the channel. This is not too bad since you never read more than one addRes anyway.
You print out your addRes (which is unchanged from its initialisation int addRes = 0) without a newline, which makes the next prompt stick right next to it (using cout << addRes << endl would print out a newline, making it prettier)

why is mpi_bcast so much slower than mpi_reduce?

Using MPI, we can do a broadcast to send an array to many nodes, or a reduce to combine arrays from many nodes onto one node.
I guess that the fastest way to implement these will be using a binary tree, where each node either sends to two nodes (bcast) or reduces over two nodes (reduce), which will give a time logarithmic in the number of nodes.
There doesn't seem to be any reason for which broadcast would be particularly slower than reduce?
I ran the following test program on a 4-computer cluster, where each computer has 12 cores. The strange thing is that broadcast was quite a lot slower than reduce. Why? Is there anything I can do about it?
The results were:
inited mpi: 0.472943 seconds
N: 200000 1.52588MB
P = 48
did alloc: 0.000147641 seconds
bcast: 0.349956 seconds
reduce: 0.0478526 seconds
bcast: 0.369131 seconds
reduce: 0.0472673 seconds
bcast: 0.516606 seconds
reduce: 0.0448555 seconds
The code was:
#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <sys/time.h>
using namespace std;
#include <mpi.h>
class NanoTimer {
public:
struct timespec start;
NanoTimer() {
clock_gettime(CLOCK_MONOTONIC, &start);
}
double elapsedSeconds() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
double time = (now.tv_sec - start.tv_sec) + (double) (now.tv_nsec - start.tv_nsec) * 1e-9;
start = now;
return time;
}
void toc(string label) {
double elapsed = elapsedSeconds();
cout << label << ": " << elapsed << " seconds" << endl;
}
};
int main( int argc, char *argv[] ) {
if( argc < 2 ) {
cout << "Usage: " << argv[0] << " [N]" << endl;
return -1;
}
int N = atoi( argv[1] );
NanoTimer timer;
MPI_Init( &argc, &argv );
int p, P;
MPI_Comm_rank( MPI_COMM_WORLD, &p );
MPI_Comm_size( MPI_COMM_WORLD, &P );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("inited mpi");
if( p == 0 ) {
cout << "N: " << N << " " << (N*sizeof(double)/1024.0/1024) << "MB" << endl;
cout << "P = " << P << endl;
}
double *src = new double[N];
double *dst = new double[N];
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("did alloc");
for( int it = 0; it < 3; it++ ) {
MPI_Bcast( src, N, MPI_DOUBLE, 0, MPI_COMM_WORLD );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("bcast");
MPI_Reduce( src, dst, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("reduce");
}
delete[] src;
MPI_Finalize();
return 0;
}
The cluster nodes were running 64-bit ubuntu 12.04. I tried both openmpi and mpich2, and got very similar results. The network is gigabit ethernet, which is not the fastest, but what I'm most curious about is not the absolute speed, so much as the disparity between broadcast and reduce.

I don't think this quite answers your question, but I hope it provides some insight.
MPI is just a standard. It doesn't define how every function should be implemented. Therefore the performance of certain tasks in MPI (in your case MPI_Bcast and MPI_Reduce) are based strictly on the implementation you are using. It is possible that you could design a broadcast using point-to-point communication methods that performs better than the given MPI_Bcast.
Anyways, you have to consider what each of these functions is doing. Broadcast is taking information from one process and sending it to all other processes; reduce is taking information from each process and reducing it onto one process. According to the (most recent) standard, MPI_Bcast is considered a One-to-All collective operation and MPI_Reduce is considered an All-to-One collective operation. Therefore your intuition about using binary trees for MPI_Reduce is probably found in both implementations. However, it most likely not found in MPI_Bcast. It might be the case that MPI_Bcast is implemented using non-blocking point-to-point communication (sending from the process containing the information to all other processes) with a wait-all after the communication. In any case, in order to figure out how both functions work, I would suggest delving into the source code of your implementations of OpenMPI and MPICH2.

As Hristo mentioned, it depends on the size of your buffer. If you're sending a large buffer, the broadcast will have to do lots of large sends, while a receive does some local operation on the buffer to reduce it down to a single value and then only transmits that one value instead of the full buffer.

Read multiple .dat files by GPU

I understand that reading files by GPU is inefficient task as it's faced by the slowest part of the system, that is, IO. However, I came up with another approach by using the CPU for files reading and let the processing burden be handled by the GPU. I wrote the following code in C++ but I'm stuck at the integration point, that is, how to make GPU handle these files after they've been read by the CPU. In other words, what is the set off point of C++-amp to be added and integrated with the code? or should I rewrite the whole code from the scratch?
{/* this code to read multiple .dat files from the directory that contains the implementation (from my account of stackoverflow) */
#include <Windows.h>
#include <ctime>
#include <stdint.h>
#include <iostream>
using std::cout;
using std::endl;
#include <fstream>
using std::ifstream;
#include <cstring>
/* Returns the amount of milliseconds elapsed since the UNIX epoch. Works on both
* windows and linux. */
uint64_t GetTimeMs64()
{
FILETIME ft;
LARGE_INTEGER li;
/* Get the amount of 100 nano seconds intervals elapsed since January 1, 1601 (UTC) and copy it
* to a LARGE_INTEGER structure. */
GetSystemTimeAsFileTime(&ft);
li.LowPart = ft.dwLowDateTime;
li.HighPart = ft.dwHighDateTime;
uint64_t ret;
ret = li.QuadPart;
ret -= 116444736000000000LL; /* Convert from file time to UNIX epoch time. */
ret /= 10000; /* From 100 nano seconds (10^-7) to 1 millisecond (10^-3) intervals */
return ret;
}
const int MAX_CHARS_PER_LINE = 512;
const int MAX_TOKENS_PER_LINE = 20;
const char* const DELIMITER = "|";
int main()
{
// create a file-reading object
uint64_t a = GetTimeMs64();
cout << a << endl;
HANDLE h;
WIN32_FIND_DATA find_data;
h = FindFirstFile( "*.dat", & find_data );
if( h == INVALID_HANDLE_VALUE ) {
cout<<"error"<<endl;
}
do {
char * s = find_data.cFileName;
ifstream fin;
fin.open(s); // open a file
if (!fin.good())
return 1; // exit if file not found
// read each line of the file
while (!fin.eof())
{
// read an entire line into memory
char buf[MAX_CHARS_PER_LINE];
fin.getline(buf, MAX_CHARS_PER_LINE);
// parse the line into blank-delimited tokens
int n = 0; // a for-loop index
// array to store memory addresses of the tokens in buf
const char* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
// parse the line
token[0] = strtok(buf, DELIMITER); // first token
if (token[0]) // zero if line is blank
{
for (n = 1; n < MAX_TOKENS_PER_LINE; n++)
{
token[n] = strtok(0, DELIMITER); // subsequent tokens
if (!token[n]) break; // no more tokens
}
}
// process (print) the tokens
for (int i = 0; i < n; i++) // n = #of tokens
cout << "Token[" << i << "] = " << token[i] << endl;
cout << endl;
}
// Your code here
} while( FindNextFile( h, & find_data ) );
FindClose( h );
uint64_t b = GetTimeMs64();
cout << a << endl;
cout << b << endl;
uint64_t c = b - a;
cout << c << endl;
system("pause");
}

There is no way to handle the files for GPU. As you assumed CPU handles IO.
So you need to store your read information in memory, send it to the GPU, compute there and etc.
One of the good ways to work with files is to archive (with GPU) your information.
So you read file with CPU, extract > compute > archive with GPU, and store it with CPU.
UPD.
(CPU IO READ from file (should be already archived information)) to -> main memory
(CPU SEND) to -> GPU global memory from main memory
(GPU EXTRACT (if archived))
(GPU COMPUTE (your work here))
(GPU ARCHIVE)
(CPU RETRIEVE) to -> main memory from GPU global memory
(CPU IO WRITE to file)

Detect disc removal on fwrite in C

I am writing an application to continuously write and read files to a drive (whether it's hard drive or SD card or whatever). I'm writing a certain pattern and then reading it back as verification. I want to immediately output some kind of blaring error as soon as the app fails. Basically we're hitting the hardware with radiation and need to detect when it fails. I have the app reading & writing the files just fine so far, but can yank the SD card mid execution and it keeps on running as if it's still there. I really need to detect the moment that SD card is removed. I've seen some suggesting using libudev. I cannot use that as this is on an embedded linux system which doesn't have it. Here's the code I have so far:
#include <stdio.h>
#include <time.h>
const unsigned long long size = 16ULL*1024;
#define NANOS 1000000000LL
#define KB 1024
long long CreateFile(char* filename)
{
struct timespec time_start;
struct timespec time_stop;
long long start, elapsed, microseconds;
int timefail = 0;
size_t stat;
if(clock_gettime(CLOCK_REALTIME, &time_start) < 0)
timefail = 1;
start = time_start.tv_sec*NANOS + time_start.tv_nsec;
int a[size];
int i, j;
for(i=0;i<size;i++)
a[i] = i;
FILE* pFile;
pFile = fopen(filename, "wb");
if(pFile < 0)
{
perror("fopen");
return -1;
}
for(j=0; j < KB; j++)
{
stat = fwrite(a, sizeof(int), size, pFile);
if(stat < 0)
perror("fwrite");
stat = fsync(pFile);
//if(stat)
// perror("fysnc");
}
fclose(pFile);
if(clock_gettime(CLOCK_REALTIME, &time_stop) < 0)
timefail = 1;
elapsed = time_stop.tv_sec*NANOS + time_stop.tv_nsec - start;
microseconds = elapsed / 1000 + (elapsed % 1000 >= 500);
if(timefail)
return -1;
return microseconds / 1000;
}
long long ReadFile(char* filename)
{
struct timespec time_start;
struct timespec time_stop;
long long start, elapsed, microseconds;
int timefail = 0;
if(clock_gettime(CLOCK_REALTIME, &time_start) < 0)
timefail = 1;
start = time_start.tv_sec*NANOS + time_start.tv_nsec;
FILE* pFile;
pFile = fopen(filename, "rb");
int a[KB];
int i=0, j=0;
for(i=0; i<size; i++)
{
if(ferror(pFile) != 0)
{
fprintf(stderr, "**********************************************");
fprintf(stderr, "READ FAILURE\n");
fclose(pFile);
return -1;
}
fread(a, sizeof(a), 1, pFile);
for(j=0; j<KB;j++)
{
if(a[0] != a[1]-1)
{
fprintf(stderr, "**********************************************");
fprintf(stderr, "DATA FAILURE, %d != %d\n", a[j], a[j+1]-1);
fclose(pFile);
return -1;
}
}
}
fclose(pFile);
if(clock_gettime(CLOCK_REALTIME, &time_stop) < 0)
timefail = 1;
if(timefail)
return -1;
elapsed = time_stop.tv_sec*NANOS + time_stop.tv_nsec - start;
microseconds = elapsed / 1000 + (elapsed % 1000 >= 500);
return microseconds/1000;
}
int main(int argc, char* argv[])
{
char* filenamebase = "/tmp/media/mmcblk0p1/test.file";
char filename[100] = "";
int i=0;
long long tmpsec = 0;
long long totalwritetime = 0;
int totalreadtime = 0;
int numfiles = 10;
int totalwritten = 0;
int totalread = 0;
for(i=0;i<numfiles;i++)
{
sprintf(filename, "%s%d", filenamebase, i);
fprintf(stderr, "Writing File: %s ...", filename);
tmpsec = CreateFile(filename);
if(tmpsec < 0)
return 0;
totalwritetime += tmpsec;
totalwritten++;
fprintf(stderr, "completed in %lld seconds\n", tmpsec);
fprintf(stderr, "Reading File: %s ...", filename);
tmpsec = ReadFile(filename);
if(tmpsec < 0)
return 0;
totalreadtime += tmpsec;
totalread++;
fprintf(stderr, "completed in %lld seconds\n", tmpsec);
}
fprintf(stderr, "Test Complete\nTotal Files: %d written, %d read\n", totalwritten, totalread);
fprintf(stderr, "File Size: %lld KB\n", size);
fprintf(stderr, "Total KBytes Written: %lld\n", size*totalwritten);
fprintf(stderr, "Average Write Speed: %0.2f KBps\n", (double)size*totalwritten/(totalwritetime/1000));
fprintf(stderr, "Total KBytes Read: %lld\n", size*totalread);
fprintf(stderr, "Average Read Speed: %0.2f KBps\n", (double)size*totalread/(totalreadtime/1000));
return 0;
}

You'll need to change your approach.
If you yank out media that has been mounted, you're likely to panic your kernel (as it keeps complex data structures that represent the mounted filesystem in memory), and break the media itself.
I've destroyed quite a few USB memory sticks that way -- their internal small logic that handle allocation and wear leveling do not like to lose power mid-run, and the cheapest ones do not seem to have capacitors capable of providing enough power to keep them running long enough to ensure a consistent state -- but SD cards and more expensive USB sticks might survive better.
Depending on the drivers used, the kernel may allow you to read and write to the media, but simply keep the changes in page cache. (Furthermore, your stdio.h I/O is likely to only reach into the page cache, and not the actual media, depending on the mount options (whether mounted direct/sync or not). Your approach simply does not provide the behaviour you assume it does.)
Instead, you should use low-level I/O (unistd.h, see man 2 open and related calls, none of stdio.h), using O_RDWR|O_DIRECT|O_SYNC flags to make sure your reads and writes hit the hardware, and access the raw media directly via the block device node, instead of mounting it at all. You can also read/write to random locations on the device, in the hopes that wear leveling does not affect your radiation resistance checks too much.
(Edited to add: If you write in blocks exactly the size of the native allocation block for the tested media device, you'll avoid the slow read-modify-write cycles on the device. The device will still do wear leveling, but that just means that the block you wrote is in a random physical location(s) in the flash chip. The native block size depends on the media device. It is possible to measure the native block size by observing how long it takes to read and write a block of different size, but I think for damage testing, a large enough power of two should work best -- say 256k or 262144 bytes. It's probably best to let the user set it for each device separately, and use either manufacturer-provided information, or a separate test program to find out the proper value.)
You do not want to use mmap() for this, as the SIGBUS signal caused by media errors and media becoming unavailable, is very tricky to handle correctly. Low-level unistd.h I/O is best for this, in my opinion.
I believe, but have not verified, that yanking out the media in mid-read/write to the unmounted low-level device, should simply yield a read/write error. (I don't have any media I'm willing to risk right now to check it, though :)

Answer from my comment:
In your write function you should have:
for(j=0; j < KB; j++)
{
uint32_t bytes_written = fwrite(a, sizeof(int), size, pFile);
if(bytes_written < size)
{
perror("fwrite");
break;
}
stat = fsync(pFile);
if(stat < 0)
{
perror("fysnc");
break;
}
}
and in your read function:
uint32_t read_bytes_count = fread(a, sizeof(a), 1, pFile);
if(read_bytes_count < sizeof(a))
break;
Also if you have a C99 compiler please use the fixed size types available in stdint.h, ex: uint32_t, ...

C: performance of pthread, low than single thrad

I'm confusing about the performance of my code, when dealing with single thread it only using 13s, but it's will consume 80s. I don't know whether the vector can only be accessed by one thread at a time, if so it's likely I have to use a struct array to store data instead of vector, could anyone kindly help?
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iterator>
#include <string>
#include <ctime>
#include <bangdb/database.h>
#include "SEQ.h"
#define NUM_THREADS 16
using namespace std;
typedef struct _thread_data_t {
std::vector<FDT> *Query;
unsigned long start;
unsigned long end;
connection* conn;
int thread;
} thread_data_t;
void *thr_func(void *arg) {
thread_data_t *data = (thread_data_t *)arg;
std::vector<FDT> *Query = data->Query;
unsigned long start = data->start;
unsigned long end = data->end;
connection* conn = data->conn;
printf("thread %d started %lu -> %lu\n", data->thread, start, end);
for (unsigned long i=start;i<=end ;i++ )
{
FDT *fout = conn->get(&((*Query).at(i)));
if (fout == NULL)
{
//printf("%s\tNULL\n", s);
}
else
{
printf("Thread:%d\t%s\n", data->thread, fout->data);
}
}
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
if (argc<2)
{
printf("USAGE: ./seq <.txt>\n");
printf("/home/rd/SCRIPTs/12X18610_L5_I052.R1.clean.code.seq\n");
exit(-1);
}
printf("%s\n", argv[1]);
vector<FDT> Query;
FILE* fpin;
if((fpin=fopen(argv[1],"r"))==NULL) {
printf("Can't open Input file %s\n", argv[1]);
return -1;
}
char *key = (char *)malloc(36);
while (fscanf(fpin, "%s", key) != EOF)
{
SEQ * sequence = new SEQ(key);
FDT *fk = new FDT( (void*)sequence, sizeof(*sequence) );
Query.push_back(*fk);
}
unsigned long Querysize = (unsigned long)(Query.size());
std::cout << "myvector stores " << Querysize << " numbers.\n";
//create database, table and connection
database* db = new database((char*)"berrydb");
//get a table, a new one or existing one, walog tells if log is on or off
table* tbl = db->gettable((char*)"hg19", JUSTOPEN);
if(tbl == NULL)
{
printf("ERROR:table NULL error");
exit(-1);
}
//get a new connection
connection* conn = tbl->getconnection();
if(conn == NULL)
{
printf("ERROR:connection NULL error");
exit(-1);
}
cerr<<"begin querying...\n";
time_t begin, end;
double duration;
begin = clock();
unsigned long ThreadDealSize = Querysize/NUM_THREADS;
cerr<<"Querysize:"<<ThreadDealSize<<endl;
pthread_t thr[NUM_THREADS];
int rc;
thread_data_t thr_data[NUM_THREADS];
for (int i=0;i<NUM_THREADS ;i++ )
{
unsigned long ThreadDealStart = ThreadDealSize*i;
unsigned long ThreadDealEnd = ThreadDealSize*(i+1) - 1;
if (i == (NUM_THREADS-1) )
{
ThreadDealEnd = Querysize-1;
}
thr_data[i].conn = conn;
thr_data[i].Query = &Query;
thr_data[i].start = ThreadDealStart;
thr_data[i].end = ThreadDealEnd;
thr_data[i].thread = i;
}
for (int i=0;i<NUM_THREADS ;i++ )
{
if (rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))
{
fprintf(stderr, "error: pthread_create, rc: %d\n", rc);
return EXIT_FAILURE;
}
}
for (int i = 0; i < NUM_THREADS; ++i) {
pthread_join(thr[i], NULL);
}
cerr<<"done\n"<<endl;
end = clock();
duration = double(end - begin) / CLOCKS_PER_SEC;
cerr << "runtime: " << duration << "\n" << endl;
db->closedatabase(OPTIMISTIC);
delete db;
printf("Done\n");
return EXIT_SUCCESS;
}

Like all data structures in standard library, methods of vector are reentrant, but not thread-safe. That means different instances can be accessed by multiple threads independently, but each instance may only be accessed by one thread at a time and you have to ensure it. But since you have separate vector for each thread, that's not your problem.
What is probably your problem is the printf. printf is thread-safe, meaning you can call it from any number of threads at the same time, but at the cost of being wrapped in mutual exclusion internally.
Majority of work in the threaded part of your program is done inside printf. So what probably happens is that all the threads are started and quickly get to the printf, where all but the first will stop. When the printf finishes and releases the mutex, system considers scheduling the threads that were waiting for it. It probably does, so rather slow context switch happens. And repeats after every printf.
How exactly it happens depends on which actual locking primitive is being used, which depends on your operating system and standard library versions. The system should each time wake up only the next sleeper, but many implementations actually wake up all of them. So in addition to the printfs being executed in mostly round-robin fashion, incurring one context switch for each, there may be quite a few additional spurious wake-ups in which the thread just finds the lock is held and goes back to sleep.
So the lesson from this is that threads don't make things automagically faster. They only help when:
The thread spends most of it's time doing blocking system calls. In things like network servers the threads wait for data from the socket, than from data for response to come from disk and finally for network to accept the response. In such cases, having many threads helps as long as they are mostly independent.
There is just so many threads as there are CPU threads. Currently the usual number is 4 (either quad-core or dual-core with hyper-threading). More threads can't physically run in parallel, so they provide no gain and incur a bit of overhead. 16 threads is thus overkill.
And they never help when they all manipulate the same objects, so they end up spending most of the time waiting for locks anyway. In addition to any of your own objects that you lock, keep in mind that input and output file handles have to be internally locked as well.
Memory allocation also needs to internally synchronize between threads, but modern allocators have separate pools for threads to avoid much of it; if the default allocator proves to be too slow with many threads, there are some specialized ones you can use.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js