Using MPI, we can do a broadcast to send an array to many nodes, or a reduce to combine arrays from many nodes onto one node.
I guess that the fastest way to implement these will be using a binary tree, where each node either sends to two nodes (bcast) or reduces over two nodes (reduce), which will give a time logarithmic in the number of nodes.
There doesn't seem to be any reason for which broadcast would be particularly slower than reduce?
I ran the following test program on a 4-computer cluster, where each computer has 12 cores. The strange thing is that broadcast was quite a lot slower than reduce. Why? Is there anything I can do about it?
The results were:
inited mpi: 0.472943 seconds
N: 200000 1.52588MB
P = 48
did alloc: 0.000147641 seconds
bcast: 0.349956 seconds
reduce: 0.0478526 seconds
bcast: 0.369131 seconds
reduce: 0.0472673 seconds
bcast: 0.516606 seconds
reduce: 0.0448555 seconds
The code was:
#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <sys/time.h>
using namespace std;
#include <mpi.h>
class NanoTimer {
public:
struct timespec start;
NanoTimer() {
clock_gettime(CLOCK_MONOTONIC, &start);
}
double elapsedSeconds() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
double time = (now.tv_sec - start.tv_sec) + (double) (now.tv_nsec - start.tv_nsec) * 1e-9;
start = now;
return time;
}
void toc(string label) {
double elapsed = elapsedSeconds();
cout << label << ": " << elapsed << " seconds" << endl;
}
};
int main( int argc, char *argv[] ) {
if( argc < 2 ) {
cout << "Usage: " << argv[0] << " [N]" << endl;
return -1;
}
int N = atoi( argv[1] );
NanoTimer timer;
MPI_Init( &argc, &argv );
int p, P;
MPI_Comm_rank( MPI_COMM_WORLD, &p );
MPI_Comm_size( MPI_COMM_WORLD, &P );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("inited mpi");
if( p == 0 ) {
cout << "N: " << N << " " << (N*sizeof(double)/1024.0/1024) << "MB" << endl;
cout << "P = " << P << endl;
}
double *src = new double[N];
double *dst = new double[N];
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("did alloc");
for( int it = 0; it < 3; it++ ) {
MPI_Bcast( src, N, MPI_DOUBLE, 0, MPI_COMM_WORLD );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("bcast");
MPI_Reduce( src, dst, N, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
MPI_Barrier(MPI_COMM_WORLD);
if( p == 0 ) timer.toc("reduce");
}
delete[] src;
MPI_Finalize();
return 0;
}
The cluster nodes were running 64-bit ubuntu 12.04. I tried both openmpi and mpich2, and got very similar results. The network is gigabit ethernet, which is not the fastest, but what I'm most curious about is not the absolute speed, so much as the disparity between broadcast and reduce.
I don't think this quite answers your question, but I hope it provides some insight.
MPI is just a standard. It doesn't define how every function should be implemented. Therefore the performance of certain tasks in MPI (in your case MPI_Bcast and MPI_Reduce) are based strictly on the implementation you are using. It is possible that you could design a broadcast using point-to-point communication methods that performs better than the given MPI_Bcast.
Anyways, you have to consider what each of these functions is doing. Broadcast is taking information from one process and sending it to all other processes; reduce is taking information from each process and reducing it onto one process. According to the (most recent) standard, MPI_Bcast is considered a One-to-All collective operation and MPI_Reduce is considered an All-to-One collective operation. Therefore your intuition about using binary trees for MPI_Reduce is probably found in both implementations. However, it most likely not found in MPI_Bcast. It might be the case that MPI_Bcast is implemented using non-blocking point-to-point communication (sending from the process containing the information to all other processes) with a wait-all after the communication. In any case, in order to figure out how both functions work, I would suggest delving into the source code of your implementations of OpenMPI and MPICH2.
As Hristo mentioned, it depends on the size of your buffer. If you're sending a large buffer, the broadcast will have to do lots of large sends, while a receive does some local operation on the buffer to reduce it down to a single value and then only transmits that one value instead of the full buffer.
Related
#include <sys/time.h>
#include <pthread.h>
#include <cstdio>
#include <iostream>
timespec m_timeToWait;
pthread_mutex_t m_lock;
pthread_cond_t m_cond;
timespec & calculateNextCheckTime(int intervalSeconds){
timeval now{};
gettimeofday(&now, nullptr);
m_timeToWait.tv_sec = now.tv_sec + intervalSeconds;
//m_timeToWait.tv_nsec = (1000 * now.tv_usec) + intervalSeconds;
return m_timeToWait;
}
void *run(void *){
int i = 0;
pthread_mutex_lock(&m_lock);
while (i < 10) {
std::cout << "Waiting .." << std::endl;
int ret = pthread_cond_timedwait(&m_cond, &m_lock, &calculateNextCheckTime(1));
std::cout << "doing work" << std::endl;
i++;
}
pthread_mutex_unlock(&m_lock);
}
int main()
{
pthread_t thread;
int ret;
int i;
std::cout << "In main: creating thread" << std::endl;
ret = pthread_create(&thread, NULL, &run, NULL);
pthread_join(reinterpret_cast<pthread_t>(&thread), reinterpret_cast<void **>(ret));
return 0;
}
There are similar examples on SO, but I can't seem to figure it out. Also, the Clion IDE insists that I use re-interpret casts on the pthread_join params, even though examples on SO don't have those casts in place. I am using C++11.
This is just maths.
You have access to tv_sec, and you have access to tv_nsec.
Currently you're only setting tv_sec, to "the seconds part of now, plus X seconds".
You can also set tv_nsec, to "the nanoseconds part of now, plus Y nanoseconds".
The result is "now, plus X seconds and Y nanoseconds"… which is when you want the program to wait (at the earliest), with nanoseconds resolution.
Just uncomment the line that does this, then provide the appropriate numbers for what you want to do.
You could have the function take an additional "milliseconds" argument (don't forget to multiply it by 1,000,000!) then leave the "seconds" at zero if you want that:
timespec& calculateNextCheckTime(const int intervalSeconds, const int intervalMillis)
{
timeval now{};
gettimeofday(&now, nullptr);
m_timeToWait.tv_sec = now.tv_sec + intervalSeconds;
m_timeToWait.tv_nsec = (1000 * now.tv_usec) + (1000 * 1000 * intervalMillis);
return m_timeToWait;
}
You may or may not wish to perform some range checking (i.e. verify that intervalMillis >= 0 && intervalMillis < 1000) to avoid nasty overflows.
Or, instead, you may wish to allow calculateNextCheckTime(1, 234) to be treated the same as calculateNextCheckTime(3, 34). And that will work, but only because you're also going to need to implement "carry" semantics to ensure that m_timeToWait.tv_nsec is less than 1,000,000,000 after adding the (1000 * now.tv_usec) component, over which the calling user has no control. (I have not implemented that in the above example.)
Also, you may or may not wish to make those arguments unsigned.
I have a c++ program running under Linux Debian 9. I'm doing a simple read() from a file descriptor:
int bytes_read = read(fd, buffer, buffer_size);
Imagine that I want to read some more data from the socket, but I want to skip a known number of bytes before getting to some content I'm interested in:
int unwanted_bytes_read = read(fd, unwanted_buffer, bytes_to_skip);
int useful_bytes = read(fd, buffer, buffer_size);
In Linux, is there a system-wide 'built-in' location that I can dump the unwanted bytes into, rather than having to maintain a buffer for unwanted data (like unwanted_buffer in the above example)?
I suppose what I'm looking for would be (sort of) the opposite of MSG_PEEK in the socket world, i.e. the kernel would purge bytes_to_skip from its receive buffer before the next useful call to recv.
If I were reading from a file then lseek would be enough. But this is not possible if you are reading from a socket and are using scatter/gather I/O, and you want to drop one of the fields.
I'm thinking about something like this:
// send side
int a = 1;
int b = 2;
int c = 3;
struct iovec iov[3];
ssize_t nwritten;
iov[0].iov_base = &a;
iov[0].iov_len = sizeof(int);
iov[1].iov_base = &b;
iov[1].iov_len = sizeof(int);
iov[2].iov_base = &c;
iov[2].iov_len = sizeof(int);
nwritten = writev(fd, iov, 3);
// receive side
int a = -1;
int c = -1;
struct iovec iov[3]; // you know that you'll be receiving three fields and what their sizes are, but you don't care about the second.
ssize_t nread;
iov[0].iov_base = &a;
iov[0].iov_len = sizeof(int);
iov[1].iov_base = ??? <---- what to put here?
iov[1].iov_len = sizeof(int);
iov[2].iov_base = &c;
iov[2].iov_len = sizeof(int);
nread = readv(fd, iov, 3);
I know that I could just create another b variable on the receive side, but if I don't want to, how can I read the sizeof(int) bytes that it occupies in the file but just dump the data and proceed to c? I could just create a generic buffer to dump b into, all I was asking is if there is such a location by default.
[EDIT]
Following a suggestion from #inetknght, I tried memory mapping /dev/null and doing my gather into the mapped address:
int nullfd = open("/dev/null", O_WRONLY);
void* blackhole = mmap(NULL, iov[1].iov_len, PROT_WRITE, MAP_SHARED, nullfd, 0);
iov[1].iov_base = blackhole;
nread = readv(fd, iov, 3);
However, blackhole comes out as 0xffff and I get an errno 13 'Permission Denied'. I tried running my code as su and this doesn't work either. Perhaps I'm setting up my mmap incorrectly?
There's a tl;dr at the end.
In my comment, I suggested you mmap() the /dev/null device. However it seems that device is not mappable on my machine (err 19: No such device). It looks like /dev/zero is mappable though. Another question/answer suggests that is equivalent to MAP_ANONYMOUS which makes the fd argument and its associated open() unnecessary in the first place. Check out an example:
#include <iostream>
#include <cstring>
#include <cerrno>
#include <cstdlib>
extern "C" {
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <fcntl.h>
}
template <class Type>
struct iovec ignored(void *p)
{
struct iovec iov_ = {};
iov_.iov_base = p;
iov_.iov_len = sizeof(Type);
return iov_;
}
int main()
{
auto * p = mmap(nullptr, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if ( MAP_FAILED == p ) {
auto err = errno;
std::cerr << "mmap(MAP_PRIVATE | MAP_ANONYMOUS): " << err << ": " << strerror(err) << std::endl;
return EXIT_FAILURE;
}
int s_[2] = {-1, -1};
int result = socketpair(AF_UNIX, SOCK_STREAM, 0, s_);
if ( result < 0 ) {
auto err = errno;
std::cerr << "socketpair(): " << err << ": " << strerror(err) << std::endl;
return EXIT_FAILURE;
}
int w_[3] = {1,2,3};
ssize_t nwritten = 0;
auto makeiov = [](int & v){
struct iovec iov_ = {};
iov_.iov_base = &v;
iov_.iov_len = sizeof(v);
return iov_;
};
struct iovec wv[3] = {
makeiov(w_[0]),
makeiov(w_[1]),
makeiov(w_[2])
};
nwritten = writev(s_[0], wv, 3);
if ( nwritten < 0 ) {
auto err = errno;
std::cerr << "writev(): " << err << ": " << strerror(err) << std::endl;
return EXIT_FAILURE;
}
int r_ = {0};
ssize_t nread = 0;
struct iovec rv[3] = {
ignored<int>(p),
makeiov(r_),
ignored<int>(p),
};
nread = readv(s_[1], rv, 3);
if ( nread < 0 ) {
auto err = errno;
std::cerr << "readv(): " << err << ": " << strerror(err) << std::endl;
return EXIT_FAILURE;
}
std::cout <<
w_[0] << '\t' <<
w_[1] << '\t' <<
w_[2] << '\n' <<
r_ << '\t' <<
*(int*)p << std::endl;
return EXIT_SUCCESS;
}
In the above example you can see that I create a private (writes won't be visible by children after fork()) anonymous (not backed by a file) memory mapping of 4KiB (one single page size on most systems). It's then used twice to provide a write destination for two ints -- the later int overwriting the earlier one.
That doesn't exactly solve your question: how to ignore the bytes. Since you're using readv(), I looked into its sister function, preadv() which on first glance appears to do what you want it to do: skip bytes. However, it seems that's not supported on socket file descriptors. The following code gives preadv(): 29: Illegal seek.
rv = makeiov(r_[1]);
nread = preadv(s_[1], &rv, 1, sizeof(int));
if ( nread < 0 ) {
auto err = errno;
std::cerr << "preadv(): " << err << ": " << strerror(err) << std::endl;
return EXIT_FAILURE;
}
So it looks like even preadv() uses seek() under the hood which is, of course, not permitted on a socket. I'm not sure if there is (yet?) a way to tell the OS to ignore/drop bytes received in an established stream. I suspect that's because #geza is correct: the cost to write to the final (ignored) destination is extremely trivial for most situations I've encountered. And, in the situations where the cost of the ignored bytes is not trivial, you should seriously consider using better options, implementations, or protocols.
tl;dr:
Creating a 4KiB anonymous private memory mapping is effectively indistinguishable from contiguous-allocation containers (there are subtle differences that aren't likely to be important for any workload outside of very high end performance). Using a standard container is also a lot less prone to allocation bugs: memory leaks, wild pointers, et al. So I'd say KISS and just do that instead of endorsing any of the code I wrote above. For example: std::array<char, 4096> ignored; or std::vector<char> ignored{4096}; and just set iovec.iov_base = ignored.data(); and set the .iov_len to whatever size you need to ignore (within the length of the container).
The efficient reading of data from a socket is when:
The user-space buffer size is the same or larger (SO_RCVBUF_size + maximum_message_size - 1) than that of the kernel socket receive buffer. You can even map buffer memory pages twice contiguously to make it a ring-buffer to avoid memmoveing incomplete messages to the beginning of the buffer.
The reading is done in one call of recv. This minimizes the number of syscalls (which are more expensive these days due to mitigations for Spectre, Meltdown, etc..). And also prevents starvation of other sockets in the same event loop, which can happen if the code repeatedly calls recv on the same socket with small buffer size until it fails with EAGAIN. As well as guarantees that you drain the entire kernel receive buffer in one recv syscall.
If you do the above, you should then interpret/decode the message from the user-space buffer ignoring whatever is necessary.
Using multiple recv or recvmsg calls with small buffer sizes is sub-optimal with regards to latency and throughput.
I have a C++ program in which I want to parse a huge file, looking for some regex that I've implemented. The program was working ok when executed sequentially but then I wanted to run it using MPI.
I started the adaptation to MPI by differentiating the master (the one who coordinates the execution) from the workers (the ones that parse the file in parallel) in the main function. Something like this:
MPI::Init(argc, argv);
...
if(rank == 0) {
...
// Master sends initial and ending byte to every worker
for(int i = 1; i < total_workers; i++) {
array[0] = (i-1) * first_worker_file_part;
array[1] = i * first_worker_file_part;
MPI::COMM_WORLD.Send(array, 2, MPI::INT, i, 1);
}
}
if(rank != 0)
readDocument();
...
MPI::Finalize();
The master will send to every worker an array with 2 position that contains the byte where it will start the reading of the file in position 0 and the byte where it needs to stop reading in position 1.
The readDocument() function looks like this by now (not parsing, just each worker reading his part of the file):
void readDocument()
{
array = new int[2];
MPI::COMM_WORLD.Recv(array, 10, MPI::INT, 0, 1, status);
int read_length = array[1] - array[0];
char* buffer = new char [read_length];
if (infile)
{
infile.seekg(array[0]); // Start reading in supposed byte
infile.read(buffer, read_length);
}
}
I've tried different examples, from writing to a file the output of the reading to running it with different number of processes. What happens is that when I run the program with 20 processes instead of 10, for example, it lasts twice the time to read the file. I expected it to be nearly half the time and I can't figure why this is happening.
Also, in a different matter, I want to make the master wait for all the workers to complete their execution and then print the final time. Is there any way to "block" him while the workers are processing? Like a cond_wait in C pthreads?
In my experience people working on computer systems with parallel file systems tend to know about those parallel file systems so your question marks you out, initially, as someone not working on such a system.
Without specific hardware support reading from a single file boils down to the system positioning a single read head and reading a sequence of bytes from the disk to memory. This situation is not materially altered by the complex realities of many modern file systems, such as RAID, which may in fact store a file across multiple disks. When multiple processes ask the operating system for access to files at the same time the o/s parcels out disk access according to some notion, possibly of fairness, so that no process gets starved. At worst the o/s spends so much time switching disk access from process to process that the rate of reading drops significantly. The most efficient, in terms of throughput, approach is for a single process to read an entire file in one go while other processes do other things.
This situation, multiple processes contending for scarce disk i/o resources, applies whether or not those processes are part of a parallel, MPI (or similar) program or entirely separate programs running concurrently.
The impact is what you observe -- instead of 10 processes each waiting to get their own 1/10th share of the file you have 20 processes each waiting for their 1/20th share. Oh, you cry, but each process is only reading half as much data so the whole gang should take the same amount of time to get the file. No, I respond, you've forgotten to add the time it takes the o/s to position and reposition the read/write heads between accesses. Read time comprises latency (how long does it take reading to start once the request has been made) and throughput (how fast can the i/o system pass the bytes to and fro).
It should be easy to come up with some reasonable estimates of latency and bandwidth that explains the twice as long reading by 20 processes as by 10.
How can you solve this ? You can't, not without a parallel file system. But you might find that having the master process read the whole file and then parcel it out to be faster than your current approach. You might not, you might just find that the current approach is the fastest for your whole computation. If read time is, say, 10% of total computation time you might decide it's a reasonable overhead to live with.
To add to High Performance Mark's correct answer, one can use MPI-IO to do the file reading, providing (in this case) hints to the IO routines not to read from every processor; but this same code with a modified (or empty) MPI_Info should be able to take advantage of a parallel file system as well should you move to a cluster that has one. For the most common implementation of MPI-IO, Romio, the manual describing what hints are available is here; in particular, we're using
MPI_Info_set(info, "cb_config_list","*:1");
to set the number of readers to be one per node. The code below will let you try reading the file using MPI-IO or POSIX (eg, seek).
#include <iostream>
#include <fstream>
#include <mpi.h>
void partitionFile(const int filesize, const int rank, const int size,
const int overlap, int *start, int *end) {
int localsize = filesize/size;
*start = rank * localsize;
*end = *start + localsize-1;
if (rank != 0) *start -= overlap;
if (rank != size-1) *end += overlap;
}
void readdataMPI(MPI_File *in, const int rank, const int size, const int overlap,
char **data, int *ndata) {
MPI_Offset filesize;
int start;
int end;
// figure out who reads what
MPI_File_get_size(*in, &filesize);
partitionFile((int)filesize, rank, size, overlap, &start, &end);
*ndata = end - start + 1;
// allocate memory
*data = new char[*ndata + 1];
// everyone reads in their part
MPI_File_read_at_all(*in, (MPI_Offset)start, *data,
(MPI_Offset)(*ndata), MPI_CHAR, MPI_STATUS_IGNORE);
(*data)[*ndata] = '\0';
}
void readdataSeek(std::ifstream &infile, int array[2], char *buffer)
{
int read_length = array[1] - array[0];
if (infile)
{
infile.seekg(array[0]); // Start reading in supposed byte
infile.read(buffer, read_length);
}
}
int main(int argc, char **argv) {
MPI_File in;
int rank, size;
int ierr;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (argc != 3) {
if (rank == 0)
std::cerr << "Usage: " << argv[0] << " infilename [MPI|POSIX]" << std::endl;
MPI_Finalize();
return -1;
}
std::string optionMPI("MPI");
if ( !optionMPI.compare(argv[2]) ) {
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list","*:1"); // ROMIO: one reader per node
// Eventually, should be able to use io_nodes_list or similar
ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, info, &in);
if (ierr) {
if (rank == 0)
std::cerr << "Usage: " << argv[0] << " Couldn't open file " << argv[1] << std::endl;
MPI_Finalize();
return -1;
}
const int overlap=1;
char *data;
int ndata;
readdataMPI(&in, rank, size, overlap, &data, &ndata);
std::cout << "MPI: Rank " << rank << " has " << ndata << " characters." << std::endl;
delete [] data;
MPI_File_close(&in);
MPI_Info_free(&info);
} else {
int fsize;
if (rank == 0) {
std::ifstream file( argv[1], std::ios::ate );
fsize=file.tellg();
file.close();
}
MPI_Bcast(&fsize, 1, MPI_INT, 0, MPI_COMM_WORLD);
int start, end;
partitionFile(fsize, rank, size, 1, &start, &end);
int array[2] = {start, end};
char *buffer = new char[end-start+2];
std::ifstream infile;
infile.open(argv[1], std::ios::in);
readdataSeek(infile, array, buffer);
buffer[end-start+1] = '\0';
std::cout << "Seeking: Rank " << rank << " has " << end-start+1 << " characters." << std::endl;
infile.close() ;
delete [] buffer;
}
MPI_Finalize();
return 0;
}
On my desktop, I don't get much of a performance difference, even oversubscribing the cores (eg, using lots of seeks):
$ time mpirun -np 20 ./read-chunks moby-dick.txt POSIX
Seeking: Rank 0 has 62864 characters.
[...]
Seeking: Rank 8 has 62865 characters.
real 0m1.250s
user 0m0.290s
sys 0m0.190s
$ time mpirun -np 20 ./read-chunks moby-dick.txt MPI
MPI: Rank 1 has 62865 characters.
[...]
MPI: Rank 4 has 62865 characters.
real 0m1.272s
user 0m0.337s
sys 0m0.265s
This question already has answers here:
How to check if a process is running or not using C++
(3 answers)
Closed 9 years ago.
Hi iuse this code for check Process after my App "piko.exe" run and if the programs such as
"non.exe","firefox.exe","lol.exe" if running closed my App and return an error.
But i need to this check process every 30 sec and i used while but my main program (this code is one part of my project) stopped working so pleas if possible pls someone edited my code thank you.
#include "StdInc.h"
#include <windows.h>
#include <tlhelp32.h>
#include <tchar.h>
#include <stdio.h>
void find_Proc(){
HANDLE proc_Snap;
HANDLE proc_pik;
HANDLE proc_pikterm;
PROCESSENTRY32 pe32;
PROCESSENTRY32 pe32pik;
int i;
char* chos[3] = {"non.exe","firefox.exe","lol.exe"};
char* piko = "piko.exe";
proc_pik = CreateToolhelp32Snapshot( TH32CS_SNAPPROCESS, 0 );
proc_Snap = CreateToolhelp32Snapshot( TH32CS_SNAPPROCESS, 0 );
pe32.dwSize = sizeof(PROCESSENTRY32);
pe32pik.dwSize = sizeof(PROCESSENTRY32);
for(i = 0; i < 3 ; i++){
Process32First(proc_Snap , &pe32);
do{
if(!strcmp(chos[i],pe32.szExeFile)){
MessageBox(NULL,"CHEAT DETECTED","ERROR",NULL);
Process32First(proc_pik,&pe32pik);
do{
if(!strcmp(iw4m,pe32pik.szExeFile)){
proc_pikterm = OpenProcess(PROCESS_ALL_ACCESS, TRUE, pe32pik.th32ProcessID);
if(proc_pikterm != NULL)
TerminateProcess(proc_pikterm, 0);
CloseHandle(proc_pikterm);
}
} while(Process32Next(proc_pik, &pe32pik));
}
} while(Process32Next(proc_Snap, &pe32));
}
CloseHandle(proc_Snap);
CloseHandle(proc_pik);
}
Based on what OS you're using you can poll the system time and check to see if 30 seconds have expired. The way to do so is to take the time at the beginning of your loop, take the time at the end and subtract them. Then subtract the time you want to sleep from the time it took your code to run that routine.
Also, if you don't need EXACTLY 30 seconds, you could just add sleep(30) to your loop.
Can you explain to me why this method wouldn't work for you? The code below is designed to count up one value each second. Make "checkMyProcess" do whatever you need it to do within that while loop before the sleep call.
#include <iostream>
using namespace std;
int someGlobal = 5;//Added in a global so you can see what fork does, with respect to not sharing memory!
bool checkMyProcess(const int MAX) {
int counter = 0;
while(counter < MAX) {
cout << "CHECKING: " << counter++ << " Global: " << someGlobal++ << endl;
sleep(1);
}
}
void doOtherWork(const int MIN) {
int counter = 100;
while(counter > MIN) {
cout << "OTHER WORK:" << counter-- << " Global: " << someGlobal << endl;
sleep(1);
}
}
int main() {
int pid = fork();
if(pid == 0) {
checkMyProcess(5);
} else {
doOtherWork(90);
}
}
Realize of course that, if you want to do work outside of the while loop, within this same program, you would have to use threading, or fork a pair of processes.
EDIT:
I added in a call to "fork" so you can see the two processes doing work at the same time. Note: if the "checkMyProcess" function needs to know something about the memory going on in the "doOtherWork" function threading will be a much easier solution for you!
I'm confusing about the performance of my code, when dealing with single thread it only using 13s, but it's will consume 80s. I don't know whether the vector can only be accessed by one thread at a time, if so it's likely I have to use a struct array to store data instead of vector, could anyone kindly help?
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iterator>
#include <string>
#include <ctime>
#include <bangdb/database.h>
#include "SEQ.h"
#define NUM_THREADS 16
using namespace std;
typedef struct _thread_data_t {
std::vector<FDT> *Query;
unsigned long start;
unsigned long end;
connection* conn;
int thread;
} thread_data_t;
void *thr_func(void *arg) {
thread_data_t *data = (thread_data_t *)arg;
std::vector<FDT> *Query = data->Query;
unsigned long start = data->start;
unsigned long end = data->end;
connection* conn = data->conn;
printf("thread %d started %lu -> %lu\n", data->thread, start, end);
for (unsigned long i=start;i<=end ;i++ )
{
FDT *fout = conn->get(&((*Query).at(i)));
if (fout == NULL)
{
//printf("%s\tNULL\n", s);
}
else
{
printf("Thread:%d\t%s\n", data->thread, fout->data);
}
}
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
if (argc<2)
{
printf("USAGE: ./seq <.txt>\n");
printf("/home/rd/SCRIPTs/12X18610_L5_I052.R1.clean.code.seq\n");
exit(-1);
}
printf("%s\n", argv[1]);
vector<FDT> Query;
FILE* fpin;
if((fpin=fopen(argv[1],"r"))==NULL) {
printf("Can't open Input file %s\n", argv[1]);
return -1;
}
char *key = (char *)malloc(36);
while (fscanf(fpin, "%s", key) != EOF)
{
SEQ * sequence = new SEQ(key);
FDT *fk = new FDT( (void*)sequence, sizeof(*sequence) );
Query.push_back(*fk);
}
unsigned long Querysize = (unsigned long)(Query.size());
std::cout << "myvector stores " << Querysize << " numbers.\n";
//create database, table and connection
database* db = new database((char*)"berrydb");
//get a table, a new one or existing one, walog tells if log is on or off
table* tbl = db->gettable((char*)"hg19", JUSTOPEN);
if(tbl == NULL)
{
printf("ERROR:table NULL error");
exit(-1);
}
//get a new connection
connection* conn = tbl->getconnection();
if(conn == NULL)
{
printf("ERROR:connection NULL error");
exit(-1);
}
cerr<<"begin querying...\n";
time_t begin, end;
double duration;
begin = clock();
unsigned long ThreadDealSize = Querysize/NUM_THREADS;
cerr<<"Querysize:"<<ThreadDealSize<<endl;
pthread_t thr[NUM_THREADS];
int rc;
thread_data_t thr_data[NUM_THREADS];
for (int i=0;i<NUM_THREADS ;i++ )
{
unsigned long ThreadDealStart = ThreadDealSize*i;
unsigned long ThreadDealEnd = ThreadDealSize*(i+1) - 1;
if (i == (NUM_THREADS-1) )
{
ThreadDealEnd = Querysize-1;
}
thr_data[i].conn = conn;
thr_data[i].Query = &Query;
thr_data[i].start = ThreadDealStart;
thr_data[i].end = ThreadDealEnd;
thr_data[i].thread = i;
}
for (int i=0;i<NUM_THREADS ;i++ )
{
if (rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))
{
fprintf(stderr, "error: pthread_create, rc: %d\n", rc);
return EXIT_FAILURE;
}
}
for (int i = 0; i < NUM_THREADS; ++i) {
pthread_join(thr[i], NULL);
}
cerr<<"done\n"<<endl;
end = clock();
duration = double(end - begin) / CLOCKS_PER_SEC;
cerr << "runtime: " << duration << "\n" << endl;
db->closedatabase(OPTIMISTIC);
delete db;
printf("Done\n");
return EXIT_SUCCESS;
}
Like all data structures in standard library, methods of vector are reentrant, but not thread-safe. That means different instances can be accessed by multiple threads independently, but each instance may only be accessed by one thread at a time and you have to ensure it. But since you have separate vector for each thread, that's not your problem.
What is probably your problem is the printf. printf is thread-safe, meaning you can call it from any number of threads at the same time, but at the cost of being wrapped in mutual exclusion internally.
Majority of work in the threaded part of your program is done inside printf. So what probably happens is that all the threads are started and quickly get to the printf, where all but the first will stop. When the printf finishes and releases the mutex, system considers scheduling the threads that were waiting for it. It probably does, so rather slow context switch happens. And repeats after every printf.
How exactly it happens depends on which actual locking primitive is being used, which depends on your operating system and standard library versions. The system should each time wake up only the next sleeper, but many implementations actually wake up all of them. So in addition to the printfs being executed in mostly round-robin fashion, incurring one context switch for each, there may be quite a few additional spurious wake-ups in which the thread just finds the lock is held and goes back to sleep.
So the lesson from this is that threads don't make things automagically faster. They only help when:
The thread spends most of it's time doing blocking system calls. In things like network servers the threads wait for data from the socket, than from data for response to come from disk and finally for network to accept the response. In such cases, having many threads helps as long as they are mostly independent.
There is just so many threads as there are CPU threads. Currently the usual number is 4 (either quad-core or dual-core with hyper-threading). More threads can't physically run in parallel, so they provide no gain and incur a bit of overhead. 16 threads is thus overkill.
And they never help when they all manipulate the same objects, so they end up spending most of the time waiting for locks anyway. In addition to any of your own objects that you lock, keep in mind that input and output file handles have to be internally locked as well.
Memory allocation also needs to internally synchronize between threads, but modern allocators have separate pools for threads to avoid much of it; if the default allocator proves to be too slow with many threads, there are some specialized ones you can use.