Edit/Solved: Joachim Pileborg's answer did the job for me. THX
Please be gentle as this is my first question.
I am actual lerning and playing with c++ in particular threading. I looked for an answer (and it would astonish me if there is not allready one out there, but i wasn't able to find it).
So back to topic:
My "play" code looks something like this (Console application)
void foo(){
//do something
}
int _tmain(int argc, _TCHAR* argv[])
{
std::thread t[threadcount];
for (int i = 0; i < threadcount; ++i) {
t[i] = std::thread(foo);
}
for (int i = 0; i < threadcount; ++i) {
t[i].join();
}
}
Is it possible to set the value of threadcount through argv?
If not could someone please give me a short snippet on how to implement
std::thread::hardware_concurrency()
as the threadcount, because also there Visualstudio gives me an error when setting
const int threadcount = std::thread::hardware_concurrency();
Thanks in advance.
As the number of threas is to be controlled by threadcount, setting it from the command line can be implemented by adding
int threadcount = atoi(argv[1]);
to the implementation. Some error checking could be done, e.g. reporting an error on a non-positive number of threads.
If the number of threads is to be determined programmatically, depending on the specific platform, this question could be interesting.
Related
I've been learning C++ from the internet for the past 2 years and finally the need has arisen for me to delve into MPI. I've been scouring stackoverflow and the rest of the internet (including http://people.sc.fsu.edu/~jburkardt/cpp_src/mpi/mpi.html and https://computing.llnl.gov/tutorials/mpi/#LLNL). I think I've got some of the logic down, but I'm having a hard time wrapping my head around the following:
#include (stuff)
using namespace std;
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows);
int main(int argc, char** argv)
{
vector<double> result;//represents a regular 1D vector
int id_proc, tot_proc, root_proc = 0;
int dim;//set to number of "columns" in A and B below
int rows;//set to number of "rows" of A and B below
vector<double> A(dim*rows), B(dim*rows);//represent matrices as 1D vectors
MPI::Init(argc,argv);
id_proc = MPI::COMM_WORLD.Get_rank();
tot_proc = MPI::COMM_WORLD.Get_size();
/*
initialize A and B here on root_proc with RNG and Bcast to everyone else
*/
//allow all processors to call function() so they can each work on a portion of A
result = function(A,B,dim,rows);
//all processors do stuff with A
//root_proc does stuff with result (doesn't matter if other processors have updated result)
MPI::Finalize();
return 0;
}
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows)
{
/*
purpose of function() is two-fold:
1. update foo because all processors need the updated "matrix"
2. get the average of the "rows" of foo and return that to main (only root processor needs this)
*/
vector<double> output(dim,0);
//add matrices the way I would normally do it in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
foo[i*dim + j] += bar[i*dim + j];//perform "matrix" addition (+= ON PURPOSE)
}
}
//obtain average of rows in foo in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
output[j] += foo[i*dim + j];//sum rows of A
}
}
for (int j = 0; j < dim; j++)
{
output[j] /= rows;//divide to obtain average
}
return output;
}
The code above is to illustrate the concept only. My main concern is to parallelize the matrix addition but what boggles my mind is this:
1) If each processor only works on a portion of that loop (naturally I'd have to modify the loop parameters per processor) what command do I use to merge all portions of A back into a single, updated A that all processors have in their memory. My guess is that I have to do some kind of Alltoall where each processor sends its portion of A to all other processors, but how do I guarantee that (for example) row 3 worked on by processor 3 overwrites row 3 of the other processors, and not row 1 by accident.
2) If I use an Alltoall inside function(), do all processors have to be allowed to step into function(), or can I isolate function() using...
if (id_proc == root_proc)
{
result = function(A,B,dim,rows);
}
… and then inside function() handle all the parallelization. As silly as it sounds, I'm trying to do a lot of the work on one processor (with broadcasts), and just parallelize the big time-consuming for loops. Just trying to keep the code conceptually simple so I can get my results and move on.
3) For the averaging part, I'm sure I can just use a reducing command if I wanted to parallelize it, correct?
Also, as an aside: is there a way to call Bcast() such that it is blocking? I'd like to use it to synchronize all my processors (boost libraries are not an option). If not then I'll just go with Barrier(). Thank you for your answer to this question, and to the community of stackoverflow for learning me how to program over the past two years! :)
1) The function you are looking is MPI_Allgather. MPI_Allgather will let you send a row from each processor and receive the result on all processors.
2) Yes you can use some of the processors in your function. Since MPI functions work with communicators you have to create a separate communicator for this purpose. I don't know how this is implemented in the C++ bindings but C bindings use the MPI_Comm_create function.
3) Yes see MPI_Allreduce.
aside: Bcast blocks a process until send/receive operation assigned to that process is finished. If you want to wait for all processors to finish their work (I don't have any idea why you would want to do this) you should use Barrier().
extra note: I wouldn't recommend using the C++ bindings as they are depreciated and you won't find specific examples on how to use them. Boost MPI is the library to use if you want C++ bindings however it does not cover all of MPI functions.
It seems that most tutorials, guides, books and Q&A from the web refers to CUDA 3 and 4.x, so that is why I'm asking it specifically about CUDA 5.0. To the question...
I would like to program for an environment with two CUDA devices, but use only one thread, to make the design simple (specially because it is a prototype). I want to know if the following code is valid:
float *x[2];
float *dev_x[2];
for(int d = 0; d < 2; d++) {
cudaSetDevice(d);
cudaMalloc(&dev_x[d], 1024);
}
for(int repeats = 0; repeats < 100; repeats++) {
for(int d = 0; d < 2; d++) {
cudaSetDevice(d);
cudaMemcpy(dev_x[d],x[d],1024,cudaMemcpyHostToDevice);
some_kernel<<<...>>>(dev_x[d]);
cudaMemcpy(x[d],dev_x[d],1024,cudaMemcpyDeviceToHost);
}
cudaStreamSynchronize(0);
}
I would like to know specifically if cudaMalloc(...)s from before the testing for persist even with the interchanging of cudaSetDevice() that happens in the same thread. Also, I would like to know if the same happens with context-dependent objects such as cudaEvent_t and cudaStream_t.
I am asking it because I have an application in this style that keeps getting some mapping error and I can't find what it is, if some missing memory leak or wrong API usage.
Note: In my original code, I do check every single CUDA call. I did not put it here for code readability.
Is this just a typo?
for(int d = 0; d < 2; d++) {
cudaSetDevice(0); // shouldn't that be 'd'
cudaMalloc(&dev_x, 1024);
}
Please check the return value of all API calls!
[C++ using Visual Studio Professional 2012]
Hi All, I am having trouble using std::mutex to prevent main() from changing variables that a second thread is accessing. In the following example (which is a massively simplified representation of my actual program) the function update() runs from the std::thread t2 in main(). update() checks if the vector world.m_grid[i][j].vec is empty and, if it is not, modifies the value contained. main() also accesses and occasionally clears this vector, and as a result if main() clears the vector after the empty check in update() but before world.m_grid[i][j].vec[0] is modified you get a vector subscript out of range error. I am trying to use std::mutex to prevent this from happening by locking barrier before the update() empty check, and releasing it after world.m_grid[i][j].vec[0] has been modified by update(), and after extensive browsing of mutex tutorials and examples I am unable to understand why the following does not have the desired effect:
#include <cstdlib>
#include <thread>
#include <mutex>
#include <vector>
using namespace std;
mutex barrier;
class World
{
public:
int m_rows;
int m_columns;
class Tile
{
public:
vector<int> vec;
int someVar;
};
vector<vector<Tile> > m_grid;
World (int rows = 100, int columns = 200): m_rows(rows), m_columns(columns), m_grid(rows, vector<Tile>(columns)) {}
};
void update(World& world)
{
while (true)
{
for (int i = 0; i < world.m_rows; ++i)
{
for (int j = 0; j < world.m_columns; ++j)
{
if (!world.m_grid[i][j].vec.empty())
{
lock_guard<mutex> guard(barrier);
world.m_grid[i][j].vec[0] += 5;
}
}
}
}
}
int main()
{
World world;
thread t2(update, ref(world));
while (true)
{
for (int i = 0; i < world.m_rows; ++i)
{
for (int j = 0; j < world.m_columns; ++j)
{
int random = rand() % 10;
if (world.m_grid[i][j].vec.empty() && random < 3) world.m_grid[i][j].vec.push_back(1);
else if (!world.m_grid[i][i].vec.empty() && random < 3) world.m_grid[i][j].vec.clear();
}
}
}
t2.join();
return 0;
}
I must be missing something fundamental here. Ideally the solution would just lock down world.m_grid[i][j] (leaving the rest of world.m_grid accessible to main()), which I assume would involve including a mutex in the class "Tile", but I run into the same problem as described here: Why does std::mutex create a C2248 when used in a struct with WIndows SOCKET? and have been unable to adapt the solution described to my project, so it would be extra helpful if someone was able to help me out there too.
-Thankyou for your time.
[edit] spelling
You need to lock the mutex also in you main function when you access the array:
...
for (int j = 0; j < world.m_columns; ++j) {
lock_guard<mutex> guard(barrier);
int random = rand() % 10;
if (world.m_grid[i][j].vec.empty() && random < 3) world.m_grid[i][j].vec.push_back(1);
else if (!world.m_grid[i][i].vec.empty() && random < 3) world.m_grid[i][j].vec.clear();
}
...
With mutexes you need to secure all accesses to your data. So far in your code thread 2 generates a mutex when it access the data. However the main thread can change the code as it does not know anything about the mutex. So the main thread can simply change the data.
The problem you are having is that you are using so called clients' side synchronization. In other words: you have several threads and before every of them writes/reads shared resource, you have to use the barrier. As tune2fs has already replied, you have to lock_guard<mutex> guard(barrier) before the call in main thread.
That said it would be much better for you to implement server side synchronization. In other words every block (if there are more then 1 lines - like in your main thread - you need to send all of them) would have to be synchronized by the server (World).
Currently I could propose using a method void modify(Func<void(vector<int>&)> mutator); in World so you would send all your logic through this method as lambdas (the easiest). Inside modify you would use the standard lock_guard using a mutex owned by World. This solution is much more scaleable and safe (you really don't want to look at all you places where you invoke code that modifies the vector).
I've got a plugin system in my project (running on linux), and part of this is that plugins have a "run" method such as:
void run(int argc, char* argv[]);
I'm calling my plugin and go to check my argv array (after doing a bunch of other stuff), and
the array is corrupted. I can print the values out at the top of the function, and they're correct, but not later on in the execution. Clearly something is corrupting the heap, but
I'm at a loss of how I can try to pin down exactly what's overwriting that memory. Valgrind hasn't helped me out much.
Sample code by request:
My plugin looks something like this:
void test_fileio::run(int argc, char* argv[]) {
bool all_passed = true;
// Prints out correctly.
for (int ii=0; ii < argc; ii++) {
printf("Arg[%i]: %s\n", ii, argv[ii]);
}
<bunch of tests snipped for brevity>
// Prints out inccorrectly.
for (int ii=0; ii < argc; ii++) {
printf("Arg[%i]: %s\n", ii, argv[ii]);
}
}
This is linked into a system that exposes it to python so I can call these plugins as python functions. So I take a string parameter to my python function and break that out thusly:
char** translate_arguments(string args, int& argc) {
int counter = 0;
vector<char*> str_vec;
// Copy argument string to get rid of const modifier
char arg_str[MAX_ARG_LEN];
strcpy(arg_str, args.c_str());
// Tokenize the string, splitting on spaces
char* token = strtok(arg_str, " ");
while (token) {
counter++;
str_vec.push_back(token);
token = strtok(NULL, " ");
}
// Allocate array
char** to_return = new char*[counter];
for (int ii=0; ii < counter; ii++)
to_return[ii] = str_vec[ii];
// Save arg count and return
argc = counter;
return to_return;
}
The resulting argc and argv is then passed to the plugin mentioned above.
How does translate_arguments get called? That is missing...
Does it prepare an array of pointers to chars before calling the run function in the plugin, since the run function has parameter char *argv[]?
This looks like the line that is causing trouble...judging by the code
// Allocate array
char** to_return = new char*[counter];
You are intending to allocate a pointer to pointer to chars, a double pointer, but it looks the precedence of the code is a bit mixed up?
Have you tried it this way:
char** to_return = new (char *)[counter];
Also, in your for loop as shown...you are not allocating space for the string itself contained in the vector...?
for (int ii=0; ii < counter; ii++)
to_return[ii] = str_vec[ii];
// Should it be this way...???
for (int ii=0; ii < counter; ii++)
to_return[ii] = strdup(str_vec[ii]);
At the risk of getting downvoted as the OP did not show how the translate_arguments is called and lacking further information....and misjudging if my answer is incorrect...
Hope this helps,
Best regards,
Tom.
Lookup how to use memory access breakpoints with your debugger. If you have a solid repo, this will pinpoint your problem in seconds. In windbg, it's:
ba w4 0x<address>
Where ba stands for "break on access", "w4" is "write 4 bytes" (use w8 on a 64 bit system) and "address" is obviously the address you're seeing corrupted. gdb and Visual Studio have similar capabilities.
if valgrind and code inspection dont help you could try electric fence
I have a class in system-C with some data members as such:
long double x[8];
I'm initializing it in the construction like this:
for (i = 0; i < 8; ++i) {
x[i] = 0;
}
But the first time I use it in my code I have garbage there.
Because of the way the system is built I can't connect a debugger easily. Are there any methods to set a data breakpoint in the code so that it tells me where in the code the variables were actually changed, but without hooking up a debugger?
Edit:
#Prakash:
Actually, this is a typo in the question, but not in my code... Thanks!
You could try starting a second thread which spins, looking for changes in the variable:
#include <pthread.h>
void *ThreadProc(void *arg)
{
volatile long double *x = (volatile long double *)arg;
while(1)
{
for(int i = 0; i < 8; i++)
{
if(x[i] != 0)
{
__asm__ __volatile__ ("int 3"); // breakpoint (x86)
}
}
return 0; // Never reached, but placate the compiler
}
...
pthread_t threadID;
pthread_create(&threadID, NULL, ThreadProc, &x[0]);
This will raise a SIGTRAP signal to your application whenever any of the x values is not zero.
Just use printk/syslog.
It's old-fashioned, but super duper easy.
Sure, it will be garbage!
The code should have been as
for (i = 0; i < 8; ++i) {
x[i] = 0;
}
EDIT: Oops, Sorry for underestimating ;)
#Frank
Actually, that lets me log debug prints to a file. What I'm looking for is something that will let me print something whenever a variable changes, without me explicitly looking for the variable.
How about Conditional breakpoints? You could try for various conditions like first element value is zero or non zero, etc??
That's assuming I can easily connect a debugger. The whole point is that I only have a library, but the executable that linked it in isn't readily available.