Multi-thread crawler doesn't speed up with threading (on local files)

Multi-thread crawler doesn't speed up with threading (on local files) - c++

I have a task - to write a multithreaded webcrawler (actually I have a local set.html files that I need to parse and move to another directory). The main condition for this task is to make it possible to enter an arbitrary number of threads and determine at what number the program will stop adding in performance.
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <queue>
#include <ctime>
#include <set>
#include <chrono>
#include <atomic>
using namespace std;
class WebCrawler{
private:
const string start_path = "/";
const string end_path = "/";
int thread_counts;
string home_page;
queue<string> to_visit;
set<string> visited;
vector<thread> threads;
mutex mt1;
int count;
public:
WebCrawler(int thread_counts_, string root_)
:thread_counts(thread_counts_), home_page(root_) {
to_visit.push(root_);
visited.insert(root_);
count = 0;
}
void crawler(){
for(int i = 0; i<thread_counts; i++)
threads.push_back(thread(&WebCrawler::start_crawl, this));
for(auto &th: threads)
th.join();
cout<<"Count: "<<count<<endl;
}
void parse_html(string page_){
cout<<"Thread: "<<this_thread::get_id()<<" page: "<<page_<< endl;
ifstream page;
page.open(start_path+page_, ios::in);
string tmp;
getline(page, tmp);
page.close();
for(int i = 0; i<tmp.size(); i++){
if( tmp[i] == '<'){
string tmp_num ="";
while(tmp[i]!= '>'){
if(isdigit(tmp[i]))
tmp_num+=tmp[i];
i++;
}
tmp_num+= ".html";
if((visited.find(tmp_num) == visited.end())){
mt1.lock();
to_visit.push(tmp_num);
visited.insert(tmp_num);
mt1.unlock();
}
}
}
}
void move(string page_){
mt1.lock();
count++;
ofstream page;
page.open(end_path+page_, ios::out);
page.close();
mt1.unlock();
}
void start_crawl(){
cout<<"Thread started: "<<this_thread::get_id()<< endl;
string page;
while(!to_visit.empty()){
mt1.lock();
page = to_visit.front();
to_visit.pop();
mt1.unlock();
parse_html(page);
move(page);
}
}
};
int main(int argc, char const *argv\[])
{
int start_time = clock();
WebCrawler crawler(7, "0.html");
crawler.crawler();
int end_time = clock();
cout<<"Time: "<<(float)(end_time -start_time)/CLOCKS_PER_SEC<<endl;
cout<<thread::hardware_concurrency()<<endl;
return 0;
}
1 thread = Time: 0.709504
2 thread = Time: 0.668037
4 thread = Time: 0.762967
7 thread = Time: 0.781821
I've been trying to figure out for a week why my program is running slower even on two threads. I probably don't fully understand how mutex works, or perhaps the speed is lost during the joining of threads. Do you have any ideas how to fix it?

There are many ways to protect things in multithreading, implicit or explicit.
In addition to the totally untested code, there are also some implicit assumptions, for example of that int is large enough for your task, that must be considered.
Lets make a short analysis of what is needing protection.
Variables that are accessed from multiple threads
things that are const can be excluded
unless you const cast them
part of them are mutable
global objects like files or cout
could be overwritten
written from multiple threads
streams have their own internal locks
so you can write to a stream from multiple threads to cout
but you don't want it for the files in this case.
if multiple threads want to open the same file, you will get an error.
std::endl forces an synchronization, so change it to "\n" like a commenter noted.
So this boils down to:
queue<string> to_visit;
set<string> visited; // should be renamed visiting
int count;
<streams and files>
count is easy
std::atomic<int> count;
The files are implicit protected by your visited/visiting check, so they are good too. So the mutex in move can be removed.
The remaining needs an mutex each as they could be independently updated.
mutex mutTovisit, // formerly known as mut1.
mutVisiting.
Now we have the problem that we could deadlock with two mutexes, if we try to lock in different order in two places. You need to read up on all the lock stuff if you add more locks, scoped_lock and lock are good places to start.
Changing the code to
{
scoped_lock visitLock(mutVisiting); // unlocks at end of } even if stuff throws
if((visited.find(tmp_num) == visited.end())){
scoped_lock toLock(mutTo);
to_visit.push(tmp_num);
visited.insert(tmp_num);
}
}
And in this code there are multiple errors, that are hidden by the not thread safe access to to_visit and the randomness of the thread starts.
while(!to_visit.empty()){ // 2. now the next thread starts and sees its empty and stops
// 3. or worse it starts then hang at lock
mt1.lock();
page = to_visit.front(); // 4. and does things that are not good here with an empty to_visit
to_visit.pop(); // 1. is now empty after reading root
mt1.unlock();
parse_html(page);
move(page);
}
To solve this you need an (atomic?) counter, found(Pages) of current known unvisited pages so we know if are done. Then to start threads when there is new work that needs to be done we can use std::condition_variable(_any)
The general idea of the plan is to have the threads wait until work is available, then each time a new page is discovered notify_one to start work.
To Startup, set the found to 1 and notify_one once the threads have started, when a thread is done with the work decrease found. To stop when found is zero, the thread that decrease it to zero notify_all so they all can stop.
What you will find is that if the data is on a single slow disk, it is unlikely you will see much effect from more than 2 threads, if all files are currently cached in ram, you might see more effect.

I think there's a bottle neck on your move function. Each thread takes the same amount of time to go through that function. You could start with that

Related

C++ Waiting for one of several threads to finish and get that threads return value

I want to parallelize the execution of a randomized algorithm in the following way: I have a number of threads which execute the same randomized operations in a loop and return in case of success. I want to start multiple threads and return once at least one of them stops (returns a value). As a minimum example, consider the following code snippet:
#include <iostream>
#include <stdlib.h> /* srand, rand */
#include <future>
#include <vector>
int random_algorithm(){
while(true) {
int random_number = rand() % 10 + 1;
if (random_number > 5){
return random_number;
}
}
}
int main(){
std::vector<std::future<int>> thread_vec;
for(int i=0;i<5;++i){
std::future<int> t = std::async(std::launch::async, random_algorithm);
thread_vec.push_back(std::move(t));
}
**//Instead of the following loop, I want to**
**//continue execution as soon as one of the threads returned.**
for(auto& th: thread_vec){
th.wait();
std::cout << "thread returned " << th.get() << std::endl;
}
return 0;
}
Basically, instead of calling th.wait() on every thread, I just want to wait here until one of the threads in thread_vec has finished its work and then get that threads return value. How would I achieve this?
Note: I saw this question, but this does not seem to reveal which of the threads finished its work.

Ok, let's start discussing your code:
rand() is not re-entrant safe. You must never use it in multiple threads concurrently. Also, it's typically a really bad random number generator.
You're using C++11 or later, so use std::random instead.
Solving your problem: instead of waiting on a future, you should simply share a condition variable with all threads, and the first thread to notify the variable and thus the main thread ends the computation.
Result returning can be implemented through atomic variables, for example (std::atomic).

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

I'd like, instead of having my threads wait, doing nothing, for other threads to finish using data, to do something else in the meantime (like checking for input, or re-rendering the previous frame in the queue, and then returning to check to see if the other thread is done with its task).
I think this code that I've written does that, and it "seems" to work in the tests I've performed, but I don't really understand how std::memory_order_acquire and std::memory_order_clear work exactly, so I'd like some expert advice on if I'm using those correctly to achieve the behaviour I want.
Also, I've never seen multithreading done this way before, which makes me a bit worried. Are there good reasons not to have a thread do other tasks instead of waiting?
/*test program
intended to test if atomic flags can be used to perform other tasks while shared
data is in use, instead of blocking
each thread enters the flag protected part of the loop 20 times before quitting
if the flag indicates that the if block is already in use, the thread is intended to
execute the code in the else block (only up to 5 times to avoid cluttering the output)
debug note: this doesn't work with std::cout because all the threads are using it at once
and it's not thread safe so it all gets garbled. at least it didn't crash
real world usage
one thread renders and draws to the screen, while the other checks for input and
provides frameData for the renderer to use. neither thread should ever block*/
#include <fstream>
#include <atomic>
#include <thread>
#include <string>
struct ThreadData {
int numTimesToWriteToDebugIfBlockFile;
int numTimesToWriteToDebugElseBlockFile;
};
class SharedData {
public:
SharedData() {
threadData = new ThreadData[10];
for (int a = 0; a < 10; ++a) {
threadData[a] = { 20, 5 };
}
flag.clear();
}
~SharedData() {
delete[] threadData;
}
void runThread(int threadID) {
while (this->threadData[threadID].numTimesToWriteToDebugIfBlockFile > 0) {
if (this->flag.test_and_set(std::memory_order_acquire)) {
std::string fileName = "debugIfBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", running, output #" << this->threadData[threadID].numTimesToWriteToDebugIfBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugIfBlockFile -= 1;
this->flag.clear(std::memory_order_release);
}
else {
if (this->threadData[threadID].numTimesToWriteToDebugElseBlockFile > 0) {
std::string fileName = "debugElseBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", standing by, output #" << this->threadData[threadID].numTimesToWriteToDebugElseBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugElseBlockFile -= 1;
}
}
}
}
private:
ThreadData* threadData;
std::atomic_flag flag;
};
void runThread(int threadID, SharedData* sharedData) {
sharedData->runThread(threadID);
}
int main() {
SharedData sharedData;
std::thread thread[10];
for (int a = 0; a < 10; ++a) {
thread[a] = std::thread(runThread, a, &sharedData);
}
thread[0].join();
thread[1].join();
thread[2].join();
thread[3].join();
thread[4].join();
thread[5].join();
thread[6].join();
thread[7].join();
thread[8].join();
thread[9].join();
return 0;
}```

The memory ordering you're using here is correct.
The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.
However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.
class SharedData {
// ...
private:
std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
// lock was taken, proceed with critical section
my_lock.unlock();
} else {
// lock not taken, do non-critical work
}
This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.
It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.

How to use std::condition_variable in a loop

I'm trying to implement some algorithm using threads that must be synchronized at some moment. More or less the sequence for each thread should be:
1. Try to find a solution with current settings.
2. Synchronize solution with other threads.
3. If any of the threads found solution end work.
4. (empty - to be inline with example below)
5. Modify parameters for algorithm and jump to 1.
Here is a toy example with algorithm changed to just random number generation - all threads should end if at least one of them will find 0.
#include <iostream>
#include <condition_variable>
#include <thread>
#include <vector>
const int numOfThreads = 8;
std::condition_variable cv1, cv2;
std::mutex m1, m2;
int lockCnt1 = 0;
int lockCnt2 = 0;
int solutionCnt = 0;
void workerThread()
{
while(true) {
// 1. do some important work
int r = rand() % 1000;
// 2. synchronize and get results from all threads
{
std::unique_lock<std::mutex> l1(m1);
++lockCnt1;
if (r == 0) ++solutionCnt; // gather solutions
if (lockCnt1 == numOfThreads) {
// last thread ends here
lockCnt2 = 0;
cv1.notify_all();
}
else {
cv1.wait(l1, [&] { return lockCnt1 == numOfThreads; });
}
}
// 3. if solution found then quit all threads
if (solutionCnt > 0) return;
// 4. if not, then set lockCnt1 to 0 to have section 2. working again
{
std::unique_lock<std::mutex> l2(m2);
++lockCnt2;
if (lockCnt2 == numOfThreads) {
// last thread ends here
lockCnt1 = 0;
cv2.notify_all();
}
else {
cv2.wait(l2, [&] { return lockCnt2 == numOfThreads; });
}
}
// 5. Setup new algorithm parameters and repeat.
}
}
int main()
{
srand(time(NULL));
std::vector<std::thread> v;
for (int i = 0; i < numOfThreads ; ++i) v.emplace_back(std::thread(workerThread));
for (int i = 0; i < numOfThreads ; ++i) v[i].join();
return 0;
}
The questions I have are about sections 2. and 4. from code above.
A) In a section 2 there is synchronization of all threads and gathering solutions (if found). All is done using lockCnt1 variable. Comparing to single use of condition_variable I found it hard how to set lockCnt1 to zero safely, to be able to reuse this section (2.) next time. Because of that I introduced section 4. Is there better way to do that (without introducing section 4.)?
B) It seems that all examples shows using condition_variable rather in context of 'producer-consumer' scenario. Is there better way to synchronization all threads in case where all are 'producers'?
Edit: Just to be clear, I didn't want to describe algorithm details since this is not important here - anyway this is necessary to have all solution(s) or none from given loop execution and mixing them is not allowed. Described sequence of execution must be followed and the question is how to have such synchronization between threads.

A) You could just not reset the lockCnt1 to 0, just keep incrementing it further. The condition lockCnt2 == numOfThreads then changes to lockCnt2 % numOfThreads == 0. You can then drop the block #4. In future you could also use std::experimental::barrier to get the threads to meet.
B) I would suggest using std::atomic for solutionCnt and then you can drop all other counters, the mutex and the condition variable. Just atomically increase it by one in the thread that found solution and then return. In all threads after every iteration check if the value is bigger than zero. If it is, then return. The advantage is that the threads do not have to meet regularly, but can try to solve it at their own pace.

Out of curiosity, I tried to solve your problem using std::async. For every attempt to find a solution, we call async. Once all parallel attempts have finished, we process feedback, adjust parameters, and repeat. An important difference with your implementation is that feedback is processed in the calling (main) thread. If processing feedback takes too long — or if we don't want to block the main thread at all — then the code in main() can be adjusted to also call std::async.
The code is supposed to be quite efficient, provided that the implementation of async uses a thread pool (e. g. Microsoft's implementation does that).
#include <chrono>
#include <future>
#include <iostream>
#include <vector>
const int numOfThreads = 8;
struct Parameters{};
struct Feedback {
int result;
};
Feedback doTheWork(const Parameters &){
// do the work and provide result and feedback for future runs
return Feedback{rand() % 1000};
}
bool isSolution(const Feedback &f){
return f.result == 0;
}
// Runs doTheWork in parallel. Number of parallel tasks is same as size of params vector
std::vector<Feedback> findSolutions(const std::vector<Parameters> &params){
// 1. Run async tasks to find solutions. Normally threads are not created each time but re-used from a pool
std::vector<std::future<Feedback>> futures;
for (auto &p: params){
futures.push_back(std::async(std::launch::async,
[&p](){ return doTheWork(p); }));
}
// 2. Syncrhonize: wait for all tasks
std::vector<Feedback> feedback(futures.size());
for (auto nofRunning = futures.size(), iFuture = size_t{0}; nofRunning > 0; ){
// Check if the task has finished (future is invalid if we already handled it during an earlier iteration)
auto &future = futures[iFuture];
if (future.valid() && future.wait_for(std::chrono::milliseconds(1)) != std::future_status::timeout){
// Collect feedback for next attempt
// Alternatively, we could already check if solution has been found and cancel other tasks [if our algorithm supports cancellation]
feedback[iFuture] = std::move(future.get());
--nofRunning;
}
if (++iFuture == futures.size())
iFuture = 0;
}
return feedback;
}
int main()
{
srand(time(NULL));
std::vector<Parameters> params(numOfThreads);
// 0. Set inital parameter values here
// If we don't want to block the main thread while the algorithm is running, we can use std::async here too
while (true){
auto feedbackVector = findSolutions(params);
auto itSolution = std::find_if(std::begin(feedbackVector), std::end(feedbackVector), isSolution);
// 3. If any of the threads has found a solution, we stop
if (itSolution != feedbackVector.end())
break;
// 5. Use feedback to re-configure parameters for next iteration
}
return 0;
}

recursive threading with C++ gives a Resource temporarily unavailable

So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6

You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.

Progress bar in Windows activity field?

I've written a c++ program that performs time consuming calculations and i want the user to be able to see the progress while the program is running in the background (minimized).
I'd like to use the same effect as chrome uses when downloading a file:
How do i access this feature? Can i use it in my c++ program?

If the time consuming operation can be performed inside a loop, and depending on whether or not it is a count controlled loop, you may be able to use thread and atomic to solve your problem.
If your processor architecture supports multithreading you can use threads to run calculations concurrently. The basic use of a thread is to run a function in parallel with the main thread, these operations may be effectively done at the same time, meaning you would be able to use the main thread to check the progress of your time consuming calculations. With parallel threads comes the problem of data races, wherein if two threads try to access or edit the same data, they could do so incorrectly and corrupt the memory. This can be solved with atomic. You could use an atomic_int to make sure two actions are never cause a data race.
A viable example:
#include <thread>
#include <mutex>
#include <atomic>
#include <iostream>
//function prototypes
void foo(std::mutex * mtx, std::atomic_int * i);
//main function
int main() {
//first define your variables
std::thread bar;
std::mutex mtx;
std::atomic_int value;
//store initial value just in case
value.store(0);
//create the thread and assign it a task by passing a function and any parameters of the function as parameters of thread
std::thread functionalThread;
functionalThread = std::thread(foo/*function name*/, &mtx, &value/*parameters of the function*/);
//a loop to keep checking value to see if it has reached its final value
//temp variable to hold value so that operations can be performed on it while the main thread does other things
int temp = value.load();
//double to hold percent value
double percent;
while (temp < 1000000000) {
//calculate percent value
percent = 100.0 * double(temp) / 1000000000.0;
//display percent value
std::cout << "The current percent is: " << percent << "%" << std::endl;
//get new value for temp
temp = value.load();
}
//display message when calculations complete
std::cout << "Task is done." << std::endl;
//when you join a thread you are essentially waiting for the thread to finish before the calling thread continues
functionalThread.join();
//cin to hold program from completing to view results
int wait;
std::cin >> wait;
//end program
return 0;
}
void foo(std::mutex * mtx, std::atomic_int * i) {
//function counts to 1,000,000,000 as fast as it can
for (i->store(0); i->load() < 1000000000; i->store(i->load() + 1)) {
//keep i counting
//the first part is the initial value, store() sets the value of the atomic int
//the second part is the exit condition, load() returns the currently stored value of the atomic
//the third part is the increment
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multi-thread crawler doesn't speed up with threading (on local files) - c++

I think there's a bottle neck on your move function. Each thread takes the same amount of time to go through that function. You could start with that

Related

C++ Waiting for one of several threads to finish and get that threads return value

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

How to use std::condition_variable in a loop

recursive threading with C++ gives a Resource temporarily unavailable

Progress bar in Windows activity field?

Categories

Resources