Increase number of concurrent processing in WEKA's Apriori - weka

I am currently using Apriori algorithm in one of our project using WEKA. Is it possible to set multi concurrent process when using this algorithm? If so, how to do it?

The bottleneck with Apriori shouldn't be the CPU ever.
In the end, it is just counting, which is not very CPU intensive.
It's bottlenecks are supposedly IO and memory.
Why don't try using another algorithm? Depending on your data, Eclat or FPgrowth may be substantially faster.

Related

Is it possible to engage multiple cores (like gcc -j8) when solving with Pyomo?

The power flow library PyPSA uses Pyomo. I am trying to reduce the cost of each linear optimal power flow simulation.
I read through the Pyomo docs. Nothing sticks out at me yet. Perhaps it is not possible to split up the processing when solving linear optimisation problems.
Ubuntu 19.04, i5-4210U 1.70 GHz, 8 Gb RAM
When you talk about processing there are two things to consider. Processing to write the .lp file and processing to solve the problem with an optimization solver.
First, writing the .lp file is to my knowledge not yet parallelized in Pyomo. PyPSA developers created Linopy to parallel the processing to reduce RAM requirements and increase the speed.
Second, parallelizing the solver processing depends on the solver. PyPSA-Eur has an example of integration for that for Gurobi, and CPLEX. The performant open-source solver HiGHS can also something like that see here.

What is the expected speed-up of using parallelization in C++ (not OpenMp, but <thread>)

What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.

How can the optimization time of a Support Vector Machine with huge Data Set be reduced?

I have 30k samples to train my Support Vector Machine and also do Cross Validation which adds to the computational expense. Are there any techniques or rules of thumb to reduce computational costs, e.g. partitioning in a certain way or training only on a specific subset of the 30k samples?
My code is already fast but depending on the choice of kernel parameter, it takes a considerable amount of time to perform the optimization. How can I speed this up?

How to fill thrust vector in a parallel way? [duplicate]

I would like to search for a given string in multiple files in parallel using CUDA. I have planned to use pfac library to search for the given string. The problem with this is how to access multiple files in parallel.
Example: We have a folder containing 1000s of files which has to be searched.
The problem here is how should i access multiple files in the given folder.The files in the folder should be dynamically obtained and each thread should be assigned a file to search the given string.
Is it possible????
Edit:
In this post: very fast text file processing (C++) .He is using the boost library to read a 3 GB text file in 16 seconds.While in my case I have to read 1000s of smaller files
Thank you
Doing your task in CUDA will not help much over doing the same thing in CPU.
Assuming that your files are stored on a standard, magnetic HDD, the typical single-threaded CPU program would consume:
About 5ms to find the sector where the file is stored and put it under the reading head.
About 10ms to load 1MB file (assuming 100MB/s read speed) into RAM memory
Less than 0.1ms to load 1MB data from RAM to CPU cache and process it using a linear search algorithm.
That is 15.1ms for a single file. If you have 1000 files, it will take 15.1s to do the work.
Now, if I give you super-powerful GPU with infinite memory bandwith, no latency, and infinite processor speed, you will be able to perform the task (3) with no time. However, HDD reads will still consume exactly the same time. GPU cannot parallelise the work of another, independent device.
As a result, instead of spending 15.1s, you will now do it in 15.0s.
The infinite GPU would give you a 0.6% speedup. A real GPU would be not even close to that!
In more general case: If you consider using CUDA, ask yourself: is the actual computation the bottleneck of the problem?
If yes - continue searching for possible solutions in the CUDA world.
If no - CUDA cannot help you.
If you deal with thousants of tiny files and you need to perform reads often, consider techniques that can "attack" your bottleneck. Some may include:
RAM buffering
Putting your hard drives in a RAID configuration
Getting an SSD
there may be more options, I am not an expert in that area.
Yes, it's probably possible to get a speed-up using CUDA if you can reduce the impact of read latency/bandwidth. One way would be by performing multiple searches concurrently. I.e. If you can search for [needle1], .. [needle1000] in your large haystack then each thread could search haystack-pieces and store the hits. Some analysis of the throughput required per-comparisons is required to determine whether your search is likely to be improved by employing CUDA. This may be useful http://dl.acm.org/citation.cfm?id=1855600

building a web crawler

I'm currently developing a custom search engine with built-in web crawler. For some reason I'm not into multi-threading, thus so far my indexer was coded in single-threaded manner. Now I have a small dilemma with the crawler I'm building. Can anybody suggest which is better, crawl 1 page then index it, or crawl 1000+ page and cache, then index?
Networks are slow (relative to the CPU). You will see a significant speed increase by parallelizing your crawler. Otherwise, your app will spend the majority of its time waiting on network IO to complete. You can either use multiple threads and blocking IO or a single thread with asynchronous IO.
Also, most indexing algorithms will perform better on batches of documents verses indexing one document at a time.
Better? In terms of what? In terms of speed I can't forsee a noticable difference. In terms of robustness (recovering from a catastrophic failure) its probably better to index each page as you crawl it.
I would strongly suggest getting "in" to to multi-threading if you are serious about your crawler. Basically, you would want to have at least one indexer and at least one crawler (potentially multitudes for both) running at all times. Among other things, this minimizes start-up and shutdown overhead (e.g. initializing and freeing data structures).
Not using threads is OK.
However if you still want performance, you need to deal with Asynchronous IO.
I would recommend checking out Boost.ASIO link text. Using Asynchronous IO will make your dilemma "irrelevant", as it would not matter. Also as a bonus, in future if you do decide to use threads, then its trivial to tell Boost.Asio to apply multuple threads to the problem.