Nodejs: parallel request, serial response - concurrency

I have done Nodejs: How to write high performance async loop for stormjs, you can check stormjs serial loop demo
but there is still have problem that parallel loop, e.g. we have a function requestContext(index, callback(err, context)) which remote get context 'http://host/post/{index}', and we need get the context of [0-99] and push context into an array in the order, [context0...context99]
but obviously this output cant work stormjs parallel loop
I still want to know how noders do this task, but you must make these requests parallel, not 1 by 1, it should be parallel request and serial push.

var counter = 0;
// create an array with numbers 0 to 99.
_.range(0,100).forEach(function(key, value, arr) {
// for each of them request context
requestContext(key, function(err, context) {
// add the context to the array under the correct key
if (!err) {
arr[key] = context;
}
// increment counter, if all have finished then fire finished.
if (++counter === 100) {
finished(arr);
}
});
});
function finished(results) {
// do stuff
}
No storm required. If you want an execution / flow control library I would recommend Futures because it doesn't compile your code and "hides the magic".
Previously you recursed through each one and executed them in a serial order pushing them into the array in order.
This time you execute them all in parallel and tell each one to assign the right ordered key in the array to their own value.
_.range Creates an array containing values 0 to 99.

Related

TBB Parallel Pipeline seems to run in-order?

I am working on a data processing pipeline with some OpenCV code, after implementing my pipeline I found no speedup, also no slowdown. I am trying to investigate why this is so.
I came up with the following example:
int start = 0;
tbb::parallel_pipeline(16,
tbb::make_filter<void, int>(tbb::filter::serial_out_of_order, [&](tbb::flow_control& fc){
if(start < 1000) {
return start++;
}
fc.stop();
return start;
}) &
tbb::make_filter<int, int>(tbb::filter::parallel, [](int num){
std::cout << num << std::endl;
return num + 1;
}) &
tbb::make_filter<int, void>(tbb::filter::parallel, [](int num){
})
);
When this code executes, 1-1000 is printed sequentially. Is this correct behavior? Or do I have an issue with my environment?
Reordering is rather unlikely to be seen at the start of the second filter in practice.
The parallel_pipeline works in such a way that the same thread puts a given item through the pipeline for as long as possible (in your pipeline, all filters after the first are parallel, so the same thread will execute all three filters for an item). The overhead for a thread to move an item to the next filter is much less than what another thread needs to steal a task for the next item, process the first filter, and then also move to the second one. Reordering is still possible if e.g. the first thread is preempted by OS, but rather unlikely.
For better chances to observe out-of-order execution, move your print statements to the third filter and add some random amount of "work" to the second one, so that the time for it to process an item varies.

Check for a condition periodically without blocking

In my project, function clipsUpdate reads some facts which are set by CLIPS without the interference of my C++ code. Based on the read facts, clipsUpdate calls the needed function.
void updateClips(void)
{
// read clipsAction
switch(clipsAction)
{
case ActMove:
goToPosition (0, 0, clipsActionArg);
break;
}
}
In goToPosition function, a message is sent to the vehicle to move to the specified position and then a while loop is used to wait until the vehicle reaches the position.
void goToPosition(float north, float east, float down)
{
// Prepare and send the message
do
{
// Read new location information.
}while(/*Specified position reached?*/)
}
The problem is that updateClips should be called every 500 ms and when the goToPosition function is called, the execution is blocked until the target location is reached. During this waiting period, something may happen that requires the vehicle to stop. Therefore, updateClips should be called every 500 ms no matter what, and it should be able to stop executing goToPosition if it's running.
I tried using threads as following, but it didn't work successfully with me and it was difficult for me to debug. I think it can be done with a simpler and cleaner way.
case ActMove:
std::thread t1(goToPosition, 0, 0, clipsActionArg);
t1.detach();
break;
My question is, how can I check if the target location is reached without blocking the execution, i.e., without using while?
You probably want an event-driven model.
In an event-driven model, your main engine is a tight loop that reads events, updates state, then waits for more events.
Some events are time based, others are input based.
The only code that is permitted to block your main thread is the main loop, where it blocks until a timer hits or a new event arrives.
It might very roughly look like this:
using namespace std::literals::chrono_literals;
void main_loop( engine_state* state ) {
bool bContinue = true;
while(bContinue) {
update_ui(state);
while(bContinue && process_message(state, 10ms)) {
bContinue = update_state(state);
}
bContinue = update_state(state);
}
}
update_ui provides feedback to the user, if required.
process_message(state, duration) looks for a message to process, or for 10ms to occur. If it sees a message (like goToPosition), it modifies state to reflect that message (for example, it might store the desired destionation). It does not block, nor does it take lots of time.
If no message is recived in duration time, it returns anyhow without modifying state (I'm assuming you want things to happen even if no new input/messages occur).
update_state takes the state and evolves it. state might have a last updated time stamp; update_state would then make the "physics" reflect the time since last one. Or do any other updates.
The point is that process_message doesn't do work on the state (it encodes desires), while update_state advances "reality".
It returns false if the main loop should exit.
update_state is called once for every process_message call.
updateClips being called every 500ms can be encoded as a repeated automatic event in the queue of messages process_message reads.
void process_message( engine_state* state, std::chrono::milliseconds ms ) {
auto start = std::chrono::high_resolution_clock::now();
while (start + ms > std::chrono::high_resolution_clock::now()) {
// engine_state::delayed is a priority_queue of timestamp/action
// ordered by timestamp:
while (!state->delayed.empty()) {
auto stamp = state->delayed.front().stamp;
if (stamp >= std::chrono::high_resolution_clock::now()) {
auto f = state->queue.front().action;
state->queue.pop();
f(stamp, state);
} else {
break;
}
}
//engine_state.queue is std::queue<std::function<void(engine_state*)>>
if (!state->queue.empty()) {
auto f = state->queue.front();
state->queue.pop();
f(state);
}
}
}
The repeated polling is implemented as a delayed action that, as its first operation, inserts a new delayed action due 500ms after this one. We pass in the time the action was due to run.
"Normal" events can be instead pushed into the normal action queue, which is a sequence of std::function<void(engine_state*)> and executed in order.
If there is nothing to do, the above function busy-waits for ms time and then returns. In some cases, we might want to go to sleep instead.
This is just a sketch of an event loop. There are many, many on the internet.

Long cycle blocks application

I hve following cycle in my app
var maxIterations: Int = 0
func calculatePoint(cn: Complex) -> Int {
let threshold: Double = 2
var z: Complex = .init(re: 0, im: 0)
var z2: Complex = .init(re: 0, im: 0)
var iteration: Int = 0
repeat {
z2 = self.pow2ForComplex(cn: z)
z.re = z2.re + cn.re
z.im = z2.im + cn.im
iteration += 1
} while self.absForComplex(cn: z) <= threshold && iteration < self.maxIterations
return iteration
}
and rainbow wheel is showing during the cycle execution. How I can manage that app is still responding to UI actions?
Note I have NSProgressIndicator updated in different part of code which is not being updated (progress is not shown) while the cycle is running.
I have suspicion that it has something to do with dispatcing but I'm quite "green" with that. I do appreciate any help.
Thanks.
To dispatch something asynchronously, call async on the appropriate queue. For example, you might change this method to do the calculation on a global background queue, and then report the result back on the main queue. By the way, when you do that, you shift from returning the result immediately to using a completion handler closure which the asynchronous method will call when the calculation is done:
func calculatePoint(_ cn: Complex, completionHandler: #escaping (Int) -> Void) {
DispatchQueue.global(qos: .userInitiated).async {
// do your complicated calculation here which calculates `iteration`
DispatchQueue.main.async {
completionHandler(iteration)
}
}
}
And you'd call it like so:
// start NSProgressIndicator here
calculatePoint(point) { iterations in
// use iterations here, noting that this is called asynchronously (i.e. later)
// stop NSProgressIndicator here
}
// don't use iterations here, because the above closure is likely not yet done by the time we get here;
// we'll get here almost immediately, but the above completion handler is called when the asynchronous
// calculation is done.
Martin has surmised that you are calculating a Mandelbrot set. If so, dispatching the calculation of each point to a global queue is not a good idea (because these global queues dispatch their blocks to worker threads, but those worker threads are quite limited).
If you want to avoid using up all of these global queue worker threads, one simple choice is to take the async call out of your routine that calculates an individual point, and just dispatch the whole routine that iterates through all of the complex values to a background thread:
DispatchQueue.global(qos: .userInitiated).async {
for row in 0 ..< height {
for column in 0 ..< width {
let c = ...
let m = self.mandelbrotValue(c)
pixelBuffer[row * width + column] = self.color(for: m)
}
}
let outputCGImage = context.makeImage()!
DispatchQueue.main.async {
completionHandler(NSImage(cgImage: outputCGImage, size: NSSize(width: width, height: height)))
}
}
That's solves the "get it off the main thread" and the "don't use up the worker threads" problems, but now we've swung from using too many worker threads, to only using one worker thread, not fully utilizing the device. We really want to do as many calculations in parallel (while not exhausting the worker threads).
One approach, when doing a for loop for complex calculations, is to use dispatch_apply (now called concurrentPerform in Swift 3). This is like a for loop, but it does the each of the loops concurrently with respect to each other (but, at the end, waits for all of those concurrent loops to finish). To do this, replace the outer for loop with concurrentPerform:
DispatchQueue.global(qos: .userInitiated).async {
DispatchQueue.concurrentPerform(iterations: height) { row in
for column in 0 ..< width {
let c = ...
let m = self.mandelbrotValue(c)
pixelBuffer[row * width + column] = self.color(for: m)
}
}
let outputCGImage = context.makeImage()!
DispatchQueue.main.async {
completionHandler(NSImage(cgImage: outputCGImage, size: NSSize(width: width, height: height)))
}
}
The concurrentPerform (formerly known as dispatch_apply) will perform the various iterations of that loop concurrently, but it will automatically optimize the number of concurrent threads for the capabilities of your device. On my MacBook Pro, this made the calculation 4.8 times faster than the simple for loop. Note, I still dispatch the whole thing to a global queue (because concurrentPerform runs synchronously, and we never want to perform slow, synchronous calculations on the main thread), but concurrentPerform will run the calculations in parallel. It's a great way to enjoy concurrency in a for loop in such a way that you won't exhaust GCD worker threads.
By the way, you mentioned that you are updating a NSProgressIndicator. Ideally, you want to update it as every pixel is processed, but if you do that, the UI may get backlogged, unable to keep up with all of these updates. You'll end up slowing the final result to allow the UI to catch up to all of those progress indicator updates.
The solution is to decouple the UI update from the progress updates. You want the background calculations to inform you as each pixel is updated, but you want the progress indicator to be updated, each time effectively saying "ok, update the progress with however many pixels were calculated since the last time I checked". There are cumbersome manual techniques to do that, but GCD provides a really elegant solution, a dispatch source, or more specifically, a DispatchSourceUserDataAdd.
So define properties for the dispatch source and a counter to keep track of how many pixels have been processed thus far:
let source = DispatchSource.makeUserDataAddSource(queue: .main)
var pixelsProcessed: UInt = 0
And then set up an event handler for the dispatch source, which updates the progress indicator:
source.setEventHandler() { [unowned self] in
self.pixelsProcessed += self.source.data
self.progressIndicator.doubleValue = Double(self.pixelsProcessed) / Double(width * height)
}
source.resume()
And then, as you process the pixels, you can simply add to your source from the background thread:
DispatchQueue.concurrentPerform(iterations: height) { row in
for column in 0 ..< width {
let c = ...
let m = self.mandelbrotValue(for: c)
pixelBuffer[row * width + column] = self.color(for: m)
self.source.add(data: 1)
}
}
If you do this, it will update the UI with the greatest frequency possible, but it will never get backlogged with a queue of updates. The dispatch source will coalesce these add calls for you.

Getting item sequence numbers from a QtConcurrent Threaded Calculation

The QtConcurrent namespace is really great for simplifying the management of multi-threaded calculations. Overall this works great and I have been able to use QtConcurrent run(), map(), and other variants in the way they are described in the API.
Overall Goal:
I would like to query, cancel(), or pause() a numerically intensive calculation from QML. So far this is working the way I would like, except that I cannot access the sequence numbers in the calculation. Here is a link that describes a similar QML setup.
Below is an image from small test app that I created to encapsulate what I am trying to do:
In the example above the calculation has nearly completed and all the cores have been enqueued with work properly, as can be seen from a system query:
But what I really would like to do is use the sequence numbers from a given list of the items IN THE multi-threaded calculation itself. E.g., one approach might be to simply setup the sequence numbers directly in a QList or QVector (other C++ STL containers can work as well), like this:
void TaskDialog::mapTask()
{
// Number of times the map function will be called:
int N = 5;
// Prepare the vector that we operate on with mapFunction:
QList<int> vectorOfInts;
for (int i = 0; i < N; i++) {
vectorOfInts << i;
}
// Start the calc:
QFuture<void> future = QtConcurrent::map(vectorOfInts, mapFunction);
_futureWatcher.setFuture(future);
//_futureWatcher.waitForFinished();
}
The calculation is non-blocking with the line: _futureWatcher.waitForFinished(); commented out, as shown in the code above. Note that when setup as a non-blocking calculation, the GUI thread is responsive, and the progress bar updates as desired.
But when the values in the QList container are queried during the calculation, what appears seem to be the uninitialized garbage values that one would expect when the array is not properly initialized.
Below is the example function I am calling:
void mapFunction(int& n)
{
// Check the n values:
qDebug() << "n = " << n;
/* Below is an arbitrary task but note that we left out n,
* although normally we would want to use it): */
const long work = 10000 * 10000 * 10;
long s = 0;
for (long j = 0; j < work; j++)
s++;
}
And the output of qDebug() is:
n = 30458288
n = 204778
n = 270195923
n = 0
n = 270385260
The n-values are useless but the sum values, s, are correct (although not shown) when the calculation is mapped in this fashion (non-blocking).
Now, if I uncomment the _futureWatcher.waitForFinished(); line then I get the expected values (the order is irrelevant):
n = 0
n = 2
n = 4
n = 3
n = 1
But in this case, with _futureWatcher.waitForFinished(); enabled, my GUI thread is blocked and the progress bar does not update.
What then would be the advantage of using QtConcurrent::map() with blocking enabled, if the goal to not block the main GUI thread?
Secondly, how can get the correct values of n in the non-blocking case, allowing the GUI to remain responsive and have the progress bar keep updating?
My only option may be to use QThread directly but I wanted to take advantage of all the nice tools setup for us in QtConcurrent.
Thoughts? Suggestions? Other options? Thanks.
EDIT: Thanks to user2025983 for the insight which helped me to solve this. The bottom line is that I first needed to dynamically allocate the QList:
QList<int>* vectorOfInts = new QList<int>;
for (int i = 0; i < N; i++)
vectorOfInts->push_back(i);
Next, the vectorOfInts is passed by reference to the map function by de-referencing the pointer, like this:
QFuture<void> future = QtConcurrent::map(*vectorOfInts, mapFunction);
Note also that the prototype of the mapFunction remains the same:
void mapFunction(int& n)
And then it all works properly: the GUI remained responsive, progress bar updated, the values of n are all correct, etc., WITHOUT the need to add blocking through the function:
_futureWatcher.waitForFinished();
Hope these extra details can help someone else.
The problem here is that your QList goes out of the scope when mapTask() finishes.
Since the mapFunction(int &n) takes the parameter by reference, it gets references to integer values which are now part of an array which is out of scope! So then the computer is free to do whatever it likes with that memory, which is why you see garbage values. If you are just using integer parameters, I would recommend passing the parameters by value and then everything should work.
Alternatively, if you must pass by reference you can have the futureWatcher delete the array when its finished.
QList<int>* vectorOfInts = new QList<int>;
// push back into structure
connect(_futureWatcher, SIGNAL(finished()), vectorOfInts, SLOT(deleteLater()));
// launch stuff
QtConcurrent::map...
// profit

How to reduce cpu usage during data transfer on TCP ports realtime

I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.