Futures and Promises in Erlang - concurrency

Does Erlang has equivalents for Future and Promises? Or since future and promises are solving a problem that doesn't exist in Erlang systems (synchronisation orchestrating for example), and hence we don't need them in Erlang.
If I want the semantics of futures and promises in Erlang, they could be emulated via Erlang processes/actors?

You could easily implement a future in Erlang like this:
F = fun() -> fancy_function() end,
% fancy code
Pid = self(),
Other = spawn(fun() -> X = F(), Pid ! {future, self(), X} end).
% more fancy code
Value = receive {future, Other, Val} -> Val end.
Having this functionality in a module and building chains from it should be easy too, but to be honest I never actually missed something like this. You are more flexible if you just freely send messages around.

The RPC module contains the rpc:async_call/4 function that does what you need. It will run a computation anywhere in a cluster (including on node(), the local node), and allow to have the result waited on with rpc:yield/1:
1> MaxTime = rpc:async_call(node(), timer, sleep, [30000]).
<0.48.0>
2> lists:sort([a,c,b]).
[a,b,c]
3> rpc:yield(MaxTime).
... [long wait] ...
ok
You can also poll in non-blocking ways by using rpc:nb_yield/1, or for a limited number of milliseconds with rpc:nb_yield/2:
4> Key2 = rpc:async_call(node(), timer, sleep, [30000]).
<0.52.0>
5> rpc:nb_yield(Key2).
timeout
6> rpc:nb_yield(Key2).
timeout
7> rpc:nb_yield(Key2).
timeout
8> rpc:nb_yield(Key2, 1000).
timeout
9> rpc:nb_yield(Key2, 100000).
... [long wait] ...
{value,ok}
That's all in the standard library and ready to be used.

Related

Explicit throughput limiting on part of an akka stream

I have a flow in our system which reads some elements from SQS (using alpakka) and does some preporcessing (~ 10 stages, normally < 1 minute in total). Then, the prepared element is sent to the main processing (single stage, taking a few minutes). The whole thing runs on AWS/K8S and we’d like to scale out when the SQS queue grows above a certain threshold. The issue is, the SQS queue takes a long time to blow up, since there are a lot of elements “idling” in-process, having done their preprocessing but waiting for the main thing.
We can’t externalize the preprocessing stuff to a separate queue since their outcome can’t survive a de/serialization roundtrip. Also, this service and the “main” processor are deeply coupled (this service runs as main’s sidecar) and can’t be scaled independently.
The preprocessing stages are technically .mapAsyncUnordered, but the whole thing is already very slim (stream stages and SQS batches/buffers).
We tried lowering the interstage buffer (akka.stream.materializer.max-input-buffer-size), but that only gives some indirect benefit, no direct control (and is too internal to be mucking with, for my taste anyway).
I tried implementing a “gate” wrapper which would limit the amount of elements allowed inside some arbitrary Flow, looking something like:
class LimitingGate[T, U](originalFlow: Flow[T, U], maxInFlight: Int) {
private def in: InputGate[T] = ???
private def out: OutputGate[U] = ???
def gatedFlow: Flow[T, U, NotUsed] = Flow[T].via(in).via(originalFlow).via(out)
}
And using callbacks between the in/out gates for throttling.
The implementation partially works (stream termination is giving me a hard time), but it feels like the wrong way to go about achieving the actual goal.
Any ideas / comments / enlightening questions are appreciated
Thanks!
Try something along these lines (I'm only compiling it in my head):
def inflightLimit[A, B, M](n: Int, source: Source[T, M])(businessFlow: Flow[T, B, _])(implicit materializer: Materializer): Source[B, M] = {
require(n > 0) // alternatively, could just result in a Source.empty...
val actorSource = Source.actorRef[Unit](
completionMatcher = PartialFunction.empty,
failureMatcher = PartialFunction.empty,
bufferSize = 2 * n,
overflowStrategy = OverflowStrategy.dropHead // shouldn't matter, but if the buffer fills, the effective limit will be reduced
)
val (flowControl, unitSource) = actorSource.preMaterialize()
source.statefulMapConcat { () =>
var firstElem: Boolean = true
{ a =>
if (firstElem) {
(0 until n).foreach(_ => flowControl.tell(())) // prime the pump on stream materialization
firstElem = false
}
List(a)
}}
.zip(unitSource)
.map(_._1)
.via(businessFlow)
.wireTap { _ => flowControl.tell(()) } // wireTap is Akka Streams 2.6, but can be easily replaced by a map stage which sends () to flowControl and passes through the input
}
Basically:
actorSource will emit a Unit ((), i.e. meaningless) element for every () it receives
statefulMapConcat will cause n messages to be sent to the actorSource only when the stream first starts (thus allowing n elements from the source through)
zip will pass on a pair of the input from source and a () only when actorSource and source both have an element available
for every element which exits businessFlow, a message will be sent to the actorSource, which will allow another element from the source through
Some things to note:
this will not in any way limit buffering within source
businessFlow cannot drop elements: after n elements are dropped the stream will no longer process elements but won't fail; if dropping elements is required, you may be able to inline businessFlow and have the stages which drop elements send a message to flowControl when they drop an element; there are other things to address this which you can do as well

Parallelize map() operation on single Observable and receive results out of order

Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().
I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.

ZeroMQ: how to reduce multithread-communication latency with inproc?

I'm using inproc and PAIR to achieve inter-thread communication and trying to solve a latency problem due to polling. Correct me if I'm wrong: Polling is inevitable, because a plain recv() call will usually block and cannot take a specific timeout.
In my current case, among N threads, each of the N-1 worker threads has a main while-loop. The N-th thread is a controller thread which will notify all the worker threads to quit at any time. However, worker threads have to use polling with a timeout to get that quit message. This introduces a latency, the latency parameter is usually 1000ms.
Here is an example
while (true) {
const std::chrono::milliseconds nTimeoutMs(1000);
std::vector<zmq::poller_event<std::size_t>> events(n);
size_t nEvents = m_poller.wait_all(events, nTimeoutMs);
bool isToQuit = false;
for (auto& evt : events) {
zmq::message_t out_recved;
try {
evt.socket.recv(out_recved, zmq::recv_flags::dontwait);
}
catch (std::exception& e) {
trace("{}: Caught exception while polling: {}. Skipped.", GetLogTitle(), e.what());
continue;
}
if (!out_recved.empty()) {
if (IsToQuit(out_recved))
isToQuit = true;
break;
}
}
if (isToQuit)
break;
//
// main business
//
...
}
To make things worse, when the main loop has nested loops, the worker threads then need to include more polling code in each layer of the nested loops. Very ugly.
The reason why I chose ZMQ for multithread communication is because of its elegance and the potential of getting rid of thread-locking. But I never realized the polling overhead.
Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation? Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?
An above posted statement ( a hypothesis ):
"...a plain recv() call will usually block and cannot take a specific timeout."
is not correct:
a plain .recv( ZMQ_NOBLOCK )-call will never "block",
a plain .recv( ZMQ_NOBLOCK )-call can get decorated so as to mimick "a specific timeout"
An above posted statement ( a hypothesis ):
"...have to use polling with a timeout ... introduces a latency, the latency parameter is usually 1000ms."
is not correct:
- one need not use polling with a timeout
- the less one need not set 1000 ms code-"injected"-latency, spent obviously only on-no-new-message state
Q : "Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation?"
Yes.
Q : "Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?"
No. inproc-transport-class is the fastest of all these kinds as it is principally protocol-less / stack-less and has more to do with ultimately fast pointer-mechanics, like in a dual-end ring-buffer pointer-management.
The Best Next Step:
1 )Re-factor your code, so as to always harness but the zero-wait { .poll() | .recv() }-methods, properly decorated for both { event- | no-event- }-specific looping.
2 )
If then willing to shave the last few [us] from the smart-loop-detection turn-around-time, may focus on improved Context()-instance setting it to work with larger amount of nIOthreads > N "under the hood".
optionally 3 )
For almost hard-Real-Time systems' design one may finally harness a deterministically driven Context()-threads' and socket-specific mapping of these execution-vehicles onto specific, non-overlapped CPU-cores ( using a carefully-crafted affinity-map )
Having set 1000 [ms] in code, no one is fair to complain about spending those very 1000 [ms] waiting in a timeout, coded by herself / himself. No excuse for doing this.
Do not blame ZeroMQ for behaviour, that was coded from the application side of the API.
Never.

Getting result of a spawned function in Erlang

My objective at the moment is to write Erlang code calculating a list of N elements, where each element is a factorial of it's "index" (so, for N = 10 I would like to get [1!, 2!, 3!, ..., 10!]). What's more, I would like every element to be calculated in a seperate process (I know it is simply inefficient, but I am expected to implement it and compare its efficiency with other methods later).
In my code, I wanted to use one function as a "loop" over given N, that for N, N-1, N-2... spawns a process which calculates factorial(N) and sends the result to some "collecting" function, which packs received results into a list. I know my concept is probably overcomplicated, so hopefully the code will explain a bit more:
messageFactorial(N, listPID) ->
listPID ! factorial(N). %% send calculated factorial to "collector".
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
nProcessesFactorialList(-1) ->
ok;
nProcessesFactorialList(N) ->
spawn(pFactorial, messageFactorial, [N, listPID]), %%for each N spawn...
nProcessesFactorialList(N-1).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
listPrepare(List) -> %% "collector", for the last factorial returns
receive %% a list of factorials (1! = 1).
1 -> List;
X ->
listPrepare([X | List])
end.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
startProcessesFactorialList(N) ->
register(listPID, spawn(pFactorial, listPrepare, [[]])),
nProcessesFactorialList(N).
I guess it shall work, by which I mean that listPrepare finally returns a list of factorials. But the problem is, I do not know how to get that list, how to get what it returned? As for now my code returns ok, as this is what nProcessesFactorialList returns at its finish. I thought about sending the List of results from listPrepare to nProcessesFactorialList in the end, but then it would also need to be a registered process, from which I wouldn't know how to recover that list.
So basically, how to get the result from a registered process running listPrepare (which is my list of factorials)? If my code is not right at all, I would ask for a suggestion of how to get it better. Thanks in advance.
My way how to do this sort of tasks is
-module(par_fact).
-export([calc/1]).
fact(X) -> fact(X, 1).
fact(0, R) -> R;
fact(X, R) when X > 0 -> fact(X-1, R*X).
calc(N) ->
Self = self(),
Pids = [ spawn_link(fun() -> Self ! {self(), {X, fact(X)}} end)
|| X <- lists:seq(1, N) ],
[ receive {Pid, R} -> R end || Pid <- Pids ].
and result:
> par_fact:calc(25).
[{1,1},
{2,2},
{3,6},
{4,24},
{5,120},
{6,720},
{7,5040},
{8,40320},
{9,362880},
{10,3628800},
{11,39916800},
{12,479001600},
{13,6227020800},
{14,87178291200},
{15,1307674368000},
{16,20922789888000},
{17,355687428096000},
{18,6402373705728000},
{19,121645100408832000},
{20,2432902008176640000},
{21,51090942171709440000},
{22,1124000727777607680000},
{23,25852016738884976640000},
{24,620448401733239439360000},
{25,15511210043330985984000000}]
The first problem is that your listPrepare process doesn't do anything with the result. Try to print it in the end.
The second problem is that you don't wait for all the processes to finish, but for process that sends 1 and this is the quickest factorial to calculate. So this message will surely be received before the more complex will be calculated, and you'll end up with only a few responses.
I had answered a bit similar question on the parallel work with many processes here: Create list across many processes in Erlang Maybe that one will help you.
I propose you this solution:
-export([launch/1,fact/2]).
launch(N) ->
launch(N,N).
% launch(Current,Total)
% when all processes are launched go to the result collect phase
launch(-1,N) -> collect(N+1);
launch(I,N) ->
% fact will be executed in a new process, so the normal way to get the answer is by message passing
% need to give the current process pid to get the answer back from the spawned process
spawn(?MODULE,fact,[I,self()]),
% loop until all processes are launched
launch(I-1,N).
% simply send the result to Pid.
fact(N,Pid) -> Pid ! {N,fact_1(N,1)}.
fact_1(I,R) when I < 2 -> R;
fact_1(I,R) -> fact_1(I-1,R*I).
% init the collect phase with an empty result list
collect(N) -> collect(N,[]).
% collect(Remaining_result_to_collect,Result_list)
collect(0,L) -> L;
% accumulate the results in L and loop until all messages are received
collect(N,L) ->
receive
R -> collect(N-1,[R|L])
end.
but a much more straight (single process) solution could be:
1> F = fun(N) -> lists:foldl(fun(I,[{X,R}|Q]) -> [{I,R*I},{X,R}|Q] end, [{0,1}], lists:seq(1,N)) end.
#Fun<erl_eval.6.80484245>
2> F(6).
[{6,720},{5,120},{4,24},{3,6},{2,2},{1,1},{0,1}]
[edit]
On a system with multicore, cache and an multitask underlying system, there is absolutly no guarantee on the order of execution, same thing on message sending. The only guarantee is in the message queue where you know that you will analyse the messages according to the order of message reception. So I agree with Dmitry, your stop condition is not 100% effective.
In addition, using startProcessesFactorialList, you spawn listPrepare which collect effectively all the factorial values (except 1!) and then simply forget the result at the end of the process, I guess this code snippet is not exactly the one you use for testing.

Is F# really faster than Erlang at spawning and killing processes?

Updated: This question contains an error which makes the benchmark meaningless. I will attempt a better benchmark comparing F# and Erlang's basic concurrency functionality and inquire about the results in another question.
I am trying do understand the performance characteristics of Erlang and F#. I find Erlang's concurrency model very appealing but am inclined to use F# for interoperability reasons. While out of the box F# doesn't offer anything like Erlang's concurrency primitives -- from what I can tell async and MailboxProcessor only cover a small portion of what Erlang does well -- I've been trying to understand what is possible in F# performance wise.
In Joe Armstrong's Programming Erlang book, he makes the point that processes are very cheap in Erlang. He uses the (roughly) the following code to demonstrate this fact:
-module(processes).
-export([max/1]).
%% max(N)
%% Create N processes then destroy them
%% See how much time this takes
max(N) ->
statistics(runtime),
statistics(wall_clock),
L = for(1, N, fun() -> spawn(fun() -> wait() end) end),
{_, Time1} = statistics(runtime),
{_, Time2} = statistics(wall_clock),
lists:foreach(fun(Pid) -> Pid ! die end, L),
U1 = Time1 * 1000 / N,
U2 = Time2 * 1000 / N,
io:format("Process spawn time=~p (~p) microseconds~n",
[U1, U2]).
wait() ->
receive
die -> void
end.
for(N, N, F) -> [F()];
for(I, N, F) -> [F()|for(I+1, N, F)].
On my Macbook Pro, spawning and killing 100 thousand processes (processes:max(100000)) takes about 8 microseconds per processes. I can raise the number of processes a bit further, but a million seems to break things pretty consistently.
Knowing very little F#, I tried to implement this example using async and MailBoxProcessor. My attempt, which may be wrong, is as follows:
#r "System.dll"
open System.Diagnostics
type waitMsg =
| Die
let wait =
MailboxProcessor.Start(fun inbox ->
let rec loop =
async { let! msg = inbox.Receive()
match msg with
| Die -> return() }
loop)
let max N =
printfn "Started!"
let stopwatch = new Stopwatch()
stopwatch.Start()
let actors = [for i in 1 .. N do yield wait]
for actor in actors do
actor.Post(Die)
stopwatch.Stop()
printfn "Process spawn time=%f microseconds." (stopwatch.Elapsed.TotalMilliseconds * 1000.0 / float(N))
printfn "Done."
Using F# on Mono, starting and killing 100,000 actors/processors takes under 2 microseconds per process, roughly 4 times faster than Erlang. More importantly, perhaps, is that I can scale up to millions of processes without any apparent problems. Starting 1 or 2 million processes still takes about 2 microseconds per process. Starting 20 million processors is still feasible, but slows to about 6 microseconds per process.
I have not yet taken the time to fully understand how F# implements async and MailBoxProcessor, but these results are encouraging. Is there something I'm doing horribly wrong?
If not, is there some place Erlang will likely outperform F#? Is there any reason Erlang's concurrency primitives can't be brought to F# through a library?
EDIT: The above numbers are wrong, due to the error Brian pointed out. I will update the entire question when I fix it.
In your original code, you only started one MailboxProcessor. Make wait() a function, and call it with each yield. Also you are not waiting for them to spin up or receive the messages, which I think invalidates the timing info; see my code below.
That said, I have some success; on my box I can do 100,000 at about 25us each. After too much more, I think possibly you start fighting the allocator/GC as much as anything, but I was able to do a million too (at about 27us each, but at this point was using like 1.5G of memory).
Basically each 'suspended async' (which is the state when a mailbox is waiting on a line like
let! msg = inbox.Receive()
) only takes some number of bytes while it's blocked. That's why you can have way, way, way more asyncs than threads; a thread typically takes like a megabyte of memory or more.
Ok, here's the code I'm using. You can use a small number like 10, and --define DEBUG to ensure the program semantics are what is desired (printf outputs may be interleaved, but you'll get the idea).
open System.Diagnostics
let MAX = 100000
type waitMsg =
| Die
let mutable countDown = MAX
let mre = new System.Threading.ManualResetEvent(false)
let wait(i) =
MailboxProcessor.Start(fun inbox ->
let rec loop =
async {
#if DEBUG
printfn "I am mbox #%d" i
#endif
if System.Threading.Interlocked.Decrement(&countDown) = 0 then
mre.Set() |> ignore
let! msg = inbox.Receive()
match msg with
| Die ->
#if DEBUG
printfn "mbox #%d died" i
#endif
if System.Threading.Interlocked.Decrement(&countDown) = 0 then
mre.Set() |> ignore
return() }
loop)
let max N =
printfn "Started!"
let stopwatch = new Stopwatch()
stopwatch.Start()
let actors = [for i in 1 .. N do yield wait(i)]
mre.WaitOne() |> ignore // ensure they have all spun up
mre.Reset() |> ignore
countDown <- MAX
for actor in actors do
actor.Post(Die)
mre.WaitOne() |> ignore // ensure they have all got the message
stopwatch.Stop()
printfn "Process spawn time=%f microseconds." (stopwatch.Elapsed.TotalMilliseconds * 1000.0 / float(N))
printfn "Done."
max MAX
All this said, I don't know Erlang, and I have not thought deeply about whether there's a way to trim down the F# any more (though it's pretty idiomatic as-is).
Erlang's VM doesn't uses OS threads or process to switch to new Erlang process. It's VM simply counts function calls into your code/process and jumps to other VM's process after some (into same OS process and same OS thread).
CLR uses mechanics based on OS process and threads, so F# has much higher overhead cost for each context switch.
So answer to your question is "No, Erlang is much faster than spawning and killing processes".
P.S. You can find results of that practical contest interesting.