How to call Executor.map in custom dask graph? - python-2.7

I've got a computation, consisting of 3 "map" steps, and the last step depends on results of the first two. I am performing this task using dask.distributed running on several PCs.
Dependency graph looks like following.
map(func1, list1) -> res_list1-\
| -> create_list_3(res_list1, res_list2)-> list3 -> map(func3, list3)
map(func2, list2) -> res_list2-/
If we imagine that these computations are independent, then it is straightforward to call map function 3 times.
from distributed import Executor, progress
def process(jobid):
e = Executor('{address}:{port}'.format(address=config('SERVER_ADDR'),
port=config('SERVER_PORT')))
futures = []
futures.append(e.map(func1, list1))
futures.append(e.map(func2, list2))
futures.append(e.map(func3, list3))
return futures
if __name__ == '__main__':
jobid = 'blah-blah-blah'
r = process(jobid)
progress(r)
However, list3 is constructed from results of func1 and func2, and its creation is not easily mappable (list1, list2, res_list1 and res_list2 are stored in the Postgresql database and creation of list3 is a JOIN query, taking some time).
I've tried to add call to submit to the list of futures, however, that did not work as expected:
def process(jobid):
e = Executor('{address}:{port}'.format(address=config('SERVER_ADDR'),
port=config('SERVER_PORT')))
futures = []
futures.append(e.map(func1, list1))
futures.append(e.map(func2, list2))
futures.append(e.submit(create_list_3))
futures.append(e.map(func3, list3))
return futures
In this case one dask-worker has received the task to execute create_list_3, but others have simultaneously received tasks to call func3, that have erred, because list3 did not exist.
Obvious thing - I'm missing synchronization. Workers must stop and wait till creation of list3 is finished.
Documentation to dask describes custom task graphs, that can provide a synchronization.
However, examples in the documentation do not include map functions, only simple calculations, like calls add and inc.
Is it possible to use map and custom dask graph in my case, or should I implement sync with some other means, that are not included in dask?

If you want to link dependencies between tasks then you should pass the outputs from previous tasks into the inputs of another.
futures1 = e.map(func1, list1)
futures2 = e.map(func2, list2)
futures3 = e.map(func3, futures1, futures2)
For any invocation of func3 Dask will handle waiting until the inputs are ready and will send the appropriate results to that function from wherever they get computed.
However it looks like you want to handle data transfer and synchronization through some other custom means. If this is so then perhaps it would be useful to pass along some token to the call to func3.
futures1 = e.map(func1, list1)
futures2 = e.map(func2, list2)
def do_nothing(*args):
return None
token1 = e.submit(do_nothing, futures1)
token2 = e.submit(do_nothing, futures2)
list3 = e.submit(create_list_3)
def func3(arg, tokens=None):
...
futures3 = e.map(func3, list3, tokens=[token1, token2])
This is a bit of a hack, but would force all func3 functions to wait until they were able to get the token results from the previous map calls.
However I recommend trying to do something like the first option. This will allow dask to be much smarter about when it runs and can release resources. Barriers like token1/2 result in sub-optimal scheduling.

Related

error: pyspark list append operation inside foreach on dataframe gives empty list outside of loop

I am facing the following issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [('James','Smith','M',30),('Anna','Rose','F',41),('Robert','Williams','M',62), ]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
lst = []
def func2(x):
lst = lst.append(x.firstname)
df.foreach(func2)
# df.foreach(lambda x: func2(x))
print(len(lst))
the lst variable here at the end of the loop is always empty. what is the reason for this? any fix?
Thanks!
The reason your code does not work is because lambda functions in PySpark are executed in different executors, each within its own local Python process and hence global variables are not accessible across executors.
You can use accumulators to achieve this. However, it comes with a performance penalty and PySpark does not provide a native List accumulator.
Solution using Accumulators
from pyspark.accumulators import AccumulatorParam
class ListParam(AccumulatorParam):
def zero(self, init_value):
return init_value
def addInPlace(self, v1, v2):
return v1 + v2
lst = spark.sparkContext.accumulator([], ListParam())
def func2(x):
global lst
lst += [x.firstname]
df.foreach(func2)
print(lst.value)
Output:
['James', 'Anna', 'Robert']
If you are looking to get back all the values for a particular column in PySpark then you can select the particular column, collect them as Row and then fetch the key you are interested in.
Collecting is an expensive operation and brings all the data to the driver and in the presence of all volume of data, it will cause driver to fail.
[row.firstname for row in df.select("firstname").collect()]

How to solve "unresolved flex record" in else if statement in SML?

I want to find a list of nodes that currently given nodes directly or indirectly connect to.
For example, I have a list of nodes:
[1,2]
and a list of tuples, and each of the tuples represents a direct edge:
[(1,5),(2,4),(4,6)]
So, the nodes I am looking for are
[1,2,5,4,6]
Because, 1 connects to 5, 2 connects to 4. Then, 4 is connected to 6.
To achieve this, I need two a queues, and a list. Each time a new node is discovered, we append the new node to the queue and the list. Then, we remove the first node of the queue, and go to next node. If a new node is connected to the current node of the queue. Then, we add new node to both the queue and the list.
We keep doing this until the queue is empty and we return the list.
So now, I have an append function which appends a list to another list:
fun append(xs, ys) =
case ys of
[] => xs
| (y::ys') => append(xs # [y], ys')
Then, I have a function called getIndirectNodes, which intends to return the lists of nodes that the given nodes indirectly connected to, but throws "unresolved flex record". List1 and List2 have the same items supposedly. But, List1 serves the queue, and list2 servers as the list to be returned.
fun getIndirectNode(listRoleTuples, list1, list2) =
if list1 = []
then list2
else if hd(list1) = #1(hd(listRoleTuples))
then (
append(list1,#2(hd(listRoleTuples)) :: []);
append(list2,#2(hd(listRoleTuples)) :: []);
getIndirectNode(listRoleTuples,tl(list1),list2)
)
else
getIndirectNode(listRoleTuples,tl(list1),list2)
If I remove the else if statement, it works perfectly fine. But, it's not what I intended to do. The problem is in the else if statement. What can I do to fix it?
SML needs to know exactly what shape a tuple has in order to deconstruct it.
You could specify the type of the parameter - listRoleTuples : (''a * ''a) list - but using pattern matching is a better idea.
(There are many other problems with that code, but that's the answer to your question.)
It seems that one of your classmates had this exact tuple problem in a very related task.
Make sure you browse the StackOverflow Q&A's before you ask the same question again.
As for getting the indirect nodes, this can be solved by fixed-point iteration.
First you get all the direct nodes, and then you get the direct nodes of the direct nodes.
And you do this recursively until no more new nodes occur this way.
fun getDirectNodes (startNode, edges) =
List.map #2 (List.filter (fn (node, _) => node = startNode) edges)
fun toSet xs =
... sort and remove duplicates ...
fun getReachableNodes (startNodes, edges) =
let
fun f startNode = getDirectNodes (startNode, edges)
val startNodes = toSet startNodes
val endNodes = toSet (List.concat (List.map f startNodes))
in
if startNodes = endNodes
then endNodes
else getReachableNodes (startNodes # endNodes, edges)
end
This doesn't exactly find indirect end-nodes; it finds all nodes directly or indirectly reachable by startNodes, and it includes startNodes themselves even if they're not directly or indirectly reachable by themselves.
I've tried to make this exercise easier by using sets as a datatype; it would be even neater with an actual, efficient implementation of a set type, e.g. using a balanced binary search tree. It is easier to see if there are no new nodes by adding elements to a set, since if a set already contains an element, it will be equivalent to itself before and after the addition of the element.
And I've tried to use higher-order functions when this makes sense. For example, given a list of things where I want to do the same thing on each element, List.map produces a list of results. But since that thing I want to do, getDirectNodes (startNode, edges) produces a list, then List.map f produces a list of lists. So List.concat collapses that into a single list.
List.concat (List.map f xs)
is a pretty common thing to do.

clojure pmap - why aren't i using all the cores?

I'm attempting to use the clojure pantomime library to extract/ocr text from a large number of tif documents (among others).
My plan has been to use pmap for to apply the mapping over a sequence of input data (from a postgres database) and then update that same postgres database with the tika/tesseract OCR output. This has been working ok, however i notice in htop that many of the cores are idle at times.
Is there anyway to reconcile this, and what steps can i take to determine why this may be blocking somewhere? All processing occurs on a single tif file, and each thread is entirely mutually exclusive.
Additional info:
some tika/tesseract processes take 3 seconds, others take up to 90 seconds. Generally speaking, tika is heavily CPU bound. I have ample memory available according to htop.
postgres has no locking issues in session management, so i don't think thats holding me up.
maybe somewhere future's are waiting to deref? how to tell where?
Any tips appreciated, thanks. Code added below.
(defn parse-a-path [{:keys [row_id, file_path]}]
(try
(let [
start (System/currentTimeMillis)
mime_type (pm/mime-type-of file_path)
file_content (-> file_path (extract/parse) :text)
language (pl/detect-language file_content)
]
{:mime_type mime_type
:file_content file_content
:language language
:row_id row_id
:parse_time_in_seconds (float (/ ( - (System/currentTimeMillis) start) 100))
:record_status "doc parsed"})))
(defn fetch-all-batch []
(t/info (str "Fetching lazy seq. all rows for batch.") )
(jdbc/query (db-connection)
["select
row_id,
file_path ,
file_extension
from the_table" ]))
(defn update-a-row [{:keys [row_id, file_path, file_extension] :as all-keys}]
(let [parse-out (parse-a-path all-keys )]
(try
(doall
(jdbc/execute!
(db-connection)
["update the_table
set
record_last_updated = current_timestamp ,
file_content = ? ,
mime_type = ? ,
language = ? ,
parse_time_in_seconds = ? ,
record_status = ?
where row_id = ? "
(:file_content parse-out) ,
(:mime_type parse-out) ,
(:language parse-out) ,
(:parse_time_in_seconds parse-out) ,
(:record_status parse-out) ,
row_id ])
(t/debug (str "updated row_id " (:row_id parse-out) " (" file_extension ") "
" in " (:parse_time_in_seconds parse-out) " seconds." )))
(catch Exception _ ))))
(dorun
(pmap
#(try
(update-a-row %)
(catch Exception e (t/error (.getNextException e)))
)
fetch-all-batch )
)
pmap runs the map function in parallel on batches of (+ 2 cores), but preserves ordering. This means if you have 8 cores, a batch of 10 items will be processed, but the new batch will only be started if all 10 have finished.
You could create your own code that uses combinations of future, delay and deref, which would be good academic exercise. After that, you can throw out your code and start using the claypoole library, which has a set of abstractions that cover the majority of uses of future.
For this specific case, use their unordered pmap or pfor implementations (upmap and upfor), which do exactly the same thing pmap does but do not have ordering; new items are picked up as soon as any one item in the batch is finished.
In situations where IO is the main bottleneck, or where processing times can greatly vary between items of work, it is the best way to parallelize map or for operations.
Of course you should take care not to rely on any sort of ordering for the return values.
(require '[com.climate.claypoole :as cp])
(cp/upmap (cp/ncpus)
#(try
(update-a-row %)
(catch Exception e (t/error (.getNextException e)))
)
fetch-all-batch )
I had a similar problem some time ago. I guess that you are making the same assumptions as me:
pmap calls f in parallel. But that doesn't mean that the work is shared equally. As you said, some take 3 seconds whereas other take 90 seconds. The thread that finished in 3 seconds does NOT ask the other ones to share some of the work left to do. So the finished threads just wait iddle until the last one finishes.
you didn't describe exactly how is your data but I will assume that you are using some kind of lazy sequence, which is bad for parallel processing. If your process is CPU bounded and you can hold your entire input in memory then prefer the use of clojure.core.reducers ('map', 'filter' and specially 'fold') to the use of the lazy map, filter and others.
In my case, these tips drop the processing time from 34 to a mere 8 seconds. Hope it helps

Spark FlatMap function for huge lists

I have a very basic question. Spark's flatMap function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.
My question is: what happens if this list is too large for your memory to handle!?
I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write() anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.
In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)
So the (lambda) function you feed to flatMap should return a list.
No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala RDD.flatMap signature:
flatMap[U](f: (T) ⇒ TraversableOnce[U])
Since subclasses of TraversableOnce include SeqView or Stream you can use a lazy sequence instead of a List. For example:
val rdd = sc.parallelize("foo" :: "bar" :: Nil)
rdd.flatMap {x => (1 to 1000000000).view.map {
_ => (x, scala.util.Random.nextLong)
}}
Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:
import numpy as np
rdd = sc.parallelize(["foo", "bar"])
rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))
Since RDDs are lazily evaluated it is even possible to return an infinite sequence from the flatMap. Using a little bit of toolz power:
from toolz.itertoolz import iterate
def inc(x):
return x + 1
rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)

F# append to list in a loop functiuonally

I am looking to convert this code to use F# list instead of the C# list implementation.
I am connecting to a database and running a query usually with C# would create a list of a type and keep adding the list while the datareader has values. How would I go about converting this to use an F# list
let queryDatabase (connection: NpgsqlConnection) (queryString: string) =
let transactions = new List<string>()
let command = new NpgsqlCommand(queryString, connection)
let dataReader = command.ExecuteReader()
while dataReader.Read() do
let json = dataReader.GetString(1)
transactions.Add(json)
transactions
The tricky thing here is that the input data source is inherently imperative (you have to call Read which mutates the internal state). So, you're crossing from imperative to a functional world - and so you cannot avoid all mutation.
I would probably write the code using a list comprehension, which keeps a similar familiar structure, but removes explicit mutation:
let queryDatabase (connection: NpgsqlConnection) (queryString: string) =
[ let command = new NpgsqlCommand(queryString, connection)
let dataReader = command.ExecuteReader()
while dataReader.Read() do
yield dataReader.GetString(1) ]
Tomas' answer is a solution to use in product code. But for sake of learning F# and functional programming I present my snippet with tail recursion and cons operator:
let drToList (dr:DataReader) =
let rec toList acc =
if not dr.Read then acc
else toList <| dr.GetString(1) :: acc
toList []
This tail recursion function is compiled into imperative-like code, thus no stack overflow and fast execution.
Also I advice you look at this C# thread and this F# documentation to see how properly dispose your command. Basically, you need to use smth like this:
let queryDb (conn: NpgsqlConnection) (qStr: string) =
use cmd = new NpgsqlCommand(qStr, conn)
cmd.ExecuteReader() |> drToList
And if we go deeper, you should also think about exception handling.