AKKA thread pool configuration - akka

Based on my usecase, I would like to create an Akka Actor per item id. This way each item action is executed by one and one thread/actor but can do concurrent items at the same time. But I have stumbled upon how to configure the number of threads for my system since I may have 1000 of items coming in at the same time and would like to use some max number of threads say 20 to service the 1000 items. Is this possible with Akka.
Thanks,
Ravi

Based on the Docs, you should try these parameters on your akka config:
# This will be used if you have set "executor = "fork-join-executor""
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 8
# The parallelism factor is used to determine thread pool size using the
# following formula: ceil(available processors * factor). Resulting size
# is then bounded by the parallelism-min and parallelism-max values.
parallelism-factor = 3.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 64
}
I'm assuming you have not changed your default-executor parameter because if you had you would already know where to look.

Related

Can the minimum time to schedule tasks be found without brute force?

If I have a list of integers representing the time it takes for a task to be completed and I have x workers that can only work on one task until the time it takes to complete is up, can I find the minimum time it could possibly take in a best case scenario? I do not need the exact permutation that makes up this minimum completion time, just the time.
For example, to make it simple, if I have a list [2, 4, 6] and I have 2 workers then if I start with 2 and 4 then when 2 finishes 6 will start meaning that it will take 8 seconds to complete all tasks. However if I start with 6 and 2 then when 2 finishes 4 will start and finish at the same time as 6, therefore the tasks only take 6 seconds if done in this order.
Is there a way of knowing that it will only take 6 seconds that is better than n! or brute force complexity that guarantees it is the minimum time possible? Thank you for any help in advance please feel free to ask questions if I left out any details or you're confused!
edit: please help :(
edit 2: is it even possible? Anyone know?
In the case of a single worker, then the actual total time required is the same as the sum of all task times.
jobs = [ 2, 4, 6, etc... ]
time_required = SUM( jobs )
In the case of two workers, then given a specific ordering of jobs the total-time required can be determined by first assigning each task's required time to whichever worker has the current lowest sum associated with it, then getting the highest sum associated with each worker:
define type worker = vector<time_t>
define type workers = min_priority_queue<worker> using worker.sum() # so the current worker.sum() (the sum of `time_t` values in `vector<time_t>`) is the priority-queue key.
define type task = int
jobs = [ 2, 4, 6, etc... ]
# Use two workers:
workers.add( new worker )
workers.add( new worker )
# Iterate once through each job:
foreach( task t in jobs ) {
minWorker = workers.getMinWorker() # priority queue "find-min" operation
minWorker.add( t )
}
# Determine which worker will work the longest time:
time_required = workers.getMaxWorker().sum() # priority queue "find-max" operation
Because this is an actual solution, then the time_required is a point-sample that exists between the upper and lower-bounds - which isn't exactly what you're after, but because it can be computed in O(n) time it's a good starting point.
The above algorithm can then be generalised to any number of workers just by adding them to the priority queue - as heap-based priority queues' find-min operation is O(1) I believe this algorithm runs in O(n) time where n is the number of jobs, independent of the number of workers. (I may be wrong about the precise runtime complexity).
As for computing bounds in less time than O(n!) time... that's tricky (at least for me, as it's been a few years since I last cracked-open my copy of CLRS).
A minimal lower-bound for x workers for any order of jobs is simply the largest single value in the job set.
A maximal upper-bound for x workers for any order of jobs could be the sum of the largest 100 * (1/x) % of jobs (so given 2 workers it's the sum of the largest 50% jobs, for 3 workers it's the sum of the largest 33% jobs, for 4 workers it's 25%, etc). This will require you to sort the set first (taking O(n log n) if using Quicksort).
jobs = [ 2, 4, 6, etc... ]
worker_count = 2
jobs.sortDescending() # O(n log n)
# if there's 50 jobs and 2 workers, then take the first 25 jobs and sum them...
# ...that's an upper_bound for the time required to complete all tasks by 2 workers, as it assumes that 1 unlucky worker will get all of the longest tasks
upper_bound = jobs.take( jobs.count / worker_count ).sum()

How to get current time in CP when using Intervals for scheduling

I am trying to schedule tasks in different machines. These machines have dynamique available ressources, for example:
machine 1: max capacity 4 core.
At T=t1 => available CPU = 2 core;
At T=t2 => available CPU = 1 core;
Each interval has a fixed time (Ex: 1 minute).
So in CPLEX, I have a cumulFunction to sum the used ressource in a machine :
cumulFunction cumuls[host in Hosts] =
sum(job in Jobs) pulse(itvs[task][host], requests[task]);
Now the problem is in the constraint:
forall(host in Hosts) {
cumuls[host] <= ftoi(available_res_function[host](**<<Current Period>>**));
}
I can't find a way to get the current period so that I could compare the used ressources to the available in that specefic period.
PS: available_res_function is a stepFunction of the available ressources.
Thank you so much for your help.
What you can do is to add a set of pulse in your cumul function.
For instance, in the sched_cumul function you could change:
cumulFunction workersUsage =
sum(h in Houses, t in TaskNames) pulse(itvs[h][t],1);
into
cumulFunction workersUsage =
sum(h in Houses, t in TaskNames) pulse(itvs[h][t],1)+pulse(1,40,3);
if you want to mention that 3 workers less are available between time 1 and 40.

Limit total CPU usage in python multiprocessing

I am using multiprocessing.Pool.imap to run many independent jobs in parallel using Python 2.7 on Windows 7. With the default settings, my total CPU usage is pegged at 100%, as measured by Windows Task Manager. This makes it impossible to do any other work while my code runs in the background.
I've tried limiting the number of processes to be the number of CPUs minus 1, as described in How to limit the number of processors that Python uses:
pool = Pool(processes=max(multiprocessing.cpu_count()-1, 1)
for p in pool.imap(func, iterable):
...
This does reduce the total number of running processes. However, each process just takes up more cycles to make up for it. So my total CPU usage is still pegged at 100%.
Is there a way to directly limit the total CPU usage - NOT just the number of processes - or failing that, is there any workaround?
The solution depends on what you want to do. Here are a few options:
Lower priorities of processes
You can nice the subprocesses. This way, though they will still eat 100% of the CPU, when you start other applications, the OS gives preference to the other applications. If you want to leave a work intensive computation run on the background of your laptop and don't care about the CPU fan running all the time, then setting the nice value with psutils is your solution. This script is a test script which runs on all cores for enough time so you can see how it behaves.
from multiprocessing import Pool, cpu_count
import math
import psutil
import os
def f(i):
return math.sqrt(i)
def limit_cpu():
"is called at every process start"
p = psutil.Process(os.getpid())
# set to lowest priority, this is windows only, on Unix use ps.nice(19)
p.nice(psutil.BELOW_NORMAL_PRIORITY_CLASS)
if __name__ == '__main__':
# start "number of cores" processes
pool = Pool(None, limit_cpu)
for p in pool.imap(f, range(10**8)):
pass
The trick is that limit_cpu is run at the beginning of every process (see initializer argment in the doc). Whereas Unix has levels -19 (highest prio) to 19 (lowest prio), Windows has a few distinct levels for giving priority. BELOW_NORMAL_PRIORITY_CLASS probably fits your requirements best, there is also IDLE_PRIORITY_CLASS which says Windows to run your process only when the system is idle.
You can view the priority if you switch to detail mode in Task Manager and right click on the process:
Lower number of processes
Although you have rejected this option it still might be a good option: Say you limit the number of subprocesses to half the cpu cores using pool = Pool(max(cpu_count()//2, 1)) then the OS initially runs those processes on half the cpu cores, while the others stay idle or just run the other applications currently running. After a short time, the OS reschedules the processes and might move them to other cpu cores etc. Both Windows as Unix based systems behave this way.
Windows: Running 2 processes on 4 cores:
OSX: Running 4 processes on 8 cores:
You see that both OS balance the process between the cores, although not evenly so you still see a few cores with higher percentages than others.
Sleep
If you absolutely want to go sure, that your processes never eat 100% of a certain core (e.g. if you want to prevent that the cpu fan goes up), then you can run sleep in your processing function:
from time import sleep
def f(i):
sleep(0.01)
return math.sqrt(i)
This makes the OS "schedule out" your process for 0.01 seconds for each computation and makes room for other applications. If there are no other applications, then the cpu core is idle, thus it will never go to 100%. You'll need to play around with different sleep durations, it will also vary from computer to computer you run it on. If you want to make it very sophisticated you could adapt the sleep depending on what cpu_times() reports.
On the OS level
you can use nice to set a priority to a single command. You could also start a python script with nice. (Below from: http://blog.scoutapp.com/articles/2014/11/04/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups)
nice
The nice command tweaks the priority level of a process so that it runs less frequently. This is useful when you need to run a
CPU intensive task as a background or batch job. The niceness level
ranges from -20 (most favorable scheduling) to 19 (least favorable).
Processes on Linux are started with a niceness of 0 by default. The
nice command (without any additional parameters) will start a process
with a niceness of 10. At that level the scheduler will see it as a
lower priority task and give it less CPU resources.Start two
matho-primes tasks, one with nice and one without:
nice matho-primes 0 9999999999 > /dev/null &matho-primes 0 9999999999 > /dev/null &
matho-primes 0 9999999999 > /dev/null &
Now run top.
As a function in Python
Another approach is to use psutils to check your CPU load average for the past minute and then have your threads check the CPU load average and spool up another thread if you are below the specified CPU load target, and sleep or kill the thread if you are above the CPU load target. This will get out of your way when you are using your computer, but will maintain a constant CPU load.
# Import Python modules
import time
import os
import multiprocessing
import psutil
import math
from random import randint
# Main task function
def main_process(item_queue, args_array):
# Go through each link in the array passed in.
while not item_queue.empty():
# Get the next item in the queue
item = item_queue.get()
# Create a random number to simulate threads that
# are not all going to be the same
randomizer = randint(100, 100000)
for i in range(randomizer):
algo_seed = math.sqrt(math.sqrt(i * randomizer) % randomizer)
# Check if the thread should continue based on current load balance
if spool_down_load_balance():
print "Process " + str(os.getpid()) + " saying goodnight..."
break
# This function will build a queue and
def start_thread_process(queue_pile, args_array):
# Create a Queue to hold link pile and share between threads
item_queue = multiprocessing.Queue()
# Put all the initial items into the queue
for item in queue_pile:
item_queue.put(item)
# Append the load balancer thread to the loop
load_balance_process = multiprocessing.Process(target=spool_up_load_balance, args=(item_queue, args_array))
# Loop through and start all processes
load_balance_process.start()
# This .join() function prevents the script from progressing further.
load_balance_process.join()
# Spool down the thread balance when load is too high
def spool_down_load_balance():
# Get the count of CPU cores
core_count = psutil.cpu_count()
# Calulate the short term load average of past minute
one_minute_load_average = os.getloadavg()[0] / core_count
# If load balance above the max return True to kill the process
if one_minute_load_average > args_array['cpu_target']:
print "-Unacceptable load balance detected. Killing process " + str(os.getpid()) + "..."
return True
# Load balancer thread function
def spool_up_load_balance(item_queue, args_array):
print "[Starting load balancer...]"
# Get the count of CPU cores
core_count = psutil.cpu_count()
# While there is still links in queue
while not item_queue.empty():
print "[Calculating load balance...]"
# Check the 1 minute average CPU load balance
# returns 1,5,15 minute load averages
one_minute_load_average = os.getloadavg()[0] / core_count
# If the load average much less than target, start a group of new threads
if one_minute_load_average < args_array['cpu_target'] / 2:
# Print message and log that load balancer is starting another thread
print "Starting another thread group due to low CPU load balance of: " + str(one_minute_load_average * 100) + "%"
time.sleep(5)
# Start another group of threads
for i in range(3):
start_new_thread = multiprocessing.Process(target=main_process,args=(item_queue, args_array))
start_new_thread.start()
# Allow the added threads to have an impact on the CPU balance
# before checking the one minute average again
time.sleep(20)
# If load average less than target start single thread
elif one_minute_load_average < args_array['cpu_target']:
# Print message and log that load balancer is starting another thread
print "Starting another single thread due to low CPU load balance of: " + str(one_minute_load_average * 100) + "%"
# Start another thread
start_new_thread = multiprocessing.Process(target=main_process,args=(item_queue, args_array))
start_new_thread.start()
# Allow the added threads to have an impact on the CPU balance
# before checking the one minute average again
time.sleep(20)
else:
# Print CPU load balance
print "Reporting stable CPU load balance: " + str(one_minute_load_average * 100) + "%"
# Sleep for another minute while
time.sleep(20)
if __name__=="__main__":
# Set the queue size
queue_size = 10000
# Define an arguments array to pass around all the values
args_array = {
# Set some initial CPU load values as a CPU usage goal
"cpu_target" : 0.60,
# When CPU load is significantly low, start this number
# of threads
"thread_group_size" : 3
}
# Create an array of fixed length to act as queue
queue_pile = list(range(queue_size))
# Set main process start time
start_time = time.time()
# Start the main process
start_thread_process(queue_pile, args_array)
print '[Finished processing the entire queue! Time consuming:{0} Time Finished: {1}]'.format(time.time() - start_time, time.strftime("%c"))
In Linux:
Use nice() with a numerical value:
#on Unix use ps.nice(10) for very low priority
p.nice(10)
https://en.wikipedia.org/wiki/Nice_(Unix)#:~:text=nice%20is%20a%20program%20found,CPU%20time%20than%20other%20processes.

AWS EMR Parallel Mappers?

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:
(Total Mappers needed for your job + Time taken to process) / (per instance capacity + desired time) as outlined here: http://www.slideshare.net/AmazonWebServices/amazon-elastic-mapreduce-deep-dive-and-best-practices-bdt404-aws-reinvent-2013, page 89.
The question is how to determine how many parallel mappers the instance will support since AWS don't publish? https://aws.amazon.com/emr/pricing/
Sorry if i missed something obvious.
Wayne
To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have
mapreduce.map.memory.mb = 768
yarn.nodemanager.resource.memory-mb = 12288
yarn.scheduler.maximum-allocation-mb = 12288 (same as above)
You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node = (12288/768) = 16
So, for the 5 node cluster , it would a max of 16*5 = 80 mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.
So, If you want to run more mappers in parallel , you can either re-size the cluster or reduce the mapreduce.map.memory.mb(and its heap mapreduce.map.java.opts) on every node and restart NM to
To understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here :
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Note : The above calculations and statements are true if EMR stays in its default configuration using YARN capacity scheduler with DefaultResourceCalculator. If for example , you configure your capacity scheduler to use DominantResourceCalculator, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.

Akka durable mailbox results in dead letters

First, I am not an Akka newbie (been using it for 2+ years). I have a high throughput (many millions of msgs/min/node) application that does heavy network I/O. An initial actor (backed by a RandomRouter) receives messages and distributes them to the appropriate child actor for processing:
private val distributionRouter = system.actorOf(Props(new DistributionActor)
.withDispatcher("distributor-dispatcher")
.withRouter(RandomRouter(distrib.distributionActors)), "distributionActor")
The application has been highly tuned and has excellent performance. I'd like to make it more fault tolerant by using a durable mailbox in front of DistributionActor. Here's the relevant config (only change is the addition of the file-based mailbox):
akka.actor.mailbox.file-based {
directory-path = "./.akka_mb"
max-items = 2147483647
# attempting to add an item after the queue reaches this size (in bytes) will fail.
max-size = 2147483647 bytes
# attempting to add an item larger than this size (in bytes) will fail.
max-item-size = 2147483647 bytes
# maximum expiration time for this queue (seconds).
max-age = 3600s
# maximum journal size before the journal should be rotated.
max-journal-size = 16 MiB
# maximum size of a queue before it drops into read-behind mode.
max-memory-size = 128 MiB
# maximum overflow (multiplier) of a journal file before we re-create it.
max-journal-overflow = 10
# absolute maximum size of a journal file until we rebuild it, no matter what.
max-journal-size-absolute = 9223372036854775807 bytes
# whether to drop older items (instead of newer) when the queue is full
discard-old-when-full = on
# whether to keep a journal file at all
keep-journal = on
# whether to sync the journal after each transaction
sync-journal = off
# circuit breaker configuration
circuit-breaker {
# maximum number of failures before opening breaker
max-failures = 3
# duration of time beyond which a call is assumed to be timed out and considered a failure
call-timeout = 3 seconds
# duration of time to wait until attempting to reset the breaker during which all calls fail-fast
reset-timeout = 30 seconds
}
}
distributor-dispatcher {
executor = "thread-pool-executor"
type = Dispatcher
thread-pool-executor {
core-pool-size-min = 20
core-pool-size-max = 20
max-pool-size-min = 20
}
throughput = 100
mailbox-type = akka.actor.mailbox.FileBasedMailboxType
}
As soon as I introduced this I noticed lots of dropped messages. When I profile it via the Typesafe Console, I see a bunch of dead letters (like 100k or so per 1M). My mailbox files are only 12MB per actor, so they're not even close to the limit. I also set up a dead letters listener to count the dead letters so I could run it outside the profiler (maybe an instrumentation issue, I thought?). Same results.
Any idea what may be causing the dead letters?
I am using Akka 2.0.4 on Scala 2.9.2.
Update:
I have noticed that the dead letters seem to have been bound for a couple of child actors owned by DistributionActor. I'm not understanding why changing the parent's mailbox has any effect on this, but this is definitely the behavior.