I have a piece of code that creates several threads on a Glue job like this:
threads = []
for data_chunk in data_chunks:
json_data = get_bulk_upload_json(data_chunk)
threads.append(Thread(target=my_func, args=(arg1, arg2)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
Where data_chunks is a list of dictionaries. Due to the nature of the data, this consumes a lot of memory. The Glue job keeps failing on a memory error, however, after further debugging, it is crashing once it reached the memory limit of just one of the workers. Meaning, it is not using the memory of the other workers at all. Another proof of this, is that not matter how many more workers I add, the same error happens in the same part of the process.
How can I use threads and distribute them between the workers?
It seems that you are misusing AWS Glue.
You shouldn't use Threads, since Glue does the parallelization for you. It is a managed version of Spark. Instead you should use Spark / Glue functions, which will then be executed on the workers.
Related
Problem
I am currently writing a stream parser that parses multiple feeds that are coming in very fast. Let's assume that it is a twitter stream of accounts' tweets where there are X accounts. I am trying to make the processing as concurrent as possible, while also making sure each account's tweets are processed sequentially.
Each tweet requires some parsing that takes some time. So if I were to use a naive thread pool, I will run into a problem where some tweets assigned to quicker ending threads and a single account's tweets may be logged out of order.
This task can be approached using a producer-consumer model. Where in this case there is only one producer: the twitter feed. The consumers are where I am uncertain.
The Approach
My idea to tackle this is fairly simple: map each account to a bucket numbered between 1 and T. Where T is the number of threads available in my computer. Then process each bucket sequentially. This way all buckets can be run concurrently, and no single account's tweets will be logged out of order.
Here is a crude visualization of what that looks like with two threads and three accounts:
As you can see, since we have two threads, Accounts 1 and 3 map to the same thread but maintain internal consistency. Threads 1 & 2 can run concurrently with no conflicts ever arising.
This structure is also very extendable. If I have more producers with Accounts 4 & 5, for example. I can still add to threads 2 and 1, respectively, without losing internal account consistency.
What I've Done So Far/The Code
I'm not sure how to structure this programmatically. I'm fairly new to multi-threading in C++ so I'm using modified code from this blog post as a way to structure my file.
I'd take a read-through if you have the time, but basically my code is a minimal example replicating this process. There are 5 buckets. The tweet parsing is simulated by making a sleep for 1 second. I assign each task a mutex to lock it based on it's bucket. This is done using a simple mutex = mutex_map[task_id % NUM_BUCKETS].
The code is available here, although the VM is limited to 2 threads. If we scale up to 11 threads (on my machine), we run into race conditions where some threads beat the others. Essentially what happens is this:
The machine has 11 threads initially available.
It assigns Task 0, 5, and 10 to some threads in the thread pool.
Task 0 goes first, gets the mutex and locks up. Tasks 5 and 10 are waiting because the mutex for bucket 0 is locked
Once Task 0 is finished, sometimes Task 10 goes first, and sometimes Task 5 goes first.
Now the solution is to just limit the thread pool size to NUM_BUCKETS, but there is a core problem I'm trying to solve here, which is that what I want to happen in the background is not being implemented.
Solution?
Anyone have any suggested ideas on how to approach this? How do I enforce consistency within a bucket? I want to basically assign each task to a specific thread based on the hash. Not sure how to do so as thread pool is what manages this for me...
I am using Qt SQL which is blocking API so I have to execute SQL code in Separate thread (QtConcurrent::run) and return (Q)future.
something like this:-
QFuture<QString> future = QtConcurrent::run( []() { /* some SQL code */ } );
auto watcher = new QFutureWatcher<QString>();
watcher.setFuture(future);
connect(watcher,&QFutureWatcher<QString>::finished,
[future](){ /* code to execute after future is finished */ });
But I learned that Threading is costly. every context switch is expensive. So it looks like CPU wastage to create new Thread just for waiting for result from MySQL server. My application is going to run on single core Virtual Machine on Google Cloud anyways . it there any way I can execute Qt SQL code asynchronusly without possibly creating new thread ?
I was also wondering how other APIs like Qt Networking implement asynchronus API without create new thread ? or i am wrong and they do create new thread under the hood ?
Many threaded applications run on a single core. Flushing cache to run on a separate core is also expensive. Use the right tool for the job. There's nothing wrong with threads.
That said, if you really want to run on a single thread use a workqueue to keep track of async task progress. The libevent library does this for you, but there are others. You just run a polling loop adding work onto the queue and executing callbacks when a task needs attention or completes.
By using QtConcurrent::run you already solved one problem - cost of creating thread because it use a thread pool.
When comes to context switches, first you could try to measure them with perf stat. And depends on situation, optimize it. If its just simple queries then probably vast majority of context switches comes from the system, not your app.
Doing something async means that you can execute task and move forward with your current code without waiting for results. But usually such task i.e sql query will spawn thread/process or will make request to OS.
Qt Networking make i.e read request and OS signals (epoll) when data will arrive. But in case of single core OS will interrupt your thread anyway.
If you have many many small queries you could try optimize them to make less queries, do caching.
I have been working on Akka Actor model. I have an usecase where more than 1000 actors will be in active and I have to process those actors. I thought of controlling the thread count through configuration defined in the application.conf.
But no. of dispatcher thread created in my application makes me helpless in tuning the dispatcher configuration. Each time when I restart my application, I see different number of dispatcher threads created (I have checked this via Thread dump each time after starting the application).
Even thread count is not equal to the one which I defined in parallelism-min. Due to this low thread count, my application is processing very slowly.
On checking the no. of core in my machine through the below code:
Runtime.getRuntime().availableProcessors();
It displays 40. But the no. of dispatcher thread count created is less than 300 even I configured parallelism as 500.
Following is my application.conf file:
consumer-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 500
parallelism-factor = 20.0
parallelism-max = 1000
}
shutdown-timeout = 1s
throughput = 1
}
May I know on what basis akka will be creating dispatcher threads internally and how I can increase my dispatcher thread count to increase parallel processing of actors?
X-Post from discuss.lightbend.com
First let me answer the question directly.
A fork-join-executor will be backed by a java.util.concurrent.forkJoinPool pool with its parallelism set to the implied parallelism from the dispatcher config. (parallelism-factor * processors, but no larger than max or less than min). So, in your case, 800.
And while I’m no expert on the implementation of the ForkJoinPool the source for the Java implementation of ForkJoinPool says “All worker thread creation is on-demand, triggered by task submissions, replacement of terminated workers, and/or compensation for blocked workers.” and it has methods like getActiveThreads(), so it’s clear that ForkJoinPooldoesn’t just naively create a giant pool of workers.
In other words, what you are seeing is expected: it’s only going to create threads as they are needed. If you really must have a gigantic pool of worker threads you could create a thread-pool-executor with a fixed-pool-size of 800. This would give you the implementation you are looking for.
But, before you do so, I think you are entirely missing the point of actors and Akka. One of the reasons that people like actors is that they are much more lightweight than threads and can give you a lot more concurrency than a thread. (Also note that concurrency != parallelism as noted in the documentation on concepts.) So trying to create a pool of 800 threads to back 1000 actors is very wasteful. In the akka docs introduction it highlights "Millions of actors can be efficiently scheduled on a dozen of threads".
I can’t tell you exactly how many threads you need without knowing your application (for example if you have blocking behavior) but the defaults (which would give you a parallelism factor of 20) is probably just fine. Benchmark to be certain, but I really don’t think you have a problem with too few threads. (The ForkJoinPool behavior you are observing seems to confirm this.)
Using a dask distributed cluster, I've noticed, that several of the futures of long running tasks switch from pending to finished, others switch from pending to lost.
I have the suspicion, that some of the lost tasks are still running, as I see dask-worker processes with a high CPU usage even if no futures have the status pending anymore.
What exactly does lost mean here? Can long-running tasks (hours) be classified as lost as they might stop the worker from reporting back to the scheudler? What else could cause the state lost and how does the scheduler react to this?
This means that for some reason the scheduler no longer has the information necessary to execute this task. Commonly this is due to non-resilient data being lost by a worker going down, such as if you explicitly scatter a piece of data to a single worker and then that worker fails.
>>> future = client.scatter(123)
>>> x = client.submit(f, future)
... worker holding future/123 dies
>>> x.status
'lost'
This is rare in general though. Usually if a worker goes down the scheduler can replicate all of the work for a particular task elsewhere.
As always, providing a minimal complete verifiable example can help to isolate what's going on in your particular situation.
I'm using quite a few cfthreads in a scheduled task (because cf runs out of memory otherwise), and now I'm getting the following error:
Cannot create a new thread because the task queue has reached it maximum
limit 5000.
So here are my questions:
what is the "task queue" exactly, and where are the docs?
how do I increase this limit?
how can I determine what the limit is dynamically? and how many threads are already in the queue?
Why not use the run-join idiom I provided as an answer to another question of yours: many queries in a task to generate json? You could alter that code example to create several threads and then join if you're looking for things work asynchronously. In addition, having as many threads as your question describes actually slow things down because the server spends too much time context switching between threads.
It looks like the limit is a built-in limit that cannot be changed.
The message above is an error message though, so you could wrap the cfthread in a cftry to find out when the limit is reached.