How should I parallelize a mix of cpu- and network-intensive tasks (in Celery) - concurrency

I have a job that scans a network file system (can be remote), pulls many files, runs a computation on them and pushes the results (per file) into a DB. I am in process of moving this to Celery so that it can be scaled up. The number of files can get really huge (1M+).
I am not sure what design approach to take, specifically:
Uniform "end2end" tasks
A task gets a batch (list of N files), pulls them, computes and uploads results.
(Using batches rather than individual files is for optimizing the connection to the remote file system and the DB, although it is a pure heuristics at this point)
Clearly, a task would spend a large part of it waiting for I/O, so we'll need to play with number of worker processes (much more than # of CPUs), so that I have enough tasks running (computing) concurrently.
pro: simple design, easier coding and control.
con: probably will need to tune the process pool size individually per installation, as it would depend on environment (network, machines etc.)
Split into dedicated smaller tasks
download, compute, upload (again, batches).
This option is appealing intuitively, but I don't actually see the advantage.
I'd be glad to get some references to tutorials on concurrency design, as well as design suggestions.

How long does it take to scan the network file system compared to computation per file?
How does the hierarchy of the remote file system look like? Are the files evenly distributed? How can you use this in your advantage ?
I would follow a process like this:
1. In one process, list the first two levels of the root remote target folder.
2. For each of the discovered folders, spin up a separate celery process that further lists the content of those folders. You may also want to save the location of the discovered files just in case things go wrong.
3. After you have listed the content of the remote file system and all celery processes that list files terminate you can go in processing mode.
4. You may want to list files with 2 processes and use the rest of your cores to start doing per file work.
NB: Before doing everything in python I would also investigate how does bash tools like xargs and find work together in remote file discovery. Xargs allows you to spin up multiple C processes that do what you want. That might be the most efficient way to do the remote file discovery and then pipe everything to you python code.

Instead of celery, you can write a simple python script which runs on k*cpu_count threads just to connect to remote servers and fetch files without celery.
Personally, I found that k value in between 4 to 7 gives better results in terms of CPU utilization for IO bound tasks. Depending on the number of files produced or the rate at which you want to consume, you can use a suitable number of threads.
Alternatively, you can use celery + gevent or celery with threads if your tasks are IO bound.
For computation and updating DB you can use celery so that you can scale as per your requirements dynamically. If you are too many tasks at a time which need DB connection, you should use DB connection pooling for workers.

Related

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.
This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

Why sqlite3 can't work with NFS?

I switch to using sqlite3 instead of MySQL because I had to run many jobs on a PBS system which doesn't not have mysql. Of course on my machine I do not have a NFS while there exists one on the PBS. After spending lots of time switching to sqlite3, I go to run many jobs and I corrupt my database.
Of course down in the sqlite3 FAQ it is mentioned about NFS, but I didn't even think about this when I started.
I can copy the database at the beginning of the job but it will turn into a merging nightmare!
I would never recommend sqlite to any of my colleagues for this simple reason: "sqlite doesn't work (on the machines that matter)"
I have read rants about NFS not being up to par and it being their fault.
I have tried a few workarounds, but as this post suggests, it is not possible.
Isn't there a workaround which sacrifices performance?
So what do I do? Try some other db software? Which one?
You are using the wrong tool. Saying "I would never recommend sqlite ..." based on this experience is a bit like saying "I would never recommend glass bottles" after they keep breaking when you use them to hammer in a nail.
You need to specify your problem more precisely. My attempt to read between the lines of your question gives me something like this:
You have many nodes that get work through some unspecified path, and produce output. The jobs do not interact because you say you can copy the database. The output from all the jobs can be merged after they are finished. How do you effectively produce the merged output?
Given that as the question, this is my advice:
Have each job produce its output in a structured file, unique to each job. After the jobs are finished, write a program to parse each file and insert it into an sqlite3 database. This uses NFS in a way it can handle (single process writing sequentially to a file) and uses sqlite3 in a way that is also sensible (single process writing to a database on a local filesystem). This avoid NFS locking issues while running the job, and should improve throughput because you don't have contention on the sqlite3 database.

Process queue as folder with files. Possible problems?

I have an executable that needs to process records in the database when the command arrives to do so. Right now, I am issuing commands via TCP exchange but I don't really like it cause
a) queue is not persistent between sessions
b) TCP port might get locked
The idea I have is to create a folder and place files in it whose names match the commands I want to issue
Like:
1.23045.-1.1
2.999.-1.1
Then, after the command has been processed, the file will be deleted or moved to Errors folder.
Is this viable or are there some unavoidable problems with this approach?
P.S. The process will be used on Linux system, so Antivirus problems are out of the question.
Yes, a few.
First, there are all the problems associated with using a filesystem. Antivirus programs are one (though I cannot see why it doesn't apply to Linux - no delete locks?). Disk space, file and directory count maximums are others. Then, open file limits and permissions...
Second, race conditions. If there are multiple consumers, more than one of them might see and start processing the command before the first one has [re]moved it.
There are also the issues of converting commands to filenames and vice versa, and coming up with different names for a single command that needs to be issued multiple times. (Though these are programming issues, not design ones; they'll merely annoy.)
None of these may apply or be of great concern to you, in which case I say: Go ahead and do it. See what we've missed that Real Life will come up with.
I probably would use an MQ server for anything approaching "serious", though.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.