Airflow delays in picking the dags from DAG folder - airflow-scheduler

We are using airflow version 2 and when we place the DAG file in DAG folder, It takes 20 to 30 seconds to reflect the dag in the airflow dag list.
NOTE : we are using the Celery executor and postgres as database.
Could someone help me on how to make the airflow pick up the DAG faster?
Is there a configuration for it / any idea is appreciable.
Thanks,
Harry

Your best bet is to decrease scheduler__min_file_process_interval parameter (this will increase the CPU usage)
Number of seconds after which a DAG file is parsed. The DAG file is
parsed every min_file_process_interval number of seconds. Updates to
DAGs are reflected after this interval. Keeping this number low will
increase CPU usage.
However, it really depends on several factors - how many DAGs you have, how many file parsing processes, how many schedulers, sorting order, interval parameters.
See https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#id24 (scheduler parameters)
Also https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html#scheduler-tuneables has good overview of the tunables you can use
Generally your approach should be the same as for any performance improvement and optimisations. Airflow gives you a lot of "knobs" to turn but it's your task (depending on your particular deployment) to decide which knobs to turn.
For example If you see that you are saturating CPU, you might need to add another scheduler as your machine might be to slow to parse all your DAGs. But when you see that you have some CPUs free you might increase number of parsing processes. If you use remote filesystem and you see that you are blocked on i/o, you might want to increase capacity of the filesystem (usually cloud remote filesystem have limited throughput) etc. etc. You can also configure sorting order - i.e. which files are processed first.
I really recommend to watch the talk from Airflow Summit 2021 - https://youtu.be/DYC4-xElccE this might be useful to understand how scheduler work and what tuning you can do.

Related

GCP Functions Node.js Huge Latency

We are running simple GCP Functions (pure, no Firebase, or any other layer added) that just handle HTTP requests using Node.js engine (previously version 8, now 10) and return some "simple JSON response". What we see is that sometimes (but not rarely) there is a huge latency when the request is "accepted by GCP" and before it gets to our function code. If I say huge I'm not speaking ms but units of seconds! And it is not a cold start (we have separate log messages on the global scope so we know when cold start occurs). Functions have currently 256 or 512 mb and run in close region.
We log at the very first line of the GCP function, for example:
or
Does anyone also experience that? And is that normal that sometimes this delay may take up to 5s (or rarely even more)?
By the way, sometimes the same thing happens on the output side as well. So if unlucky, it may take up to 10s. Thanks in advance for any reply, no matter if you have or have not similar experience.
All such problems I have seen have been related with cold start or it was not possible to prove that they are not related with code start.
This question could be even to broad to stackoverflow. We do not have any chance to reproduce it without example at least functions and number of the executions, however I will try to answer.
It seems that latency analyzes are done mainly on logs. I think you should try to use "Trace" functionality that is available in GCP (direct link and documentation). This should give you data to be able to track the issue.
Example i have used it on helloworld cloud function and was curl'ing it from bush script. It seems that over few hundreds of invocations there was one execution with latency 10 times greater than usually.
I hope it will help somehow :)!

Why is IO 99.99 % even though the Disk Read And write seems to be very small

One of our Kafka brokers had a very high load average (about 8 on average) in an 8 core machine. Although this should be okay but our cluster still seems to be facing problems and producers were failing to flush messages at the usual pace.
Upon further investigation, I found that my java process was waiting too much for IO, almost 99.99% of the time and as of now, I believe this is a problem.
Mind that this happened even when the load was relatively low (around 100-150 Kbps), I have seen it perform perfectly even with 2 Mbps of data input into the cluster.
I am not sure if this problem is because of Kafka, I am assuming it is not because all other brokers worked fine during this time and our data is perfectly divided among the 5 brokers.
Please assist me in finding the root cause of the problem. Where should I look to find the problem? Are there any other tools that can help me debug this problem?
We are using 1 TB mounted EBS Volume on an m5.2x large machine.
Please feel free to ask any questions.
GC Logs Snapshot
Answering my own question after figuring out the problem.
It turns out that the real problem was associated with the way st1 HDD drive works rather than kafka or GC.
st1 HDD volume type is optimized for workloads involving large, sequential I/O, and performs very bad with small random IOs. You can read more about it here.
Although It should have worked fine for just Kafka, but we were writing Kafka application logs to the same HDD, which was adding a lot to the READ/WRITE IOs and subsequently depleting our burst credits very fast during peak time. Our cluster worked fine as long as we had burst credits available and the performance reduced after the credits depleted.
There are several solutions to this problem :
First remove any external apps adding IO load to the st1 drive as its not meant for those kinds of small random IOs.
Increase the number of such st1 parallel drives divide the load.This is easy to do with Kafka as it allows us to keep data in different directories in different drives. But only new topics will be divided as the partitions are assigned to directories when the topic is created.
Use gp2 SSD drives as they kind of manage both kinds of loads very well. But these are expensive.
Use larger st1 drives fit for your use case as the throughput and burst credits are dependent on the size of the disk. READ HERE
This article helped me a lot to figure out the problem.
Thanks.

Speeding up model training using MITIE with Rasa

I'm training a model for recognizing short, one to three sentence strings of text using the MITIE back-end in Rasa. The model trains and works using spaCy, but it isn't quite as accurate as I'd like. Training on spaCy takes no more than five minutes, but training for MITIE ran for several days non-stop on my computer with 16GB of RAM. So I started training it on an Amazon EC2 r4.8xlarge instance with 255GB RAM and 32 threads, but it doesn't seem to be using all the resources available to it.
In the Rasa config file, I have num_threads: 32 and set max_training_processes: 1, which I thought would help use all the memory and computing power available. But now that it has been running for a few hours, CPU usage is sitting at 3% (100% usage but only on one thread), and memory usage stays around 25GB, one tenth of what it could be.
Do any of you have any experience with trying to accelerate MITIE training? My model has 175 intents and a total of 6000 intent examples. Is there something to tweak in the Rasa config files?
So I am going to try to address this from several angles. First specifically from the Rasa NLU angle the docs specifically say:
Training MITIE can be quite slow on datasets with more than a few intents.
and provide two alternatives:
Use the mite_sklearn pipeline which trains using sklearn.
Use the MITIE fork where Tom B from Rasa has modified the code to run faster in most cases.
Given that you're only getting a single cores used I doubt this will have an impact, but it has been suggested by Alan from Rasa that num_threads should be set to 2-3x your number of cores.
If you haven't evaluated both of those possibilities then you probably should.
Not all aspects of MITIE are multi-threaded. See this issue opened by someone else using Rasa on the MITIE GitHub page and quoted here:
Some parts of MITIE aren't threaded. How much you benefit from the threading varies from task to task and dataset to dataset. Sometimes only 100% CPU utilization happens and that's normal.
Specifically on training data related I would recommend that you look at the evaluate tool recently introduced into the Rasa repo. It includes a confusion matrix that would potentially help identify trouble areas.
This may allow you to switch to spaCy and use a portion of your 6000 examples as an evaluation set and adding back in examples to the intents that aren't performing well.
I have more questions on where the 6000 examples came from, if they're balanced, and how different each intent is, have you verified that words from the training examples are in the corpus you are using, etc but I think the above is enough to get started.
It will be no surprise to the Rasa team that MITIE is taking forever to train, it will be more of a surprise that you can't get good accuracy out of another pipeline.
As a last resort I would encourage you to open an issue on the Rasa NLU GitHub page and and engage the team there for further support. Or join the Gitter conversation.

Distributed Tensorflow Training of Reinpect Human detection model

I am working on Distributed Tensorflow, particularly the implementation of Reinspect model using Distributed Tensorflow given in the following paper https://github.com/Russell91/TensorBox .
We are using Between-graph-Asynchronous implementation of Distributed tensorflow settings but the results are very surprising. While bench marking, we have come to see that Distributed training takes almost more than 2 times more training time than a single machine training. Any leads about what could be happening and what else could be tried be would be really appreciated. Thanks
Note: There is a correction in the post, we are using between-graph implementation not in-graph implementation. Sorry for the mistake
In general, I wouldn't be surprised if moving from a single-process implementation of a model to a multi-machine implementation would lead to a slowdown. From your question, it's not obvious what might be going on, but here are a few general pointers:
If the model has a large number of parameters relative to the amount of computation (e.g. if it mostly performs large matrix multiplications rather than convolutions), then you may find that the network is the bottleneck. What is the bandwidth of your network connection?
Are there a large number of copies between processes, perhaps due to unfortunate device placement? Try collecting and visualizing a timeline to see what is going on when you run your model.
You mention that you are using "in-graph replication", which is not currently recommended for scalability. In-graph replication can create a bottleneck at the single master, especially when you have a large model graph with many replicas.
Are you using a single input pipeline across the replicas or multiple input pipelines? Using a single input pipeline would create a bottleneck at the process running the input pipeline. (However, with in-graph replication, running multiple input pipelines could also create a bottleneck as there would be one Python process driving the I/O with a large number of threads.)
Or are you using the feed mechanism? Feeding data is much slower when it has to cross process boundaries, as it would in a replicated setting. Using between-graph replication would at least remove the bottleneck at the single client process, but to get better performance you should use an input pipeline. (As Yaroslav observed, feeding and fetching large tensor values is slower in the distributed version because the data is transferred via RPC. In a single process these would use a simple memcpy() instead.)
How many processes are you using? What does the scaling curve look like? Is there an immediate slowdown when you switch to using a parameter server and single worker replica (compared to a single combined process)? Does the performance get better or worse as you add more replicas?
I was looking at similar thing recently, and I noticed that moving data from grpc into Python runtime is slower than expected. In particular consider following pattern
add_op = params.assign_add(update)
...
sess.run(add_op)
If add_op lies on a different process, then sess.run adds a decoding step that happens at rate of 50-100 MB/second.
Here's a benchmark and relevant discussion

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?
MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.
A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.
MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.
I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.
It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.