I have a 8 node Hadoop cluster, where each node has 24 physical cores with Hyper-threading, thus, 48 vCPUs and 256GB memory.
I am trying to run a 6TB Terasort job.
Problem: Terasort runs with no errors when I use yarn.nodemanager.resource.cpu-vcores=44 (48 minus 4 for OS, DN, RM, etc.). However, when I try to over-subscribe the CPUs with yarn.nodemanager.resource.cpu-vcores=88, I get several map and reduce errors.
All map failures are like "Too many fetch failures....".
All reduce erros are like "....#Block does not have enough number of replicas....".
I have seen THIS and THIS links. I have checked my /etc/hosts files and also bumped my net.core.somaxconn kernel parameter.
I don't understand why do I get map and reduce failures with over-subscribed CPUs.
Any hints or recommendations would be helpful, and thanks in advance.
I got to the bottom of the “Too many fetch…” error. What was happening was that because the servers were heavily loaded when running my 7TB job (remember that 1TB jobs always ran successfully), there were not enough connections happening between the master and the slaves. I needed to increase the listen queue between the master and the slave, which can be done by modifying a kernel parameter called “somaxconn”.
By default, “somaxconn” is set to to 128 in the rhel OS. By bumping it to 1024, the 7TB terasort job ran successfully with no failures.
Hope this helps someone.
Related
I am running a Vizier HyperParameter Tuning job on the GCP AI Platform and trials keep getting interrupted with the error: Terminated by service. If the job is supposed to continue running, it will be restarted on other VM shortly.
I am using a STANDARD_P100 GPU and it seems like the individual tuning trials are getting booted (pre-empted) from the GPU in the middle of training -- some trials complete successfully and some trials get stopped around 1000 or 2000 steps -- these are always happening on the 1000 which is significant because I am doing evaluation every 1000 steps so something seems to be happening when switching between training and evaluation that is allowing these jobs to get pre-empted. The next trial starts up and then typically runs for 1000 steps again (rather than restarting the previous trial).
Is there anything that I can do so that my trials will complete successfully? They never get re-started as the VM says and it seems like it makes the entire hyperparameter tuning worthless because ~90% of the trials are never completed and the ones that fail likely give bad information to the vizier optimization algorithm. These runs can be quite expensive to run on GPUs and they are essentially worthless as currently configured even though I am being charged for trials that don't ever complete.
An example of my hptuning_config is below...
scaleTier: CUSTOM
masterType: standard_v100
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: 'accuracy'
maxTrials: 80
maxParallelTrials: 1
enableTrialEarlyStopping: TRUE
params: ...
I got the same problem. I suspect it's because of the enableTrialEarlyStopping
https://cloud.google.com/ml-engine/docs/using-hyperparameter-tuning#stopping_trials_early
You have to set:
enableTrialEarlyStopping: False
I am running a single-instance worker on AWS Beanstalk. It is a single-container Docker that runs some processes once every business day. Mostly, the processes sync a large number of small files from S3 and analyze those.
The setup runs fine for about a week, and then CPU load starts growing linearly in time, as in this screenshot.
The CPU load stays at a considerable level, slowing down my scheduled processes. At the same time, my top-resource tracking running inside the container (privileged Docker mode to enable it):
echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail
shows nearly no CPU load (which changes only during the time that my daily process runs, seemingly accurately reflecting system load at those times).
What am I missing here in terms of the origin of this "background" system load? Wondering if anybody seen some similar behavior, and/or could suggest additional diagnostics from inside the running container.
So far I have been re-starting the setup every week to remove the "background" load, but that is sub-optimal since the first run after each restart has to collect over 1 million small files from S3 (while subsequent daily runs add only a few thousand files per day).
The profile is a bit odd. Especially that it is a linear growth. Almost like something is accumulating and taking progressively longer to process.
I don't have enough information to point at a specific issue. A few things that you could check:
Are you collecting files anywhere, whether intentionally or in a cache or transfer folder? It could be that the system is running background processes (AV, index, defrag, dedupe, etc) and the "large number of small files" are accumulating to become something that needs to be paged or handled inefficiently.
Does any part of your process use a weekly naming convention or house keeping process. Might you be getting conflicts, or accumulating work load as the week rolls over. i.e. the 2nd week is actually processing both the 1st & 2nd week data, but never completing so that the next day it is progressively worse. I saw something similar where an inappropriate bubble sort process was not completing (never reached the completion condition due to the slow but steady inflow of data causing it to constantly reset) and the demand by the process got progressively higher as the array got larger.
Do you have some logging on a weekly rollover cycle ?
Are there any other key performance metrics following the trend ? (network, disk IO, memory, paging, etc)
Do consider if it is a false positive. if it is high CPU there should be other metrics mirroring the CPU behaviour, cache use, disk IO, S3 transfer statistics/logging.
RL
How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar
I implemented a simple http server link, but the result of the test (ab -n 10000 -c 100 http://localhost:8080/status) is very bad (look through the test.png in the previous link)
I don't understand why it doesn't work correctly with multiple threads.
I believe that, by default, Netty's default thread pool is configured with as many threads as there are cores on the machine. The idea being to handle requests asynchronously and non-blocking (where possible).
Your /status test includes a database transaction which blocks because of the intrinsic design of database drivers etc. So your performance - at high level - is essentially a result of:-
a.) you are running a pretty hefty test of 10,000 requests attempting to run 100 requests in parallel
b.) you are calling into a database for each request so this is will not be quick (relatively speaking compared to some non-blocking I/O operation)
A couple of questions/considerations for you:-
Machine Spec.?
What is the spec. of the machine you are running your application and test on?
How many cores?
If you only have 8 cores available then you will only have 8 threads running in parallel at any time. That means those batches of 100 requests per time will be queueing up
Consider what is running on the machine during the test
It sound like you are running the application AND Apache Bench on the same machine so be aware that both your application and the testing tool will both be contending for those cores (this is in addition to any background processes going on also contending for those cores - such as the OS)
What will the load be?
Predicting load is difficult right. If you do think you are likely to have 100 requests into the database at any one time then you may need to think about:-
a. your production environment may need a couple of instance to handle the load
b. try changing the config. of Netty's default thread pool to increase the number of threads
c. think about your application architecture - can you cache any of those results instead of going to the database for each request
May be linked to the usage of Database access (synchronous task) within one of your handler (at least in your TrafficShappingHandler) ?
You might need to "make async" your database calls (other threads in a producer/consumer way for instance)...
If something else, I do not have enough information...
Actually I have 3 questions. Any input is appreciated. Thank you!
1) How to run exactly 1 process on each host? My application uses TBB for multi-threading. Does it mean that I should run exactly 1 process on each host for best performance?
2) My cluster has heterogeneous hosts. Some hosts have better CPUs and more memory than the others. How to map process ranks to real hosts for work distribution purposes? I am thinking to use hostname.Is there a better to do it?
3) How process ranks are assigned? What process gets 0?
1) TBB splits loops into several threads of a thread pool to utilize all processors of one machine. So you should only run one process per machine. More processes would fight with each other for processor time. The number of processes per machine is given by options in your hostfile:
# my_hostfile
192.168.0.208 slots=1 max_slots=1
...
2) To give each machine an appropriate amount of work according to its performance is not trivial.
The easiest approach is to split the workload into small pieces of work, send them to the slaves, collect their answers, and give them new pieces of work, until you are done. There is an example on my website (in German). You can also find some references to manuals and tutorials there.
3) Each process gets a number (processID) in your program by
MPI_Comm_rank(MPI_COMM_WORLD, &processID);
The master has processID == 0. Maybe the other are given the slots in the order of your hostfile. Another possibility is they are assigned in the order the connections to slaves are established. I don't know that.