YARN min container size - mapreduce

Can someone explain me what would be the negative impact of setting the min container size to a small value like 512 MB for a cluster with 128GB per datanode?
MIN_CONTAINER_SIZE=
╔═════════════════════╦══════════════════════════╗
║ Total RAM per Node ║Recommended Min Container ║
╠═════════════════════╬══════════════════════════╣
║ Less than 4 GB ║ 256 MB ║
║ Between 4 GB & 8 GB ║ 512 MB ║
║ Between 8 GB & 24 GB║ 1024 MB ║
║ Above 24 GB ║ 2048 MB ║
╚═════════════════════╩══════════════════════════╝
What's the benefit of adhering to these guidelines?
What if we have jobs which process smaller amounts of data, wouldn't have a small yarn.scheduler.minimum-allocation-mb = 512 MB and a bigger
mapreduce.map.memory.mb = 4096 (even though this tipically is set to double the above parameter)
allow elasticity between a big number of containers and thus more splits, and having them grow quite bigger if needed?

Related

Intel MPI Benchmarks result exceeds network bandwidth

I was running the Intel MPI Benchmarks on a mini MPI cluster of two nodes on AWS EC2.
The executables of the benchmark were compiled with make CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx.
The cluster was set up based on this tutorial.
(What I did differently is that I installed OpenMPI instead of MPICH.)
The type of the AWS instances is t3a.xlarge.
According to this spec the network performance is up to 5 Gbps (i.e. 625 MB/s).
I chose Ubuntu 18.04 as the OS for these instances.
The PingPong benchmark gave me surprising results:
mpiuser#mpin0:~$ mpirun -rf rankfile -np 2 ./IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Thu Nov 28 15:17:38 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1054-aws
# Version : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
# MPI Version : 3.1
# MPI_Datatype : MPI_BYTE
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 32.05 0.00
1 1000 31.74 0.03
2 1000 31.98 0.06
4 1000 32.08 0.12
8 1000 31.86 0.25
16 1000 31.30 0.51
32 1000 31.40 1.02
64 1000 32.86 1.95
128 1000 32.66 3.92
256 1000 33.56 7.63
512 1000 35.00 14.63
1024 1000 34.74 29.48
2048 1000 36.56 56.02
4096 1000 39.80 102.91
8192 1000 49.92 164.10
16384 1000 57.73 283.79
32768 1000 78.86 415.54
65536 640 170.26 384.91
131072 320 239.80 546.60
262144 160 357.34 733.61
524288 80 609.82 859.74
1048576 40 1106.36 947.77
2097152 20 2514.62 833.98
4194304 10 5830.39 719.39
Some of the speeds exceed 625 MB/s.
The rankfile was used to force distributing the two processes onto two nodes.
It is not the case that the two MPI processes are running on the same node.
The content of the rankfile is:
rank 0=mpin0 slot=0
rank 1=mpin1 slot=0
What are the possible reasons that caused the benchmark result to exceed the theoretical limit?

How to avoid remote memory allocation in numa architecture?

I have a numa machine with two nodes, each node has 8 cores and 64G memory and have two services: service-1 and service-2, service-1 deployed in node-0, and service-2 deployed in node-1.
I just want these two services to run separately, So I start service as followed:
numactl --membind=0 ./service-1
numactl --membind=1 ./service-2
In service code, I use pthread_setaffinity_np to bind threads to corresponding node's cpus(binding service-1 to cpu[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]);
localhost:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
localhost:~$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 64327 MB
node 0 free: 17231 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 64498 MB
node 1 free: 37633 MB
node distances:
node 0 1
0: 10 21
1: 21 10
What I expected is linux system allocates memory on node-1 for service-2, besides service-2 visits node-1 memory mainly and some dynamic library pages on node-0. But sadly I find service-2 has a lot of memory on remote node(node-0) and visiting it costs too much time.
service-2's numa_maps as followed:
7f07e8c00000 bind:1 anon=455168 dirty=455168 N0=442368 N1=12800
7f08e4200000 bind:1 anon=128000 dirty=128000 N0=121344 N1=6656
7f092be00000 bind:1 anon=21504 dirty=21504 N0=20992 N1=512
7f093ac00000 bind:1 anon=32768 dirty=32768 N0=27136 N1=5632
7f0944e00000 bind:1 anon=82432 dirty=82432 N0=81920 N1=512
7f0959400000 bind:1 anon=1024 dirty=1024 N0=1024
7f0959a00000 bind:1 anon=4096 dirty=4096 N0=4096
7f095aa00000 bind:1 anon=2560 dirty=2560 N0=2560
7f095b600000 bind:1 anon=4608 dirty=4608 N0=4608
7f095c800000 bind:1 anon=512 dirty=512 N0=512
7f095cc00000 bind:1 anon=512 dirty=512 N0=512
...
So here are my questions:
1、Does linux system really do allocate remote memory(node-0) for service-2, regardless of service-2 is already bound to node-1 by membind command?
--membind=nodes, -m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.
2、Is this associated with kernel.numa_balancing, which values 1 in my machine? In my opinion, with kernel.numa_balancing = 1, linux only migrates tasks to the nearest memory or move memory to the nearest node where the task was executed. Since service-2 has already bound to node-1, no balancing would happen?
3、Can someone explain how this remote allocation happened? Is there any methods to avoid this?
Thank you very much !

AWS volume larger than specified

I created an AWS instance with an attached volume of size 500 GB. Everything looks good in AWS console which shows the volume to be 500 GB (it is /dev/xvdf). When I ssh into the instance and look at the drive I see the drive is actually 540 GB instead of 500 GB. Why is this, where did this extra 40 GB come from?
fdisk output:
Disk /dev/xvdf: 536.9 GB, 536870912000 bytes
255 heads, 63 sectors/track, 65270 cylinders, total 1048576000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
df -h (uses 1024):
/dev/xvdf 493G 110G 358G 24% /data0
df -H (uses 1000):
/dev/xvdf 529G 118G 384G 24% /data0
Your volume size is correct.
536,870,912,000 ÷ 1,024 ÷ 1,024 ÷ 1,024 = 500 GiB.
1 GiB ("gibibyte," or giga-binary byte) is 230 bytes. EBS volume sizes are in GiB.
I might be wrong but "/dev/xvdf" Shows that AWS is using some form of Xen
be that XenServer or some other flavor.
What happens is:
Xen calculates how much space is actually needed so that after you format the volume to "ext4" or any other FS you will have 500GB or as close to it as possible.
Anyways this is IME.

In Hadoop HDFS, how many data nodes a 1GB file uses to be stored?

I have a file of 1GB size to be stored on HDFS file system. I am having a cluster setup of 10 data nodes and a namenode. Is there any calculation that the Namenode uses (not for replicas) a particular no of data nodes for the storage of the file? Or Is there any parameter that we can configure to use for a file storage? If so, what is the default no of datanodes that Hadoop uses to store the file if it is not specifically configured?
I want to know if it uses all the datanodes of the cluster or only specific no of datanodes.
Let's consider the HDFS block size is 64MB and free space is also existing on all the datanodes.
Thanks in advance.
If the configured block size is 64 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/64 = 16 blocks, which means 1 Datanode will consume 16 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your 1 GB file is -> *16 * 3 =48 blocks*.
If your one block is of 64 MB, then total size your 1 GB file consumed is ->
*64 * 48 = 3072 MB*.
Hope that clears your doubt.
In Second(2nd) Generation of Hadoop
If the configured block size is 128 MB, and you have a 1 GB file which means the file size is 1024 MB.
So the blocks needed will be 1024/128 = 8 blocks, which means 1 Datanode will contain 8 blocks to store your 1 GB file.
Now, let's say that you have a 10 nodes cluster then the default replica is 3, that means your 1 GB file will be stored on 3 different nodes. So, the blocks acquired by your
1 GB file is -> *8 * 3 =24 blocks*.
If your one block is of 128 MB, then total size your 1 GB file consumed is -
*128 * 24 = 3072 MB*.

Converting B to MB properly, or entirely wrong

I'm not very experienced in these things, so please try not to jump to conclusions right out the gate. Okay, so I've got a number in bytes that I've been trying to convert to mb with little consistency or success. An example is a directory I have that comes back as 191,919,191 bytes (191.919 MB) when I 'get info'.
I was curious about how to convert it myself, so here's what I learned:
Google:
1 KB = 1000 B
1 MB = 1000 KB
1 GB = 1000 MB
So far so good...
1024000 B in KB = 1024
1024 KB in MB = 1.024
This seems perfectly logical...
191,919,191 B to MB = 191.919 MB
This looks correct too, but when I go to convert my bytes to mb using mostly any code sample out there in existence I end up with something far different from friendly ol' google.
According to Princeton
SYNOPSIS:
Converting between bytes, kilobytes, megabytes, and gigabytes.
SOLUTION:
1 byte = 1 character
1 kilobyte (kb) = 1024 bytes
1 megabyte (Mb) = 1024 kb = 1,048,576 bytes
1 gigabyte (Gb) = 1024 Mb = 1,048,576 kb = 1,073,741,824 bytes
So with this information:
191.919 mb / (1024000) = 187.421 B
I've also seen conversions like this:
191.919 mb / (1024 * 1024) = 183.028 B
WTF? is this stuff just made up as we go along, or is there some standard way of getting the real file size in mb from bytes? I'm completely lost and confused now because of this conflicting information. I have no real idea of who is right or wrong with this, or if I'm just completely missing something.
I have code like this:
UInt32 fileSize = 191919191; // size in bytes
UInt32 mbSize = fileSize / 1024000; // do conversion
printf(#"%u MB",(unsigned int)mbSize); // result:
Which outputs:
187 MB
So how in the world can 191,919,191 bytes = 191 MB?
Just to summarise...
The official, SI standardised, correct use of the prefixes is that kilo = 10^3 = 1000, mega = 10^6 = 1000000 and so on. The abbreviations are K, M, G etc.
There is a separate set of prefixes for the digital world where kibi = 2^10 = 1024, mebi = 2^20 = 1048576 and so on. The abbreviations are Ki, Mi, Gi and so on.
In popular usage the abbreviations K, M, G etc are slightly vague, and sometimes understood to mean one thing, sometimes the other.
The bottom line is that whenever it matters, you have to take extra care to know which you're using. Some documentation will not have taken that care, and can be very misleading. I don't see this ambiguity changing any time soon.