Number of mappers very less for 9 GB Data - mapreduce

I have 9 GB data which has lots of small files lesser than 64 MB block size. I have cluster level block setting as 512 MB. But I have copied this data alone with 64 MB block setting. When I run my pig script with job level hadoop block size as 64 MB , it allocated 20 mappers for this job. I also verified the job.xml having setting as block size as 64 MB. When I reexecute my pig script with cluster level setting as 64 MB, it allocates 160 mappers.
My concern is why it allocates only 20 mappers when I set it in job level. Any idea ?

Related

how can I force Spark executor to spawn more threads per task?

I am running a cluster of EMR Spark with this setup:
Master: 1 of m5.xlarge
Core: 4 of m5.xlarge
spark.executor.instances 4
spark.executor.cores 4
spark.driver.memory 11171M
spark.executor.memory 10356M
spark.emr.maximizeResourceAllocation true
spark.emr.default.executor.memory 10356M
spark.emr.default.executor.cores 4
spark.emr.default.executor.instances 4
where xlarge is an instance type which has 4 vCPU cores 16 GB memory.
Since I am using Spark to migrate database, the workload is very I/O intensive but not so much CPU intensive. I notice each executor node only spawns 4 threads (seems like 1 thread per vCPU core), while the CPU still has plenty of headroom.
Is there a way which allows me to force a higher thread allocation per executor node so that I can fully utilize my resources? Thanks.
One vCPU can hold only one thread.
In case you have assign 4 vCPU to your executor it will never spawns more than 4 threads.
For more detail
Calculation of vCPU & Cores
First, we need to select a virtual server and CPU. For this example, we’ll select Intel Xeon E-2288G as the underlying CPU. Key stats for the Intel Xeon E-2288G include 8 cores / 16 threads with a 3.7GHz base clock and 5.0GHz turbo boost. There is 16MB of onboard cache.
(16 Threads x 8 Cores) x 1 CPU = 128 vCPU
reference

Excessive thread count yields better results with file reading

I have a hundred million files, my program reads all these files at every startup. I have been looking for ways to make this process faster. On the way, I've encountered something strange. My CPU has 4 physical cores, but reading this many files with even higher thread counts yields much better results. Which is interesting, given that opening threads more than the logical core count of the CPU should be somewhat pointless.
8 Threads: 29.858 s
16 Threads: 15.882 s
32 Threads: 9.989 s
64 Threads: 7.965 s
128 Threads: 8.275 s
256 Threads: 8.159 s
512 Threads: 8.098 s
1024 Threads: 8.253 s
4096 Threads: 8.744 s
16001 Threads: 10.033 s
Why this may occur ? Is it some disk bottleneck ?
Did the homework, profiled the code, literally %95 of the runtime consists of read(), open() and close()
I am reading the first 4096 bytes of every file (my pagesize)
Ubuntu 18.04
Intel i7 6700HQ
Samsung 970 Evo Plus NVMe SSD
GCC/G++ 11
Why this may occur ?
If you open one file at "/a/b/c/d/e" then read one block of data from the file; the OS may have to fetch directory info for "/a", then fetch directory info for "/a/b", then fetch directory info for "/a/b/c", then... It might add up to a total of 6 blocks fetched from disk (5 blocks of directory info then one block of file data), and those blocks might be scattered all over the disk.
If you open a 100 million files and read one block of file data from each; then this might involve fetching 600 million things (500 million pieces of directory info, and 100 million pieces of file data).
What is the optimal order to do these 600 million things?
Often there's directory info caches and file data caches involved (and all requests that can be satisfied by data that's already cached should be done ASAP, before that data is evicted out of cache/s to make room for other data). Often the disk hardware also has rules (e.g. faster to access all blocks within the same "group of disk blocks" before switching to the next "group of disk blocks"). Sometimes there's parallelism in the disk hardware (e.g. two requests from the same zone can't be done in parallel, but 2 requests from different zones can be done in parallel).
The optimal order to do these 600 million things is something the OS can figure out.
More specifically; the optimal order to do these 600 million things is something the OS can figure out; if and only if the OS actually knows about all of them.
If you have (e.g.) 8 threads that send one request (e.g. to open a file) and then block (using no CPU time) until the pending request completes; then the OS will only know about a maximum of 8 requests at a time. In other words; the operating system's ability to optimize the order that file IO requests are performed is constrained by the number of pending requests, which is constrained by the number of threads you have.
Ideally; a single thread would be able to ask the OS "open all the files in this list of a hundred million files" so that the OS can fully optimize the order (with the least thread management overhead). Sadly, most operating systems don't support anything like this (e.g. POSIX asynchronous IO fails to support any kind of "asynchronous open").
Having a large number of threads (that are all blocked and not using any CPU time while they wait for their request/s to actually be done by file system and/or disk driver) is the only way to improve the operating system's ability to optimize the order of IO requests.

Different results while profiling memory to get max RSS with massif and time

A little bit of the context: I try to implement a C++ application that uses mmap to map some arbitrary large files for reading and writing, which can scale from a few MB to several GB. Due to this, it is important to profile the memory usage (peak RSS, I want to see how uch physical memory it consumes) of the program in order to see its performance.
I use Valgrind's massif tool with the option pages-as-heap=yes and massif visualizer. I expect that this will show me the peak RSS. I run the program with mmap reserving exactly 1GB. The massif-visualizer showed me exactly what was expected (1GB peak) (see image).
I also used \time -v command, and this showed that the max RSS was very small in size (5000 kbytes, more or less). Here's one example output:
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 112%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4772
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1928
Voluntary context switches: 32
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 88
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I also modified the program so time would run much longer, in case the delta for memory snapshots is not sufficient, but I got the same results.
Why the memory peaks are different? From what I read on several posts, people expect massif with pages-as-heap=yes option, will show you the maximum RSS which will be its peak. However, there is a difference on what \time command outputs showing a much smaller peak. I suspect that the snapshots of massif are about the virtual memory, but correct me if I am wrong. Furthermore, I would appreciate it if anyone very familiar with massif would describe how it works and if there is a way to get max RSS. Thanks in advance!
EDIT: The answer here seems to partly answer my question:
Does mmap or malloc allocate RAM?
As far as I understand, when you map files to memory they occupy the virtual space and not the physical memory space, in most OS systems. However, the mapped pages start to occupy physical memory space as soon as they become dirty, ie something is is written on them.
So, back to my original question, could this imply that valgrind massif with flag --pages-as-heap=yes tracks the virtual memory space, whilst the \time -v shows the physical memory less due to the fact that the pages were not modified?

poll / epoll server - how to handle per connection data

I am writing linux network server using poll / epoll.
I plan to have lots of connected clients, say 5000 - 10000 or even 20000.
Each client will send some data as request, then server will send some data back. For the moment for simplicity I decided to limit this data to 16 KB.
I am using C++11.
As I see it,
I can create huge "static" array of 16 KB blocks with total size of max clients e.g. for 10K connections - 10000 x 16 KB = 160 MB
I can create array of buffers (std::vector<char>) and push_back there.
I can create std::vector of buffers.
In all cases server will take 160 MB on full load, but if I use "static" array, there will be no any memory allocations / movements after initial allocation.
What is the best way to proceed and is there some other solution I am missing here.

Speed variation between vCPUs on the same Amazon EC2 instance

I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.
I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:
for i in `seq 0 29`; do
nohup taskset -c $i $BINARY_PATH &> $i.out &
done
The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.
The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.
My questions are:
Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
My experience is on c3 instances. It's likely similar with c4.
For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).
In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.
To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo
"physical id" : shows the physical processor id (only one processor in c3.2xlarge)
"processor" : gives the number of vCPUs
"core id" : tells you which vCPUs map back to each Core ID.
If you put this in a table, you have:
physical_id processor core_id
0 0 0
0 1 1
0 2 2
0 3 3
0 4 0
0 5 1
0 6 2
0 7 3
You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :
cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core".
There will be 2 "Logical Cores" that are associated with a "Physical Core"
So, in your case, one solution is to disable hyperthreading with :
echo 0 > /sys/devices/system/cpu/cpuX/online
Where X for a c3.2xlarge would be 4...7
EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.