Spark Using Disk Resources When Memory is Available - amazon-web-services

I'm working on optimizing the performance of my Spark cluster (run on AWS EMR) that is performing Collaborative Filtering using the ALS matrix factorization algorithm. We are using quite a few factors and iterations so I'm trying to optimize these steps in particular. I am trying to understand why I am using disk space when I have plenty of memory available. Here is the total cluster memory available:
Here is the remaining disk space (notice the dips of disk utilization):
I've tried looking at the Yarn manager and it looks like it shows that each node slave has: 110 GB (used) 4 GB (avail.). You can also see the total allocated on the first image (700 GB). I've also tried changing the ALS source and forcing the intermediateRDDStorageLevel and finalRDDStorageLevel from MEMORY_AND_DISK to MEMORY_ONLY and that didn't affect anything.
I am not persisting my RDD's anywhere else in my code so I'm not sure where this disk utilization is coming from. I'd like to better utilize the resources on my cluster, any ideas? How can I more effectively use the available memory?

There can be few scenerios where spark will be using the disk usage instead of memory
If you have shuffle operation. Spark writes the shuffled data in the disk only so if you have shuffle operation you are out of luck
Low executor memory. If you have low executor memory spark has less memory to keep the data so it will be spilling the data from memory to disk. However as you suggested you have tried executor memory from 20G to 40G. I will recommend to keep the executor memory till 40G as beyoind that JVM GC could make your process slower.
If you don't have shuffle operation you might as well tweak spark.memory.fraction if you are using spark 2.2
From documentation
spark.memory.fraction (doc) expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
large records.
So you can make the spark.memory.fraction to .9 and see the behavior.
Lastly there are options apart from MEMORY_ONLY as storage level like MEMORY_ONLY_SER which will serialize the data and store in memory. This option reduces the memory usage as serialized object size is much smaller than the actual object size. If you see lot of spill you can opt this storage level.

Related

LMDB: Open large databases in a limited memory system

I have a program that is projected to use a few GB of lmdb diskspace (it's a blockchain, and we're moving away from leveldb due to its lack of ACID, which I need for some future plans). Is it possible to run that program with that database on a Raspberry Pi without adding more swap (with >1 GB memory)? (considering that adding swap is for advanced users).
Currently when I run that program mdb_env_set_mapsize(1 << 30), hence 1 GB of mapsize, it returns error 12, which is out-of-memory. But it works if I reduce the size to 512 MB.
But what's the right way to handle such memory issues in lmdb when the database size keeps increasing?
The maximum size of memory that can be memory mapped depends on the size of the virtual address space, which is dictated by the CPU's virtual memory manager. A 32-bit CPU have a limit of 4GB virtual address space, this limit is for the whole system unless PAE is enabled, in which case the limit is per process.
In addition to this, the kernel and your application reserves some space of their own on your address space, and memory allocation usually requires contiguous address space, reducing memory available for the database to allocate.
So your user will need to either enable PAE on their system, or upgrade to 64-bit CPU. If neither of these is an option in your application, then you cannot use a memory mapped file larger than your available address space, so you'll have to do some segmentation to split your data into multiple files that you can map only small chunks at a time. I'm guessing that lmdb requires that it can map the entire database file into memory.
For a blockchain application, your data is mostly a linear sequence of log entries, so your application should only need to work with the most recent entries most of the time. You can separate the recent entries into its own working file, and the rest of the log in a database that doesn't require mapping the entire file into memory or in multiple fixed size files that you can map and unmap as needed.

How Size of Driver Memory & Executor memory is allocated when using Spark on Amazon EMR

I was using AWS EMR 5.2 instance m4.2x large with 10 nodes for running my Spark applications using Spark 2.0.2. I used the property of maximizeResourceAllocation=true. I saw in spark-defaults.conf that where I saw following properties :
spark.executor.instances 10
spark.executor.cores 16
spark.driver.memory 22342M
spark.executor.memory 21527M
spark.default.parallelism 320
In yarn-site.xml,I saw yarn.nodemanager.resource.memory-mb=24576(24GB).I only understand that the spark.executor.instances set to 10 as I am using 10 nodes cluster. But can anyone explain to me how the other properties have been set like how the driver memory & the executor memory has been calculated? Also I used the property of maximizeResourceAllocation=true.How does this affects the memory?
I suggest the book Spark in Action. In brief, executors are containers which run tasks delivered to them by the driver. One node in a cluster could launch several executors depending on resource allocation. CPU allocation enables running tasks in parallel, so it is better to have more cores for executors. So more CPU cores means more task slots. Memory allocation for executors should be made in a sane way which should fit YARN container memory. YARN container memory >= executor memory + executor memory overhead.
Spark reserves parts of that memory for cached data storage and for temporary shuffle data. Set the heap for these with the parameters spark.storage.memoryFraction (default 0.6) and spark.shuffle.memoryFraction (default 0.2). Because these parts of the heap can grow before Spark can measure and limit them, two additional safety parameters must be set: spark.storage.safetyFraction (default 0.9) and spark.shuffle.safetyFraction (default 0.8). Safety parameters lower the memory fraction by the amount specified. The actual part of the heap used for storage by default is 0.6 × 0.9 (safety fraction times the storage memory fraction), which equals 54%. Similarly, the part of the heap used for shuffle data is 0.2 × 0.8 (safety fraction times the shuffle memory fraction), which equals 16%. You then have 30% of the heap reserved for other Java objects and resources needed to run tasks. You should, however, count on only 20%.
The Driver orchestrates stages and tasks among executors. Results from executors are returned back to the driver so the memory for the driver also should be considered to handle all data can be gathered together from all executors.

How does hardware affect the performance of malloc\new on Windows

I am working on the performance of a c++ application on Windows 7, which is doing a lot of computation and a lot of of small allocations. Basically I observed a bottleneck using visual studio sampling profiler and it come down to the parsing of a file and creation of a huge tree structure of the type
class TreeStruct : std::map<key, TreeStructPtr>
{
SomeMetadata p;
int* buff;
int buffsize;
}
There are ten of thousand of these structure created during the parsing
The buffer is not that big, 1 byte to few hundred bytes
The profiler report that the most costly functions is
free (13 000 exclusive samples, 38% Exclusive Samples)
operator new (13 000 exclusive samples, 38% Exclusive Samples)
realloc (4000 exclusive samples, 13% Exclusive Samples)
I managed to optimize and to reduce allocations to
operator new (2200 exclusive samples, 48% Exclusive Samples)
free (1770 exclusive samples, 38% Exclusive Samples)
some function (73 exclusive samples, 1.5% Exclusive Samples)
When I measure the client waiting time (ie a client wait for the action to process with a stopwatch) The installed version on my machine went from 85s of processing time to 16s of processing time, which is great. I proceed to test on the most powerful machine we have and was stunned that the non optimized version took only 3.5s while to optimized around 2s. Same executable, same operating system...
Question: How is such a disparity possible on two modern machines?
Here are the specs :
85s to 16s machine
3.5s to 2s machine
The processing is mono-threaded.
As others have commented, frequent small allocations are a waste of time and memory.
For every allocation, there is overhead:
Function call preparation
Function call (break in execution path; possible reload of execution
pipeline).
Algorithm to find a memory block (searching perhaps).
Allocating the memory (marking the block as unavailable).
Placing the address into a register
Returning from the function (another break in sequential execution).
Regardless of your machine's speed, the above process is a lot of execution to allocate a small block of memory.
Modern processors love to keep their data close (as in a data cache). Their performance increases when they can fetch data from the cache and not fetch outside the processor (access times slow down the further away the values are, such as memory on chip, outside core; memory off chip on the same board; memory on other boards; memory on devices (such as Flash and hard drive). Reallocating memory defeats the effectiveness of the data cache.
The Operating System may get involved and slow down your program. In the allocation or delete functions, the O.S. may check for paging. Paging, in a simple form, is the swapping of memory areas with areas on the hard drive. This may occur when other higher priority tasks are running and demand more memory.
An algorithm for speeding up data access:
Load data from memory into local variables (registers if possible).
Process the data in the local variables (registers).
Store the finished data.
If you can, place data into structures. Load all the structure members at once. Structures allow for data to be placed into contiguous memory (which reduces the need to reload the cache).
Lastly, reduce branching or changes in execution. Research "loop unrolling". Your compiler may perform this optimization at higher optimization settings.

Amazon RDS running out of freeable memory. Should I be worried?

I have an Amazon RDS instance. Freeable Memory has been declining since setup over 1-2 weeks, starting from 15GB of memory down to about 250MB. As it has dropped this low in the last days, it has started to resemble a sawtooth pattern where Freeable Memory drops to this range (250 - 350MB) and then goes back up agin to 500 - 600MB in a sawtooth pattern.
There has not been any notable decline in application quality. However, I am worried that the DB will run out of memory and crash.
Is there a danger that the RDS instance will run out of memory? Is there some setting or parameter I should be looking at to determine if the instance is set up correctly? What is causing this sawtooth pattern?
Short answer - you shouldn't worry about FreeableMemory unless it is became really low (about 100-200 Mb) or significant swapping occur (see RDS SwapUsage metric).
FreeableMemory is not a MySQL metric, but OS metric. It is hard to give precise definition, but you can treat it as memory which OS will be able to allocate to anyone who request it (in your case it likely will be MySQL). MySQL have a set of settings which are restricting it's overall memory usage to some cap(you can use something like this to actually calculate it). It's unlikely that your instance will ever hit this limit, due to the fact that in general you never reach max number of connections, but this is still possible.
Now going back to "decline" in FreeableMemory metric. For the MySQL most of the memory consume by InnoDB buffer pool (see here for details). RDS instances in there config by default have size for this buffer set to 75% of hosts physical memory - which in your case is about 12 GB. This buffer is used for caching all DB data which used in both read and write operations.
So in your case, since this buffer is really big - it is slowly filling with data which cached (it is likely that this buffer is actually big enough to cache all DB). So when you first start you instance this buffer is empty and than once you start reading/writing stuff into DB all this data are bringing into cache. They will stay here up to the time when this cache became full and new request came. At this time least recently used data will be replaced with new data. So initial decline of FreeableMemory after DB instance restart explains with this fact. It is not a bad thing, cause you actually want as much as possible of you data to be cache in order for you DB to work faster. The only thing which can go nasty is when part or all of this buffer will be pushed out of physical memory into swap. At that point you will have huge performance drop.
As a preventive care it might be a good idea to tune MySQL max memory used for different thing in case you FreeableMemory metric is constantly on a level of 100-200 Mb, just to reduce possibility of swapping.
Freeable memory field is used by MySQL for buffering and caching for it`s own processes. It is normal for the amount of Freeable memory to decrease over time. I wouldn't be worried it kicks old info out as it demands more room.
After several support tickets at AWS I found that tuning the parameters groups can help, specially the shared buffer, lowering them to keep a reserved quantity to avoid drops or failovers due to lack of memory

Cuda zero-copy performance

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?
I have a kernel that uses the zero-copy feature and with NVVP I see the following:
Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.
When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.
However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.
The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?
Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:
The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.
The latency to
L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles
The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.
The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.
The profiler's exposes the following counters:
gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate
In the zero copy case all of these metrics should be much lower.
The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).