MPI allocating massive amounts of memory on startup - fortran

I am trying to run an MPI application compiled with mpiifort (ifort (IFORT) 2021.6.0 20220226) using mpiexec.hydra (Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)), on a machine with 36 Intel(R) Xeon(R) Platinum 8360Y CPU # 2.40GHz cores.
Before any actual work is done, that is before the actual application prints any output, massive amounts of memory are allocated (tens to hundreds of GB). This seems to depend on the number of cores used. With more than 30 cores, MPI actually crashes because it runs out of memory.
Is this an MPI or compiler bug? What is all this memory allocated for? Can I control its amount somehow?
Edit: here's a minimal example
program test
implicit none
integer :: mpi_err
include "mpif.h"
call MPI_INIT(mpi_err)
call MPI_FINALIZE()
end program

Related

CUDA C++ How to programs to benchmark shared memory bandwidth?

I'm looking for a way to benchmark shared mem and L1/L2 cache. However, the benchmark results I found are very different depending on the source.
In this paper, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking, they show shared memory bandwidth to be 12000GB/s on Tesla V100, but they don't provide how they reached that number. If I use gpumembench on a NVIDIA A30, I only get ~5000GB/s.
Is there any other sample programs I can use to benchmark shared memory?

Detecting CPU and Core information from my Intel System

I am currently using Windows 8 Pro OS, along with the Processor: Intel(R) Core(TM) i7-4790 CPU # 3.60GHz, with RAM 8 GB.
I wanted to know how many Physical processors and how many actual Cores my System has. With my very basic understanding for Hardware and this discussion here, when I am searching Intel Information for this processor at this Intel site here, it says:
# of Cores 4
# of Threads 8
In the Task Manager of my System for CPU, it says:
Maximum Speed: 3.60 GHz
Sockets: 1
Cores: 4
Physical processors: 8
Am I correct in assuming that I have 1 Physical processor with 4 actual physical cores, and each physical core has 2 virtual cores (= 2 threads). As such the total physical processors are 8, as mentioned in my Task Manager. But, if my assumption is correct, then why say physical processors =8, and not virtual processors?
I need to know the core details of my machine as I need to write Low Latency programs, using maybe OpenMP.
Thanks for your time...
From the perspective of your operating system, even HyperThreaded processors are "real" processors - they exist in the CPU. They use real, physical resources like instruction decoders and ALUs. Just because those resources are shared between HT cores doesn't mean they're not "real".
General computing will see a speedup by using Hyper Threading, because the various threads are doing different kinds of things, leveraging the shared resources. A CPU-intensive task running in parallel may not see as high of performance however, due to the strain on the shared resources. For example, if there's only one ALU, it doesn't make sense to have two threads competing for it.
Run benchmarks and determine for your application what the appropriate settings are, regarding HT being enabled or not. With a question this broad, we can't give you a definitive answer.

Will CUDA API affect CPU's Ram access performance?

* More testing shows CPU's Ram slowness is not related to CUDA. It turns out Func2(CPU) is CPU intensive but not memory intensive, then for my program1, the pressure on memory is less as it's Func2 who is occupying CPU. For program2(GPU), as Func2 becomes very fast with GPU, Func1 occupies CPU and put a lot of pressure on memory, leading to Func1's slowness *
Short version: if I run 20 processes concurrently on the same server, I noticed CPU's running speed is much slower when GPU is involved (vs pure CPU processes)
Long version:
My server: Win Server 2012, 48 cores (hyperthreaded from 24), 192 GB Ram (the 20 processes will only use ~40GB), 4 K40 cards
My program1 (CPU Version):
For 30 iterations:
Func1(CPU): 6s (lot's CPU memory access)
Func2(CPU): 60s (lot's CPU memory access)
My program2 (GPU Version, to use CPU cores for Func1, and K40s for Func2):
//1 K40 will hold 5 contexts at the same time, till end of the 30 iterations
cudaSetDevice //very slow, can be few seconds
cudaMalloc ~1GB //can be another few seconds
For 30 iterations:
Func1(CPU): 6s
Func2(GPU), 1s (60X speedup) //share GPU via named_mutex
If I run 20 program1(CPU) together, I noticed Func1's 6s becomes 12s on average
While for 20 program2(GPU), Func1 takes ~42s to complete, while my Func2(GPU) is still ~1s (This 1s includes locking GPU, some cudaMemcpy & the kernel call. I presume this includes GPU context switching also). So seems GPU's own performance is not affected much, while CPU does (by GPU)
So I suspect cudaSetDevice/cudaMalloc/cudaMemcpy is affecting CPU's Ram access? If it's true, parallelization using both multi-core CPU & GPU will be affected.
Thanks.
This is almost certainly caused by resource contention.
When, in the standard API, you run 20 processes, you have 20 separate GPU contexts. Every time one of those processes wishes to perform an API call, there must be a context switch to that processes context. Context switching is expensive and has a lot of latency. That will be the source of the slow down you are seeing. Nothing to do with memory performance.
NVIDIA has released a system called MPS (Multi Process Server) which reimplements the CUDA API as a service and internally exploits the Hyper-Q facility of modern TesLa cards to push operations onto the wide command queue which Hyper-Q supports. This removes all the context switching latency. It sounds like you might want to investigate this, if performances is important to you and your code requires a large number of processes sharing a single device

parallel c++11 program random crashes

I have a problem which I could not solve for a long time now. Since, I don't have more Ideas I am happy for any suggestions.
The program is a physics simulation which works on a huge tree data structure with millions of dynamical allocated nodes which are constructed / reorganized / destructed many times in parallel throughout the simulation with allot of pointers involved. Also this might sound very error-prone I am almost sure that I am doing all this in a thread-save manner. The program uses only standard libs and classes plus Intel-MKL (blas / lapack optimized for Intel CPUs) for matrix operations.
My code is parallelized using c++11 threads. The program runs fine on my desktop, my laptop and on two different Intel clusters using up to 8 threads. Only on one cluster the code suffers from random crashes if I use more than 2 threads (it runs absolutely fine with one or two threads).
The crash reports are varying from case to case but are mostly connected to heap corruption (segmentation fault, corrupted double linked list, malloc assertions, ...). some times the program gets caught in an infinite loop as well. In very rear cases the data structure suddenly blows up and the program runs out of memory. Anyway, since the program runs fine on all other machines I doubt the problem is in my source code. Since the crashes occur randomly I found all back tracing information relatively useless.
The hardware of the problematic cluster is almost identical to another cluster on which the code runs fine on up to 8 threads (Intel Xeon E5-2630 CPUs). The libs / compilers / OS are all relatively up to date. Note that other open-MP parallelized programs are running fine on the same cluster.
(Linux version 3.11.10-21-default (geeko#buildhost) (gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) ) #1 SMP Mon Jul 21 15:28:46 UTC 2014 (9a9565d))
I already tried the following approaches:
adding allot of assertions to assure that all my pointers are handled correctly
linking against tc-malloc instead of glibc-malloc/free
trying different compilers (g++, icpc, clang++) and compiler options (with / without compiler optimizations / debugging options)
using the working binary from another machine with statically linked libraries to
using open-MP instead of c++ threads
switching between serial / parallel MKL
using other blas / lapack libraries
Using valgrind is out of question, since the problem occurs randomly after 10 minutes up to several hours and valgrind gives me a slowdown factor of around 50 - 100 (Plus valgrind does not allow real concurrency). Nevertheless I ran the code in valgrind for several hours without problems.
Also, I can not see any problem with the resource limits:
RLIMIT_AS: 18446744073709551615
RLIMIT_CORE : 18446744073709551615
RLIMIT_CPU: 18446744073709551615
RLIMIT_DATA: 18446744073709551615
RLIMIT_FSIZE: 18446744073709551615
RLIMIT_LOCKS: 18446744073709551615
RLIMIT_MEMLOCK: 18446744073709551615
RLIMIT_MSGQUEUE: 819200
RLIMIT_NICE: 0
RLIMIT_NOFILE: 1024
RLIMIT_NPROC: 2066856
RLIMIT_RSS: 18446744073709551615
RLIMIT_RTPRIO: 0
RLIMIT_RTTIME: 18446744073709551615
RLIMIT_SIGPENDING: 2066856
RLIMIT_STACK : 18446744073709551615
RLIMIT_STACK : 18446744073709551615
I found out that for some reason the stack size per thread seems to be only 2mb, so I increased it using ulimit -s. Anyway stack size shouldn't be the problem.
Also the program should not have problem with allocatable memory on the heap, since the memory size is more than sufficient.
Does anyone have an Idea of what could go wrong here / where I should look at? Maybe I miss some environment variables I should check? I think the fact that the error occurs only if I use more than two threads and that the crash rate for more than two threads is independent of the number of threads could be a hint.
Thanks in advance.

CUDA ptxas Error "function uses too much shared data"

I have never used CUDA or C++ before, but I am trying to get Ramses GPU from (http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/download.html running.
Due to an error in the autogen.sh I used ./configure and got this one working.
So the makefile produced contains the following NVCC flags
NVCCFLAGS = -gencode=arch=compute_10,code=sm_10 -gencode=arch=compute_11,code=sm_11 -gencode=arch=compute_13,code=sm_13 -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20 -use_fast_math -O3
But when I try to compile the program using make, I get multiple ptxas Errors:
Entry function '_Z30kernel_viscosity_forces_3d_oldPfS_S_S_iiiiiffff' uses too much shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z26kernel_viscosity_forces_3dPfS_S_S_iiiiiffff' uses too much shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z32kernel_viscosity_forces_3d_zslabPfS_S_S_iiiiiffff9ZslabInfo' uses too much shared data (0x70e0 bytes + 0x10 bytes system, 0x4000 max)
I'm trying to compile this code on Linux with Kernel 2.6 and CUDA 4.2 (I try to do it in my University and they are not upgrading stuff regularly.) on two NVIDIDA C1060. I tried replacing the sm_10, sm_11 and sm_13 by sm_20, (I saw this fix here: Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error) but that didn't fix my problem.
Do you have any suggestions? I can upload the Makefile as well as everything else, if you need it.
Thank you for your help!
The code you are compiling requires a static allocation of 28880 bytes (0x70d0) of shared memory per block. For compute capability 2.x and newer GPUs, this is no problem because they support up to 48kb of shared memory. However, for compute capability 1.x devices, the shared memory limit is 16kb (and up to 256 bytes of that can be consumed by kernel arguments). Because of this, the code cannot be compiled for compute 1.x devices and the compiler is generating an error telling you this. So the error comes from specifying sm_13/compute_13 to compiler. You can removed that and the build should work.
However, it gets worse. The Tesla C1060 is a compute capability 1.3 device. As a result, you will not be able to compile and run those kernels on your GPUs. There is no solution short of omitting those kernels from the build (if you don't need them), or back porting the code to the compute 1.x architecture. I have no idea whether that is feasible or not. Or finding more modern hardware to run the code on.