Utilizing the two devices in a Tesla K80 on AWS p2 instances - amazon-web-services

I use a p2 instance on AWS, that is supposed to have a Tesla K80 gpu, with two GK210 GPUs inside it (https://blogs.nvidia.com/blog/2014/11/18/tesla-k80-perf/).
According to the following post from Nvidia forums, I should be able to see and access each of the two devices separately (https://devtalk.nvidia.com/default/topic/995255/using-tesla-k80-as-two-tesla-k40/?offset=4).
However, when I run nvidia-smi on the p2 instance, I only see one device:
[ec2-user#ip-172-31-34-73 caffe]$ nvidia-smi
Wed Feb 22 12:20:51 2017
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:00:1E.0 Off | 0 |
| N/A 34C P8 31W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
How can I monitor and access the 2 devices?

The actual situation with a p2.xlarge instance is that you have 1/2 of a K80 assigned to that VM, so your nvidia-smi output here is expected, and you will not be able to access 2 GPU devices from that VM/instance type.

Related

Build Gstreamer gst-plugins-bad 1.18.6 with support for Nvidia cuda, encoding/decoding, etc

Similar questions have been asked elsewhere on Stack Overflow. However, I do not believe any posts here give an answer relevant to the more recent releases of gst-plugins-bad.
I'd like to encode and decode H264 video streams with gstreamer using hardware support via my GTX1080 video card. I have been able to get this to work previously following this guide on gst-plugins-bad 1.16.3, but my goal now is to access features available in 1.18.6. However, starting with 1.17.0, the build system for gst-plugins-bad changed from autoconf to Meson. I don't have ANY experience w/ Meson at all (honestly, I hadn't even heard of it until this point), and as such have no idea how to pass the proper arguments to build w/ nvidia support. Also, I wouldn't have come asking if there were any documentation available referencing what I'm trying to do here specific to later versions of gstreamer plugins. As far as I can tell, there isn't.
I am on Ubuntu 22.04 with Cuda 11.8, Gstreamer 1.20.3, gst-plugins-bad 1.18.6.
For brevity is the output from nvidia-smi:
Wed Jan 4 14:42:19 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:42:00.0 On | N/A |
| 0% 54C P2 40W / 240W | 652MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1879 G /usr/lib/xorg/Xorg 354MiB |
| 0 N/A N/A 2034 G /usr/bin/gnome-shell 92MiB |
| 0 N/A N/A 12603 G ...1/usr/lib/firefox/firefox 163MiB |
| 0 N/A N/A 20393 G ...RendererForSitePerProcess 38MiB |
+-----------------------------------------------------------------------------+
Any help is appreciated, thanks in advance.

While using X11 on a GPU, does XShmGetImage give you host/device memory back?

If you have X11 running on a GPU like so:
Fri Aug 2 23:52:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.30 Driver Version: 430.30 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 14W / 150W | 141MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3255 G /usr/lib/xorg/Xorg 57MiB |
| 0 3286 G /usr/bin/gnome-shell 81MiB |
+-----------------------------------------------------------------------------+
If you run XShmGetImage(), does it give you a pointer to a memory address in GPU memory or host memory?
If the GPU, I assume you can do other operations on the NVIDIA card with it, like H264 encode that data.
Is there a way to copy the memory from one GPU memory block to a different GPU memory block?
I am using NVENC libraries.
Reading the MIT Shared Memory extension's documentation:
The next step is to create the shared memory segment. This is best
done after the creation of the XImage, since you need to make use of
the information in that XImage to know how much memory to allocate. To
create the segment, you need a call like:
shminfo.shmid = shmget(IPC_PRIVATE, image->bytes_per_line * image->height, IPC_CREAT|0777);
This implies the extension regards "shared memory" as "that which is returned by shmget or equivalent". Since shmget is incapable of allocating GPU memory, my answer is the XImage is in host memory, not device.

How to check the region where your GoogleColab is running?

Following this tweet I tried without luck to use this GPU on Google Colab. I'm wondering if this is due to the region where my notebook is running but I don't have idea how to check this.
Am I missing something setting up the GPU? I followed this post [ Solved! See UPDATE 2]
How can I check in which region I'm from colab?
UPDATE
The output of !nvidia-smi is
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P0 54W / 149W | 121MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
UPDATE 2
Tesla T4 is now available but I'm still interested in how to check in which region my instances are running.
You can check server location by curl:
curl ipinfo.io
And you can change server by factory reset runtime, but you can't choose region.
In your Colab notebook, create a new cell and run:
!curl ipinfo.io
Output:
{
"ip": "31.141.44.210",
"hostname": "31.141.44.210.bc.googleusercontent.com",
"city": "Groningen",
"region": "Groningen", <--
"country": "NL",
"loc": "52.2192,6.1667",
"org": "AS396981 Google LLC",
"postal": "9711",
"timezone": "Europe/Amsterdam",
"readme": "https://ipinfo.io/missingauth"
}
Just running curl ipinfo.io (without the leading !) didn't work for me. So, extending #Krishna answers here.
For me !nvidia-smi is not working it tells me:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Moreover I still have the K80 so

How to troubleshoot cudaErrorUnknown from cudaDeviceSynchronize()?

I have a large code base that performs RGB to YUV color conversion with CUDA kernels. Since I am doing a lot of parallel conversions, I use streams (maybe that is relevant here). The code is running on Linux, It is working fine on a Quadro K4200 GPU but I recently got a new Quadro P4000 GPU on which I constantly get cudaErrorUnknown when calling cudaDeviceSynchronize(). Before this happens the only things I do are a call to cuMemcpy2DAsync to copy the pixel data and after that a call to my kernel. The code base is large and I can share some relevant parts, but can anyone give advice how could I troubleshoot this? Since I was working with the K4200 all the time, I haven't changed the CUDA compiler flags. Should I do that? I am currently compiling the same code for both cards with the following flags:
--compiler-bindir /usr/bin/gcc-4.9 -gencode=arch=compute_30,code=\"sm_30,compute_30\" -cudart static -maxrregcount=0 --machine 64 --compile -g -G -std=c++11 -D_MWAITXINTRIN_H_INCLUDED
But in that case is it even possible to make a single object that runs on different GPUs?
This is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:04:00.0 Off | N/A |
| 46% 39C P0 29W / 105W | 0MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K4200 Off | 00000000:84:00.0 Off | N/A |
| 30% 40C P0 26W / 110W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Should I disable the old card, could the driver seeing both cards start to behave incorrectly? Are there any internal NVIDIA logs/tools that I can use to get a more detailed description of what is failing?
How to troubleshoot ... ?
By transforming your program into a
Minimal, Complete, Verifiable Example (MCVE)
of this issue manifesting.
This will focus your "list of suspects" to very few CUDA API calls, which should either be enough for you to figure out the problem by yourself or would make it possible for you to post the whole thing (in a different question) here and get proper help. Or you'll find out the problem goes away as you drop supposedly-irrelevant parts of the code, meaning that it lies with what you've just removed.
Recompiling the kernel with the correct architecture flags -gencode=arch=compute_61,code=sm_61 as suggested by #tera fixed it for Quadro P4000, however now the same code fails on Quadro K4200, but this time with a reasonable error cudaErrorNoKernelImageForDevice:
This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.
So apparently my biggest problem was the lack of knowledge to understand what could be causing the cudaErrorUnknown.

Reliable way to programmatically get the number of hardware threads on Windows

I'm struggling to find a reliable way to get the number of hardware threads on Windows. I am running a Windows 7 Professional SP1 64-bit on a machine with dual CPU Intel Xeon E5-2699 v3 # 2.30GHz totalizing 36 cores and 72 threads.
I have tried different methods to get the number of cores, and I have found that only two of them seem to work accurately in a 32-bit or 64-bit process. Here are my results:
+------------------------------------------------+----------------+----------------+
| Methods | 32-bit process | 64-bit process |
+------------------------------------------------+----------------+----------------+
| GetSystemInfo->dwNumberOfProcessors | 32 | 36 |
| GetNativeSystemInfo->dwNumberOfProcessors | 36 | 36 |
| GetLogicalProcessorInformation | 36 | 36 |
| GetProcessAffinityMask.processAffinityMask | 32 | 32 |
| GetProcessAffinityMask.systemAffinityMask | 32 | 32 |
| omp_get_num_procs | 32 | 36 |
| getenv("NUMBER_OF_PROCESSORS") | 36 | 36 |
| GetActiveProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| GetMaximumProcessorCount(ALL_PROCESSOR_GROUPS) | 64 | 72 |
| boost::thread::hardware_concurrency() | 32 | 36 |
| Performance counter API | 36 | 36 |
| WMI | 72 | 72 |
| HARDWARE\DESCRIPTION\System\CentralProcessor | 72 | 72 |
+------------------------------------------------+----------------+----------------+
I do not explain why all these functions return different values. The only 2 methods which seem reliable to me is either using WMI (but fairly complicated) or simply to read in the Windows registry the following key: HARDWARE\DESCRIPTION\System\CentralProcessor.
What do you think?
Do you confirm that the WMI and registry key methods are the only reliable methods?
Thanks in advance
The API function that you need is GetLogicalProcessorInformationEx. Since you have more than 64 processors, your processors are grouped. GetLogicalProcessorInformation only reports the processors in the processor group that the thread is currently assigned. You need to use GetLogicalProcessorInformationEx to get past that limitation.
The documentation says:
On systems with more than 64 logical processors, the GetLogicalProcessorInformation function retrieves logical processor information about processors in the processor group to which the calling thread is currently assigned. Use the GetLogicalProcessorInformationEx function to retrieve information about processors in all processor groups on the system.
Late answer with code:
size_t myHardwareConcurrency(){
size_t concurrency=0;
DWORD length=0;
if(GetLogicalProcessorInformationEx(RelationAll,nullptr,&length)!=FALSE){
return concurrency;}
if(GetLastError()!=ERROR_INSUFFICIENT_BUFFER){
return concurrency;}
std::unique_ptr<void,void(*)(void*)>buffer(std::malloc(length),std::free);
if(!buffer){
return concurrency;}
unsigned char*mem=reinterpret_cast<unsigned char*>(buffer.get());
if(GetLogicalProcessorInformationEx(RelationAll,reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem),&length)==false){
return concurrency;}
for(DWORD i=0;i<length;){
auto*proc=reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(mem+i);
if(proc->Relationship==RelationProcessorCore){
for(WORD group=0;group<proc->Processor.GroupCount;++group){
for(KAFFINITY mask=proc->Processor.GroupMask[group].Mask;mask!=0;mask>>=1){
concurrency+=mask&1;}}}
i+=proc->Size;}
return concurrency;}
It worked on my dual Xeon gold 6154 with Windows 64 bit system (2 procs * 18 cores/proc * 2 threads/core = 72 threads). The result is 72 both for 32 bit processes and for 64 bit processes.
I do not have access to a system with a 32 bit Windows though.
In case of error, it returns zero like std::thread::hardware_concurrency does.
You can use the CPUID instruction to query the processor directly (platform independent, though since you can't do inline asm in MSVC anymore for some compilers you'll need to use different functions to have access to it). The only downside is that as of a few years ago Intel and AMD handle this instruction differently, and you'll need to do a lot of work to ensure you are reading the information correctly. In fact, not only will you be able to get a core count, but you can get all kinds of processor topology information. Not sure how it works in a VM though if you are using that environment.