D3D11: Reference Rasterizer vs WARP - c++

I have a test for a pixel shader that does some rendering and compares the result to a reference image to verify that the shader produces an expected output. When this test is run on a CI machine, it is on a VM without a GPU, so I call D3D11CreateDevice with D3D_DRIVER_TYPE_REFERENCE to use the reference rasterizer. We have been doing this for years without issue on a Windows 7 VM.
We are now trying to move to a Windows 10 VM for our CI tests. When I run the test here, various API calls start failing after some number of successful tests (on the order of 5000-10000) with DXGI_ERROR_DEVICE_REMOVED, and calling GetDeviceRemovedReason returns DXGI_ERROR_DRIVER_INTERNAL_ERROR. After some debugging I've found that the failure originates during a call to ID3D11DeviceContext::PSSetShader (yes, this returns void, but I found this via a breakpoint in KernelBase.dll!RaiseException). This call looks exactly like the thousands of previous calls to PSSetShader as far as I can tell. It doesn't appear to be a resource issue, the process is only using 8MB of memory when the error occurs, and the handle count is not growing.
I can reproduce the issue on multiple Win10 systems, and it succeeds on multiple Win7 systems. The big difference between the two is that on Win7, the API calls are going through d3d11ref.dll, and on Win10 they are going through d3d10warp.dll. I am not really familiar with what the differences are or why one or the other would be chosen, and MSDN's documentation is quite opaque on the subject. I know that both d3d11ref.dll and d3d10warp.dll are both present on both failing and passing systems; I don't know what the logic is for one or the other being loaded for the same set of calls, or why the d3d10warp library fails.
So, can someone explain the difference between the two, and/or suggest how I could get d3d11ref.dll to load in Windows 10? As far as I can tell it is a bug in d3d10warp.dll and for now I would just like to side-step it.
In case it matters, I am calling D3D11CreateDevice with the desired feature level set to D3D_FEATURE_LEVEL_11_0, and I verify that the same level is returned as acheived. I am passing 0 for creationFlags, and my D3D11_SDK_VERSION is defined as 7 in d3d11.h. Below is the call stack above PSSetShader when the failure occurs. This seems to be the first call that fails, and every call after it with a return code also fails.
d3d10warp.dll!UMDevice::MSCB_SetError(long,enum UMDevice::DDI_TYPE)
d3d10warp.dll!UMContext::SetShaderWithInterfaces(enum PIXELJIT_SHADER_STAGE,struct D3D10DDI_HSHADER,unsigned int,unsigned int const *,struct D3D11DDIARG_POINTERDATA const *)
d3d10warp.dll!UMDevice::PsSetShaderWithInterfaces(struct D3D10DDI_HDEVICE,struct D3D10DDI_HSHADER,unsigned int,unsigned int const *,struct D3D11DDIARG_POINTERDATA const *)
d3d11.dll!CContext::TID3D11DeviceContext_SetShaderWithInterfaces_<1,4>(class CContext *,struct ID3D11PixelShader *,struct ID3D11ClassInstance * const *,unsigned int)
Update: With the D3D Debug layers enabled, I get the following additional output when the error occurs:
D3D11: Removing Device.
D3D11 WARNING: ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DRIVER_INTERNAL_ERROR: There is strong evidence that the driver has performed an undefined operation; but it may be because the application performed an illegal or undefined operation to begin with.). [ EXECUTION WARNING #379: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT]
D3D11 ERROR: ID3D11DeviceContext::Map: Returning DXGI_ERROR_DEVICE_REMOVED, when a Resource was trying to be mapped with READ or READWRITE. [ RESOURCE_MANIPULATION ERROR #2097214: RESOURCE_MAP_DEVICEREMOVED_RETURN]
The third line about the call to Map happens after my test fails to notice and handle the device removed and later tries to map a texture, so I don't think that's related. The other is about what I expected; there's an error in the driver, and possibly my test is doing something bad to cause it. I still don't know what that might be, or why it worked in Windows 7.
Update 2: I have found that if I run my tests on Windows 10 in Windows 7 compatibility mode, there is no device removed error and all of my tests pass. It is still using d3d10warp.dll instead of d3d11ref.dll, so that wasn't exactly the problem. I'm not sure how to investigate "what am I doing that's incompatible with Windows 10 or its WARP device"; this might need to be a Microsoft support ticket.

The problem is that you haven't enabled the Windows 10 optional feature "Graphics Tools" on that system. That is how you install the DirectX 11/12 Debug Runtime on Windows 10 including Direct3D 11's reference device, WARP for DirectX 12, the DirectX SDK debug layer for DX11/DX12, etc.
WARP for DirectX 11 is available on all systems, not just those with the "Graphics Tools" feature. Generally speaking, most people have switched to using WARP instead of the software reference driver since it is a lot faster. If you are having crashes under WARP, you should investigate the source of those crashes by enabling the DEBUG device.
See this blog post.


Accessing Max Input Delay with C++ on Windows

I am having trouble obtaining certain data from Windows Performance Counters with C++. I will preface my question by stating that I am new to both C++ and to developing for Windows, but I have spent some time on this issue already so I feel familiar with the concepts I am discussing here.
How do I use Windows PDH (Performance Data Helper) C++ to obtain Max Input Delay--either per session or per process? Are there certain Performance Counters that are not available outside of perfmon?
Progress so far:
I have used this example to log some Performance Counters successfully, but the ones I want produce the error code 0xC0000BB8: "The specified object is not found on the system." This confuses me because I can access the objects--"User Input Delay per Process" or "User Input Delay per Session"--fine through perfmon. I even went as far as enabling the counter in the registry as outlined in the article I linked in my question, despite being on a build of Windows 10 that should have it enabled by default. I had to make a small change to get the code to compile, but I have changed only the definition of COUNTER_PATH during my testing because, again, the code works as advertised except when it comes to the counter I want to access. Specifically:
Does not compile:
CONST PWSTR COUNTER_PATH = L"\\Processor(0)\\% Processor Time";
Does compile and log:
CONST wchar_t *COUNTER_PATH = L"\\Processor(0)\\% Processor Timee";
CONST PWSTR COUNTER_PATH = const_cast<PWSTR>(TEXT( "\\Processor(0)\\% Processor Time" ));
Compiles, but throws error code 0xC0000BB8 at runtime (This is the Counter I want to access):
CONST PWSTR COUNTER_PATH = const_cast<PWSTR>(TEXT( "\\User Input Delay per Session(1)\\Max Input Delay" ));
The hardcoded session ID of 1 in the string was for troubleshooting purposes, but wildcard (*) and 0 were also used with the same result. The counter path matches that shown in perfmon.
Essentially, all Performance Counters that I have attempted to access with this code--about 5 completely different ones--have successfully logged the data being requested, but the one I want to access continues to be evasive.
I asked this same question on Microsoft Q&A and received the answer:
The Performance Counters in question require administrator privileges to access. All I had to do was run this program in administrator command prompt, and that solved my issue.

Linking Cuda (cudart.lib) makes DXGI DuplicateOutput1() fail

For an obscure reason my call to IDXGIOutput5::DuplicateOutput1() fail with error 0x887a0004 (DXGI_ERROR_UNSUPPORTED) after I added cudart.lib in my project.
I work on Visual Studio 2019, my code for monitor duplication is the classic :
hr = output5->DuplicateOutput1(this->dxgiDevice, 0, sizeof(supportedFormats) / sizeof(DXGI_FORMAT), supportedFormats, &this->dxgiOutputDuplication);
And the only thing I tried to do with cuda at the moment is simply to list the Cuda devices :
int nDevices = 0;
cudaError_t error = cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
LOG_FUNC_DEBUG("Graphic adapter : Descripion: %s, Memory Clock Rate : %d kHz, Memory Bus Width : %u bits",
Moreover this piece of code is called far later after I try to start monitor duplication with DXGI.
Every thing seems correct in my application : I do a call to SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2), and I'm not running on e discrete GPU (see [https://support.microsoft.com/en-us/help/3019314/error-generated-when-desktop-duplication-api-capable-application-is-ru][1])
And by the way it used to work, and it works again if I just remove the "so simple" Cuda call and the cudart.lib from the linker input !
I really don't understand what can cause this strange behavior, any idea ?
...after I added cudart.lib in my project
When you link CUDA library you force your application to run on discrete GPU. You already know this should be avoided, but you still force it through this link.
...and I'm not running on e discrete GPU...
You are, static link to CUDA is a specific case which hints to use dGPU.
There are systems where Desktop Duplication is not working against dGPU and yours seems to be one of those. Even though unobvious, you are seeing behavior by [NVIDIA] design.
(There are however also other systems where Desktop Duplication is working against dGPU and is not working against iGPU).
Your potential solution is along this line:
Application is not directly linked against cuda.lib or cudart.lib or LoadLibrary to dynamically load the nvcuda.dll or cudart*.dll and uses GetProcAddress to retrieve function addresses from nvcuda.dll or cudart*.dll.

Several Nt functions return STATUS_WAIT_0 on Windows 7 x32

I have downloaded a Windows 7 x32 Enterprise (IE11) hyper-v image from Microsoft website to test a research project.
For some reason all the Ntdll functions I call (syscall) return STATUS_WAIT_0. I mean all of them that I have tested including RtlGetVersion, NtAllocateVirtualMemory, NtCreateFile and more.
Could this be because it's a virtual machine ? Or could it be because I do direct system calls ?
Please advise, I have tested my project under non-virtual machines including latest Windows 10 and it works fine so I doubt it's my code.
STATUS_WAIT_0 can be considered as STATUS_SUCCESS since it's value is both 0.
Ntdll function basically returns a NTSTATUS, like RtlGetVersion,NtAllocateVirtualMemory,NtCreateFile and more.
The following document contains the common usage details of the NTSTATUS values
Return value/code:
The operation completed successfully.
The caller specified WaitAny for WaitType and one of the dispatcher objects in the Object array has been set to the signaled state.

Running the executable of hdl_simple_viewer.cpp from Point Cloud Library

The Point Cloud library comes with an executable pcl_hdl_viewer_simple that I can run (./pcl_hdl_viewer_simple) without any extra arguments to get live data from a Velodyne LIDAR HDL32.
The source code for this program is supposed to be hdl_viewer_simple.cpp. A simplified version of the code is given on this page which cannot be compiled readily and requires a tiny bit of tweaking to make it compile.
My problem is that the executable that I build myself for both the versions are not able to run. I always get the smart pointer error "Assertion px!=0" error. I am not sure if I am not executing the program in the correct way or what. The executable is supposed to be executed like
./hdl_viewer_simple -calibrationFile hdl32calib.xml -pcapFile file.pcap
in case of playing from previously recorded PCAP files or just ./hdl_viewer_simple if wanting to get live data from the real sensor. However, I always get the assertion failed error.
Has anyone been able to run the executables? I do not want to use the ROS drivers
"Assertion px!=0" is occurring because your pointer is not initialized.
Now that being said, you could initialize it inside your routines, in case the pointer is NULL, especially for data input.
in here, you can try updating the line 83 like this :
CloudConstPtr cloud(new Cloud); //initializing your pointer
and hopefully, it will work.

Visual Studio 2013 Floating Point Support fix?

I have a DLL, which is commercial software, so therefore I cannot show the code here...
I get the error "6002" -floating point support not loaded, but only on some applications.
This dll is hooked to several applications, without problems.
I tried everything that I found on Google, like reinstalling VC++, clean PC, registry, everything.
So my conclusion is that either there is another dll compiled in another version of Visual Studio (2010) and it`s somehow conflicting with my dll ?!
Or, I have some memory leak, which I cannot find.
I use the following functions in my DLL which (I think) is the issue:
Example function I use for logging:
void ##SOFTWARE-CLASS##::write_log(char *text)
FILE *write_log;
char dateStr [9];
char timeStr [9];
_strdate( dateStr);
_strtime( timeStr );
write_log = fopen("##SOFTWARE-FILE##","a+");
fprintf_s(write_log,"[%s %s] %s \n", dateStr, timeStr, text);
Nothing else is used that may cause floating errors...
The dll is hooked properly, I have no warnings, and no errors.
I must mention, that I have created an empty DLL, with a MessageBox, at the first run, I was getting the same error, but after switching to /fp:strict, the error disappeared. So I did the same thing to my project, but the error is still there. I even recoded the whole project, just to see if it fixes the problem, but no.
Please give me advice on how can I solve this problem, as this is the third day that I am testing...
From MSDN : R6002 the document says that a program will only load floating point support if needed. What this means, is that the detoured code is being injected into binaries which did not initialize the floating point subsystem. The solution would be to relink your commercial product without requiring the floating point code, or to relink with an alternative floating point library.
You could try and detect the missing initialization and perform it yourself, but that would have a larger impact on the injected system, and possibly create instabilities.
Use a debugger with the failing executable, and you should be able to get a call stack which identifies where the failure occurs.