Running a dll in kernel mode - c++

I'm just curious: I have a Windows dll which does some rendering/drawing jobs with openGL and then returns the result to the application.
Would it be faster if the code didn't run in user-mode but in kernel-mode? (no interruptions and higher priority)

Running in kernel mode doesn't get you higher priority, and it doesn't get rid of interruptions. Unless you ask it to, which you can do in user mode too for the most part.
The biggest problem you would face is that openGL is simply not available in kernel mode. It is a user mode API, that talks down into a device driver to implement some of its logic, but a lot of the logic is implemented entirely in user mode. It isn't like there is a syscall for every openGL API.
Even if you could overcome that (which you can't), as Erbureth mentions the security risk would be huge, debugging it would be a nightmare (have you ever used a kernel mode debugger?) and installing it would require admin privileges.
So all in all, no - it isn't possible.

Related

Gauss Blur 3d image in cuda, sometimes it works sometimes it does not [duplicate]

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.

C++, OpenCV, & Kinect: Processing speed goes down

I use C++ (Visual Studio 2015) and OpenCV (ver 3.2.0) to process data sent from Kinect v1. My C++ program has no problem when it starts debugging for the first time. After it stops debugging and re-start debugging, however, it gets very slow.
I am suspecting that the program closes without releasing some memory (i.e., memory leak). I am aware of that I would need to use the delete function to release the memory if I use the new function. But I didn't use the new function in the C++ program (I neither used the malloc() function, which is equivalent to the new function in C programs).
For OpenCV, I use the destroyAllWindows function at the end of the program. For Kinect v1, I also use the NuiShutdown(), Release(), and CloseHandle() functions at the end of the program.
Is there anything else I need do to release the memory (e.g., releasing memory associated with Mat in OpenCV)? Or is something else causing the decrease in processing speed?
I'd appreciate your help. Thanks.
After first run disconnect Kinect then reconnect and try second run.
If all goes well now then the problem is most likely stuck thread. The device access is usually handled by separate threads and especially with USB they can get stuck (in case of error or sync problem between accessing form host and expecting on device side) until you disconnect device (not sure which Kinect driver are you using but JUNGO version which NuiShutdown() infers have this problem). You can also check task manager before disconnection if there are not some stuck processes left after first run.
To remedy this you need to find out what are you doing wrong during access. It could be:
wrong USB port
use the back side not front slots.
invalid USB transfer request
device is always waiting for specific set of commands or stream and waits until it does not receive it so it blocks all other things. So using unsupported commands or reading in wrong times or sizes of packets can cause this.
USB communication is out of sync
PC host can timeout in case you do not have enough CPU power while critical operation is processed (or have opened too many apps on background).
This can be caused also by wrong gfx driver as I suspect you are using rendering ... Intel HD graphics can generate such problems with ease especially on notebooks. Try to disable any rendering in your app or at least limit rendering to OpenGL 1.0 to see if speed is the same in between runs. If this is the case the whole desktop usually flickers or is not repainting parts of apps ... and animations are sometimes sluggish.
Another problem might be a debugger. If without it all is well then debugger is the problem and you can not solve it. Debugging while accessing IO can cause sync and timeout problems especially with USB.
To check for memory leaks you can simply see how much free memory you got before 1st run and compare it to values after 1st,2nd,3th .. runs if the value lowers you got something stuck somewhere. After app close all the memory belonging to app is freed by OS so even if you forget some delete that does not matter unless some thread is still running ...
Some USB drivers based on libUSB I encountered got also problem with Handle leaks. But that behaves differently ... all runs fine until there are no free handles. After that OS is non functional you can not open any window,app, anything ... until any app is closed.
[Edit1] Front USB slots
Front slots are usually connected to motherboard with relatively long cable (usually flat and not very well shielded) so it is more susceptible to noise. Also as it is located usually around HDD and above high frequency parts of the motherboard it also induce it into the USB feed. All this degrades the quality of USB signal causing much much bigger rejection rate hence lowering sync capability and also the overall usable bandwidth.
If you compare that with backside USB ports they have no cables but are connected directly in PCB with short and well shielded paths so the connection quality is much much better.
So if you use device demanding high bandwith or synchronism then front ports are a bad choice.

How to get a pointer to a hardware driver in Windows?

I want to write a program that will monitor memory in a driver and print the memory contents every so often.
However, I'm not finding any resources in the Windows API that seem to allow me to grab a pointer (Handle) to a specific driver.
I'd appreciate any answer either from User space OR kernel space.
If you want to know exactly what I'm doing, I'm attempting to duplicate the results from this paper except on Windows. After I gain the ability to monitor a buffer in a basic windows console program, I intend to monitor from the GPU.
[For the record: I am a Graduate Student who is pursuing this as a summer project... this is ethical malware research.]
============UPDATE ==================
This might technically be better suited as an answer, but not really until I have a working solution.
My initial plan of attack is to use WinDbg to do dynamic analysis on the keyboard driver when it gets loaded, so I can get some idea about normal loading/unloading behavior. I'm using chapter 10 of this book, to guide setting up my testbed and once I understand more about the keyboard structure and its buffer, I'll work backwards towards getting a permanent reference to this structure and see about passing it into the graphics card and monitoring it with DMA as the original paper did on Linux.
You won't solve this problem by "grabbing a pointer to a specific driver". You need to locate the specific buffer used by the keyboard driver that resides on top of the USB driver.
You will have to actually grok the keyboard and USB drivers for Windows. At least part of which is probably available if you have a DDK (driver development kit) [aka WDK, Windows Driver Kit]. You will definitely need a graphics driver for this part of the project.
You will also have to develop a driver mechanism to map an arbitrary (kernel) lump of memory to your graphics driver - which means you need access to the source code for the graphics driver. (In theory, you could perhaps hack about in the page-tables, but Windows itself isn't too keen on software messing with the page-tables, and you'd definitely need to be VERY careful if the system is SMP, since modifying page-tables in an SMP system requires that you flush the TLB's of the "other" CPU(cores) in the system after updates).
To me, this seems like a rather interesting project, but a really tough one in a closed source system like Windows. At least in Linux, the developer has the source-code to read. When it comes to Windows, most of the relevant source code is completely unavailable (unless your school has special license to the MS Source code - I think there are some that do).

C++ Possible to access kernel-mode registry key accesses?

When I used C# i was only able to access user-mode registry accesses.
Is it very difficult to access kernel-mode registry accesses using C++?
I recall reading somewhere I may have to create a dummy windows driver or something?
EDIT: Basically as a hobby project I wish to create a simple registry monitor. However, I do want to catch kernel mode (as well as user mode) registry accesses..... last time I did this, using C# I could not access the kernel mode activity.
There are two ways to achieve this:
Hook the relevant functions in the kernel - the traditional way - which requires a C/Kernel Driver. This is possible on x86 Windows, but on x64 Kernel Patch Protection will detect these modifications and shut down the system (with a bluescreen).
Build a registry filter driver - this is the now encouraged way to attack this problem and is the way process monitor works. You can also build file system filter drivers this way. Essentially, you simply need to pass the information back to userland which boils down to:
IoRegisterDevice(...somewhere in \Devices\YourDriverName...)
IoCreateSymbolicLink(\\DosDevices\Name -> \Devices\YourDriverName)
then a C, C++, C# application should be able to open the file \\.\YourDriverName and DeviceIoControl to it and receive responses.
It is possible to use C++ to write kernel drivers, but see this before you embark on doing so. To be clearer, you need to be really careful about memory in kernel mode (paged, nonpaged) and you're not going to have access to much of the standard library.
As an aside, you should be aware that:
Not all registry hives are accessible to kernel mode drivers, depending on context.
The paths are not common. So the kernel accesses \Registry\System whereas userland accesses HKLM.

Why can't I set master volume for USB/Firewire Audio interface with IAudioEndpointVolume::SetMasterVolumeLevelScalar

I am trying to fix an Audacity bug that revolves around portmixer. The output/input level is settable using the mac version of portmixer, but not always in windows. I am debugging portmixer's window code to try to make it work there.
Using IAudioEndpointVolume::SetMasterVolumeLevelScalar to set the master volume works fine for onboard sound, but using pro external USB or firewire interfaces like the RME Fireface 400, the output volume won't change, although it is reflected in Window's sound control panel for that device, and also in the system mixer.
Also, outside of our program, changing the master slider for the system mixer (in the taskbar) there is no effect - the soundcard outputs the same (full) level regardless of the level the system says it is at. The only way to change the output level is using the custom app that the hardware developers give with the card.
The IAudioEndpointVolume::QueryHardwareSupport function gives back ENDPOINT_HARDWARE_SUPPORT_VOLUME so it should be able to do this.
This behavior exists for both input and output on many devices.
Is this possibly a Window's bug?
It is possible to workaround this by emulating (scaling) the output, but this is not preferred as it is not functionally identical - better to let the audio interface do the scaling (esp. for input if it involves a preamp).
The cards you talk about -like the RME- ones simply do not support setting the master or any other level through software, and there is not much you can do about it. This is not a Windows bug. One could argue that giving back ENDPOINT_HARDWARE_SUPPORT_VOLUME is a bug though, but that likely originates from the driver level, not Windows itself.
The only solution I found so far is hooking up a debugger (or adding a dll hook) to the vendor supplied software and looking at the DeviceIOControl calls it makes (those are the ones used to talk to the hardware) while setting the volume in the vendor software. Pretty hard to do this for every single card, but probably worth doing for a couple of pro cards. Especially for Audacity, for open source audio software it's actually not that bad so I can imagine some people being really happy if the volume on their card could be set by it. (at the time we were exclusively using an RME Multiface I spent quite some time in figuring out the DeviceIOControl calls, but in the end it was definitely worth it as I could set the volume in dB for any point in the matrix)