Strange OpenXR Behaviour on xrEndFrame - directx-12

Firstly: Context is VR with HP Reverb G2, WMR runtime, DX12.
We're seeing some unexplained behaviour across developer machines when working with OpenXR. It looks as thought the OpenXR runtime is changing the way it presents depending on the machine setting for preferred GPU.
More specifically, we noticed that depending on the machines setting for preferred GPU, we are seeing a different method used when XrEndFrame is called. This is a big deal as the different method results in a blank screen being drawn into our current renderTarget!
The difference is that when the preferred device is an Nvidia GPU, xrEndFrame looks like this in PIX (in a graphics queue that is separate to our main render):
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
2 8063 Signal(pFence:obj#20, Value:62)
3 8064 Wait(pFence:obj#36, Value:31)
5 8065 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4084, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
6 8066 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:1}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4085, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
8 8067 Signal(pFence:obj#20, Value:63)
9 8068 Signal(pFence:obj#21, Value:31)
and when it isn't, (i.e. somehow maybe picking up Intel onboard?) it looks like this:
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
0 8064 Wait(pFence:obj#45, Value:21)
2 8065 ClearRenderTargetView(RenderTargetView:res#4008, ColorRGBA:{Element:0, Element:0, Element:0, Element:0}, NumRects:0, pRects:nullptr)
15 8066 DrawIndexedInstanced(IndexCountPerInstance:4, InstanceCount:2, StartIndexLocation:0, BaseVertexLocation:0, StartInstanceLocation:0)
17 8067 Signal(pFence:obj#22, Value:23)
18 8068 Signal(pFence:obj#23, Value:21)
The latter is clearing the current renderTargetView and drawing a quad over the top that is the dimensions of the headset display.
Yet- we've checked the rendering code and it is definitely not selecting the Intel graphics device. However the second behaviour goes away if we set 'preferred graphics processor' to nvidia gpu in nvidia control panel.
We can also see that the above behaviour is the result of a call to XrEndFrame, and that our rendering code is identical otherwise.
Any clue as to what part of the runtime might be looking at or influenced by this setting?
Unfortunately (fortuitously?) we found we need to work on the rendering code to be able to swap runtimes to say SteamVR, so right now we can't swap out the runtime.
Obviously we have a workaround, which is to set the preferred device. But understanding how/why this issue is occurring would be great.

So this was finally tracked down to an error on our part.
In our case we were using xrGetD3D12GraphicsRequirementsKHR to get the minimum graphics requirements for openxr.
This has an adapterLuid identifier in the structure XrGraphicsRequirementsD3D12KHR which we should have been using to select the GPU in the graphics API, but weren't.

Related

OpenCL / OpenGL Implicit Synchronization on AMD Tahiti

I'm having a problem with the "implicit synchronization" of OpenCL and OpenGL on an AMD Tahiti (AMD Radeon HD 7900 Series) device. The device has the cl/gl extensions, cl_khr_gl_sharing, and cl_khr_gl_event.
When I run the program which is just a simple vbo update kernel, and draw it as a white line with simple shader, it hiccups like crazy, stalling what looks to be 2-4 frames every update. I can confirm that it isn't the cl kernel or gl shader that I'm using to update and draw the buffer, because if I put glFinish and commandQueue.finish() before and after the acquire and release of gl objects for the cl update, everything works as it should.
So, I figured that I needed to "enable" the event extension...
#pragma OPENCL EXTENSION cl_khr_gl_event : enable
...in the cl program, but that throws errors. I assume this extension isn't a client facing extension and is supposed to just work as "expected", which is why I can't enable it.
The third behavior that I noticed...if I take out the glFinish() and commandQueue.finish(), and run it in CodeXL debug, the implicit synchronization works. As in, without any changes to the code base, like forcing synchronization with finish, CodeXL allows for implicit synchronization. So, implicit sync clearly works, but I can't get it to work by just running the application regularly through Visual Studio and forcing synchronization.
Clearly I'm missing something, but I honestly can't see it. Any thoughts or explanations would be greatly appreciated, as I'd love to keep the synchronization implicit.
I'm guessing you're not using the GLsync-cl_event synchro (GL_ARB_cl_event and cl_khr_gl_event extensions), which is why adding cl/glFinish and the overhead from CodeXL are helping.
My guess is your code looks like:
A1. clEnqueueNDRangeKernel
A2. clEnqueueReleaseObjects
[here is where you inserted clFinish]
B1. glDraw*
B2. wgl/glXSwapBuffers
[here is where you inserted glFinish]
C1. clEnqueueAcquireObjects
[repeat from A1]
Instead, you should:
CL->GL synchro: have clEnqueueReleaseObjects create an (output) event to be passed to glCreateSyncFromCLeventARB, then use glWaitSync (NOT glClientWaitSync - which in this case would be the same as clFinish).
GL->CL synchro: have clEnqueueAcquireObjects take an (input) event, which will be created with clCreateFromGLsync, taking a sync object from glFenceSync
Overall, it should be:
A1. `clEnqueueNDRangeKernel`
[Option 1.1:]
A2. `clEnqueueReleaseObjects`( ..., 0, NULL, &eve1)
[Option 1.2:]
A2. `clEnqueueReleaseObjects`( ..., 0, NULL, NULL)
A2'. `clEnqueueMarker`(&eve1)
A3. sync1 = glCreateSyncFromCLeventARB(eve1)
* clReleaseEvent(eve1)
A4. glWaitSync(sync1)
* glDeleteSync(sync1)
B1. glDraw*
B2. wgl/glXSwapBuffers
B3. sync2 = glFenceSync
B4. eve2 = clCreateFromGLSync(sync2)
* glDeleteSync(sync2)
[Option 2.1:]
C1. clEnqueueAcquireObjects(, ..., 1, &eve2, NULL)
* clReleaseEvent(eve2)
[Option 2.2:]
B5. clEnqueueWaitForEvents(1, &eve2)
* clReleaseEvent(eve2)
C1. clEnqueueAcquireObjects(, ..., 0, NULL, NULL)
[Repeat from A1]
(Options 1.2 / 2.2 are better if you don't exactly know in advance what will be the last enqueue before handing control over to the other API)
As a side note, I assumed you're not using an out-of-order queue for OpenCL (there really shouldn't be a need for one in this case) - if you did, you of course have to also synchro clEnqueueAcquire -> clEnqueueNDRange -> clEnqueueRelease.

IRQ 8 isn't working... HW or SW?

First, I program for Vintage computer groups. What I write is specifically for MS-DOS and not windows, because that's what people are running. My current program is for later systems and not the 8086 line, so the plan was to use IRQ 8. This allows me to set the interrupt rate in binary values from 2 / second to 8192 / second (2, 4, 8, 16, etc...)
Only, for some reason, on the newer old systems (ok, that sounds weird,) it doesn't seem to be working. In emulation, and the 386 system I have access to, it works just fine, but on the P3 system I have (GA-6BXC MB w/P3 800 CPU,) it just doesn't work.
The code
setting up the interrupt
disable();
oldrtc = getvect(0x70); //Reads the vector for IRQ 8
settvect(0x70,countdown); //Sets the vector for
outportb(0x70,0x8a);
y = inportb(0x71) & 0xf0;
outportb(0x70,0x8a);
outportb(0x71,y | _MRATE_); //Adjustable value, set for 64 interrupts per second
outportb(0x70,0x8b);
y = inportb(0x71);
outportb(0x70,0x8b);
outportb(0x71,y | 0x40);
enable();
at the end of the interrupt
outportb(0x70,0x0c);
inportb(0x71); //Reading the C register resets the interrupt
outportb(0xa0,0x20); //Resets the PIC (turns interrupts back on)
outportb(0x20,0x20); //There are 2 PICs on AT machines and later
When closing program down
disable();
outportb(0x70,0x8b);
y = inportb(0x71);
outportb(0x70,0x8b);
outportb(0x71,y & 0xbf);
setvect(0x70,oldrtc);
enable();
I don't see anything in the code that can be causing the problem. But it just doesn't seem to make sense. While I don't completely trust the information, MSD "does" report IRQ 8 as the RTC Counter and says it is present and working just fine. Is it possible that later systems have moved the vector? Everything I find says that IRQ 8 is vector 0x70, but the interrupt never triggers on my Pentium III system. Is there some way to find if the Vectors have been changed?
It's been a LONG time since I've done any MS-DOS code and I don't think I ever worked with this particular interrupt (I'm pretty sure you can just read the memory location to fetch the time too, and IRQ0 can be used to trigger you at an interval too, so maybe that's better. Anyway, given my rustiness, forgive me for kinda link dumping.
http://wiki.osdev.org/Real_Time_Clock the bottom of that page has someone saying they've had problem on some machines too. RBIL suggests it might be a BIOS thing: http://www.ctyme.com/intr/rb-7797.htm
Without DOS, I'd just capture IRQ0 itself and remap all of them to my own interrupt numbers and change the timing as needed. I've done that somewhat recently! I think that's a bad idea on DOS though, this looks more recommended for that: http://www.ctyme.com/intr/rb-2443.htm
Anyway though, I betcha it has to do with the BIOS thing:
"Notes: Many BIOSes turn off the periodic interrupt in the INT 70h handler unless in an event wait (see INT 15/AH=83h,INT 15/AH=86h).. May be masked by setting bit 0 on I/O port A1h "

Getting OpenCV Error: Insufficient memory while running OpenCV Sample Program: "stitching_detailed.cpp"

I recently starting working with OpenCV with the intent of stitching large amounts of images together to create massive panoramas. To begin my experimentation, I looked into the sample programs that come with the OpenCV files to get an idea about how to implement the OpenCV libraries. Since I was interested in image stitching, I went straight for the "stitching_detailed.cpp." The code can be found at:
https://code.ros.org/trac/opencv/browser/trunk/opencv/samples/cpp/stitching_detailed.cpp?rev=6856
Now, this program does most of what I need it to do, but I ran into something interesting. I found that for 9 out of 15 of the optional projection warpers, I receive the following error when I try to run the program:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
where the "X's" mark integer that change between the different types of projection (as though different methods require different amounts of space). The full source code for "alloc.cpp" can be found at the following website:
https://code.ros.org/trac/opencv/browser/trunk/opencv/modules/core/src/alloc.cpp?rev=3060
However, the line of code that emits this error in alloc.cpp is:
static void* OutOfMemoryError(size_t size)
{
--HERE--> CV_Error_(CV_StsNoMem, ("Failed to allocate %lu bytes", (unsigned long)size));
return 0;
}
So, I am simply lost as to the possible reasons that this error may be occurring. I realize that this error would normally occur if the system is out of memory, but I when running this program with my test images I am never using more that ~3.5GB of RAM, according to my Task Manager.
Also, since the program was written as an sample of the OpenCV stitching capabilities BY OpenCV developers I find it hard to believe that there is a drastic memory error present within the source code.
Finally, the program works fine if I use some of the warping methods:
- spherical
- fisheye
- transverseMercator
- compressedPlanePortraitA2B1
- paniniPortraitA2B1
- paniniPortraitA1.5B1)
but as ask the program to use any of the others (through the command line tag
--warp [PROJECTION_NAME]):
- plane
- cylindrical
- stereographic
- compressedPlaneA2B1
- mercator
- compressedPlaneA1.5B1
- compressedPlanePortraitA1.5B1
- paniniA2B1
- paniniA1.5B1
I get the error mentioned above. I get pretty good results from the transverseMercator project warper, but I would like to test the stereographic in particular. Can anyone help me figure this out?
The pictures that I am trying to process are 1360 x 1024 in resolution and my computer has the following stats:
Model: HP Z800 Workstation
Operating System: Windows 7 enterprise 64-bit OPS
Processor: Intel Xeon 2.40GHz (12 cores)
Memory: 14GB RAM
Hard Drive: 1TB Hitachi
Video Card: ATI FirePro V4800
Any help would be greatly appreciated, thanks!
When I run OpenCV's APP traincascade, i get just the same error as you:
Insufficient memory (Failed to allocate XXXXXXXXXX bytes) in unknown function,
file C:\slave\winInstallerMegaPack\src\opencv\modules\core\src\alloc.cpp,
line 52
at the time, only about 70% pecent of my RAM(6G) was occupied. And when runnig trainscascade step by step, I found that the error would be thrown.when it use about more than 1.5G RAM space.
then, I found the are two arguments which can control how many memory should be used:
-precalcValBufSize
-precalcIdxBufSize
so i tried to set these two to 128, it run. I hope my experience can help you.
I thought this problem is nothing about memory leak, it is just relate to how many memory the OS limits a application occupy. I expect someone can check my guess.
I've recently had a similar issue with OpenCV image stitching. I've used create method to create stitcher instance and provided 5 images in vertical order to stitch method, but I've received insufficient memory error.
Panorama was successfully created after setting:
setWaveCorrection(false)
This solution will not be applicable if you need wave correction.
This may be related to the sequence of the stitching, I split a big picture into 3*3, and firstly I stitch them row by row and there is no problem, when I stitch them column by column, there is the problem same as you.

kyoto cabinet scan_parallel not really parallel?

I just spent a day creating an abstraction layer to kyotodb to remove global locks from my code, I was busy porting my algorithms to this new abstraction layer when I discover that scan_parallel isn't really parallel. It only maxes out one core -- For jollies I stuck in a billion-int-countdown spin-loop in my code(empty stubs as I port) to try and simulate some processing time. still only one core maxed. Do I need to move to berkley db or leveldb ? I thought kyotodb was meant for internet scale problems :/. I must be doing something wrong or missing some gotchas.
top or iostat never went above 100% / 25% (iostat one cpu maxed = 1/number of cores * 100):/ On a quad core i5.
source db is 10gigs corpus of protocol buffer encoded data (treedb) with the following flags (picked these up from the documentation).
index_db.tune_options(TreeDB::TLINEAR | TreeDB::TCOMPRESS);
index_db.tune_buckets(1LL * 1000);
index_db.tune_defrag(8);
index_db.tune_page(32768);
edit
Do not remove the IR TAG. Please think before you wave arround the detag bat.
This IS an IR related question, its about creating GINORMOUS (40 gig +) inverted files ONLINE, inverted indices are the base of IR data access methods, and inverted index creation has a unique transactional profile. By removing the IR tag you rob me of the wisdom of IR researchers who have used a database library to create such large database files.

ASCII DOS Games - Rendering methods

I'm writing an old school ASCII DOS-Prompt game. Honestly I'm trying to emulate ZZT to learn more about this brand of game design (Even if it is antiquated)
I'm doing well, got my full-screen text mode to work and I can create worlds and move around without problems BUT I cannot find a decent timing method for my renders.
I know my rendering and pre-rendering code is fast because if I don't add any delay()s or (clock()-renderBegin)/CLK_TCK checks from time.h the renders are blazingly fast.
I don't want to use delay() because it is to my knowledge platform specific and on top of that I can't run any code while it delays (Like user input and processing). So I decided to do something like this:
do {
if(kbhit()) {
input = getch();
processInput(input);
}
if(clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval) {
renderTimer = clock();
render();
ballLogic();
}
}while(input != 'p');
Which should in "theory" work just fine. The problem is that when I run this code (setting the RenderInterval to 0.0333 or 30fps) I don't get ANYWHERE close to 30fps, I get more like 18 at max.
I thought maybe I'd try setting the RenderInterval to 0.0 to see if the performance kicked up... it did not. I was (with a RenderInterval of 0.0) getting at max ~18-20fps.
I though maybe since I'm continuously calling all these clock() and "divide this by that" methods I was slowing the CPU down something scary, but when I took the render and ballLogic calls out of the if statement's brackets and set RenderInterval to 0.0 I get, again, blazingly fast renders.
This doesn't make sence to me since if I left the if check in, shouldn't it run just as slow? I mean it still has to do all the calculations
BTW I'm compiling with Borland's Turbo C++ V1.01
The best gaming experience is usually achieved by synchronizing with the vertical retrace of the monitor. In addition to providing timing, this will also make the game run smoother on the screen, at least if you have a CRT monitor connected to the computer.
In 80x25 text mode, the vertical retrace (on VGA) occurs 70 times/second. I don't remember if the frequency was the same on EGA/CGA, but am pretty sure that it was 50 Hz on Hercules and MDA. By measuring the duration of, say, 20 frames, you should have a sufficiently good estimate of what frequency you are dealing with.
Let the main loop be someting like:
while (playing) {
do whatever needs to be done for this particular frame
VSync();
}
... /* snip */
/* Wait for vertical retrace */
void VSync() {
while((inp(0x3DA) & 0x08));
while(!(inp(0x3DA) & 0x08));
}
clock()-renderTimer > RenderInterval * CLOCKS_PER_SEC
would compute a bit faster, possibly even faster if you pre-compute the RenderInterval * CLOCKS_PER_SEC part.
I figured out why it wasn't rendering right away, the timer that I created is fine the problem is that the actual clock_t is only accurate to .054547XXX or so and so I could only render at 18fps. The way I would fix this is by using a more accurate clock... which is a whole other story
What about this: you are substracting from x (=clock()) y (=renderTimer). Both x and y are being divided by CLOCKS_PER_SEC:
clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval
Wouldn't it be mor efficiente to write:
( clock() - renderTimer ) > RenderInterval
The very first problem I saw with the division was that you're not going to get a real number from it, since it happens between two long ints. The secons problem is that it is more efficiente to multiply RenderInterval * CLOCKS_PER_SEC and this way get rid of it, simplifying the operation.
Adding the brackets gives more legibility to it. And maybe by simplifying this phormula you will get easier what's going wrong.
As you've spotted with your most recent question, you're limited by CLOCKS_PER_SEC which is only about 18. You get one frame per discrete value of clock, which is why you're limited to 18fps.
You could use the screen vertical blanking interval for timing, it's traditional for games as it avoids "tearing" (where half the screen shows one frame and half shows another)