I'm working on a 3d software renderer. In my code, I've declared a structure Arti3DVSOutput with no default constructor. It's like this:
struct Arti3DVSOutput {
vec4 vPosition; // vec4 has a default ctor that sets all 4 floats 0.0f.
float Varyings[g_ciMaxVaryingNum];
};
void Arti3DDevice::GetTransformedVertex(uint32_t i_iVertexIndex, Arti3DTransformedVertex *out)
{
// Try to fetch result from cache.
uint32_t iCacheIndex = i_iVertexIndex&(g_ciCacheSize - 1);
if (vCache[iCacheIndex].tag == i_iVertexIndex)
{
*out = *(vCache[iCacheIndex].v);
}
else
{
// Cache miss. Need calculation.
Arti3DVSInput vsinput;
// omit some codes that fill in "vsinput"..........
Arti3DVSOutput vs_output;
// Whether comment the following line makes a big difference.
//memset(&vs_output, 0, sizeof(Arti3DVSOutput));
mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);
*out = vs_output;
// Store results in cache.
vCache[iCacheIndex].tag = i_iVertexIndex;
vCache[iCacheIndex].v = &tvBuffer[i_iVertexIndex];
tvBuffer[i_iVertexIndex] = vs_output;
}
}
mRC.pfnVS is a function pointer and the function it's pointing is implemented like this:
void NewCubeVS(Arti3DVSInput *i_pVSInput, Arti3DShaderUniform* i_pUniform, Arti3DVSOutput *o_pVSOutput)
{
o_pVSOutput->vPosition = i_pUniform->mvp * i_pVSInput->ShaderInputs[0];
o_pVSOutput->Varyings[0] = i_pVSInput->ShaderInputs[1].x;
o_pVSOutput->Varyings[1] = i_pVSInput->ShaderInputs[1].y;
o_pVSOutput->Varyings[2] = i_pVSInput->ShaderInputs[1].z;
}
As you can see, what I do in this function is just fill in some members of "o_pVSOutput". No read operation is performed.
Here comes the problem: The pefermance of the renderer has a great drop from 400+ fps to 60+ fps when the local variable "vsoutput" is not set to 0 before I pass its address to the function("NewCubeVS" in this case) as the third parameters.
The image rendered is exactly the same. When I turn off the optimization(-O0), the perfermance of two versions are the same. Once I turn on the optimaztion(-O1 or -O2 or -O3), the performance difference shows again.
I profiled this program and found something very strange. The increase of time costs of "vsoutput uninitialized" version does not take place in the function "GetTransformedVertex",not even near from it. The time increase is caused by some SSE intrinsics functions way after the "GetTransformedVertex" is called. I'm really confused...
FYI,I'm using Visual Studio 2013 Community.
Now I do know that this performance drop is caused by unitialized structure. But I don't know how. Does it implicitly turned off some compiler's optimization options?
If necessary, I will post my source code to my github for your reference.
Any opinions are appreciated! Thank you in advance.
Update: Enlighted by #KerrekSB, I did some more tests.
Call memset() with different values, the perfermance could be quite different!
1: With 0, fps 400+.
2: With 1,2,3,4,5.... fps 40~60.
Then I removed the memset() and explicitly implemented a ctor for Arti3DOutput. The ctor did nothing but set all floats in Varyings[] to one valid float-point value(eg. 0.0f,1.5f,100.0f....). Haha, 400+ fps.
Till now, it seems that the values/content in Arti3DVSOutput have a great effect on the perfermance.
Then I did some more tests to find out which piece of memory of Arti3DVSOutput does really matters. Here comes the code.
Arti3DVSOutput vs_output; // No explict default ctor in this version.
// Comment the following 12 lines of code one by one to find out which piece of unitialized memory really matters.
vs_output.Varyings[0] = 0.0f;
vs_output.Varyings[1] = 0.0f;
vs_output.Varyings[2] = 0.0f;
vs_output.Varyings[3] = 0.0f;
vs_output.Varyings[4] = 0.0f;
vs_output.Varyings[5] = 0.0f;
vs_output.Varyings[6] = 0.0f;
vs_output.Varyings[7] = 0.0f;
vs_output.Varyings[8] = 0.0f;
vs_output.Varyings[9] = 0.0f;
vs_output.Varyings[10] = 0.0f;
vs_output.Varyings[11] = 0.0f;
mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);
Comment that 12 lines of code one by one and run the program.
The result is shown as follows:
comment line# FPS
0 420
1 420
2 420
3 420
4 200
5 420
6 280
7 195
8 197
9 200
10 200
11 420
0,1,2,3,5,11 420
4,6,7,8,9,10 60
It seems the 4th,6th,7th,8th,9th and the 10th elements of Varyings[] all make some contibutions to the perfermance drop.
I really got confused by what the compiler does behind my back. Is there some kind of value check the compiler has to do?
Solution:
I figured it out!
The source of the problem is that uninitialized or improperly initialized float point values are used as parameters by SSE intrinsics afterwards. Those invalid floats generate exceptions and slow down the SSE intrinsics greatly.
Related
I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}
I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.
I have implemented a pixel mask class used for checking for perfect collision. I am using SFML so the implementation is fairly straight forward:
Loop through each pixel of the image and decide whether its true or false based on its transparency value. Here is the code I have used:
// Create an Image from the given texture
sf::Image image(texture.copyToImage());
// measure the time this function takes
sf::Clock clock;
sf::Time time = sf::Time::Zero;
clock.restart();
// Reserve memory for the pixelMask vector to avoid repeating allocation
pixelMask.reserve(image.getSize().x);
// Loop through every pixel of the texture
for (unsigned int i = 0; i < image.getSize().x; i++)
{
// Create the mask for one line
std::vector<bool> tempMask;
// Reserve memory for the pixelMask vector to avoid repeating allocation
tempMask.reserve(image.getSize().y);
for (unsigned int j = 0; j < image.getSize().y; j++)
{
// If the pixel is not transparrent
if (image.getPixel(i, j).a > 0)
// Some part of the texture is there --> push back true
tempMask.push_back(true);
else
// The user can't see this part of the texture --> push back false
tempMask.push_back(false);
}
pixelMask.push_back(tempMask);
}
time = clock.restart();
std::cout << std::endl << "The creation of the pixel mask took: " << time.asMicroseconds() << " microseconds (" << time.asSeconds() << ")";
I have used the an instance of the sf::Clock to meassure time.
My problem is that this function takes ages (e.g. 15 seconds) for larger images(e.g. 1280x720). Interestingly, only in debug mode. When compiling the release version the same texture/image only takes 0.1 seconds or less.
I have tried to reduce memory allocations by using the resize() method but it didn't change much. I know that looping through almost 1 million pixels is slow but it should not be 15 seconds slow should it?
Since I want to test my code in debug mode (for obvious reasons) and I don't want to wait 5 min till all the pixel masks have been created, what I am looking for is basically a way to:
Either optimise the code / have I missed somthing obvious?
Or get something similar to the release performance in debug mode
Thanks for your help!
Optimizing For Debug
Optimizing for debug builds is generally a very counter-productive idea. It could even have you optimize for debug in a way that not only makes maintaining code more difficult, but may even slow down release builds. Debug builds in general are going to be much slower to run. Even with the flattest kind of C code I write which doesn't pose much for an optimizer to do beyond reasonable register allocation and instruction selection, it's normal for the debug build to take 20 times longer to finish an operation. That's just something to accept rather than change too much.
That said, I can understand the temptation to do so at times. Sometimes you want to debug a certain part of code only for the other operations in the software to takes ages, requiring you to wait a long time before you can even get to the code you are interested in tracing through. I find in those cases that it's helpful, if you can, to separate debug mode input sizes from release mode (ex: having the debug mode only work with an input that is 1/10th of the original size). That does cause discrepancies between release and debug as a negative, but the positives sometimes outweigh the negatives from a productivity standpoint. Another strategy is to build parts of your code in release and just debug the parts you're interested in, like building a plugin in debug against a host application in release.
Approach at Your Own Peril
With that aside, if you really want to make your debug builds run faster and accept all the risks associated, then the main way is to just pose less work for your compiler to optimize away. That's going to be flatter code typically with more plain old data types, less function calls, and so forth.
First and foremost, you might be spending a lot of time on debug mode assertions for safety. See things like checked iterators and how to disable them:
https://msdn.microsoft.com/en-us/library/aa985965.aspx
For your case, you can easily flatten your nested loop into a single loop. There's no need to create these pixel masks with separate containers per scanline, since you can always get at your scanline data with some basic arithmetic (y*image_width or y*image_stride). So initially I'd flatten the loop. That might even help modestly for release mode. I don't know the SFML API so I'll illustrate with pseudocode.
const int num_pixels = image.w * image.h;
vector<bool> pixelMask(num_pixels);
for (int j=0; j < num_pixels; ++j)
pixelMask[j] = image.pixelAlpha(j) > 0;
Just that already might help a lot. Hopefully SFML lets you access pixels with a single index without having to specify column and row (x and y). If you want to go even further, it might help to grab the pointer to the array of pixels from SFML (also hopefully possible) and use that:
vector<bool> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
for (int j=0; j < num_pixels; ++j)
{
// Assuming 32-bit pixels (should probably use uint32_t).
// Note that no right shift is necessary when you just want
// to check for non-zero values.
const unsigned int alpha = pixels[j] & 0xff000000;
pixelMask[j] = alpha > 0;
}
Also vector<bool> stores each boolean as a single bit. That saves memory but translates to some more instructions for random-access. Sometimes you can get a speed up even in release by just using more memory. I'd test both release and debug and time carefully, but you can try this:
vector<char> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
char* pixelUsed = &pixelMask[0];
for (int j=0; j < num_pixels; ++j)
{
const unsigned int alpha = pixels[j] & 0xff000000;
pixelUsed[j] = alpha > 0;
}
Loops are faster if working with costants:
1. for (unsigned int i = 0; i < image.getSize().x; i++) get this image.getSize() before the loop.
2. get the mask for one line out of the loop and reuse it. Lines are of the same length I assume. std::vector tempMask;
This shall speed you up a bit.
Note that the compilation for debugging gives way more different machine code.
I'm working on implementing animations within my model loader which uses Assimp; C++/OpenGL for rendering. I've been following this tutorial: http://ogldev.atspace.co.uk/www/tutorial38/tutorial38.html extensively. Suffice it to say that I did not follow the tutorial completely as there were some bits that I disagreed with code-wise, so I adapted it. Mind you, I don't use none of the maths components the author there uses, so I used glm. At any rate, the problem is that sometimes my program runs, and on other times it doesn't. When I run my program it would run and then crash instantly, and on other times it would simply run as normal.
A few things to take into account:
Before animations/loading bones were added, the model loader worked completely fine and models were loaded without causing no crash whatsoever;
Models with NO bones still load just as fine; it only becomes a problem when models with bones are being loaded.
Please note that NOTHING from the bones is being rendered. I haven't even started allocating the bones to vertex attributes; not even the shaders are modified for this.
Everything is being run on a single thread; there is no multi-threading... yet.
So, naturally I took to this bit of code which actually loaded the bones. I've debugged the application and found that the problems lie mostly around here:
Mesh* processMesh(uint meshIndex, aiMesh *mesh)
{
vector<VertexBoneData> bones;
bones.resize(mesh->mNumVertices);
// .. getting other mesh data
if (pAnimate)
{
for (uint i = 0; i < mesh->mNumBones; i++)
{
uint boneIndex = 0;
string boneName(mesh->mBones[i]->mName.data);
auto it = pBoneMap.find(boneName);
if (it == pBoneMap.end())
{
boneIndex = pNumBones;
++pNumBones;
BoneInfo bi;
pBoneInfo.push_back(bi);
auto tempMat = mesh->mBones[i]->mOffsetMatrix;
pBoneInfo[boneIndex].boneOffset = to_glm_mat4(tempMat);
pBoneMap[boneName] = boneIndex;
}
else boneIndex = pBoneMap[boneName];
for (uint j = 0; j < mesh->mBones[i]->mNumWeights; j++)
{
uint vertexID = mesh->mBones[i]->mWeights[j].mVertexId;
float weit = mesh->mBones[i]->mWeights[j].mWeight;
bones.at(vertexID).addBoneData(boneIndex, weit);
}
}
}
}
In the last line the author used a [] operator to access elements, but I decided to use '.at for range-checking. The function to_glm_mat4 is defined thus:
glm::mat4 to_glm_mat4(const aiMatrix4x4 &m)
{
glm::mat4 to;
to[0][0] = m.a1; to[1][0] = m.a2;
to[2][0] = m.a3; to[3][0] = m.a4;
to[0][1] = m.b1; to[1][1] = m.b2;
to[2][1] = m.b3; to[3][1] = m.b4;
to[0][2] = m.c1; to[1][2] = m.c2;
to[2][2] = m.c3; to[3][2] = m.c4;
to[0][3] = m.d1; to[1][3] = m.d2;
to[2][3] = m.d3; to[3][3] = m.d4;
return to;
}
I also had to change VertexBoneData since it used raw arrays which I thought flawed:
struct VertexBoneData
{
vector boneIDs;
vector weights;
VertexBoneData()
{
reset();
boneIDs.resize(NUM_BONES_PER_VERTEX);
weights.resize(NUM_BONES_PER_VERTEX);
}
void reset()
{
boneIDs.clear();
weights.clear();
}
void addBoneData(unsigned int boneID, float weight)
{
for (uint i = 0; i < boneIDs.size(); i++)
{
if (weights.at(i) == 0.0) // SEG FAULT HERE
{
boneIDs.at(i) = boneID;
weights.at(i) = weight;
return;
}
}
assert(0);
}
};
Now, I'm not entirely sure what is causing the crash, but what baffles me most is that sometimes the program runs (implying that the code isn't necessarily the culprit). So I decided to do a debug-smashdown which involved me inspecting each bone (I skipped some; there are loads of bones!) and found that AFTER all the bones have been loaded I would get this very strange error:
No source available for "drm_intel_bo_unreference() at 0x7fffec369ed9"
and sometimes I would get this error:
Error in '/home/.../: corrupted double-linked list (not small): 0x00000 etc ***
and sometimes I would get a seg fault from glm regarding a vec4 instantiation;
and sometimes... my program runs without ever crashing!
To be fair, implementing animations may just about be harsh on my laptop so maybe it's a CPU/GPU problem as in it's unable to process so much data in one gulp, which is resulting in this crash. My theory is that since it's unable to process that much data, that data is never allocated to vectors.
I'm not using any multi-threading whatsoever, but it has crossed my mind. I figure that it may be the CPU being unable to process so much data hence the chance-run. If I implemented threading, such that the bone-loading is done on another thread; or better, use a mutex because what I found is that by debugging the application slowly the program runs, which makes sense because each task is being broken down into chunks; and that is what a mutex technically does, per se.
For the sake of the argument, and no mockery avowed, my technical specs:
Ubuntu 15.04 64-bit
Intel i5 dual-core
Intel HD 5500
Mesa 10.5.9 (OpenGL 3.3)
Programming on Eclipse Mars
I thus ask, what the hell is causing these intel_drm errors?
I've reproduced this issue and found it may have been a problem with the lack of multi-threading when it comes to loading bones. I decided to move the loading bone errata into its own function as prescribed in the foresaid tutorial. What I later did was:
if (pAnimate)
{
std::thread t1[&] {
loadBones(meshIndex, mesh, bones);
});
t1.join();
}
The lambda function above has the [&] to indicate we're passing everything as a reference to ensure no copies are created. To prevent any external forces from 'touching' the data within the loadBones(..) function, I've installed a mutex within the function like so:
void ModelLoader::loadBones(uint meshIndex, const aiMesh *mesh, std::vector<VertexBoneData> &bones)
{
std::mutex mut;
std::lock_guard<std::mutex> lock(mut);
// load bones
}
This is only a quick and dirty fix. It might not work for everyone, and there's no guarantee the program will run crash-less.
Here are some testing results:
Sans threading & mutex: program runs 0 out of 3 times in a row
With threading; sans mutex: program runs 2 out of 3 times in a row
With threading & mutex: program runs 3 out of 3 times in a row
If you're using Linux, remember to link pthread as well as including <thread> and <mutex>. Suggestions on thread-optimisation are welcome!
I am writing a code for a mathematical method (Incomplete Cholesky) and I have hit a curious roadblock. Please see the following simplified code.
for(k=0;k<nosUnknowns;k++)
{
//Pieces of code
for(i=k+1;i<nosUnknowns;i++)
{
// more code
}
for(j=k+1;j<nosUnknowns;j++)
{
for(i=j;i<nosUnknowns;i++)
{
//Some more code
if(xOk && yOk && zOk)
{
if(xDF == 1 && yDF == 0 && zDF == 0)
{
for(row=0;row<3;row++)
{
for(col=0;col<3;col++)
{
// All 3x3 static arrays This is the line
statObj->A1_[row][col] -= localFuncArr[row][col];
}
}
}
}
}//Inner loop i ends here
}//Inner loop j ends here
}//outer loop k ends here
For context,
statObj is an object containing a number of 3x3 static double arrays. I am initializing statObj by a call to new function. Then I am populating the arrays inside it using some mathematical functions. One such array is A1_. The value of variable nosUnknowns is around 3000. The array localFuncArr is previously generated by matrix multiplication and is a double array.
Now this is my problem:
When I use the line as shown in the code, the code runs extremely sluggishly. Something like 245secs for the whole function.
When I comment out the said line, the code performs extremely fast. It takes something like 6 secs.
Now when I replace the said line with the following line : localFuncArr[row][col] += 3.0, again the code runs with the same speed as that of case(2) above.
Clearly something about the call to statObj->A1_ is making the code run slow.
My question(s):
Is Cache Poisoning the reason why this is happening ?
If so, what could be changed in terms of array initialization/object initialization/loop unrolling or for that matter any form of code optimization that can speed this up ?
Any insights to this from experienced folks is highly appreciated.
EDIT: Changed the description to be more verbose and redress some of the points mentioned in the comments.
If the conditions are mostly true, your line of code is executed 3000x3000x3000x3x3 times. That's about 245 billion times. Depending on your hardware architecture 245 seconds might be a very reasonable timing (that's 1 iteration every 2 cycles - assuming 2GHz processor). In any case there isn't anything in the code that suggests cache poisoning.
During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.
You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}
How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.
The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.
I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.
Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }