Can't create right alias to __saddr in IAR RL78 or it's a bug in optimization? - iar

I think IAR loose __saddr atribute when creating alias
Assume we have SFR with next description:
__saddr __no_init volatile union
{
unsigned char P3;
__BITS8 P3_bit;
} # 0xFFF03;
Now we want to use P3 indirect, by alias:
{
auto &Px = P3;
P3 = 0x55;
Px = 0x55;
}
Now look disassembly window:
P3 = 0x55;
02947 CD0355 MOV S:__A_P3, #0x55
Px = 0x55;
0294A 3603FF MOVW HL, #0xFF03
0294D 5155 MOV A, #0x55
0294F 9B MOV [HL], A
So, IAR can't use __saddr on alias or something wrong with my code?
Reproducing code:
// No real hardware need. This code can be run under simulator
#include <ior5f100aa.h> // Any MCU with an existing P3 can be used
// C++ program
int main(void)
{
// write to register P3 (direct)
P3 = 0x55;
{
// write to register P3 (indirect, using alias)
auto& Px = P3;
Px = 0x55;
}
}
Comple, run, open disassembly window
P3 = 0x55;
main():
_main:
00136 CD0355 MOV S:__A_P3, #0x55
auto& Px = P3;
00139 3603FF MOVW HL, #0xFF03
Px = 0x55;
0013C 5155 MOV A, #0x55
0013E 9B MOV [HL], A
[screenshot of disassembly window][1]
[1]: https://i.stack.imgur.com/XuKGv.png
I see that direct write to P3 is shorter than writing by alias

Related

Using OpenMP threads and std::(experimental::)simd to compute the Mandelbrot set

I am looking to implement a simple Mandelbrot set plotter using different kind of HPC paradigms, showing their strengths and weaknesses and how easy or difficult their implementations are. Think of GPGPU (CUDA/OpenACC/OpenMP4.5), threading/OpenMP and MPI. And use these examples to give programmers new to HPC a handhold and to see what the possibilities are. Clarity of code is more important than getting the absolute top performance out of the hardware, that's the second step ;)
Because the problem is trivial to parallelize and modern CPUs can gain a huge amount of performance using vector instructions, I also want to combine OpenMP and SIMD. Unfortunately, simply adding a #pragma omp simd does not yield satisfying results and using intrinsics is not very user friendly or future proof. Or pretty.
Fortunately, work is being done to the C++ standard such that it should be easier to generically implement vector instructions, as mentioned in the TS: "Extensions for parallelism, version 2", specifically section 9 on data-parallel types. A WIP implementation can be found here, which is based on VC which can be found here.
Assume that I have the following class (which has been changed to make it a bit simpler)
#include <stddef.h>
using Range = std::pair<double, double>;
using Resolution = std::pair<std::size_t, std::size_t>;
class Mandelbrot
{
double* d_iters;
Range d_xrange;
Range d_yrange;
Resolution d_res;
std::size_t d_maxIter;
public:
Mandelbrot(Range xrange, Range yrange, Resolution res, std::size_t maxIter);
~Mandelbrot();
void writeImage(std::string const& fileName);
void computeMandelbrot();
private:
void calculateColors();
};
And the following implementation of computeMandelbrot() using OpenMP
void Mandelbrot::computeMandelbrot()
{
double dx = (d_xrange.second - d_xrange.first) / d_res.first;
double dy = (d_yrange.second - d_yrange.first) / d_res.second;
#pragma omp parallel for schedule(dynamic)
for (std::size_t row = 0; row != d_res.second; ++row)
{
double c_imag = d_yrange.first + row * dy;
for (std::size_t col = 0; col != d_res.first; ++col)
{
double real = 0.0;
double imag = 0.0;
double realSquared = 0.0;
double imagSquared = 0.0;
double c_real = d_xrange.first + col * dx;
std::size_t iter = 0;
while (iter < d_maxIter && realSquared + imagSquared < 4.0)
{
realSquared = real * real;
imagSquared = imag * imag;
imag = 2 * real * imag + c_imag;
real = realSquared - imagSquared + c_real;
++iter;
}
d_iters[row * d_res.first + col] = iter;
}
}
}
We can assume that the resolutions both x and y directions are multiples of 2/4/8/.., depending on which SIMD instructions we use.
Unfortunately, there is very little information available online on std::experimental::simd. Nor any non-trivial examples as far as I could find.
In the Vc git repository, there is an implementation of the Mandelbrot set calculator, but it's quite convoluted and due to the lack of comments rather difficult to follow.
It is clear that I should change the data types of the doubles in the function computeMandelbrot(), but I'm unsure to what. The TS mentions two main new data types for some type T,
native_simd = std::experimental::simd<T, std::experimental::simd_abi::native>;
and
fixed_size_simd = std::experimental::simd<T, std::experimental::simd_abi::fixed_size<N>>;
Using native_simd makes the most sense, since I don't know my bounds at compile time. But then it is not clear to me what these types represent, is a native_simd<double> a single double or is it a collection of doubles on which a vector instruction is executed? And then how many doubles are in this collection?
If somebody could point me to examples where these concepts are used, or give me some pointers on how to implement vector instructions using std::experimental::simd, I would be very grateful.
Here is a very basic implementation, which works (as far as I can tell). The testing which elements of the vector have absolute value larger than 2 is done in a very cumbersome and inefficient way. There must be a better way to do this, but I haven't found it yet.
I get about a 72% performance increase on a AMD Ryzen 5 3600 and giving g++ the option -march=znver2, which is less than expected.
template <class T>
void mandelbrot(T xstart, T xend,
T ystart, T yend)
{
namespace stdx = std::experimental;
constexpr auto simdSize = stdx::native_simd<T>().size();
constexpr unsigned size = 4096;
constexpr unsigned maxIter = 250;
assert(size % simdSize == 0);
unsigned* res = new unsigned[size * size];
T dx = (xend - xstart) / size;
T dy = (yend - ystart) / size;
for (std::size_t row = 0; row != size; ++row)
{
T c_imag = ystart + row * dy;
for (std::size_t col = 0; col != size; col += simdSize)
{
stdx::native_simd<T> real{0};
stdx::native_simd<T> imag{0};
stdx::native_simd<T> realSquared{0};
stdx::native_simd<T> imagSquared{0};
stdx::fixed_size_simd<unsigned, simdSize> iters{0};
stdx::native_simd<T> c_real;
for (int idx = 0; idx != simdSize; ++idx)
{
c_real[idx] = xstart + (col + idx) * dx;
}
for (unsigned iter = 0; iter != maxIter; ++iter)
{
realSquared = real * real;
imagSquared = imag * imag;
auto isInside = realSquared + imagSquared > stdx::native_simd<T>{4};
for (int idx = 0; idx != simdSize; ++idx)
{
// if not bigger than 4, increase iters
if (!isInside[idx])
{
iters[idx] += 1;
}
else
{
// prevent that they become inf/nan
real[idx] = static_cast<T>(4);
imag[idx] = static_cast<T>(4);
}
}
if (stdx::all_of(isInside) )
{
break;
}
imag = static_cast<T>(2.0) * real * imag + c_imag;
real = realSquared - imagSquared + c_real;
}
iters.copy_to(res + row * size + col, stdx::element_aligned);
}
}
delete[] res;
}
The whole testing code (starting from auto test = (...)) compiles down to
.L9:
vmulps ymm1, ymm1, ymm1
vmulps ymm13, ymm2, ymm2
xor eax, eax
vaddps ymm2, ymm13, ymm1
vcmpltps ymm2, ymm5, ymm2
vmovaps YMMWORD PTR [rsp+160], ymm2
jmp .L6
.L3:
vmovss DWORD PTR [rsp+32+rax], xmm0
vmovss DWORD PTR [rsp+64+rax], xmm0
add rax, 4
cmp rax, 32
je .L22
.L6:
vucomiss xmm3, DWORD PTR [rsp+160+rax]
jp .L3
jne .L3
inc DWORD PTR [rsp+96+rax]
add rax, 4
cmp rax, 32
jne .L6

Values of a vector are changing when they shouldn't be

I am writing a game engine from scratch as a free-time learning exercise. I am currently implementing a rendering queue, but the values of the vector in charge of the queue keep changing. (always to the same value of -107374176 when it should be 10.0f) The vector objRID is of type OBJR*, where OBJR is a struct containing position information, as well as a pointer to a bitmap. The bitmap library I am using doesn't seem to be the culprit, but it can be found at: http://partow.net/programming/bitmap/index.html.
The overarching exception is a read access violation of 0x1CCCCCCCC. I have stepped through the program and have found that the values of the struct change one by one every iteration of the "rep stos" after the 19th iteration. I have no real Idea as to how the "rep stos" could affect something which is seemingly unrelated. (A don't have a great grasp on assembler in the first place) I am very open to suggestions besides the error at hand.
If someone could explain how the following assembly affects the vector objRID I think I would be able to solve this problem myself in the future.
163: int loop()
164: {
00007FF7AC74D580 40 55 push rbp
00007FF7AC74D582 57 push rdi
00007FF7AC74D583 48 81 EC A8 01 00 00 sub rsp,1A8h
00007FF7AC74D58A 48 8D 6C 24 20 lea rbp,[rsp+20h]
00007FF7AC74D58F 48 8B FC mov rdi,rsp
00007FF7AC74D592 B9 6A 00 00 00 mov ecx,6Ah
00007FF7AC74D597 B8 CC CC CC CC mov eax,0CCCCCCCCh
00007FF7AC74D59C F3 AB rep stos dword ptr [rdi] <---- 19th - 26th iteration here
I hate to just throw the whole program in here, but I believe it is much less confusing this way.
The program is structured as such:
#include "stdafx.h"
#include <Windows.h>
#include "bitmap_image.hpp"
#define maxObjects 1024
struct VEC2_f {
float x, y;
VEC2_f(float x, float y)
{
VEC2_f::x = x;
VEC2_f::y = y;
}
VEC2_f()
{
VEC2_f::x = 0.0f;
VEC2_f::y = 0.0f;
}
};
struct OBJR {
VEC2_f pos, vel;
int ID = -1;
bitmap_image* Lbmp;
OBJR(bitmap_image* Lbmp, VEC2_f pos, VEC2_f vel)
{
OBJR::Lbmp = Lbmp;
OBJR::pos = pos;
OBJR::vel = vel;
}
OBJR(bitmap_image* Lbmp, float x, float y, float vx, float vy)
{
OBJR::Lbmp = Lbmp;
OBJR::pos = VEC2_f(x, y);
OBJR::vel = VEC2_f(vx, vy);
}
//if -1 then ID isn't set yet
int getID()
{
return ID;
}
};
std::vector<OBJR*> objRID;
int IDCOUNTER = 0;
bool running = true;
HWND con;
HDC dc;
COLORREF color;
void objInit(OBJR* Lobj)
{
if (objRID.size() > maxObjects)
{
objRID.pop_back();
Lobj->ID = maxObjects; }
Lobj->ID = IDCOUNTER++;
objRID.push_back(Lobj);
}
void input()
{
}
void update()
{
}
VEC2_f interpolate(float interpolation, VEC2_f pos, VEC2_f vel)
{
return VEC2_f(pos.x + (vel.x * interpolation), pos.y + (vel.y * interpolation));
}
void renderBitmap(bitmap_image* Lbmp, VEC2_f Ipos)
{
unsigned int h, w;
rgb_t colorT;
h = Lbmp->height(); <--- Read access violation here
w = Lbmp->width();
for (unsigned int y = 0; y < h; y++)
{
for (unsigned int x = 0; x < w; x++)
{
colorT = Lbmp->get_pixel(x, y);
color = RGB(colorT.red, colorT.green, colorT.blue);
SetPixelV(dc, x + Ipos.x, y + Ipos.y, color);
}
}
}
void renderOBJR(float interpolation, OBJR* obj)
{
renderBitmap(obj->Lbmp, interpolate(interpolation, obj->pos, obj->vel));
}
void render(float interpolation)
{
for (int i = 0; i < objRID.size(); i++)
{
renderOBJR(interpolation, objRID[i]);
}
}
void resizeWindow()
{
RECT r;
GetWindowRect(con, &r);
MoveWindow(con, r.left, r.top, 800, 600, true);
}
int init()
{
con = GetConsoleWindow();
dc = GetDC(con);
resizeWindow();
return 0;
}
int loop()
{ //<--- this is where the disassembly was taken from and is where the Lbmp becomes invalid
const int TPS = 60;
const int SKIP_TICKS = 1000 / TPS;
const int FRAMESKIP = 1;
DWORD next_tick = GetTickCount();
float interpolation;
int loop;
while (running)
{
loop = 0;
while (GetTickCount() > next_tick && loop < FRAMESKIP)
{
input();
update();
next_tick += SKIP_TICKS;
loop++;
}
interpolation = float(GetTickCount() + SKIP_TICKS - next_tick) / float(SKIP_TICKS);
render(interpolation);
}
return 0;
}
int deInit()
{
ReleaseDC(con, dc);
return 0;
}
void test()
{
bitmap_image bitmap = bitmap_image("testBW.bmp");
VEC2_f pos = VEC2_f(10.f, 10.f);
VEC2_f vel = VEC2_f();
OBJR test1 = OBJR(&bitmap, pos, vel);
objInit(&test1);
renderBitmap(&bitmap, pos);
}
int main()
{
init();
test();
loop();
deInit();
return 0;
}
rep stosd is one way to implement memset. Some compilers will inline it.
It looks like your compiler is using it to init stack memory with a poison value of 0xCCCCCCCC, for 4 * 0x6A bytes. (Notice that there's a mov rdi, rsp right before it.) I assume this is a debug build, because optimized builds wouldn't do that extra work.
The overarching exception is a read access violation of 0x1CCCCCCCC.
Looks like you read a pointer from an uninitialized object, and the "poison" value did its job of creating a pointer value that you can easily see is bogus, and which faults instead of silently continuing until some later fault farther away from the real problem.
The 0x00000001 in the high half of the pointer is suspicious. Perhaps you partially overwrote this object? Set a watch point on the memory where the pointer value is stored, and find out what else modifies it.
Or you're keeping pointers to local variables after they've gone out of scope, and then the next function call reuses that stack memory for its stack frame. The debug-mode code poisons its whole stack frame, overwriting some of the objects pointed to by your std::vector<OBJR*> objRID;.
So it's not the vector contents or the vector object itself that are changing, it's the objects pointed to by the vector contents.
Again, the 0xCCCCCCCC poison is doing its job of finding bugs in your program.
(#molbdnilo spotted this. I didn't take the time to read your whole C++ code that carefully the first time.)

CUDA - Kernel uses more registers than expected? [duplicate]

This question already has an answer here:
What kind of variables consume registers in CUDA?
(1 answer)
Closed 8 years ago.
I have a kernel, which calculates sums. If i go through the kernel counting the number of variables declared, i would assume a total of 5 registers per kernel*. However when profiling the kernel, 34 registers are used. I need get down to 30 register to allow the execution of 1024 threads.
Can anybody see what is wrong?
__global__ void sum_kernel(float* values, float bk_size, int start_idx, int end_idx, int resolution, float* avgs){
// Allocate shared memory (assuming a maximum of 1024 threads).
__shared__ float sums[1024];
// Boundary check.
if(blockIdx.x == 0){
avgs[blockIdx.x] = values[start_idx];
return;
}
else if(blockIdx.x == resolution-1) {
avgs[blockIdx.x] = values[start_idx+(end_idx-start_idx)-1];
return;
}
else if(blockIdx.x > resolution -2){
return;
}
// Iteration index calculation.
unsigned int idx_prev = floor((blockIdx.x + 0) * bk_size) + 1;
unsigned int from = idx_prev + threadIdx.x*(bk_size / blockDim.x);
unsigned int to = from + (bk_size / blockDim.x);
to = (to < (end_idx-start_idx))? to : (end_idx-start_idx);
// Partial average calculation using shared memory.
sums[threadIdx.x] = 0;
for (from; from < to; from++)
{
sums[threadIdx.x] += values[from+start_idx];
}
__syncthreads();
// Addition of partial sums.
if(threadIdx.x != 0) return;
from = 1;
for(from; from < 1024; from++)
{
sum += sums[from];
}
avgs[blockIdx.x] = sum;
}
Assuming 2 registers per pointer, 1 register per unsigned int, arguments stored in constant memory.
You cannot estimate the number of used registers in terms of the number of declared variables. The compiler can use registers to make address calculations or to store temporary variables you are not explicitly declaring etc.
For example, I have disassembled the first part of your kernel function, namely
__global__ void sum_kernel(float* values, float bk_size, int start_idx, int end_idx, int resolution, float* avgs){
// Boundary check.
if(blockIdx.x == 0){
avgs[blockIdx.x] = values[start_idx];
return;
}
else if(blockIdx.x == resolution-1) {
avgs[blockIdx.x] = values[start_idx+(end_idx-start_idx)-1];
return;
}
else if(blockIdx.x > resolution -2){
return;
}
}
having the following result
code for sm_20
Function : _Z10sum_kernelPffiiiS_
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */ R1 = [0x1][0x100]
/*0008*/ S2R R2, SR_CTAID.X; /* 0x2c00000094009c04 */ R2 = BlockIdx.x
/*0010*/ MOV R0, c[0x0][0x34]; /* 0x28004000d0001de4 */ R0 = [0x0][0x34]
/*0018*/ ISETP.EQ.AND P0, PT, R2, RZ, PT; /* 0x190e0000fc21dc23 */ if (R2 == 0)
/*0020*/ #P0 BRA 0x78; /* 0x40000001400001e7 */
/*0028*/ MOV R0, c[0x0][0x30]; /* 0x28004000c0001de4 */
/*0030*/ IADD R0, R0, -0x1; /* 0x4800fffffc001c03 */
/*0038*/ ISETP.NE.AND P0, PT, R2, R0, PT; /* 0x1a8e00000021dc23 */
/*0040*/ #P0 EXIT ; /* 0x80000000000001e7 */
/*0048*/ MOV R0, c[0x0][0x2c]; /* 0x28004000b0001de4 */
/*0050*/ ISCADD R2, R2, c[0x0][0x34], 0x2; /* 0x40004000d0209c43 */
/*0058*/ ISCADD R0, R0, c[0x0][0x20], 0x2; /* 0x4000400080001c43 */
/*0060*/ LDU R0, [R0+-0x4]; /* 0x8bfffffff0001c85 */
/*0068*/ ST [R2], R0; /* 0x9000000000201c85 */
/*0070*/ BRA 0x98; /* 0x4000000080001de7 */
/*0078*/ MOV R2, c[0x0][0x28]; /* 0x28004000a0009de4 */
/*0080*/ ISCADD R2, R2, c[0x0][0x20], 0x2; /* 0x4000400080209c43 */
/*0088*/ LDU R2, [R2]; /* 0x8800000000209c85 */ R2 used for addressing and storing gmem data
/*0090*/ ST [R0], R2; /* 0x9000000000009c85 */ R0 used for addressing
/*0098*/ EXIT ; /* 0x8000000000001de7 */
In the CUDA code snippet above, there is no explicitly declared variable. As you can see from the disassembled code, the compiler has used 3 registers, namely, R0, R1 and R2. Those registers are interchanbeble in functionality and used to store constants, memory addresses and global memory values.

Does dereferencing a pointer create a copy in this example?

I'm attempting to optimize some code because I must draw the same QPixmap onto a larger one many many times. Since passing a QPixmap by value in my own methods would create copies with each call, I thought I could shave some time off by working with pointers to QPixmaps. However, it seems my work has been in vein. I think it's because calling QPainter::drawPixmap(..., const QPixmap&, ...) creates a copy of it.
QPixmap *pixmap = new QPixmap(10,10);
painter.drawPixmap(0,0, *pixmap);
Is a copy being created in this example?
If so, how might I go about optimizing the drawing many images onto another?
I have already read this Q/A here: Does dereferencing a pointer make a copy of it? but a definite answer for my specific case eludes me.
No. The function drawPixmap takes a const reference to the pixmap, so there is no copy being made. Here's the prototype for the QPainter member function:
void drawPixmap ( int x, int y, const QPixmap & pixmap )
According to the QPixmap class reference:
QPixmap objects can be passed around by value since the QPixmap class
uses implicit data sharing. For more information, see the Implicit
Data Sharing documentation.
QPixmap implementation:
QPixmap::QPixmap(const QPixmap &pixmap)
: QPaintDevice()
{
if (!qt_pixmap_thread_test()) {
init(0, 0, QPixmapData::PixmapType);
return;
}
if (pixmap.paintingActive()) { // make a deep copy
operator=(pixmap.copy());
} else {
data = pixmap.data;
}
}
Only when the pixmap is painting active you'll need a deep copy otherwise the new pixmap would only need to copy the orignal data pointer.
For difference of const reference and pointer:
QPixmap largeMap(1000, 1000);
QPainter p(&largeMap);
int count = 100000;
qint64 time1, time2;
QPixmap *pSmallMap = new QPixmap("e:/test.png");
QPixmap smallMap = QPixmap("e:/test.png");
time1 = QDateTime::currentMSecsSinceEpoch();
for (int i = 0; i < count; ++i) {
p.drawPixmap(0, 0, *pSmallMap);
}
time2 = QDateTime::currentMSecsSinceEpoch();;
qDebug("def time = %d\n", time2 - time1);
time1 = QDateTime::currentMSecsSinceEpoch();
for (int i = 0; i < count; ++i) {
p.drawPixmap(0, 0, smallMap);
}
time2 = QDateTime::currentMSecsSinceEpoch();;
qDebug("normal time = %d\n", time2 - time1);
compile under visual studio 2010 Debug configuration would produce following assembly :
28: p.drawPixmap(0, 0, *pSmallMap);
003B1647 8B 55 C4 mov edx,dword ptr [ebp-3Ch] //the pixmap pointer
003B164A 52 push edx
003B164B 6A 00 push 0 //x
003B164D 6A 00 push 0 //y
003B164F 8D 4D F0 lea ecx,[ebp-10h] //the qpainter pointer
003B1652 FF 15 9C D7 3B 00 call dword ptr [__imp_QPainter::drawPixmap (3BD79Ch)]
35: p.drawPixmap(0, 0, smallMap);
003B16A8 8D 4D E0 lea ecx,[ebp-20h] //the pixmap pointer
003B16AB 51 push ecx
003B16AC 6A 00 push 0 //x
003B16AE 6A 00 push 0 //y
003B16B0 8D 4D F0 lea ecx,[ebp-10h] //the qpainter pointer
003B16B3 FF 15 9C D7 3B 00 call dword ptr [__imp_QPainter::drawPixmap (3BD79Ch)]
There should be no differece between this two, because the compiler will produce the same assembly code: pass the pointer to the drawPixmap function.
And QDateTime::currentMSecsSinceEpoch() nearly show the same result on my box.

Can I calculate global variable in HLSL before vertex shader applied?

there are three global variable g_A, g_B, g_C
I want to do this; g_C = g_A * g_B
I tried this;
technique RenderScene
{
g_C = g_A * g_B;
pass P0
{
VertexShader = compile vs_2_0 RenderSceneVS();
PixelShader = compile ps_2_0 RenderScenePS();
}
}
However, it's improper syntax.
What should I do?
Must I calculate this variable in c++ code before rendering?
DirectX 9 and 10 Effects support "preshaders", where static expressions will be pulled out to be executed on the CPU automatically.
See D3DXSHADER_NO_PRESHADER in the documentation, which is a flag that suppresses this behavior. Here's a web link: http://msdn.microsoft.com/en-us/library/windows/desktop/bb205441(v=vs.85).aspx
Your declaration of g_C is missing both static and the type e.g. float. You'll also need to move it into global scope.
When you compile it with fxc, you might see something like the following (note the preshader block after the comment block):
technique RenderScene
{
pass P0
{
vertexshader =
asm {
//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
// Parameters:
//
// float4 g_A;
// float4 g_B;
//
//
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// g_A c0 1
// g_B c1 1
//
preshader
mul c0, c0, c1
// approximately 1 instruction used
//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
vs_2_0
mov oPos, c0
// approximately 1 instruction slot used
};
pixelshader =
asm {
//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
ps_2_0
def c0, 0, 0, 0, 0
mov r0, c0.x
mov oC0, r0
// approximately 2 instruction slots used
};
}
}
I compiled the following:
float4 g_A;
float4 g_B;
static float4 g_C = g_A * g_B;
float4 RenderSceneVS() : POSITION
{
return g_C;
}
float4 RenderScenePS() : COLOR
{
return 0.0;
}
technique RenderScene
{
pass P0
{
VertexShader = compile vs_2_0 RenderSceneVS();
PixelShader = compile ps_2_0 RenderScenePS();
}
}
with fxc /Tfx_2_0 t.fx to generate the listing. They're not very interesting shaders...