LLVM NVPTX backend struct parameter zero size - llvm

I'm getting an obscure exception when loading the PTX assembly generated by LLVM's NVPTX backend. (I'm loading the PTX from ManagedCuda - http://managedcuda.codeplex.com/ )
ErrorNoBinaryForGPU: This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.
Here is the LLVM IR for the module (it's a bit weird since it's generated by a tool)
; ModuleID = 'Module'
target triple = "nvptx64-nvidia-cuda"
%testStruct = type { i32 }
define void #kernel(i32 addrspace(1)*) {
entry:
%1 = alloca %testStruct
store %testStruct zeroinitializer, %testStruct* %1
%2 = load %testStruct* %1
call void #structtest(%testStruct %2)
ret void
}
define void #structtest(%testStruct) {
entry:
ret void
}
!nvvm.annotations = !{!0}
!0 = metadata !{void (i32 addrspace(1)*)* #kernel, metadata !"kernel", i32 1}
and here is the resulting PTX
//
// Generated by LLVM NVPTX Back-End
//
.version 3.1
.target sm_20
.address_size 64
// .globl kernel
.visible .func structtest
(
.param .b0 structtest_param_0
)
;
.visible .entry kernel(
.param .u64 kernel_param_0
)
{
.local .align 8 .b8 __local_depot0[8];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .s32 %r<2>;
.reg .s64 %rl<2>;
mov.u64 %rl1, __local_depot0;
cvta.local.u64 %SP, %rl1;
mov.u32 %r1, 0;
st.u32 [%SP+0], %r1;
// Callseq Start 0
{
.reg .b32 temp_param_reg;
// <end>}
.param .align 4 .b8 param0[4];
st.param.b32 [param0+0], %r1;
call.uni
structtest,
(
param0
);
//{
}// Callseq End 0
ret;
}
// .globl structtest
.visible .func structtest(
.param .b0 structtest_param_0
)
{
ret;
}
I have no idea how to read PTX, but I have a feeling the problem has to do with the .b0 bit of .param .b0 structtest_param_0 in the structtest function definition.
Passing non-structure values (like integers or pointers) works fine, and the .b0. bit of the function reads something sane like .b32 or .b64 when doing so.
Changing triple to nvptx-nvidia-cuda (32 bit) does nothing, as well as including/excluding the data layout suggested in http://llvm.org/docs/NVPTXUsage.html
Is this a bug in the NVPTX backend, or am I doing something wrong?
Update:
I'm looking through this - http://llvm.org/docs/doxygen/html/NVPTXAsmPrinter_8cpp_source.html - and it appears as if the type is falling through to line 01568, is obviously not a primitive type, and Ty->getPrimitiveSizeInBits() returns zero. (At least that's my guess, anyway)
Do I need to add a special case for checking to see if it's a structure, taking the address, making the argument byval, and dereferencing the struct afterwards? That seems like a hacky solution, but I'm not sure how else to fix it.

Have you tried to get the error message buffer from compilation? In managedCuda this would be something like:
CudaContext ctx = new CudaContext();
CudaJitOptionCollection options = new CudaJitOptionCollection();
CudaJOErrorLogBuffer err = new CudaJOErrorLogBuffer(1024);
options.Add(err);
try
{
ctx.LoadModulePTX("test.ptx", options);
}
catch
{
options.UpdateValues();
MessageBox.Show(err.Value);
}
When I run your ptx it says:
ptxas application ptx input, line 12; fatal : Parsing error near '.b0': syntax error
ptxas fatal : Ptx assembly aborted due to errors"
what supports your guess with b0.

Related

GNU LD section attributes and flags for 64-bit kernel ELF

I am trying to link a 64-bit kernel ELF using GNU LD. I have a executable section named lowerhalf and then the other usual sections. The linker script I use is this
ENTRY(kernelCompatibilityModeStart);
SECTIONS {
. = 0x80000000; /* lowerhalf origin */
.lowerhalf : ALIGN(0x1000) {
*(.lowerhalf);
}
. = 0xffffffff80000000; /* higherhalf origin */
.GDT64 : ALIGN(0x1000) {
*(.GDT64);
}
.IDT64 : ALIGN(0x1000) {
*(.IDT64);
}
.TSS64 : ALIGN(0x1000) {
*(.TSS64);
}
.KERNELSTACK : ALIGN(0x1000) {
*(.KERNELSTACK);
}
.ISTs : ALIGN(0x1000) {
*(.ISTs);
}
.text : ALIGN(0x1000) {
*(.text);
}
.data : ALIGN(0x1000) {
*(.data);
}
.rodata : ALIGN(0x1000) {
*(.rodata*);
}
.bss : ALIGN(0x1000) {
*(COMMON);
*(.bss);
}
}
When I look at the generated segments using readelf I see something like
Elf file type is EXEC (Executable file)
Entry point 0x80000000
There are 3 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000001000 0x0000000080000000 0x0000000080000000
0x0000000000000030 0x0000000000000030 R 0x1000
LOAD 0x0000000000002000 0xffffffff80000000 0xffffffff80000000
0x0000000000026718 0x0000000000026718 R E 0x1000
LOAD 0x0000000000029000 0xffffffff80027000 0xffffffff80027000
0x0000000000004954 0x00000000000053d8 RW 0x1000
Section to Segment mapping:
Segment Sections...
00 .lowerhalf
01 .GDT64 .IDT64 .TSS64 .KERNELSTACK .ISTs .text
02 .data .ctors .rodata .eh_frame .bss
Some problems with the output ELF:
the .text section gets bundled up with other sections like .GDT64 and .KERNELSTACK
.lowerhalf should have RE attributes instead of just R
all .data, .ctors, and .rodata end up with the RW flags.
I tried using the MEMORY directive but that results in a load of linker errors or warnings and doesn't give the desired output.
I'd like for the sections to have appropriate RWE attributes so I can create my paging entries accordingly.

what llvm store instruction pattern do i need?

im trying to make an llvm backend and i dont know what i need to fix this error
LLVM ERROR: Cannot select: t5: ch = store<ST4[%retval]> t0, Constant:i32<0>, FrameIndex:i64<0>, undef:i64
this is the ir im trying to process
define i32 #main() #0 {
%retval = alloca i32, align 4
store i32 0, i32* %retval, align 4
ret i32 0
}
but i don't know what dag pattern i need to be able to match it.
a tablegen file that contains some of the instructions my arch supports is here https://github.com/jfmherokiller/customllvm/blob/master/llvm/lib/Target/ZCPU/zcpuInstr.td
i just figured out the issue i was looking at the issue wrong
store<ST4[%retval]> t0, Constant:i32<0>, FrameIndex:i64<0>, undef:i64
can be expessed in function form as store(Constant:i32<0>,FrameIndex:i64<0>) or store constant i32 0 in
stack frame index 0.
The information i wasnt getting was that FrameIndex:i64<0> directly related to this line in TargetSelectionDAG.td def frameindex :SDNode<"ISD::FrameIndex",SDTPtrLeaf, [],"FrameIndexSDNode">;
so FrameIndex = frameindex

cuModuleGetFunction returns not found

I want to compile CUDA kernels with the nvrtc JIT compiler to improve the performance of my application (so I have an increased amount of instruction fetches but I am saving multiple array accesses).
The functions looks e.g. like this and is generated by my function generator (not that important):
extern "C" __device__ void GetSumOfBranches(double* branches, double* outSum)
{
double sum = (branches[38])+(-branches[334])+(-branches[398])+(-branches[411]);
*outSum = sum;
}
I am compiling the code above with the following function:
CUfunction* FunctionGenerator::CreateFunction(const char* programText)
{
// When I comment this statement out the output of the PTX file is changing
// what is the reson?!
// Bug?
std::string savedString = std::string(programText);
nvrtcProgram prog;
nvrtcCreateProgram(&prog, programText, "GetSumOfBranches.cu", 0, NULL, NULL);
const char *opts[] = {"--gpu-architecture=compute_52", "--fmad=false"};
nvrtcCompileProgram(prog, 2, opts);
// Obtain compilation log from the program.
size_t logSize;
nvrtcGetProgramLogSize(prog, &logSize);
char *log = new char[logSize];
nvrtcGetProgramLog(prog, log);
// Obtain PTX from the program.
size_t ptxSize;
nvrtcGetPTXSize(prog, &ptxSize);
char *ptx = new char[ptxSize];
nvrtcGetPTX(prog, ptx);
printf("%s", ptx);
CUdevice cuDevice;
CUcontext context;
CUmodule module;
CUfunction* kernel;
kernel = (CUfunction*)malloc(sizeof(CUfunction));
cuInit(0);
cuDeviceGet(&cuDevice, 0);
cuCtxCreate(&context, 0, cuDevice);
auto resultLoad = cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
auto resultGetF = cuModuleGetFunction(kernel, module, "GetSumOfBranches");
return kernel;
}
Everything is working except that cuModuleGetFunction is returning CUDA_ERROR_NOT_FOUND. That error occurs because GetSumOfBranches cannot be found in the PTX file.
However the output of printf("%s", ptx); is this:
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//
.version 4.3
.target sm_52
.address_size 64
// .globl GetSumOfBranches
.visible .func GetSumOfBranches(
.param .b64 GetSumOfBranches_param_0,
.param .b64 GetSumOfBranches_param_1
)
{
.reg .f64 %fd<8>;
.reg .b64 %rd<3>;
ld.param.u64 %rd1, [GetSumOfBranches_param_0];
ld.param.u64 %rd2, [GetSumOfBranches_param_1];
ld.f64 %fd1, [%rd1+304];
ld.f64 %fd2, [%rd1+2672];
sub.rn.f64 %fd3, %fd1, %fd2;
ld.f64 %fd4, [%rd1+3184];
sub.rn.f64 %fd5, %fd3, %fd4;
ld.f64 %fd6, [%rd1+3288];
sub.rn.f64 %fd7, %fd5, %fd6;
st.f64 [%rd2], %fd7;
ret;
}
In my optinion everything is fine and GetSumOfBranches sould be found by cuModuleGetFunction. Can you explain me why?
Second Question
when i outcomment std::string savedString = std::string(programText); then the output of the PTX is just:
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//
.version 4.3
.target sm_52
.address_size 64
and this is weird because savedString is not used at all...
What you are trying to do isn't supported. The host side modules management APIs and device ELF format do not expose __device__ functions, only __global__ functions which are callable via the kernel launch APIs.
You can compile device functions a priori or at runtime and link them with kernels in a JIT fashion, and you can retrieve those kernels and call them. But that is all you can do.

how to use llvm intrinsics #llvm.read_register?

I noticed that llvm.read_register() could read the value of stack pointer, as well as llvm.write_register() could set the value of stack pointer. I add main function to the stackpointer.ll which could be found in the llvm src:
;stackpointer.ll
define i32 #get_stack() nounwind {
 %sp = call i32 #llvm.read_register.i32(metadata !0)
 ret i32 %sp
}
declare i32 #llvm.read_register.i32(metadata) nounwind
!0 = metadata !{metadata !"sp\00"}
define i32 #main() {
 %1 = call i32 #get_stack()
 ret i32 %1
}
I tested on an armv7 board running ubuntu 11.04:
lli stackpointer.ll
then, I get a stack dump:
ARMCodeEmitter::emitPseudoInstruction
UNREACHABLE executed at ARMCodeEmitter.cpp:847!
Stack dump:
0.  Program arguments: lli stackpointer.ll
1.  Running pass 'ARM Machine Code Emitter' on function '#main'
Aborted
I also tried llc:
llc stackpointer.ll -o stackpointer.s
The error messege:
Can't get register for value!
UNREACHABLE executed at ARMCodeEmitter.cpp:1183!
Stack dump:
0.  Program arguments: llc stackpointer.ll -o stackpointer.s
1.  Running pass 'Function Pass Manager' on moulude 'stackpointer.ll'
2.  Running pass 'ARM Instruction Selection' on function '#get_stack'
Aborted
I also tried on x86-64 platform, it didn't work. What is the correct way to use these intrinsics?
My lli didn't like your metadata definition.
I cnagned your
!0 = metadata !{metadata !"sp\00"}
to
!0 = !{!"sp\00"}
And it worked. (Well, since I'm on x86-64, I have also changed everywhere i32 to i64 and sp to rsp).
Plus there were bad whitespace symbols in your formatting, but I think it might be due to StackOverflow/html or something).

Bad align value for a ELF section causes the program to be loaded wrong

I'm currently building a toy OS using a custom linker script to create the binary :
ENTRY(entry_point)
/* base virtual address of the kernel */
VIRT_BASE = 0xFFFFFFFF80000000;
SECTIONS
{
. = 0x100000;
/*
* Place multiboot header at 0x10000 as it is where Grub will be looking
* for it.
* Immediately followed by the boot code
*/
.boot :
{
*(.mbhdr)
_load_start = .;
*(.boot)
. = ALIGN(4096);
/* reserve space for paging data structures */
pml4 = .;
. += 0x1000;
pdpt = .;
. += 0x1000;
pagedir = .;
. += 0x1000;
. += 0x8000;
/* stack segment for loader */
stack = .;
}
/*
* Kernel code section is placed at his virtual address
*/
. += VIRT_BASE;
.text ALIGN(0x1000) : AT(ADDR(.text) - VIRT_BASE)
{
*(.text)
*(.gnu.linkonce.t*)
}
.data ALIGN(0x1000) : AT(ADDR(.data) - VIRT_BASE)
{
*(.data)
*(.gnu.linkonce.d*)
}
.rodata ALIGN(0x1000) : AT(ADDR(.rodata) - VIRT_BASE)
{
*(.rodata*)
*(.gnu.linkonce.r*)
}
_load_end = . - VIRT_BASE;
.bss ALIGN(0x1000) : AT(ADDR(.bss) - VIRT_BASE)
{
*(COMMON)
*(.bss)
*(.gnu.linkonce.b*)
}
_bss_end = . - VIRT_BASE;
/DISCARD/ :
{
*(.comment)
*(.eh_frame)
}
}
Since I use virtual memory, I use the AT() directive to make a distinction between the relocation address (the value of the symbol) and the actual physical load address, because virtual memory is not enabled when the binary is loaded. The address I give to AT correspond to the physical address mapped to the virtual relocation address. this works well.
But in some cases, I notice there is a strange shift after my code is loaded in memory. the .text section (which does not include the assembly boot code, only C++ code) is located 8 bytes higher than where it should be. An objdump shows the virtual relocation address is correct in the binary, as well as the load address. The only visible change is in readelf -l :
Elf file type is EXEC (Executable file)
Entry point 0x10003c
There are 3 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x0000e8 0x0000000000100000 0x0000000000100000 0x00c000 0x00c000 R 0x8
LOAD 0x00c0f0 0xffffffff8010c000 0x000000000010c000 0x004032 0x005002 RWE 0x10
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RWE 0x8
Section to Segment mapping:
Segment Sections...
00 .boot
01 .text .text._ZN2io3outEth .data .rodata .bss
02
The ALIGN of my text section is 0x10 here, whereas it is 0x8 when everything works properly.
Why is this align only present in some cases ? e.g. when I use this kind of C++ static initialization :
struct interrupt_desc
{
uint16_t clbk_low = 0x0;
uint16_t selector = 0x08;
uint8_t zero = 0x0;
uint8_t flags = 0xE;
uint16_t clbk_mid = 0x0;
uint32_t clbk_high = 0x0;
uint32_t zero2 = 0x0;
} __attribute__((packed));
interrupt_desc bar[32];
or when I implement these assembly structures for IDT handling, if I put them in the .data section, everything if fine, but if they go to the .text, the align is there again :
[SECTION .data]
[GLOBAL _interrupt_table_register]
[GLOBAL _interrupt_vector_table]
[BITS 64]
align 8
_interrupt_table_register:
DW 0xABCD
DQ _interrupt_vector_table
_interrupt_vector_table:
%rep 256
DW 0x0000
DW 0x0008
DB 0x00
DB 0x0E
DW 0x0000
DD 0x00000000
DD 0x00000000
%endrep
And finally, I don't really understand what this align value means and how I can adjust it. Moreover the load address is already 0x10 bytes-aligned, so why does this align changes the effective load address ?
Maybe I fail to understand something important here so any explanation about the internals of ELF is very welcome.
In case anyone wonders, the binary is an ELF64 linked with GNU ld, booted by Grub2 using Multiboot2.
PS: there is nothing wrong with my virtual memory mappings as they work perfectly when the align is 8. also, the shift is present event when I dump physical memory.
also, previously, I used to use Rust for the main code instead of C++. With Rust, this align bug was always present since the implementation of LTO in the compiler, but fine previously.
Thanks for your answers.