Why is my (re)implementation of strlen wrong? - c++

I came up with this little code but all the professionals said its dangerous and I should not write code like this. Can anyone highlight its vulnerabilities in 'more' details?
int strlen(char *s){
return (*s) ? 1 + strlen(s + 1) : 0;
}

It has no vulnerabilities per se, this is perfectly correct code. It is prematurely pessimized, of course. It will run out of stack space for anything but the shortest strings, and its performance will suck due to recursive calls, but otherwise it's OK.
The tail call optimization most likely won't cope with such code. If you want to live dangerously and depend on tail-call optimizations, you should rephrase it to use the tail-call:
// note: size_t is an unsigned integertype
int strlen_impl(const char *s, size_t len) {
if (*s == 0) return len;
if (len + 1 < len) return len; // protect from overflows
return strlen_impl(s+1, len+1);
}
int strlen(const char *s) {
return strlen_impl(s, 0);
}

Dangerous it a bit of a stretch, but it is needlessly recursive and likely to be less efficient than the iterative alternative.
I suppose also that given a very long string there is a danger of a stack overflow.

There are two serious security bugs in this code:
Use of int instead of size_t for the return type. As written, strings longer than INT_MAX will cause this function to invoke undefined behavior via integer overflow. In practice, this could lead to computing strlen(huge_string) as some small value like 1, malloc'ing the wrong amount of memory, and then performing strcpy into it, causing a buffer overflow.
Unbounded recursion which can overflow the stack, i.e. Stack Overflow. :-) A compiler may choose to optimize the recursion into a loop (in this case, it's possible with current compiler technology), but there is no guarantee that it will. In a best case, stack overflow will simply crash the program. In a worst case (e.g. running on a thread with no guard page) it could clobber unrelated memory, possibly yielding arbitrary code execution.

The problem with killing the stack that have been pointed out, ought to be fixed by a decent compiler, where the apparent recursive call is flattened into a loop. I verified this hypothesis and asked clang to translate your code:
//sl.c
unsigned sl(char const* s) {
return (*s) ? (1+sl(s+1)) : 0;
}
Compiling and disassembling:
clang -emit-llvm -O1 -c sl.c -o sl.o
# ^^ Yes, O1 is already sufficient.
llvm-dis-3.2 sl.o
And this is the relevant part of the llvm result (sl.o.ll)
define i32 #sl(i8* nocapture %s) nounwind uwtable readonly {
%1 = load i8* %s, align 1, !tbaa !0
%2 = icmp eq i8 %1, 0
br i1 %2, label %tailrecurse._crit_edge, label %tailrecurse
tailrecurse: ; preds = %tailrecurse, %0
%s.tr3 = phi i8* [ %3, %tailrecurse ], [ %s, %0 ]
%accumulator.tr2 = phi i32 [ %4, %tailrecurse ], [ 0, %0 ]
%3 = getelementptr inbounds i8* %s.tr3, i64 1
%4 = add i32 %accumulator.tr2, 1
%5 = load i8* %3, align 1, !tbaa !0
%6 = icmp eq i8 %5, 0
br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse
tailrecurse._crit_edge: ; preds = %tailrecurse, %0
%accumulator.tr.lcssa = phi i32 [ 0, %0 ], [ %4, %tailrecurse ]
ret i32 %accumulator.tr.lcssa
}
I don't see a recursive call. Indeed clang called the looping label tailrecurse which gives us a pointer as to what clang is doing here.
So, finally (tl;dr) yes, this code is perfectly safe and a decent compiler with a decent flag will iron the recursion out.

Related

Is LLVM bitcast from vector of bool (i1) to i8, i16, etc. well defined?

In LLVM, can a value of type <8 x i1> be bitcasted to an i8? If so what is the expected bit order? The LLVM documentation on bitcast is not explicit on this. The claim it makes is
The bitcast instruction converts value to type ty2. It is always a no-op cast because no bits change with this conversion. The conversion is done as if the value had been stored to memory and read back as type ty2.
Tangentially, on the mailing list, it has been clarified that no-op cast does not mean what it sounds like. Back to the issue at hand, the problem I see with bitcasting <8 x i1> to any other type (not just i8) is that a value of type <8 x i1> cannot be stored to memory. I have confirmed this experimentally (code not included), and it is also well-documented on the mailing list. Since storing values of type <8 x i1> leads to undefined behavior, the specification "as if the value had been stored to memory and read back as type ty2" implies that any bitcast to or from <8 x i1> results in undefined behavior. Note that a very similar question has been asked before, but the answers to this question do not provide a satisfactory answer to the general soundness issue presented here. The author of the aforementioned issue resolved the issue by bitcasting <8 x i1> to <1 x i8>, but this cast involves an argument of type <8 x i1>, so I am not convinced that it is sound.
For what it's worth, in some of my own small tests with LLVM, I have confirmed that bitcasting <8 x i1> to i8 works. Below is a function that tests 8 i16s at a time for whether or not they are each equal to 42.
; Filename is equality-8x16.ll
define void #equals42(<8 x i16>* %src0,i8* %dst0,i64 %len0) { ; i32()*
entry:
%len = udiv exact i64 %len0, 8
br label %cond
cond:
%i = phi i64 [ 0, %entry ], [ %isucc, %loop ]
%src = phi <8 x i16>* [ %src0, %entry ], [ %srcsucc, %loop ]
%dst = phi i8* [ %dst0, %entry ], [ %dstsucc, %loop ]
%cmp = icmp slt i64 %i, %len
br i1 %cmp, label %loop, label %end
loop:
%isucc = add i64 %i, 1
%srcsucc = getelementptr <8 x i16>, <8 x i16>* %src, i64 1
%dstsucc = getelementptr i8, i8* %dst, i64 1
%val = load <8 x i16>, <8 x i16>* %src
%bits = icmp eq <8 x i16> %val, <i16 42,i16 42,i16 42,i16 42,i16 42,i16 42,i16 42,i16 42>
%res = bitcast <8 x i1> %bits to i8
store i8 %res, i8* %dst
br label %cond
end:
ret void;
}
And here's some C code (call-equality.c) that calls it:
#include <stdio.h>
#include <stdint.h>
#define SZ 8
void equals42(void*,void*,int64_t);
/* Prints the highest bit first and the lowest bit last */
void printbits(uint8_t x)
{
for(int i=sizeof(x)<<3; i; i--)
putchar('0'+((x>>(i-1))&1));
}
int main(){
uint16_t a[SZ * 8] = {0};
uint8_t b[8];
a[1] = 42;
a[15] = 42;
equals42(a,b,SZ * 8);
for(int i = 0; i < SZ; i++){
printf("Index %d:",i);
printbits(b[i]);
printf("\n");
}
}
Build, link, and run with:
llc-9 -O3 -mcpu=skylake -filetype=obj equality-8x16.ll
gcc call-equality.c equality-8x16.o
./a.out
And here's the results:
Index 0:00000010
Index 1:10000000
Index 2:00000000
Index 3:00000000
Index 4:00000000
Index 5:00000000
Index 6:00000000
Index 7:00000000
This works, and it even happens to do what I expect. This bits at positions 1 and 15 are set (interpreting byte 1, bit position 7 as bit position 15). However, it's not clear whether or not I would get the same results on a big endian platform (I'm using a little-endian Skylake CPU). Again, I'd like to stress that LLVM's official documentation does not document the behavior of bitcasts involving <8 x i1>.
The question is not just "does this happen to work on your computer or mine". (Although if someone has a big-endian platform, I would be curious to see if the example program gives the same results). The real questions are:
Is there some quasi-authoritative source, even if it's just mailing list threads and issue trackers, that specifies the semantics of these bitcasts?
If these bitcasts are unsound, what is the idiomatic way to convert a <8 x i1> to an i8? It is possible to project out all eight bits individually (via extractelement) and then build an i8 with some ORs and bitshifts, but that seems both tedious and relies heavily on an optimization pass to get the shuffle operation I would expect. Is there something better?
The closest thing I've found so far is a mailing list thread from 2018 where a user notices a problem where bitcast <16 x i1> %a1 to i16 is poorly optimized. A maintainer responds, offering a fix in r348104 (which I'm not able to find on GitHub or Phabricator). But this seems to imply that bitcast <16 x i1> %a1 to i16 is understood to be well defined. But what is it actually supposed to mean? Is element 0 supposed to be bit position 0 in the resulting word? I think so, but it would be nice to see this spelled out somewhere.

Setting a variable to 0 in LLVM IR

Is it possible to set a variable to 0 (or any other number) in LLVM-IR ? My searches have found me the following 3 line snippet, but is there anything simpler than the following solution ?
%ptr = alloca i32 ; yields i32*:ptr
store i32 3, i32* %ptr ; yields void
%val = load i32, i32* %ptr ; yields i32:val = i32 3
To set a value to zero (or null in general) you can use
Constant::getNullValue(Type)
and to set a value with an arbitrary constant number you can use ConstantInt::get(), but you need to identify the context first, like this:
LLVMContext &context = function->getContext();
/* or BB->getContext(), BB can be any basic block in the function */
Value* constVal = ConstantInt::get(Type::getInt32Ty(context), 3);
LLVM-IR is in static single assignment (SSA) form, so each variable is only assigned once. If you want to assign a value to a memory region you can simply use a store operation as you showed in your example:
store i32 3, i32* %ptr
The type of the second argument is i32* which means that it is a pointer to an integer that is 32 bit long.

incrementing a ptr in llvm ir

I am trying to understand the getelementptr instruction in llvm IR, but not fully understanding it.
I have a struct like below -
struct Foo {
int32_t* p;
}
I want to do this -
foo.p++;
What would be the right code for this?
%0 = getelementptr %Foo* %fooPtr, i32 0, i32 0
%1 = getelementptr i32* %0, i8 1
store i32* %1, i32* %0
I am wondering if value in %0 needs to be first loaded using "load" before executing 2nd line.
Thanks!
You can see the GEP instruction as an operation that performs arithmetic operations on pointers. In LLVM IR the GEP instruction is your instruction of choice to perform operations on pointers easyly. You don't have to do cumbersome calculate the size of your types and offsets to manually perform such operations.
In your case:
%0 = getelementptr %Foo* %fooPtr, i32 0, i32 0
selects the member inside the structure. It uses the pointer operatand %fooPtr to calculate %0 = ((fooPtr + 0) + 0). GEP does not know about fooPtr just pointing to one element of Foo, this is why two indices are used to select the member.
%1 = getelementptr i32* %0, i8 1
As mentioned above the GEP performs pointer arithmetic and in your case get %1 = (p + 1);
Since you are operating on pointers using GEP you don't need to load the value of p. GEP will do this implicitly for you.
Now you can store the new index back to the position of the p member inside the Foo struct pointed to by fooPtr.
For further reading: The Often Misunderstood GEP Instruction

At what moment is memory typically allocated for local variables in C++?

I'm debugging a rather weird stack overflow supposedly caused by allocating too large variables on stack and I'd like to clarify the following.
Suppose I have the following function:
void function()
{
char buffer[1 * 1024];
if( condition ) {
char buffer[1 * 1024];
doSomething( buffer, sizeof( buffer ) );
} else {
char buffer[512 * 1024];
doSomething( buffer, sizeof( buffer ) );
}
}
I understand, that it's compiler-dependent and also depends on what optimizer decides, but what is the typical strategy for allocating memory for those local variables?
Will the worst case (1 + 512 kilobytes) be allocated immediately once function is entered or will 1 kilobyte be allocated first, then depending on condition either 1 or 512 kilobytes be additionally allocated?
On many platforms/ABIs, the entire stackframe (including memory for every local variable) is allocated when you enter the function. On others, it's common to push/pop memory bit by bit, as it is needed.
Of course, in cases where the entire stackframe is allocated in one go, different compilers might still decide on different stack frame sizes. In your case, some compilers would miss an optimization opportunity, and allocate unique memory for every local variable, even the ones that are in different branches of the code (both the 1 * 1024 array and the 512 * 1024 one in your case), where a better optimizing compiler should only allocate the maximum memory required of any path through the function (the else path in your case, so allocating a 512kb block should be enough).
If you want to know what your platform does, look at the disassembly.
But it wouldn't surprise me to see the entire chunk of memory allocated immediately.
I checked on LLVM:
void doSomething(char*,char*);
void function(bool b)
{
char b1[1 * 1024];
if( b ) {
char b2[1 * 1024];
doSomething(b1, b2);
} else {
char b3[512 * 1024];
doSomething(b1, b3);
}
}
Yields:
; ModuleID = '/tmp/webcompile/_28066_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
define void #_Z8functionb(i1 zeroext %b) {
entry:
%b1 = alloca [1024 x i8], align 1 ; <[1024 x i8]*> [#uses=1]
%b2 = alloca [1024 x i8], align 1 ; <[1024 x i8]*> [#uses=1]
%b3 = alloca [524288 x i8], align 1 ; <[524288 x i8]*> [#uses=1]
%arraydecay = getelementptr inbounds [1024 x i8]* %b1, i64 0, i64 0 ; <i8*> [#uses=2]
br i1 %b, label %if.then, label %if.else
if.then: ; preds = %entry
%arraydecay2 = getelementptr inbounds [1024 x i8]* %b2, i64 0, i64 0 ; <i8*> [#uses=1]
call void #_Z11doSomethingPcS_(i8* %arraydecay, i8* %arraydecay2)
ret void
if.else: ; preds = %entry
%arraydecay6 = getelementptr inbounds [524288 x i8]* %b3, i64 0, i64 0 ; <i8*> [#uses=1]
call void #_Z11doSomethingPcS_(i8* %arraydecay, i8* %arraydecay6)
ret void
}
declare void #_Z11doSomethingPcS_(i8*, i8*)
You can see the 3 alloca at the top of the function.
I must admit I am slightly disappointed that b2 and b3 are not folded together in the IR, since only one of them will ever be used.
This optimization is known as "stack coloring", because you're assigning multiple stack objects to the same address. This is an area that we know LLVM can improve. Currently LLVM only does this for stack objects created by the register allocator for spill slots. We'd like to extend this to handle user stack variables as well, but we need a way to capture the lifetime of the value in IR.
There is a rough sketch of how we plan to do this here:
http://nondot.org/sabre/LLVMNotes/MemoryUseMarkers.txt
Implementation work on this is underway, several pieces are implemented in mainline.
-Chris
Your local (stack) variables are allocated in the same space as stack frames. When the function is called, the stack pointer is changed to "make room" for the stack frame. It's typically done in a single call. If you consume the stack with local variables, you'll encounter a stack overflow.
~512 kbytes is really too large for the stack in any case; you should allocate this on the heap using std::vector.
As you say, it is compiler dependent, but you could consider using alloca to overcome this. The variables would still be allocated on the stack, and still automatically freed as they go out of scope, but you take control over when and if the stack space is allocated.
While use of alloca is typically discouraged, it does have its uses in situations such as the above.

Writing llvm byte code

I have just discovered LLVM and don't know much about it yet. I have been trying it out using llvm in browser. I can see that any C code I write is converted to LLVM byte code which is then converted to native code. The page shows a textual representation of the byte code. For example for the following C code:
int array[] = { 1, 2, 3};
int foo(int X) {
return array[X];
}
It shows the following byte code:
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"
#array = global [3 x i32] [i32 1, i32 2, i32 3] ; <[3 x i32]*> [#uses=1]
define i32 #foo(i32 %X) nounwind readonly {
entry:
%0 = sext i32 %X to i64 ; <i64> [#uses=1]
%1 = getelementptr inbounds [3 x i32]* #array, i64 0, i64 %0 ; <i32*> [#uses=1]
%2 = load i32* %1, align 4 ; <i32> [#uses=1]
ret i32 %2
}
My question is: Can I write the byte code and give it to the llvm assembler to convert to native code skipping the first step of writing C code altogether? If yes, how do I do it? Does any one have any pointers for me?
One very important feature (and design goal) of the LLVM IR language is its 3-way representation:
The textual representation you can see here
The bytecode representation (or binary form)
The in-memory representation
All 3 are indeed completely interchangeable. Nothing that can be expressed in one cannot be expressed in the 2 others as well.
Therefore, as long as you conform to the syntax, you can indeed write the IR yourself. It is rather pointless though, unless used as an exercise to accustom yourself with the format, whether to be better at reading (and diagnosing) the IR or to produce your own compiler :)
Yes, surely you can. First, you can write LLVM IR by hand. All tools like llc (which will generate a native code for you) and opt (LLVM IR => LLVM IR optimizer) accept textual representation of LLVM IR as input.