Keeping tight allocation-free loops interruptible - ocaml

I want to keep an otherwise alloc-free loop interruptible.
To do this, I'm adding a dummy allocation to the loop.
My question is about minimizing the allocation costs.
Here is a minimal version of what I am doing:
let my_dummy_alloc () =
ignore (Sys.opaque_identity (ref 0));;
let my_fun x =
my_dummy_alloc ();
x;;
let my_handler _ = raise Exit;;
Sys.(set_signal sigint
(Signal_handle my_handler));
for i = 1 to int_of_string Sys.argv.(1) do
ignore (Sys.opaque_identity (my_fun i));
done;;
ocamlopt emits AMD64 asm code that looks like this:
_camlTt__my_fun_166:
subq $8, %rsp
L104:
subq $16, %r15 # I want 8
cmpq 8(%r14), %r15
jb L105
L107:
leaq 8(%r15), %rbx
movq $1024, -8(%rbx)
movq $1, (%rbx) # unneeded
addq $8, %rsp
ret
L105:
call _caml_call_gc
L106:
jmp L107
From what I understand about the internal representation of heap allocated blocks, blocks comprising one header word and zero data words should be possible.
How can I can generate such blocks using OCaml code?
Edit. I had a hunch ...
... that Obj.new_block might help. It did not. Instead, the runtime quadrupled and the
interruptability was gone.
Looking at the asm code, I noticed call _caml_c_call, which explains the effects I observed. Back to square one!

Related

Understanding of the following assembly code from CSAPP

I am recently reading CSAPP and I have a question about example of assembly code. This is an example from CSAPP, the code is followed:
long pcount_goto
(unsigned long x) {
long result = 0;
result += x & 0x1;
x >>= 1;
if(x) goto loop;
return result;

And the corresponding assembly code is:
movl $0, %eax # result = 0
.L2: # loop:
movq %rdi, %rdx
andl $1, %edx # t = x & 0x1
addq %rdx, %rax # result += t
shrq %rdi # x >>= 1
jne .L2 # if (x) goto loop
rep; ret
The questions I have may look naive since I am very new to assembly code but I will be grateful is someone can help me with these questions.
what's the difference between %eax, %rax, (also %edx, %rdx). I have seen them occur in the assembly code but they seems to refer to the same space/address. What's the point of using two different names?
In the code
andl $1, %edx # t = x & 0x1
I understand that %edx now stores the t, but where does x goes then?
In the code
shrq %rdi
I think
shrq 1, %rdi
should be better?
For
jne .L2 # if (x) goto loop
Where does if (x) goes? I can't see any judgement.
These are really basic questions, a little research of your own should have answered all of them. Anyway,
The e registers are the low 32 bits of the r registers. You pick one depending on what size you need. There are also 16 and 8 bit registers. Consult a basic architecture manual.
The and instruction modifies its argument, it's not a = b & c, it's a &= b.
That would be shrq $1, %rdi which is valid, and shrq %rdi is just an alias for it.
jne examines the zero flag which is set earlier by shrq automatically if the result was zero.

Why g++ still optimizes tail recursion when the recursion function result is multiplied?

They say, the tail recursion optimization works only when the the call is just before return from the function. So they show this code as example of what shouldn't be optimized by C compilers:
long long f(long long n) {
return n > 0 ? f(n - 1) * n : 1;
}
because there the recursive function call is multiplied by n which means the last operation is multiplication, not recursive call. However, it is even on -O1 level:
recursion`f:
0x100000930 <+0>: pushq %rbp
0x100000931 <+1>: movq %rsp, %rbp
0x100000934 <+4>: movl $0x1, %eax
0x100000939 <+9>: testq %rdi, %rdi
0x10000093c <+12>: jle 0x10000094e
0x10000093e <+14>: nop
0x100000940 <+16>: imulq %rdi, %rax
0x100000944 <+20>: cmpq $0x1, %rdi
0x100000948 <+24>: leaq -0x1(%rdi), %rdi
0x10000094c <+28>: jg 0x100000940
0x10000094e <+30>: popq %rbp
0x10000094f <+31>: retq
They say that:
Your final rules are therefore sufficiently correct. However, return n
* fact(n - 1) does have an operation in the tail position! This is the multiplication *, which will be the last thing the function does
before it returns. In some languages, this might actually be
implemented as a function call which could then be tail-call
optimized.
However, as we see from ASM listing, multiplication is still an ASM instruction, not a separate function. So I really struggle to see difference with accumulator approach:
int fac_times (int n, int acc) {
return (n == 0) ? acc : fac_times(n - 1, acc * n);
}
int factorial (int n) {
return fac_times(n, 1);
}
This produces
recursion`fac_times:
0x1000008e0 <+0>: pushq %rbp
0x1000008e1 <+1>: movq %rsp, %rbp
0x1000008e4 <+4>: testl %edi, %edi
0x1000008e6 <+6>: je 0x1000008f7
0x1000008e8 <+8>: nopl (%rax,%rax)
0x1000008f0 <+16>: imull %edi, %esi
0x1000008f3 <+19>: decl %edi
0x1000008f5 <+21>: jne 0x1000008f0
0x1000008f7 <+23>: movl %esi, %eax
0x1000008f9 <+25>: popq %rbp
0x1000008fa <+26>: retq
Am I missing something? Or it's just compilers became smarter?
As you see in the assembly code, the compiler is smart enough to turn your code into a loop that is basically equivalent to (disregarding the different data types):
int fac(int n)
{
int result = n;
while (--n)
result *= n;
return result;
}
GCC is smart enough to know that the state needed by each call to your original f can be kept in two variables (n and result) through the whole recursive call sequence, so that no stack is necessary. It can transform f to fac_times, and both to fac, so to say. This is most likely not only a result of tail call optimization in the strictest sense, but one of the loads of other heuristics that GCC uses for optimization.
(I can't go more into detail regarding the specific heuristics that are used here since I don't know enough about them.)
The non-accumulator f isn't tail-recursive. The compiler's options include turning it into a loop by transforming it, or call / some insns / ret, but they don't include jmp f without other transformations.
tail-call optimization applies in cases like this:
int ext(int a);
int foo(int x) { return ext(x); }
asm output from godbolt:
foo: # #foo
jmp ext # TAILCALL
Tail-call optimization means leaving a function (or recursing) with a jmp instead of a ret. Anything else is not tailcall optimization. Tail-recursion that's optimized with a jmp really is a loop, though.
A good compiler will do further transformations to put the conditional branch at the bottom of the loop when possible, removing the unconditional branch. (In asm, the do{}while() style of looping is the most natural).

Optimizing: Inline or Macrofunction?

I need to optimize a program as good as somehow possible. Now I came across this issue: I have a one-dimensional array which represents a texture in form of pixel data. I now need to manipulate that data. The array is accessed via the following function:
(y * width) + x
to have x,y coordinates. Now the question is, what way is the most optimized for this function, I have considered the following two possibilities:
Inline:
inline int Coords(x,y) { return (y * width) + x); }
Macro:
#define COORDS(X,Y) ((Y)*width)+(X)
which one is the best practice to use here, or is there a way to get a even more optimized variant of this which I dont know?
I wrote a little test program to see what the difference would be between the two approaches.
Here it is:
#include <cstdint>
#include <algorithm>
#include <iterator>
#include <iostream>
using namespace std;
static constexpr int width = 100;
inline int Coords(int x, int y) { return (y * width) + x; }
#define COORDS(X,Y) ((Y)*width)+(X)
void fill1(uint8_t* bytes, int height)
{
for (int x = 0 ; x < width ; ++x) {
for (int y = 0 ; y < height ; ++y) {
bytes[Coords(x,y)] = 0;
}
}
}
void fill2(uint8_t* bytes, int height)
{
for (int x = 0 ; x < width ; ++x) {
for (int y = 0 ; y < height ; ++y) {
bytes[COORDS(x,y)] = 0;
}
}
}
auto main() -> int
{
uint8_t buf1[100 * 100];
uint8_t buf2[100 * 100];
fill1(buf1, 100);
fill2(buf2, 100);
// these are here to prevent the compiler from optimising away all the above code.
copy(begin(buf1), end(buf1), ostream_iterator<char>(cout));
copy(begin(buf2), end(buf2), ostream_iterator<char>(cout));
return 0;
}
I compiled it like this:
c++ -S -o intent.s -std=c++1y -O3 intent.cpp
and then looked at the source code to see what the compiler would do.
As expected, the compiler completely ignores all attempts by the programmer to optimise, and instead looks solely at the expressed intent, side effects and possibilities of aliases. Then it emits exactly the same code for both functions (which are of course inlined).
relevant parts of the assembly:
.globl _main
.align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp16:
.cfi_def_cfa_offset 16
Ltmp17:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp18:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $20024, %rsp ## imm = 0x4E38
Ltmp19:
.cfi_offset %rbx, -56
Ltmp20:
.cfi_offset %r12, -48
Ltmp21:
.cfi_offset %r13, -40
Ltmp22:
.cfi_offset %r14, -32
Ltmp23:
.cfi_offset %r15, -24
movq ___stack_chk_guard#GOTPCREL(%rip), %r15
movq (%r15), %r15
movq %r15, -48(%rbp)
xorl %eax, %eax
xorl %ecx, %ecx
.align 4, 0x90
LBB2_1: ## %.lr.ph.us.i
## =>This Loop Header: Depth=1
## Child Loop BB2_2 Depth 2
leaq -10048(%rbp,%rcx), %rdx
movl $400, %esi ## imm = 0x190
.align 4, 0x90
LBB2_2: ## Parent Loop BB2_1 Depth=1
## => This Inner Loop Header: Depth=2
movb $0, -400(%rdx,%rsi)
movb $0, -300(%rdx,%rsi)
movb $0, -200(%rdx,%rsi)
movb $0, -100(%rdx,%rsi)
movb $0, (%rdx,%rsi)
addq $500, %rsi ## imm = 0x1F4
cmpq $10400, %rsi ## imm = 0x28A0
jne LBB2_2
## BB#3: ## in Loop: Header=BB2_1 Depth=1
incq %rcx
cmpq $100, %rcx
jne LBB2_1
## BB#4:
xorl %r13d, %r13d
.align 4, 0x90
LBB2_5: ## %.lr.ph.us.i10
## =>This Loop Header: Depth=1
## Child Loop BB2_6 Depth 2
leaq -20048(%rbp,%rax), %rcx
movl $400, %edx ## imm = 0x190
.align 4, 0x90
LBB2_6: ## Parent Loop BB2_5 Depth=1
## => This Inner Loop Header: Depth=2
movb $0, -400(%rcx,%rdx)
movb $0, -300(%rcx,%rdx)
movb $0, -200(%rcx,%rdx)
movb $0, -100(%rcx,%rdx)
movb $0, (%rcx,%rdx)
addq $500, %rdx ## imm = 0x1F4
cmpq $10400, %rdx ## imm = 0x28A0
jne LBB2_6
## BB#7: ## in Loop: Header=BB2_5 Depth=1
incq %rax
cmpq $100, %rax
jne LBB2_5
## BB#8:
movq __ZNSt3__14coutE#GOTPCREL(%rip), %r14
leaq -20049(%rbp), %r12
xorl %ebx, %ebx
.align 4, 0x90
LBB2_9: ## %_ZNSt3__116ostream_iteratorIccNS_11char_traitsIcEEEaSERKc.exit.us.i.i13
## =>This Inner Loop Header: Depth=1
movb -10048(%rbp,%r13), %al
movb %al, -20049(%rbp)
movl $1, %edx
movq %r14, %rdi
movq %r12, %rsi
callq __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
incq %r13
cmpq $10000, %r13 ## imm = 0x2710
jne LBB2_9
## BB#10:
movq __ZNSt3__14coutE#GOTPCREL(%rip), %r14
leaq -20049(%rbp), %r12
.align 4, 0x90
LBB2_11: ## %_ZNSt3__116ostream_iteratorIccNS_11char_traitsIcEEEaSERKc.exit.us.i.i
## =>This Inner Loop Header: Depth=1
movb -20048(%rbp,%rbx), %al
movb %al, -20049(%rbp)
movl $1, %edx
movq %r14, %rdi
movq %r12, %rsi
callq __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
incq %rbx
cmpq $10000, %rbx ## imm = 0x2710
jne LBB2_11
## BB#12: ## %_ZNSt3__14copyIPhNS_16ostream_iteratorIccNS_11char_traitsIcEEEEEET0_T_S7_S6_.exit
cmpq -48(%rbp), %r15
jne LBB2_14
## BB#13: ## %_ZNSt3__14copyIPhNS_16ostream_iteratorIccNS_11char_traitsIcEEEEEET0_T_S7_S6_.exit
xorl %eax, %eax
addq $20024, %rsp ## imm = 0x4E38
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
Note that without the two calls to copy(..., ostream_iterator...) the compiler surmised that the total effect of the program was nothing and refused to emit any code at all, other than to return 0 from main()
Moral of the story: stop trying to do the compiler's job. Get on with yours.
Your job is to express intent as elegantly as you can. That's all.
Inline function, for two reasons:
it's less prone to bugs,
it lets the compiler decide whether to inline or not, so you don't have to waste time worrying about such trivial things.
First job: fix the bugs in the macro.
If you're that concerned, implement both ways using a compiler directive and profile the results.
Change inline int Coords(x,y) to inline int Coords(const x, const y) so, if the macro version does turn out quicker, then the inline build version will error if the macro is ever refactored to modify the arguments.
My hunch is that the function will be no slower than the macro in a good optimised build. And a code base without macros is easier to maintain.
If you do end up settling for the macro, then I'd be inclined to pass width as a macro argument too for the sake of program stability.
I am surprised that no one mentioned one major difference between a function and a macro: any compiler can inline the function, but not many (if at all) can create a function out of a macro even if that will benefit the performance.
I would offer a diverging answer in that this question seems to be looking at the wrong solutions. It's comparing two things that even the most basic optimizer from the 90s (maybe even 80s) should be able to optimize to the same degree (a trivial one-liner function versus a macro).
If you want to improve performance here, you have to compare between solutions that aren't so trivial for the compiler to optimize.
For example, let's say you access the texture in a sequential way. Then you don't need to access a pixel through (y*w) + x, you can simply iterate over it sequentially:
for (int j=0; j < num_pixels; ++j)
// do something with pixels[j]
In practice I've seen performance benefits with these kinds of loops over the y/x double loop even against the most modern compilers.
Let's say you aren't accessing things perfectly sequentially but can still access adjacent horizontal pixels within a scanline. You might get a performance boost in that case by doing:
// Given a particular y value:
Pixel* scanline = pixels + y*w;
for (int x=0; x < w; ++x)
// do something with scanline[x]
If you aren't doing either of these things and need completely random access to an image, maybe you can figure out a way to make your memory access pattern more uniform (accessing more horizontal pixels that would likely be in the same L1 cache line prior to eviction).
Sometimes it can even be worth the cost to transpose the image if that results in the bulk of your subsequent memory access being horizontal within a scanline and not across scanlines (due to spatial locality). It might seem crazy that the cost of transposing an image (basically rotating it 90 degrees and swapping rows with columns) will more than make up for the reduced cost of accessing it afterwards, but accessing memory in an efficient, cache-friendly pattern is a huge deal, and especially in image processing (like the difference between hundreds of millions of pixels per second vs. just millions of pixels per second).
If you can't do any of this and still need random access and you're facing profiler hotspots here, then it might help to split your texture image into smaller tiles (that would mean rendering more textured quads/triangles and possibly even doing extra work to ensure seamless results at the boundaries of each texture tile, but it can balance out and the extra geometry overhead can outweigh the cost if your overhead is in processing the texture). That would be increasing locality of reference and the probability that you'll use more memory cached to faster but smaller memory prior to eviction by actually reducing the size of the texture input you are processing in a totally random-access kind of way.
Any of these techniques can provide a boost in performance -- trying to optimize a one-liner function by using a macro instead is very unlikely to help anything except just make the code harder to maintain. In the best case scenario a macro might improve performance in a completely unoptimized debug build, but that kind of defeats the whole purpose of a debug build which is intended to be easy to debug, and macros are notoriously difficult to debug.

Performance: if (i == 0) versus if (!i) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm writing (or at least trying to write) some high-performance C++ code. I've come across a a part where I need to do a large amount of integer comparisons, namely, to check if the result is equal to zero.
Which is more efficient? That is, which requires fewer processor instructions?
if (i == 0) {
// do stuff
}
or
if (!i) {
// do stuff
}
I'm running it on an x86-64 architecture, if that makes any difference.
Let's look at the assembly (with no optimizations) of this code with gcc :
void foo(int& i)
{
if(!i)
i++;
}
void bar(int& i)
{
if(i == 0)
i++;
}
int main()
{
int i = 0;
foo(i);
bar(i);
}
foo(int&): # #foo(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB0_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB0_2:
ret
bar(int&): # #bar(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB1_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB1_2:
ret
main: # #main
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -8(%rbp), %rdi
movl $0, -4(%rbp)
movl $0, -8(%rbp)
callq foo(int&)
leaq -8(%rbp), %rdi
callq bar(int&)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
Bottom line:
The generated assembly is exactly identical (even without optimizations enabled), so it doesn't matter : choose the clearer, most readable syntax, which is probably if( i == 0) in your case.
In C++, you almost never need to care about such micro optimizations, compilers/optimizers are very good at this game : trust them. If you don't and if you have a performance bottleneck, profile / look at the assembly for your particular platform.
Note:
You can use godbolt.org to generate such assembly, it is a very handy tool.
You can also use the -S option on gcc to produce the assembly (other compilers have similar options)
Unless you have an insane compiler, they should compile identically. Having said that, for the sanity of future people looking at your code, only use i == 0 if i is a numeric type and !i if i is a bool type.
No compiler of the better known ones will compile those to anything that differs significantly enough that it will matter when you do what everyone must do before applying manual optimizations: measure.

Why is my stack-based implementation of this code so much slower than recursion?

I have a tree whose nodes store either -1 or a non-negative integer that is the name of a vertex. Each vertex appears at most once within the tree. The following function is a bottleneck in my code:
Version A:
void node_vertex_members(node *A, vector<int> *vertexList){
if(A->contents != -1){
vertexList->push_back(A->contents);
}
else{
for(int i=0;i<A->children.size();i++){
node_vertex_members(A->children[i],vertexList);
}
}
}
Version B:
void node_vertex_members(node *A, vector<int> *vertexList){
stack<node*> q;
q.push(A);
while(!q.empty()){
int x = q.top()->contents;
if(x != -1){
vertexList->push_back(x);
q.pop();
}
else{
node *temp = q.top();
q.pop();
for(int i=temp->children.size()-1; i>=0; --i){
q.push(temp->children[i]);
}
}
}
}
For some reason, version B takes significantly longer to run than version A, which I did not expect. What might the compiler be doing that's so much more clever than my code? Put another way, what am I doing that's so inefficient? Also perplexing to me is that if I try anything such as checking in version B whether the children's contents are -1 before putting them on the stack, it slows down dramatically (almost 3x). For reference, I am using g++ in Cygwin with the -O3 option.
Update:
I was able to match the recursive version using the following code (version C):
node *node_list[65536];
void node_vertex_members(node *A, vector<int> *vertex_list){
int top = 0;
node_list[top] = A;
while(top >= 0){
int x = node_list[top]->contents;
if(x != -1){
vertex_list->push_back(x);
--top;
}
else{
node* temp = node_list[top];
--top;
for(int i=temp->children.size()-1; i>=0; --i){
++top;
node_list[top] = temp->children[i];
}
}
}
}
Obvious downsides are the code length and the magic number (and associated hard limit). And, as I said, this only matches the version A performance. I will of course be sticking with the recursive version, but I am satisfied now that it was basically STL overhead biting me.
Version A has one significant advantage: far smaller code size.
Version B has one significant disadvantage: memory allocation for the stack elements. Consider that the stack starts out empty and has elements pushed into it one by one. Every so often, a new allocation will have to be made for the underlying deque. This is an expensive operation, and it may be repeated a few times for each call of your function.
Edit: here's the assembly generated by g++ -O2 -S with GCC 4.7.3 on Mac OS, run through c++filt and annotated by me:
versionA(node*, std::vector<int, std::allocator<int> >*):
LFB609:
pushq %r12
LCFI5:
movq %rsi, %r12
pushq %rbp
LCFI6:
movq %rdi, %rbp
pushq %rbx
LCFI7:
movl (%rdi), %eax
cmpl $-1, %eax ; if(A->contents != -1)
jne L36 ; vertexList->push_back(A->contents)
movq 8(%rdi), %rcx
xorl %r8d, %r8d
movl $1, %ebx
movq 16(%rdi), %rax
subq %rcx, %rax
sarq $3, %rax
testq %rax, %rax
jne L46 ; i < A->children.size()
jmp L35
L43: ; for(int i=0;i<A->children.size();i++)
movq %rdx, %rbx
L46:
movq (%rcx,%r8,8), %rdi
movq %r12, %rsi
call versionA(node*, std::vector<int, std::allocator<int> >*)
movq 8(%rbp), %rcx
leaq 1(%rbx), %rdx
movq 16(%rbp), %rax
movq %rbx, %r8
subq %rcx, %rax
sarq $3, %rax
cmpq %rbx, %rax
ja L43 ; continue
L35:
popq %rbx
LCFI8:
popq %rbp
LCFI9:
popq %r12
LCFI10:
ret
L36: ; vertexList->push_back(A->contents)
LCFI11:
movq 8(%rsi), %rsi
cmpq 16(%r12), %rsi ; vector::size == vector::capacity
je L39
testq %rsi, %rsi
je L40
movl %eax, (%rsi)
L40:
popq %rbx
LCFI12:
addq $4, %rsi
movq %rsi, 8(%r12)
popq %rbp
LCFI13:
popq %r12
LCFI14:
ret
L39: ; slow path for vector to expand capacity
LCFI15:
movq %rdi, %rdx
movq %r12, %rdi
call std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&)
jmp L35
That's fairly succinct and at a glance seems fairly free of "speed bumps." When I compile with -O3 I get an unholy mess, with unrolled loops and other fun stuff. I don't have time to annotate Version B right now, but suffice to say it is more complex due to many deque functions and scribbling on a lot more memory. No surprise it's slower.
The maximum size of q in version B is significantly greater than the maximum recursion depth in version A. That could make your cache performance quite a bit less efficient.
(version A: depth is log(N)/log(b), version B: queue length hits b*log(N)/log(b))
The second code is slower because it's maintaining a second dynamic set data structure in addition to the collection that is being returned. That involves more memory allocations, more object initializations, more list insertions and deletions.
However, the algorithm in the second code is more flexible: it can be trivially modified to give you a breadth-first traversal instead of depth first, whereas recursion only performs depth-first traversal. (Well, it can go depth first, but the change is not quite as trivial; see comment at the end.)
Since the job is to traverse everything and collect some nodes, perhaps the depth first traversal is better, assuming you don't want depth-first order.
But in situations where you are searching for a node which satisfies some condition, it may be more appropriate to implement a breadth-first search. If the tree is infinite (because it is not a data structure, but a search tree of possibilities, such as future moves in a game or whatever), it may be intractable to do depth-first, because there is no bottom. In some situations, it is desirable to find a node which is close to the root, not just any node. A depth-first search can take a long time to find a node which is close to the root of the tree. If the tree is deep, but usually the desired node is found not far from the root, a depth-first search can waste a lot of time, even if the recursion mechanism which implements it is fast.
Recursion can do breadth-first, by iterative deepening: recurse to a maximum depth of 1, then recurse from the top again, this time to a maxmum depth of 2, and so on. The queue based traversal just has to change the order in which it adds nodes to the work queue.