What's the size of the function stack in OCaml? - ocaml

If we do not-tail-recursive functions, ocaml will create a stack and push info inside. And it is possible to get stack overflow error if we recursively call too many times.
So what's the threshold? What's the size of the function stack?

For the bytecode interpreter, the documentation says the default size is 256k words. (I think wordsize is 32 or 64 bits depending on the system.) You can adjust it with the l parameter in OCAMLRUNPARAM or through the GC module.
For native code, the documentation says that the native conventions of the OS are used. So it will be different for each implementation.
I just looked these things up now; I've never needed to know in practice. Generally I don't want to write code that gets anywhere near the stacksize limit.

I don't know this for sure, but it is clear that the recursion depth is dependent on the function you are talking about. Simply consider these two (non-tail-recursive) functions:
let rec f x = print_int x; print_char '\n'; 1 + f (x+1);;
let rec g x y z = print_int x; print_char '\n'; 1 + g (x+1) y z;;
And try f 0 resp. g 0 0 0. Both functions will eventually produce a stack overflow, but the latter (g) will do so "earlier".
It may be the case that there is a certain number of bytes available on the stack. You can probably approximate this number by looking at how far f goes and looking up what exactly is pushed onto the stack when a function call occurs.

Related

A unique type of data conversion

in the following code
tt=5;
for(i=0;i<tt;i++)
{
int c,d,l;
scanf("%lld%lld%lld",&c,&d,&l);
printf("%d %d %d %d",c,d,l,tt);
}
in the first iteration, the value of 'tt' is changing to 0 automatically.
I know that i have declared c,d,l as int and taking input as long long so it is making c,d=0. But still, i m not able to understand how tt is becoming 0.
Small, but obligatory announcement. As it was said in comments, you face undefined behavior, so
don't be surprised by tt assigned to zero
don't be surprised by tt not assigned to zero after insignificant code changes (e.g. reordering initialization from "int i,tt;" to "int tt, i;" or vice versa)
don't be surprised by tt not assigned to zero after compiling with different flags or different compiler version or for different platform or for testing with different input
don't be surprised by anything. Any behavior is possible.
You can't expect this code to work one way or another, so don't ever use it in real program.
However, you seem to be OK with that, and the question is "what is actually happening with tt". IMHO this question is really great, it reveals passion to understand programming deeper, and it helps in digging into lower layer. So lets get started.
Possible explanation
I failed to reproduce behavior on VS2015, but situation is quite clear. Actual data aligning, variable sizes, endianness, stack growth direction and other details may differ on your PC, but the general idea should be the same.
Variables i, tt, c, d, l are local, so they are stored on stack. Lets assume, sizeof(int) is 4 and sizeof(long long) is 8 which is quite common. Then one of possible data alignments is shown on picture (addresses grow from left to right, each cell represents one byte):
When doing scanf, you pass address of c (blue arrow on next pict) for filling with data. But size of data is 8 bytes, so data of both c and tt are overwritten (blue cells on the pict). For little-endian representation, you always write zeroes to tt unless really big number is entered by user, while c actually gets valid data for small numbers.
However, valid data in c will be rewritten the same way during filling d, the same will happen to d while filling l. So only l will get nonzero value in described case. Easy test: enter large number for c, d, l and check if tt is still zero.
How to get precise answer
You can get all answers from assembly code. Enable disassembly listing (exact steps depend on toolchain: gcc has -S option, visual studio has "goto disassembly" item in context menu while on breakpoint) and analyze listing. It's really helpful to see exact instructions your CPU is going to execute. Some debuggers allow executing instructions one by one. So you need to find out how variables are alligned on stack and when exactly are they overwritten. Analyzing scanf is hard for beginners, so you can start with the simplified version of your program: replace scanf with the following (can't test, but should work):
*((long long *)(&c)) = 1; //or any other user specified value
*((long long *)(&d)) = 2;
*((long long *)(&l)) = 3;

Why is this Lwt based and seemingly concurrent code so inconsistent

I am trying to create concurrent examples of Lwt and came up with this little sample
let () =
Lwt_main.run (
let start = Unix.time () in
Lwt_io.open_file Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= fun h ->
Lwt_io.read ~count:52428800
data_source >>= Lwt_io.write_line h)
[0;1;2;3;4;5;6;7;8;9] >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start))
EDIT: With asking for 50GB it was:
"However this is incredibly slow and basically useless.
Does the inner bind need to be forced somehow?"
EDIT: I originally had written asking for 50 GB and it never finished, now I have a different problem with asking for 50MB, The execution is nearly instantaneously and du -sh reports only a directory size of 80k.
EDIT: I have also tried the code with explicitly closing the file handles with the same bad result.
I am on OS X latest version and compile with
ocamlfind ocamlopt -package lwt.unix main.ml -linkpkg -o Test
(I have also tried /dev/random, yes I'm using wall-clock time.)
So, your code has some issues.
Issue 1
The main issue is that you understood the Lwt_io.read function incorrectly (and nobody can blame you!).
val read : ?count : int -> input_channel -> string Lwt.t
(** [read ?count ic] reads at most [len] characters from [ic]. It
returns [""] if the end of input is reached. If [count] is not
specified, it reads all bytes until the end of input. *)
When ~count:len is specified it will read at most len characters. At most, means, that it can read less. But if the count option is omitted, then it will read all data. I, personally, find this behavior unintuitive, if not weird. So, this at most means up to len or less, i.e., no guarantee is provided that it will read exactly len bytes. And indeed, if you add a check into your program:
Lwt_io.read ~count:52428800 data_source >>= fun data ->
Lwt_io.printlf "Read %d bytes" (String.length data) >>= fun () ->
Lwt_io.write h data >>= fun () ->
You will see, that it will read only 4096 bytes, per try:
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Why 4096? Because this is the default buffer size. But it actually doesn't matter.
Issue 2
Lwt_io module implements a buffered IO. That means that all your writes and reads are not going directly to a file, but are buffered in the memory. That means, that you should remember to flush and close. Your code doesn't close descriptors on finish, so you can end up with a situation when some buffers are left unflushed after a program is terminated. Lwt_io in particular, flushes all buffers before program exit. But you shouldn't rely on this undocumented feature (it may hit you in future, when you will try any other buffered io, like fstreams from standard C library). So, always close your files (another problem is that today file descriptors are the most precious resource, and their leaking is very hard to find).
Issue 3
Don't use /dev/urandom or /dev/random to measure io. For the former you will measure the performance of random number generator, for the latter you will measure the flow of entropy in your machine. Both are quite slow. Depending on the speed of your CPU, you will rarely get more than 16 Mb/s, and it is much less, then Lwt can throughput. Reading from /dev/zero and writing to /dev/null will actually perform all transfers in memory and will show the actual speed, that can be achieved by your program. A well-written program will be still bounded by the kernel speed. In the example program, provided below, this will show an average speed of 700 MB/s.
Issue 4
Don't use the buffered io, if you're really striving for a performance. You will never get the maximum. For example, Lwt_io.read will read first at buffer, then it will create a string and copy data to that string. If you really need some performance, then you should provide your own buffering. In most cases, there is no need for this, as Lwt_io is quite performant. But if you need to process dozens of megabytes per second, or need some special buffering policy (something non-linear), you may need to think about providing your own buffering. The good news is that Lwt_io allows you to do this. You can take a look at an example program, that will measure the performance of Lwt input/output. It mimics a well-known pv program.
Issue 5
You're expecting to get some performance by running threads in parallel. The problem is that in your test there is no place for the concurrency. /dev/random (as well as /dev/zero) is one device that is bounded only by CPU. This is the same, as just calling a random function. It will always be available, so no system call will block on it. Writing to a regular file is also not a good place for concurrency. First of all, usually there is only one hard-drive, with one writing head in it. Even if system call will block and yield control to another thread, this will result in a performance digression, as two threads will now compete for the header position. If you have SSD, there will not be any competition for the header, but the performance will be still worse, as you will spoil your caches. But fortunately, usually writing on regular files doesn't block. So your threads will run consequently, i.e., they will be serialized.
If you look at your files, you'll see they're each 4097K – that's 4096K that was read from /dev/urandom, plus one byte for the newline. You're reaching a buffer maximum with Lwt_io.read, so even though you say ~count:awholelot, it only gives you ~count:4096.
I don't know what the canonical Lwt way to do this is, but here's one alternative:
open Lwt
let stream_a_little source n =
let left = ref n in
Lwt_stream.from (fun () ->
if !left <= 0 then return None
else Lwt_io.read ~count:!left source >>= (fun s ->
left:=!left - (Bytes.length s);
return (Some s)
))
let main () =
Lwt_io.open_file ~buffer_size:(4096*8) ~mode:Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= (fun h ->
Lwt_stream.iter_s (Lwt_io.write h)
(stream_a_little data_source 52428800)))
[0;1;2;3;4;5;6;7;8;9]
let timeit f =
let start = Unix.time () in
f () >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start)
let () =
Lwt_main.run (timeit main)
EDIT: Note that lwt is a cooperative threading library; when you have two threads going "at the same time", they don't actually do stuff in your OCaml process at the same time. OCaml is (as of yet) single-core, so when one thread is moving, the others wait nicely until that thread says "OK, I've done some work, you others go". So when you try to stream to 8 files at the same time, you're basically doling out a little randomness to file1, then a little to file2, … a little to file8, then (if there's still work left to do) a little to file1, then a little to file2 etc. This makes sense if you're waiting on lots of input anyway (say your input is coming over the network), and your main process has a lot of time to go through each thread and check "is there any input?", but when all your threads are just reading from /dev/random, it'd be much faster to just fill up one file first, then the second, etc. And assuming it's possible for several CPU's to read /dev/(u)random in parallell (and your drive can keep up), it'd of course be much faster to load ncpu reads at the same time, but then you need multicore (or just do this in shell script).
EDIT2: showed how to increase buffer size on the reader, ups the speed a bit ;) Note that you can also simply set the buffer_size as high as you want on your old example, which will read it all in one go, but you can't get more than your buffer_size unless you read several times.

How to print memory addresses in OCaml?

Say I have a variable:
let a = ref 3 in magic_code
Magic_code should print the address in memory that is stored in a. Is there something like that? I googled this but nothing came up...
This should work:
let a = ref 3 in
let address = 2*(Obj.magic a) in
Printf.printf "%d" address;;
OCaml distinguishes between heap pointers and integers using the least significant bit of a word, 0 for pointers and 1 for integers (see this chapter in Real World OCaml).
Obj.magic is a function of type 'a -> 'b that lets you bypass typing (i.e. arbitrarily "cast"). If you force OCaml to interpret the reference as an int by unsafely casting it via Obj.magic, the value you get is the address shifted right by one bit. To obtain the actual memory address, you need to shift it back left by 1 bit, i.e. double the value.
Also see this answer.

why polymorphism is so costly in haskell(GHC)?

I am asking this question with refernce to this SO question.
Accepted answer by Don stewart : First line says "Your code is highly polymorphic change all float vars to Double .." and it gives 4X performance improvement.
I am interested in doing matrix computations in Haskell, should I make it a habit of writing highly monomorphic code?
But some languages make good use of ad-hoc polymorphism to generate fast code, why GHC won't or can't? (read C++ or D)
why can't we have something like blitz++ or eigen for Haskell? I don't understand how typeclasses and (ad-hoc)polymorphism in GHC work.
With polymorphic code, there is usually a tradeoff between code size and code speed. Either you produce a separate version of the same code for each type that it will operate on, which results in larger code, or you produce a single version that can operate on multiple types, which will be slower.
C++ implementations of templates choose in favor of increasing code speed at the cost of increasing code size. By default, GHC takes the opposite tradeoff. However, it is possible to get GHC to produce separate versions for different types using the SPECIALIZE and INLINABLE pragmas. This will result in polymorphic code that has speed similar to monomorphic code.
I want to supplement Dirk's answer by saying that INLINABLE is usually recommended over SPECIALIZE. An INLINABLE annotation on a function guarantees that the module exports the original source code of the function so that it can be specialized at the point of usage. This usually removes the need to provide separate SPECIALIZE pragmas for every single use case.
Unlike INLINE, INLINABLE does not change GHC's optimization heuristics. It just says "Please export the source code".
I don't understand how typeclasses work in GHC.
OK, consider this function:
linear :: Num x => x -> x -> x -> x
linear a b x = a*x + b
This takes three numbers as input, and returns a number as output. This function accepts any number type; it is polymorphic. How does GHC implement that? Well, essentially the compiler creates a "class dictionary" which holds all the class methods inside it (in this case, +, -, *, etc.) This dictionary becomes an extra, hidden argument to the function. Something like this:
data NumDict x =
NumDict
{
method_add :: x -> x -> x,
method_subtract :: x -> x -> x,
method_multiply :: x -> x -> x,
...
}
linear :: NumDict x -> x -> x -> x -> x
linear dict a b x = a `method_multiply dict` x `method_add dict` b
Whenever you call the function, the compiler automatically inserts the correct dictionary - unless the calling function is also polymorphic, in which case it will have received a dictionary itself, so just pass that along.
In truth, functions that lack polymorphism are typically faster not so much because of a lack of function look-ups, but because knowing the types allows additional optimisations to be done. For example, our polymorphic linear function will work on numbers, vectors, matricies, ratios, complex numbers, anything. Now, if the compiler knows that we want to use it on, say, Double, now all the operations become single machine-code instructions, all the operands can be passed in processor registers, and so on. All of which results in fantastically efficient code. Even if it's complex numbers with Double components, we can make it nice and efficient. If we have no idea what type we'll get, we can't do any of those optimisations... That's where most of the speed difference typically comes from.
For a tiny function like linear, it's highly likely it will be inlined every time it's called, resulting in no polymorphism overhead and a small amount of code duplication - rather like a C++ template. For a larger, more complex polymorphic function, there may be some cost. In general, the compiler decides this, not you - unless you want to start sprinkling pragmas around the place. ;-) Or, if you don't actually use any polymorphism, you can just give everything monomorphic type signatures...

C++ using boolean evaluations for array positions (jump table)

I have a C++ IF statement which looks like (pseudo code- all variables are ints):
if(x < y){
c += d;
}
else{
c += f;
}
and I am thinking of trying to remove the IF statement and instead, load the values d and f into a two-element array:
array[0] = d
array[1] = f
and then I would like to be able to refer to the array elements '0' or '1' based upon the underlying type of boolean (at least in C- 0 or 1). Is there any way to do this? So my code would change to be something like:
c += array[(x<y)] if this is true, c increments by f, otherwise if its false, c increments by d.
Can I do this, using the boolean result to look up the array index?
Of course you can do it. However, chances are that you are only going to make it worse. If you think that you are removing a branch in this case — you are mistaken. Assuming a production quality compiler and x86_64 architecture, your first version will result in a nice conditional move (i.e. cmovge). The second version, however, will result in extra level of indirection and reading memory (i.e. mov eax,DWORD PTR [rax*4+0x4005d0].
If you accept suggestions, I have a very bad feeling that you are on a very, very wrong path right now. When you are optimizing your program, you have to first measure/profile to determine a bottleneck. Only when you know what are bottlenecks, you can start optimizing them. When optimizing, you have to measure/profile it again to see whether there is an improvement or not. What you seem to be doing is not trusting your compiler, guessing, and doing false-optimization. I recommend you stop right there, or else it will go down the hill from there, trust me.
You could replace the if statement with the following if you want more compact code.
c += (x < y) ? d : f;
Yes that will work. Although it will make your code harder to understand and modern compilers will eliminate the if statement anyways (when translating to assembler).