Efficiently copying string to mmap'ed location - ocaml

I'd like to copy a string from an OCaml program to a mmap'ed memory region (obtained via Genarray.file_map) as efficiently as possible. My objective is to allow other processes to read from this shared memory while the OCaml process is running, with minimal overhead (I do not need full concurrency features, there is only one writer and one reader).
I tried copying character per character, as in the following snippet (where I copy the first 255 characters of string s):
let fd = Unix.openfile "/tmp/foo" [Unix.O_RDWR; Unix.O_CREAT] 0o600 in
let mmap = Bigarray.Genarray.map_file fd Bigarray.Char Bigarray.C_layout true
(Array.of_list [256])
in
let n = min (String.length s - 1) 255 in
for i = 0 to n do
Bigarray.Genarray.set mmap [|i|] (String.get s i)
done;
Bigarray.Genarray.set mmap [|n|] (Char.chr 0)
But this is very inefficient: even with a relatively small input, it takes already 3 times longer to execute than without mmap.
Is there a better way to do it? Ideally I'd like to avoid too many dependencies, e.g. Jane Street's core.

The Bigarray module only provides blit between bigarrays. If the extra overhead really is due to the behavior of mmapped memory, you might try copying from the string to a bigarray first, then blit from the bigarray into the mmapped target. (To read from the mmapped array, you could do the inverse.)
The only other thing I can think of is to code up the transfers in C (using memcpy()) and call them as external functions.

Related

How to decompress less than original size with Lz4 library?

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.

Efficient way to read file multiple line one time?

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3
I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.
Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att
Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?
Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.
C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.
Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)
You're right that it would be a problem if every fgets required a system call.
You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.
If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.
Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.
You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)
Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.
It all depends what you want to do as processing afterwards : do you need to keep a copy of the lines ? Do you intend to process input as std::strings ? etc...
Here some general remarks that could help you further:
istream::getline() and fgets() are buffered operations. So I/O is already reduced and you could assume the performance is already correct.
std::getline() is also buffered. Nevertheless, if you don't need to process std::strings the function would cost you a considerable number of memory allocation/deallocation, which might impact performance
Bloc operations like read() or fread() can achieve economies of scale if you can afford large buffers. This can be especially efficient, if you use the data in a throw-away fashion (because you can avoid copying the data and work directly in the buffer), but at the cost of an extra complexity.
But all these considerations shall not forget that the erformance is very much affected by the library implementation that you use.
I've done a little informal benchmark reading a milion of lines in the format you've shown:
* With MSVC2015 on my PC the read() is twice as fast as fgets(), and almost 4 times faster than std::string.
* With GCC on CodingGround, compiling with O3, fgets(), and both getline() are approximately the same, and the read() is slower.
Here the full code if you want to play around.
Here the the code that show you how to move the buffer arround.
int nr=0; // number of bytes read
bool last=false; // last (incomplete) read
while (!last)
{
// here nr conains the number of bytes kept from incomplete line
last = !ifs.read(buffer+nr, szb-nr);
nr = nr+ifs.gcount();
char *s, *p = buffer, *pe = p + nr;
do { // process complete lines in buffer
for (s = p; p != pe && *p != '\n'; p++)
;
if (p != pe || (p == pe && last)) {
if (p != pe)
*p++ = '\0';
lines++; // TO DO: here s is a null terminated line to process
sln += strlen(s); // (dummy operatio for the example)
}
} while (p != pe); // until eand of buffer is reached
std::copy(s, pe, buffer); // copy last (incoplete) line to begin of buffer
nr = pe - s; // and prepare the info for the next iteration
}

Why is this Lwt based and seemingly concurrent code so inconsistent

I am trying to create concurrent examples of Lwt and came up with this little sample
let () =
Lwt_main.run (
let start = Unix.time () in
Lwt_io.open_file Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= fun h ->
Lwt_io.read ~count:52428800
data_source >>= Lwt_io.write_line h)
[0;1;2;3;4;5;6;7;8;9] >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start))
EDIT: With asking for 50GB it was:
"However this is incredibly slow and basically useless.
Does the inner bind need to be forced somehow?"
EDIT: I originally had written asking for 50 GB and it never finished, now I have a different problem with asking for 50MB, The execution is nearly instantaneously and du -sh reports only a directory size of 80k.
EDIT: I have also tried the code with explicitly closing the file handles with the same bad result.
I am on OS X latest version and compile with
ocamlfind ocamlopt -package lwt.unix main.ml -linkpkg -o Test
(I have also tried /dev/random, yes I'm using wall-clock time.)
So, your code has some issues.
Issue 1
The main issue is that you understood the Lwt_io.read function incorrectly (and nobody can blame you!).
val read : ?count : int -> input_channel -> string Lwt.t
(** [read ?count ic] reads at most [len] characters from [ic]. It
returns [""] if the end of input is reached. If [count] is not
specified, it reads all bytes until the end of input. *)
When ~count:len is specified it will read at most len characters. At most, means, that it can read less. But if the count option is omitted, then it will read all data. I, personally, find this behavior unintuitive, if not weird. So, this at most means up to len or less, i.e., no guarantee is provided that it will read exactly len bytes. And indeed, if you add a check into your program:
Lwt_io.read ~count:52428800 data_source >>= fun data ->
Lwt_io.printlf "Read %d bytes" (String.length data) >>= fun () ->
Lwt_io.write h data >>= fun () ->
You will see, that it will read only 4096 bytes, per try:
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Read 4096 bytes
Why 4096? Because this is the default buffer size. But it actually doesn't matter.
Issue 2
Lwt_io module implements a buffered IO. That means that all your writes and reads are not going directly to a file, but are buffered in the memory. That means, that you should remember to flush and close. Your code doesn't close descriptors on finish, so you can end up with a situation when some buffers are left unflushed after a program is terminated. Lwt_io in particular, flushes all buffers before program exit. But you shouldn't rely on this undocumented feature (it may hit you in future, when you will try any other buffered io, like fstreams from standard C library). So, always close your files (another problem is that today file descriptors are the most precious resource, and their leaking is very hard to find).
Issue 3
Don't use /dev/urandom or /dev/random to measure io. For the former you will measure the performance of random number generator, for the latter you will measure the flow of entropy in your machine. Both are quite slow. Depending on the speed of your CPU, you will rarely get more than 16 Mb/s, and it is much less, then Lwt can throughput. Reading from /dev/zero and writing to /dev/null will actually perform all transfers in memory and will show the actual speed, that can be achieved by your program. A well-written program will be still bounded by the kernel speed. In the example program, provided below, this will show an average speed of 700 MB/s.
Issue 4
Don't use the buffered io, if you're really striving for a performance. You will never get the maximum. For example, Lwt_io.read will read first at buffer, then it will create a string and copy data to that string. If you really need some performance, then you should provide your own buffering. In most cases, there is no need for this, as Lwt_io is quite performant. But if you need to process dozens of megabytes per second, or need some special buffering policy (something non-linear), you may need to think about providing your own buffering. The good news is that Lwt_io allows you to do this. You can take a look at an example program, that will measure the performance of Lwt input/output. It mimics a well-known pv program.
Issue 5
You're expecting to get some performance by running threads in parallel. The problem is that in your test there is no place for the concurrency. /dev/random (as well as /dev/zero) is one device that is bounded only by CPU. This is the same, as just calling a random function. It will always be available, so no system call will block on it. Writing to a regular file is also not a good place for concurrency. First of all, usually there is only one hard-drive, with one writing head in it. Even if system call will block and yield control to another thread, this will result in a performance digression, as two threads will now compete for the header position. If you have SSD, there will not be any competition for the header, but the performance will be still worse, as you will spoil your caches. But fortunately, usually writing on regular files doesn't block. So your threads will run consequently, i.e., they will be serialized.
If you look at your files, you'll see they're each 4097K – that's 4096K that was read from /dev/urandom, plus one byte for the newline. You're reaching a buffer maximum with Lwt_io.read, so even though you say ~count:awholelot, it only gives you ~count:4096.
I don't know what the canonical Lwt way to do this is, but here's one alternative:
open Lwt
let stream_a_little source n =
let left = ref n in
Lwt_stream.from (fun () ->
if !left <= 0 then return None
else Lwt_io.read ~count:!left source >>= (fun s ->
left:=!left - (Bytes.length s);
return (Some s)
))
let main () =
Lwt_io.open_file ~buffer_size:(4096*8) ~mode:Lwt_io.Input "/dev/urandom" >>= fun data_source ->
Lwt_unix.mkdir "serial" 0o777 >>= fun () ->
Lwt_list.iter_p
(fun count ->
let count = string_of_int count in
Lwt_io.open_file
~flags:[Unix.O_RDWR; Unix.O_CREAT]
~perm:0o777
~mode:Lwt_io.Output ("serial/file"^ count ^ ".txt") >>= (fun h ->
Lwt_stream.iter_s (Lwt_io.write h)
(stream_a_little data_source 52428800)))
[0;1;2;3;4;5;6;7;8;9]
let timeit f =
let start = Unix.time () in
f () >>= fun () ->
let finished = Unix.time () in
Lwt_io.printlf "Execution time took %f seconds" (finished -. start)
let () =
Lwt_main.run (timeit main)
EDIT: Note that lwt is a cooperative threading library; when you have two threads going "at the same time", they don't actually do stuff in your OCaml process at the same time. OCaml is (as of yet) single-core, so when one thread is moving, the others wait nicely until that thread says "OK, I've done some work, you others go". So when you try to stream to 8 files at the same time, you're basically doling out a little randomness to file1, then a little to file2, … a little to file8, then (if there's still work left to do) a little to file1, then a little to file2 etc. This makes sense if you're waiting on lots of input anyway (say your input is coming over the network), and your main process has a lot of time to go through each thread and check "is there any input?", but when all your threads are just reading from /dev/random, it'd be much faster to just fill up one file first, then the second, etc. And assuming it's possible for several CPU's to read /dev/(u)random in parallell (and your drive can keep up), it'd of course be much faster to load ncpu reads at the same time, but then you need multicore (or just do this in shell script).
EDIT2: showed how to increase buffer size on the reader, ups the speed a bit ;) Note that you can also simply set the buffer_size as high as you want on your old example, which will read it all in one go, but you can't get more than your buffer_size unless you read several times.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?
What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)
One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.
use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

Profiling memory usage in Mathematica

Is there any way to profile the mathkernel memory usage (down to individual variables) other than paying $$$ for their Eclipse plugin (mathematica workbench, iirc)?
Right now I finish execution of a program that takes multiple GB's of ram, but the only things that are stored should be ~50MB of data at most, yet mathkernel.exe tends to hold onto ~1.5GB (basically, as much as Windows will give it). Is there any better way to get around this, other than saving the data I need and quitting the kernel every time?
EDIT: I've just learned of the ByteCount function (which shows some disturbing results on basic datatypes, but that's besides the point), but even the sum over all my variables is nowhere near the amount taken by mathkernel. What gives?
One thing a lot of users don't realize is that it takes memory to store all your inputs and outputs in the In and Out symbols, regardless of whether or not you assign an output to a variable. Out is also aliased as %, where % is the previous output, %% is the second-to-last, etc. %123 is equivalent to Out[123].
If you don't have a habit of using %, or only use it to a few levels deep, set $HistoryLength to 0 or a small positive integer, to keep only the last few (or no) outputs around in Out.
You might also want to look at the functions MaxMemoryUsed and MemoryInUse.
Of course, the $HistoryLength issue may or not be your problem, but you haven't shared what your actual evaluation is.
If you're able to post it, perhaps someone will be able to shed more light on why it's so memory-intensive.
Here is my solution for profiling of memory usage:
myByteCount[symbolName_String] :=
Replace[ToHeldExpression[symbolName],
Hold[x__] :>
If[MemberQ[Attributes[x], Protected | ReadProtected],
Sequence ## {}, {ByteCount[
Through[{OwnValues, DownValues, UpValues, SubValues,
DefaultValues, FormatValues, NValues}[Unevaluated#x,
Sort -> False]]], symbolName}]];
With[{listing = myByteCount /# Names[]},
Labeled[Grid[Reverse#Take[Sort[listing], -100], Frame -> True,
Alignment -> Left],
Column[{Style[
"ByteCount for symbols without attributes Protected and \
ReadProtected in all contexts", 16, FontFamily -> "Times"],
Style[Row#{"Total: ", Total[listing[[All, 1]]], " bytes for ",
Length[listing], " symbols"}, Bold]}, Center, 1.5], Top]]
Evaluation the above gives the following table:
Michael Pilat's answer is a good one, and MemoryInUse and MaxMemoryUsed are probably the best tools you have. ByteCount is rarely all that helpful because what it measures can be a huge overestimate because it ignores shared subexpressions, and it often ignores memory that isn't directly accessible through Mathematica functions, which is often a major component of memory usage.
One thing you can do in some circumstances is use the Share function, which forces subexpressions to be shared when possible. In some circumstances, this can save you tens or even hundreds of magabytes. You can tell how well it's working by using MemoryInUse before and after you use Share.
Also, some innocuous-seeming things can cause Mathematica to use a whole lot more memory than you expect. Contiguous arrays of machine reals (and only machine reals) can be allocated as so-called "packed" arrays, much the way they would be allocated by C or Fortran. However, if you have a mix of machine reals and other structures (including symbols) in an array, everything has to be "boxed", and the array becomes an array of pointers, which can add a lot of overhead.
One way is to automatize restarting of kernel when it goes out of memory. You can execute your memory-consuming code in a slave kernel while the master kernel only takes the result of computation and controls memory usage.