I was inspired by the fst package to try to write a C++ function to quickly serialize some data structures I have in R to disk.
But I am having trouble achieving the same write speed even on very simple objects. The code below is a simple example of writing a large 1 GB vector to disk.
Using custom C++ code, I achieve a write speed of 135 MB/s, which is the limit of my disk according to CrystalBench.
On the same data, write_fst achieves a write speed of 223 MB/s, which seems impossible since my disk can't write that fast. (Note, I am using fst::threads_fst(1) and compress=0 settings, and the files have the same data size.)
What am I missing?
How can I get the C++ function to write to disk faster?
C++ Code:
#include <Rcpp.h>
#include <fstream>
#include <cstring>
#include <iostream>
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;
// [[Rcpp::export]]
void test(SEXP x) {
char* d = reinterpret_cast<char*>(REAL(x));
long dl = Rf_xlength(x) * 8;
std::ofstream OutFile;
OutFile.open("/tmp/test.raw", std::ios::out | std::ios::binary);
OutFile.write(d, dl);
OutFile.close();
}
R Code:
library(microbenchmark)
library(Rcpp)
library(dplyr)
library(fst)
fst::threads_fst(1)
sourceCpp("test.cpp")
x <- runif(134217728) # 1 gigabyte
df <- data.frame(x)
microbenchmark(test(x), write_fst(df, "/tmp/test.fst", compress=0), times=3)
Unit: seconds
expr min lq mean median uq max neval
test(x) 6.549581 7.262408 7.559021 7.975235 8.063740 8.152246 3
write_fst(df, "/tmp/test.fst", compress = 0) 4.548579 4.570346 4.592398 4.592114 4.614307 4.636501 3
file.info("/tmp/test.fst")$size/1e6
# [1] 1073.742
file.info("/tmp/test.raw")$size/1e6
# [1] 1073.742
Benchmarking SSD write and read performance is a tricky business and hard to do right. There are many effects to take into account.
For example, many SSD's use techniques to accelerate data speeds (intelligently), such as DRAM caching. Those techniques can increase your write speed, especially in cases where an identical dataset is written to disk multiple times, as in your example. To avoid this effect, each iteration of the benchmark should write a unique dataset to disk.
The block sizes of write and read operations are also important: the default physical sector size of SSD's is 4KB. Writing smaller blocks hampers performance, but with fst I found that writing blocks of data larger than a few MB's also lowers performance, due to CPU cache effects. Because fst writes it's data to disk in relatively small chunks, it's usually faster than alternatives that write data in a single large block.
To facilitate this block-wise writing to SSD, you could modify your code:
Rcpp::cppFunction('
#include <fstream>
#include <cstring>
#include <iostream>
#define BLOCKSIZE 262144 // 2^18 bytes per block
long test_blocks(SEXP x, Rcpp::String path) {
char* d = reinterpret_cast<char*>(REAL(x));
std::ofstream outfile;
outfile.open(path.get_cstring(), std::ios::out | std::ios::binary);
long dl = Rf_xlength(x) * 8;
long nr_of_blocks = dl / BLOCKSIZE;
for (long block_nr = 0; block_nr < nr_of_blocks; block_nr++) {
outfile.write(&d[block_nr * BLOCKSIZE], BLOCKSIZE);
}
long remaining_bytes = dl % BLOCKSIZE;
outfile.write(&d[nr_of_blocks * BLOCKSIZE], remaining_bytes);
outfile.close();
return dl;
}
')
Now we can compare methods test, test_blocks and fst::write_fst in a single benchmark:
x <- runif(134217728) # 1 gigabyte
df <- data.frame(X = x)
fst::threads_fst(1) # use fst in single threaded mode
microbenchmark::microbenchmark(
test(x, "test.bin"),
test_blocks(x, "test.bin"),
fst::write_fst(df, "test.fst", compress = 0),
times = 10)
#> Unit: seconds
#> expr min lq mean
#> test(x, "test.bin") 1.473615 1.506019 1.590430
#> test_blocks(x, "test.bin") 1.018082 1.062673 1.134956
#> fst::write_fst(df, "test.fst", compress = 0) 1.127446 1.144039 1.249864
#> median uq max neval
#> 1.600055 1.635883 1.765512 10
#> 1.131631 1.204373 1.264220 10
#> 1.261269 1.327304 1.343248 10
As you can see, the modified method test_blocks is about 40 percent faster than the original method and even slightly faster than the fst package. This is expected, because fst has some overhead in storing column and table information, (possible) attributes, hashes and compression information.
Please note that the difference between fst and your initial test method is much less pronounced on my system, showing again the challenges in using benchmarks to optimize a system.
Related
I am trying to benchmark the potential performance increase of emplace_back vs. push_back using Google Benchmark:
#include <benchmark/benchmark.h>
#include <vector>
#include <string>
using namespace std;
static void BM_pushback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.push_back("A long string to avoid sbo");
}
}
static void BM_emplaceback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.emplace_back("A long string to avoid sbo");
}
}
BENCHMARK(BM_pushback)->UseRealTime();
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK_MAIN();
Using g++ with -O3 I get:
Running ./benchmark
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.60, 0.48, 0.53
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_pushback/real_time 886383300 ns 886276679 ns 1
BM_emplaceback/real_time 698194513 ns 698159138 ns 1
If I, however, change the order of execution, that is
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK(BM_pushback)->UseRealTime();
I receive opposite timings:
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_emplaceback/real_time 863837627 ns 863814885 ns 1
BM_pushback/real_time 765406073 ns 765407578 ns 1
Can someone explain this behavior? I think, this might have something to do with caching but what exactly is going on here? Just as a note: What seems to help is to reduce the benchmarking size, i.e. adding only 10000 instead of 10000000 strings to my vector (which leads to more cycles being performed by the library).
I have this minimal example of Google Benchmark usage.
The weird thing is that "42" is printed a number of times (4), not just once.
I understand that the library has to run things several times to gain statistics but this I though that this is handled by the statie-loop itself.
This is a minimal example of something more complicated where I wanted to print (outside the loop) the result to verify that different implementations of the same function would give the same result.
#include <benchmark/benchmark.h>
#include<iostream>
#include <thread> //sleep for
int SomeFunction(){
using namespace std::chrono_literals;
std::this_thread::sleep_for(10ms);
return 42;
}
static void BM_SomeFunction(benchmark::State& state) {
// Perform setup here
int result = -1;
for (auto _ : state) {
// This code gets timed
result = SomeFunction();
benchmark::DoNotOptimize(result);
}
std::cout<< result <<std::endl;
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();
output: (42 is printed 4 times, why more than once, why 4?)
Running ./a.out
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.30, 0.65, 0.79
42
42
42
42
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_SomeFunction 10243011 ns 11051 ns 1000
How else could I test (at least visually) that different benchmarking blocks give the same answer?
The problem is not of practical nature and I'm only looking for a sound explanation of the observed occurence. I'm reading Seamless R and C++ Integration with Rcpp (Use R!) by Dirk Eddelbuettel. Following the introduction, I'm looking at two simple "Fibonacci functions".
In RStudio I have a cpp file of the following structure
fib_fun.cpp
#include <Rcpp.h>
// [[Rcpp::export]]
int fibonacci(const int x) {
if (x < 2)
return x;
else
return (fibonacci(x -1)) + fibonacci(x-2);
}
/*** R
# Call the fib function defined in R
fibonacci(10)
*/
I also have an inline implementation of the same function:
inline_fib.R
# Inline fib implementation
incltxt <- "int fibonacci(const int x) {
if (x == 0) return(0);
if (x == 1) return(1);
return fibonacci(x - 1) + fibonacci(x - 2);
}"
# Inline call
require(inline)
fibRcpp <- cxxfunction(signature(xs = "int"), plugin = "Rcpp",
includes = incltxt,
body = "int x = Rcpp::as<int>(xs);
return Rcpp::wrap(fibonacci(x));")
When I benchmark the functions I get the following results:
> microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 3.121 3.198 5.5192 3.447 3.886 23.491 10
fibRcpp(10) 1.176 1.398 3.9520 1.558 1.709 25.721 10
Questions
I would like to understand why there is a significant difference in performance between the two functions?
With respect to the practicalities surrounding the use of Rcpp, what generally considered to be a good practice? In my naivety, my first hunch would be to write a function and source it via sourceCpp but this solutions appears to be much slower.
Benchmarking code
require(microbenchmark); require(Rcpp); require(inline)
sourceCpp("fib_fun.cpp"); source("inline_fib.R")
microbenchmark(fibonacci(10), fibRcpp(10), times = 10)
Comment replies
I tried the functions with the unsigned int instead of the int, results:
Unit: microseconds
expr min lq mean median uq max neval
fibonacci(10) 2.908 2.992 5.0369 3.267 3.598 20.291 10
fibRcpp(10) 1.201 1.263 6.3523 1.424 1.639 50.536 10
All good comments above.
The function is way too lightweight at x=10 and you need to call way more often than times=10 to find anything meaningful. You are measuring noise.
As for style, most of us prefer fibonacci() via Rcpp Attributes...
Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.
If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100
I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.
I have this line of R code:
croppedDNA <- completeDNA[,apply(completeDNA,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]
What it does is identify the sites (cols) in a matrix of DNA sequences (1 row = one seq) that are not universal (informative) and subsets them from the matrix to make a new 'cropped matrix' i.e. get rid of all the columns in which values are the same. For a big dataset this takes about 6 seconds. I don't know if I can do it faster in C++ (still a beginner in C++) but it will be good for me to try. My idea is to use Rcpp, loop through the columns of the CharacterMatrix, pull out the column (the site) as a CharacterVector check if they are the same. If they are the same, record that column number/index, continue for all columns. Then at the end make a new CharacterMatrix that only includes those columns. It is important that I keep the rownames and column names as they are in th "R version" of the matrix i.e. if a column goes, so should the colname.
I've been writing for about two minutes, so far what I have is (not finished):
#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix reduce_sequences(CharacterMatrix completeDNA)
{
std::vector<bool> informativeSites;
for(int i = 0; i < completeDNA.ncol(); i++)
{
CharacterVector bpsite = completeDNA(,i);
if(all(bpsite == bpsite[1])
{
informativeSites.push_back(i);
}
}
CharacterMatrix cutDNA = completeDNA(,informativeSites);
return cutDNA;
}
Am I going the right way about this? Is there an easier way. My understanding is I need std::vector because it's easy to grow them (since I don't know in advance how many cols I am going to want to keep). With the indexing will I need to +1 to the informativeSites vector at the end (because R indexes from 1 and C++ from 0)?
Thanks,
Ben W.
Sample data:
set.seed(123)
z <- matrix(sample(c("a", "t", "c", "g", "N", "-"), 3*398508, TRUE), 3, 398508)
OP's solution:
system.time(y1 <- z[,apply(z,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))])
# user system elapsed
# 4.929 0.043 4.976
A faster version using base R:
system.time(y2 <- (z[, colSums(z[-1,] != z[-nrow(z), ]) > 0]))
# user system elapsed
# 0.087 0.011 0.098
The results are identical:
identical(y1, y2)
# [1] TRUE
It's very possible c++ will beat it, but is it really necessary?