I was wondering if there was an Rcpp function which takes an Rcpp::String data type as input and returns a given character (by index) of the string. For example, extracting the character at index 0 of the string. This would be equivalent to the string::at method from the string header in c++. I have written this:
#include <vector>
#include <string>
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<std::string> stringList;
int SplitGenotypesA(std::string s) {
char a = s.at(0);
int b = a - '0';
return b;
}
But would prefer not to have to convert between Rcpp::String and std::string types.
You can feed an R vector of strings directly to C++ using Rcpp::StringVector. This will obviously handle single elements too.
Getting the nth character of the ith element of your vector is as simple as vector[i][n].
So, without using std::string you can do this:
#include<Rcpp.h>
// [[Rcpp::export]]
Rcpp::NumericVector SplitGenotypesA(Rcpp::StringVector R_character_vector)
{
int number_of_strings = R_character_vector.size();
Rcpp::NumericVector result(number_of_strings);
for(int i = 0; i < number_of_strings; ++i)
{
char a = R_character_vector[i][0];
result[i] = a - '0';
}
return result;
}
Now in R you can do:
SplitGenotypesA("9C")
# [1] 9
or better yet,
SplitGenotypesA(c("1A", "2B", "9C"))
# [1] 1 2 9
Which is even a little faster than the native R method of doing the same thing:
microbenchmark::microbenchmark(
R_method = as.numeric(substr(c("1A", "2B", "9C"), 1, 1)),
Rcpp_method = SplitGenotypesA(c("1A", "2B", "9C")),
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# R_method 3.422 3.765 4.076722 4.107 4.108 46.881 1000
# Rcpp_method 3.080 3.423 3.718779 3.765 3.765 32.509 1000
Related
i have a question regarding splitting a string. I am working on creating a binary to Hex converter, and want to split up my binary sequence that is represented as a string by 4 chars so that I can easily convert each set of 4 bits into a hexadecimal form:
Example:
00000111010110111100110100010101
would turn into:
"0000", "0111", "0101", "1011", "1100", "1101", "0001", "0101"
Thank you for any assistance you can provide!
Using the std::string::substr function and a simple for loop you can just sub-divide the string into groups of 4 and push them into a std::vector<std::string> as shown below...
#include <iostream>
#include <vector>
#include <string>
int main() {
std::string nums = "00000111010110111100110100010101";
std::vector<std::string> bins;
for (std::size_t i = 0; i < nums.size(); i += 4)
bins.push_back(nums.substr(i, 4));
return 0;
}
Then bins becomes a std::vector filled with the sub-divided binary numbers.
I think the length ofstringstream is calculating by the blocks.that means how many blocks in it, how long it is.the blocks is splited by '\t' '\s' '\n'.
for example, stringstream = '23\t45\t5.677\t' , its length should be 6. The delimiter should be counted.
I just can verify my idea while the type of arguments are all int.
here is my code.
I wonder that s_double.tellp() is not 10.
#include<iostream>
#include<sstream>
#include<cstdlib>
using namespace std;
int main()
{
stringstream s_int;
stringstream s_double;
srand((unsinged)time(NULL));
for(int index = 0;index<5;index++)
{
double random = rand() / (double) RAND_MAX * 5;
s_int<<index<<'\t';
s_double<<random<<'\t';
}
cout<<s_int.tellp()<<'\n';
cout<<s_double.tellp()<<'\n';
exit(0);
}
output:
10
40
after I changed the range of random, the output of s_double changed too.
double random = rand() / (double) RAND_MAX *9;
output:
10
42
The easiest but not faster method is:
auto nLength = strm.str().length();
Regarding s_double position - it is easy to answer your question by examining the content of this stream in a debugger or print it. You will see that double could be "0.554213" for 0.554212545 or "1" for 1 so string length for defferent doubles is complitely different.
I have a recursive function that prints a some in nodes in a tree as integer ids. After exporting the function to R, I cannot use the cout output for anything (or so it seems). What would be ideal is if (1) I can return the output as a vector or (2) parse the cout inside R without losing too much speed.
I would insert some code here but my function is particularly generic. Essentially I'm trying to return, say, the Fibonacci sequence as a vector instead of a sum but through a recursive function without using global or static variables.
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5
So one could,
y <- fib(6)
y[4] and y[4:5] would return respectively,
[1] 2
[1] 2 3
Thanks in advance for insights and ideas in problem solving. Using a static variable was as far as I got on my own.
I discuss this problem at length with different hashing and memoization implementation in both R and C++ in chapter one of the Rcpp book.
You should read this online book http://adv-r.had.co.nz/, and mostly the memoization part where your question is partly answered http://adv-r.had.co.nz/Function-operators.html:
Just add the function fib3 such as:
library(memoise)
fib2 <- memoise(function(n) {
if (n < 2) return(1)
fib2(n - 2) + fib2(n - 1)
})
fib3 <- memoise(function(n) sapply(1:n, fib2))
#> fib3(6)
#[1] 1 2 3 5 8 13
Just for fun, a slightly more involved approach that uses std::generate_n and a function object (fseq) in lieu of sapply:
#include <Rcpp.h>
struct fseq {
public:
fseq() {
current = 0;
}
int operator()() {
int val = fib(current);
current++;
return val;
}
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
private:
int current;
};
// [[Rcpp::export(".fib")]]
int fib(int n) {
if (n==0) return 0;
if (n==1) return 1;
return fib(n-2) + fib(n-1);
}
// [[Rcpp::export]]
std::vector<int> fib_seq(const int n) {
if (n < 1) throw std::invalid_argument("n must be >= 1");
std::vector<int> seq;
seq.reserve(n);
std::generate_n(std::back_inserter(seq), n, fseq());
return seq;
}
library(microbenchmark)
##
R> fib_seq(6)
[1] 0 1 1 2 3 5
R> all.equal(fib_seq(6),.fib_seq(6))
[1] TRUE
.fib_seq <- function(n) sapply(0:(n-1), .fib)
##
R> microbenchmark(
fib_seq(6),.fib_seq(6),
times=1000L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(6) 1.561 1.9015 3.287824 2.108 2.3430 1046.021 1000
.fib_seq(6) 27.239 29.0615 35.538355 30.290 32.8065 1108.266 1000
R> microbenchmark(
fib_seq(15),.fib_seq(15),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(15) 6.108 6.5875 7.46431 7.0795 7.7590 20.391 100
.fib_seq(15) 57.243 60.7195 72.97281 63.8120 73.4045 231.707 100
R> microbenchmark(
fib_seq(28),.fib_seq(28),
times=100L,unit="us")
Unit: microseconds
expr min lq mean median uq max neval
fib_seq(28) 2134.861 2143.489 2222.018 2167.364 2219.400 2650.854 100
.fib_seq(28) 3705.492 3721.586 3871.314 3745.956 3852.516 5040.827 100
Note that these functions were parametrized to reflect your statement
For example, fib(6) would return inside R as:
[1] 0 1 1 2 3 5
PROBLEM SOLVED: thanks everyone!
I am almost entirely new to C++ so I apologise in advance if the question seems trivial.
I am trying to convert a string of letters to a set of 2 digit numbers where a = 10, b = 11, ..., Y = 34, Z = 35 so that (for example) "abc def" goes to "101112131415". How would I go about doing this? Any help would really be appreciated. Also, I don't mind whether capitalization results in the same number or a different number. Thank you very much in advance. I probably won't need it for a few days but if anyone is feeling particularly nice how would I go about reversing this process? i.e. "101112131415" --> "abcdef" Thanks.
EDIT: This isn't homework, I'm entirely self taught. I have completed this project before in a different language and decided to try C++ to compare the differences and try to learn C++ in the process :)
EDIT: I have roughly what I want, I just need a little bit of help converting this so that it applies to strings, thanks guys.
#include <iostream>
#include <sstream>
#include <string>
int returnVal (char x)
{
return (int) x - 87;
}
int main()
{
char x = 'g';
std::cout << returnVal(x);
}
A portable method is to use a table lookup:
const unsigned int letter_to_value[] =
{10, 11, 12, /*...*/, 35};
// ...
letter = toupper(letter);
const unsigned int index = letter - 'A';
value = letter_to_value[index];
cout << index;
Each character has it's ASCII values. Try converting your characters into ASCII and then manipulate the difference.
Example:
int x = 'a';
cout << x;
will print 97; and
int x = 'a';
cout << x - 87;
will print 10.
Hence, you could write a function like this:
int returnVal(char x)
{
return (int)x - 87;
}
to get the required output.
And your main program could look like:
int main()
{
string s = "abcdef"
for (unsigned int i = 0; i < s.length(); i++)
{
cout << returnVal(s[i]);
}
return 0;
}
This is a simple way to do it, if not messy.
map<char, int> vals; // maps a character to an integer
int g = 1; // if a needs to be 10 then set g = 10
string alphabet = "abcdefghijklmnopqrstuvwxyz";
for(char c : alphabet) { // kooky krazy for loop
vals[c] = g;
g++;
}
What Daniel said, try it out for yourself.
As a starting point though, casting:
int i = (int)string[0] + offset;
will get you your number from character, and: stringstream will be useful too.
How would I go about doing this?
By trying to do something first, and looking for help only if you feel you cannot advance.
That being said, the most obvious solution that comes to mind is based on the fact that characters (i.e. 'a', 'G') are really numbers. Suppose you have the following:
char c = 'a';
You can get the number associated with c by doing:
int n = static_cast<int>(c);
Then, add some offset to 'n':
n += 10;
...and cast it back to a char:
c = static_cast<char>(n);
Note: The above assumes that characters are consecutive, i.e. the number corresponding to 'a' is equal to the one corresponding to 'z' minus the amount of letters between the two. This usually holds, though.
This can work
int Number = 123; // number to be converted to a string
string Result; // string which will contain the result
ostringstream convert; // stream used for the conversion
convert << Number; // insert the textual representation of 'Number' in the characters in the stream
Result = convert.str(); // set 'Result' to the contents of the stream
you should add this headers
#include <sstream>
#include <string>
Many answers will tell you that characters are encoded in ASCII and that you can convert a letter to an index by subtracting 'a'.
This is not proper C++. It is acceptable when your program requirements include a specification that ASCII is in use. However, the C++ standard alone does not require this. There are C++ implementations with other character sets.
In the absence of knowledge that ASCII is in use, you can use translation tables:
#include <limits.h>
// Define a table to translate from characters to desired codes:
static unsigned int Translate[UCHAR_MAX] =
{
['a'] = 10,
['b'] = 11,
…
};
Then you may translate characters to numbers by looking them up in the table:
unsigned char x = something;
int result = Translate[x];
Once you have the translation, you could print it as two digits using printf("%02d", result);.
Translating in the other direction requires reading two characters, converting them to a number (interpreting them as decimal), and performing a similar translation. You might have a different translation table set up for this reverse translation.
Just do this !
(s[i] - 'A' + 1)
Basically we are converting a char to number by subtracting it by A and then adding 1 to match the number and letters
Suppose, I've a data.frame as follows:
set.seed(45)
DF <- data.frame(x=1:10, strata2013=sample(letters[1:3], 10, TRUE))
x strata2013
1 1 b
2 2 a
3 3 a
4 4 b
5 5 b
6 6 a
7 7 a
8 8 b
9 9 a
10 10 a
And I'd like to get the counts for each unique value in the column strata2013, then, using data.table (for speed), one could do it in this manner:
DT <- as.data.table(DF)
DT[, .N, by=strata2013]
strata2013 N
1: b 4
2: a 6
Now, I'd like to try and accomplish this in Rcpp, as a learning exercise. I've written and tried out the code shown below which is supposed to provide the same output, but instead it gives me an error. Here's the code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector LengthStrata (CharacterVector uniqueStrata, DataFrame dataset ) {
int n = uniqueStrata.size();
NumericVector Nh(n);
Rcpp::CharacterVector strata=dataset["strate2013"];
for (int i = 0; i < n; ++i) {
Nh[i]=strata(uniqueStrata(i)).size();
}
return Nh;
}
Here is the error message:
conversion from 'Rcpp::Vector<16>::Proxy {aka Rcpp::internal::string_proxy<16>}'
to 'const size_t { aka const long long unsigned int}' is ambiguous
What am I doing wrong? Thank you very much for your help.
If I understand correctly, you're hoping that strata( uniqueStrata(i) ) will subset the vector, similar to how R's subsetting operates. This is unfortunately not the case; you would have to perform the subsetting 'by hand'. Rcpp doesn't have 'generic' subsetting operates available yet.
When it comes to using Rcpp, you really want to leverage the C++ standard library where possible. The de-facto C++ way of generating these counts would be to use a std::map (or std::unordered_map, if you can assume C++11), with something like the following. I include a benchmark for interest.
Note from Dirk: unordered_map is actually available from tr1 for pre-C++11, so one can include it using e.g. #include <tr1/unordered_map>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
int n = strata.size();
std::map<SEXP, int> counts;
for (int i = 0; i < n; ++i) {
++counts[ strata[i] ];
}
return wrap(counts);
}
/*** R
library(data.table)
library(microbenchmark)
set.seed(45)
DF <- data.frame(strata2013=sample(letters, 1E5, TRUE))
DT <- data.table(DF)
LengthStrata(DF)
DT[, .N, by=strata2013]
microbenchmark(
LengthStrata(DF),
DT[, .N, by=strata2013]
)
*/
gives me
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 3.267131 3.831563 3.934992 4.101050 11.491939 100
DT[, .N, by = strata2013] 1.980896 2.360590 2.480884 2.687771 3.052583 100
The Rcpp solution is slower in this case likely due to the time it takes to move R objects to and from the C++ containers, but hopefully this is instructive.
Aside: This is, in fact, already included in Rcpp as the sugar table function, so if you want to skip the learning experience, you can use a pre-baked solution as
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector LengthStrata( DataFrame dataset ) {
Rcpp::CharacterVector strata = dataset["strata2013"];
return table(strata);
}
Sugar improves the speed of the Rcpp function:
Unit: milliseconds
expr min lq median uq max neval
LengthStrata(DF) 5.548094 5.870184 6.014002 6.448235 6.922062 100
DT[, .N, by = strate2013] 6.526993 7.136290 7.462661 7.949543 81.233216 100
I am not sure I understand what you are trying to do. And when strata is a vector
Rcpp::CharacterVector strata=df["strate2013"];
then I am not sure what
strata(uniqueStrata(i)).size()
is supposed to do. Maybe you could describe in words (or in R with some example code and data) what you are trying to do here.