I am a novice in C++ and Rcpp, and I am wondering how to compare each element of two different vectors without loop at one time.
My goal is to change the element of v1 by referencing other vector.`
Current code is
v1 = {6,7,8,9,10}
v2 = {2,4,6,8,10}
v3 = {a,b,a,b,c}
v4 = {0,0,0,0,0}
v5 = {a,b,c}
v6 = {1,2,3}
for (i in 1:5){
if (v1[i] > v2[i]){
for (j in 1:3){
if (v5[j] == v3[i]){
v4[i] = v2[i] + v6[j]
if (v1[i] > v4[i]){
v1[i] = v4[i]
}
}
}
}
}
The result sould be
v1 = {3,6,7,9,10}
In fact, v1, v2, v3, v4 and v5, v6 are the different dataframe in R. Each element of v1 is compared to v2. If an element i in v1 is larger than i element in v2, the element of v1 becomes a sum of i element of v1 and element of v6 by corresponding v3 & v5. Then the newly estimated value v4[i] is compared to v1[i].
I have ta large number of cases in v1~v5 and v5~v6. In this case, using loop takes a long time. Is it possible to compare the different vectors without loop? or how to estimate and reference the other vector's element?
I do not see the need to use Rcpp or C++ here. The way I understand your requirements, you are trying to manipulate two sets of equal length vectors. For a "set of equal length" vectors one normally uses a data.frame or one of its extensions. Here I am using base R, data.table and dplyr with tibble. See for yourself which syntax you prefer. Generally speaking, data.table will most likely be faster for large data sets.
Setup data:
v1 <- c(6,7,8,9,10)
v2 <- c(2,4,6,8,10)
v3 <- c("a","b","a","b","c")
v5 <- c("a","b","c")
v6 <- c(1,2,3)
Base R:
df1 <- data.frame(v1, v2, v3)
df2 <- data.frame(v5, v6)
df1 <- merge(df1, df2, by.x = "v3", by = "v5")
df1$v4 <- df1$v2 + df1$v6
df1$v1 <- ifelse(df1$v1 > df1$v2 & df1$v1 > df1$v4, df1[["v4"]], df1[["v1"]])
df1
#> v3 v1 v2 v6 v4
#> 1 a 3 2 1 3
#> 2 a 7 6 1 7
#> 3 b 6 4 2 6
#> 4 b 9 8 2 10
#> 5 c 10 10 3 13
data.table:
library(data.table)
dt1 <- data.table(v1, v2, v3, key = "v3")
dt2 <- data.table(v5, v6, key = "v5")
dt1[dt2, v4 := v2 + v6]
dt1[v1 > v2 & v1 > v4, v1 := v4]
dt1
#> v1 v2 v3 v4
#> 1: 3 2 a 3
#> 2: 7 6 a 7
#> 3: 6 4 b 6
#> 4: 9 8 b 10
#> 5: 10 10 c 13
dplyr:
suppressPackageStartupMessages(library(dplyr))
t1 <- tibble(v1, v2, v3)
t2 <- tibble(v5, v6)
t1 %>%
inner_join(t2, by = c("v3" = "v5")) %>%
mutate(v4 = v2 + v6) %>%
mutate(v1 = case_when(
v1 > v2 & v1 > v4 ~ v4,
TRUE ~ v1
))
#> # A tibble: 5 x 5
#> v1 v2 v3 v6 v4
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 3 2 a 1 3
#> 2 6 4 b 2 6
#> 3 7 6 a 1 7
#> 4 9 8 b 2 10
#> 5 10 10 c 3 13
Created on 2019-04-19 by the reprex package (v0.2.1)
The general idea is always the same:
join the two tables on the character column
create new column v4 as sum of v2 and v6
update v1 to the value of v4 where v1 > v2 and v1 > v4
Note that base R and data.table do not preserve the order, so it would make more sense to put the output into an additional column.
Related
After running this code:
t1 <-Sys.time()
df.m <- left_join(df.h,daRta3,by=c("year","month","MA","day"))
t2 <- Sys.time()
difftime(t2,t1)
I have this error.
Error: std::bad_alloc
The dimension of the matrix that I have tried to create is 74495*2695 = 180.10^6 rows.
The computer in which I run the code has 20 GB of RAM
I tried the memory.limit() but it did not solve my issue.
Examine cardinality of your join key
Is the c("year","month","MA","day") unique in both df.h and daRta3?
What are the most frequent values?
NA values. left_join can treat NA values as equal or different:
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x')
# A tibble: 9 x 1
x
<lgl>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x', na_matches = 'never')
# A tibble: 3 x 1
x
<lgl>
1 NA
2 NA
3 NA
If order and values in c("year","month","MA","day") can be guaranteed to be the same then simple cbind or bind_cols might be an efficient solution
Is it possible to customize setdiff using regular expressions to see what is in one vector and not another? For example:
x <- c("1\t119\t120\t1\t119\t120\tABC\tDEF\t0", "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0")
[1] "1\t119\t120\t1\t119\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0"
[4] "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0"
y <- c("1\t119\t120\t1\t109\t120\tABC\tDEF\t0", "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0", "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0", "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0")
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0"
[4] "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
I want to be able to show that:
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
are new because 4\t157\t158 and 4\t157\t158 are unique to y. This doesn't work:
> setdiff(y,x)
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0" "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0" "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
Because column 5 is clearly different in both x and y. I want to setdiff only based on the first three columns.
A simple example of setdiff can be found here: How to tell what is in one vector and not another?
One way to do this is to put x and y as data.frames and anti-join. I'll use data.table since I find it more natural.
library(data.table)
xDT <- as.data.table(do.call("rbind", strsplit(x, split = "\t")))
yDT <- as.data.table(do.call("rbind", strsplit(y, split = "\t")))
Now anti-join (a "setdiff" for data.frames/data.tables):
yDT[!xDT, on = paste0("V", 1:3)]
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1: 4 157 158 4 147 158 XWX YTY 0
# 2: 5 158 159 5 148 159 PHP WZW 0
You could also get the row index (thanks to #Frank for the suggested improvement/simplification):
> yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]
Or extract it directly from y:
> y[yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]]
# [1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
We could also use anti_join from dplyr after reading it with either fread
library(data.table)
library(dplyr)
anti_join(fread(paste(y, collapse='\n')),
fread(paste(x, collapse='\n')), by = c('V1', 'V2', 'V3'))
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# (int) (int) (int) (int) (int) (int) (chr) (chr) (int)
# 1 4 157 158 4 147 158 XWX YTY 0
# 2 5 158 159 5 148 159 PHP WZW 0
Or (as the title requests for regex) we can use regex to remove part of the string and then do the %in%
y[!sub('(([^\t]+\t){3}).*', '\\1', y) %in%
sub('(([^\t]+\t){3}).*', '\\1', x)]
#[1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
Im trying to get customized with the tidyrpackage, and am strugling with the problem of having a variable which is a concatenate of several variables. In the minimal example below, I would like to split variable v2 into its constituent variables v3and v4and then swing these so I end up with the four variables v1-v4.
require(plyr)
require(dplyr)
require(stringr)
require(tidyr)
data <-
data.frame(
v1=c(1,2),
v2=c("v3 cheese; v4 200", "v3 ham; v4 150")) %>%
tbl_df()
If I split v2 into a new temp I get only v3:
mutate(data,
temp=unlist(sapply(str_split(data$v2, pattern=";"), "[", 1)))
v1 v2 temp
1 1 v3 cheese; v4 200 v3 cheese
2 2 v3 ham; v4 150 v3 ham
My problems are:
1) How do I split and swing v3 AND v4 up as column names using tidyr?
2) In my real data I do not know (or they are to many) the
variable names but they have the structure "var value", and I
would like to use some regex to automatically identify and swing
them as in 1)
Got inspired by this SO answer but could not get it to work though with regex code for variable names.
UPDATE:
My output would be something like (v2 could be skipped as its now redundant with v3 and v4):
v1 v2 v3 v4
1 1 v3 cheese; v4 200 cheese 200
2 2 v3 ham; v4 150 ham 150
Split the data by ";", convert the split output to a long form, split the data again by " " (but in a wide form this time) and spread the values out to the wide form you desire.
Here it is using "dplyr" + "tidyr" + "stringi":
library(dplyr)
library(tidyr)
library(stringi)
data %>%
mutate(v2 = stri_split_fixed(as.character(v2), ";")) %>%
unnest(v2) %>%
mutate(v2 = stri_trim_both(v2)) %>%
separate(v2, into = c("var", "val")) %>%
spread(var, val)
# Source: local data frame [2 x 3]
#
# v1 v3 v4
# 1 1 cheese 200
# 2 2 ham 150
Alternatively, using cSplit from my "splitstackshape" package (which doesn't presently work with tbl_dfs)
library(dplyr)
library(tidyr)
library(splitstackshape)
as.data.frame(data) %>%
cSplit("v2", ";", "long") %>%
cSplit("v2", " ") %>%
spread(v2_1, v2_2)
# v1 v3 v4
# 1: 1 cheese 200
# 2: 2 ham 150
I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.
Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)
Say i have a list mn like this
i<-c(w=5,n="oes")
p<-c(w=9,n="ty",j="ooe")
mn<-list(i,p,i,p,i,p,i)
Now I´d like to select the list elements with the shortest length (the i´s) and append "unknown" to the list before creating a dataframe. How can I do this?
EDIT: In the end I´d like the list to have every i element in mn as w=5,n="oes", and j="unknown" before mn including p is changed into a dataframe:
To find the lenght of each element in your list, use length wrapped in sapply:
len <- sapply(mn, length)
len
[1] 2 3 2 3 2 3 2
Now, to identify only those elements that have lengths equal to the shortest length:
which(len==min(len))
[1] 1 3 5 7
Use subsetting and as.data.frame to create your data.frame. But this data.frame will have somewhat random column names, so I rename the column names:
df <- as.data.frame(mn[which(len==min(len))])
names(df) <- seq_len(ncol(df))
df
1 2 3 4
w 5 5 5 5
n oes oes oes oes
You will have to clarify what you mean with "append unknown" to this data.frame.
Another possibility is:
all.names = unique( unlist( lapply( mn, names ) ) )
do.call( 'rbind', lapply( mn, function( r ) {
data.frame( sapply( all.names, function( v ) r[ v ], simplify=F ) )
} ) )
which gives:
w n j
w 5 oes <NA>
w1 9 ty ooe
w2 5 oes <NA>
w3 9 ty ooe
w4 5 oes <NA>
w5 9 ty ooe
w6 5 oes <NA>
But I get the feeling there's a much neater route to this solution...
edit
If you want unknown rather than <NA>, you can change the inner sapply to:
sapply( all.names, function( v ) if( is.na( r[v] ) ) 'unknown' else r[v], simplify=F )
There is not very elegant, but it might do the trick:
maxlength <- max(sapply(mn,length))
## make a new list, with the "missing" entries replaced with "unknown"
mn2 <- lapply(mn,function(x)c(x,rep('unknown',maxlength - length(x))))
## convert to a data.frame
mn3 <- data.frame(matrix(unlist(mn2),nrow = 3))
Which gives the following
> mn3
X1 X2 X3 X4 X5 X6 X7
1 5 9 5 9 5 9 5
2 oes ty oes ty oes ty oes
3 unknown ooe unknown ooe unknown ooe unknown
However it is better practice to use NA, rather than "unknown"