Related
I am a novice in C++ and Rcpp, and I am wondering how to compare each element of two different vectors without loop at one time.
My goal is to change the element of v1 by referencing other vector.`
Current code is
v1 = {6,7,8,9,10}
v2 = {2,4,6,8,10}
v3 = {a,b,a,b,c}
v4 = {0,0,0,0,0}
v5 = {a,b,c}
v6 = {1,2,3}
for (i in 1:5){
if (v1[i] > v2[i]){
for (j in 1:3){
if (v5[j] == v3[i]){
v4[i] = v2[i] + v6[j]
if (v1[i] > v4[i]){
v1[i] = v4[i]
}
}
}
}
}
The result sould be
v1 = {3,6,7,9,10}
In fact, v1, v2, v3, v4 and v5, v6 are the different dataframe in R. Each element of v1 is compared to v2. If an element i in v1 is larger than i element in v2, the element of v1 becomes a sum of i element of v1 and element of v6 by corresponding v3 & v5. Then the newly estimated value v4[i] is compared to v1[i].
I have ta large number of cases in v1~v5 and v5~v6. In this case, using loop takes a long time. Is it possible to compare the different vectors without loop? or how to estimate and reference the other vector's element?
I do not see the need to use Rcpp or C++ here. The way I understand your requirements, you are trying to manipulate two sets of equal length vectors. For a "set of equal length" vectors one normally uses a data.frame or one of its extensions. Here I am using base R, data.table and dplyr with tibble. See for yourself which syntax you prefer. Generally speaking, data.table will most likely be faster for large data sets.
Setup data:
v1 <- c(6,7,8,9,10)
v2 <- c(2,4,6,8,10)
v3 <- c("a","b","a","b","c")
v5 <- c("a","b","c")
v6 <- c(1,2,3)
Base R:
df1 <- data.frame(v1, v2, v3)
df2 <- data.frame(v5, v6)
df1 <- merge(df1, df2, by.x = "v3", by = "v5")
df1$v4 <- df1$v2 + df1$v6
df1$v1 <- ifelse(df1$v1 > df1$v2 & df1$v1 > df1$v4, df1[["v4"]], df1[["v1"]])
df1
#> v3 v1 v2 v6 v4
#> 1 a 3 2 1 3
#> 2 a 7 6 1 7
#> 3 b 6 4 2 6
#> 4 b 9 8 2 10
#> 5 c 10 10 3 13
data.table:
library(data.table)
dt1 <- data.table(v1, v2, v3, key = "v3")
dt2 <- data.table(v5, v6, key = "v5")
dt1[dt2, v4 := v2 + v6]
dt1[v1 > v2 & v1 > v4, v1 := v4]
dt1
#> v1 v2 v3 v4
#> 1: 3 2 a 3
#> 2: 7 6 a 7
#> 3: 6 4 b 6
#> 4: 9 8 b 10
#> 5: 10 10 c 13
dplyr:
suppressPackageStartupMessages(library(dplyr))
t1 <- tibble(v1, v2, v3)
t2 <- tibble(v5, v6)
t1 %>%
inner_join(t2, by = c("v3" = "v5")) %>%
mutate(v4 = v2 + v6) %>%
mutate(v1 = case_when(
v1 > v2 & v1 > v4 ~ v4,
TRUE ~ v1
))
#> # A tibble: 5 x 5
#> v1 v2 v3 v6 v4
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 3 2 a 1 3
#> 2 6 4 b 2 6
#> 3 7 6 a 1 7
#> 4 9 8 b 2 10
#> 5 10 10 c 3 13
Created on 2019-04-19 by the reprex package (v0.2.1)
The general idea is always the same:
join the two tables on the character column
create new column v4 as sum of v2 and v6
update v1 to the value of v4 where v1 > v2 and v1 > v4
Note that base R and data.table do not preserve the order, so it would make more sense to put the output into an additional column.
I'am trying to extract sets of coordinates from strings and change the format.
I have tried some of the stringr package and getting nowhere with the pattern extraction.
It's my first time dealing with regex and still is a little confusing to create a pattern.
There is a data frame with one column with one or more sets of coordinates.
The only pattern (the majority) separating Lat from Long is (-), and to separate one set of coordinates to another there is a (/)
Here is an example of some of the data:
ID Coordinates
1 3438-5150
2 3346-5108/3352-5120 East island, South port
3 West coast (284312 472254)
4 28.39.97-47.05.62/29.09.13-47.44.03
5 2843-4722/3359-5122(1H-2H-3H-4F)
Most of the data is in decimal degree, e.g. (id 1 is Lat 34.38 Lon 51.50), some others is in 00º00'00'', e.g. (id 4 is Lat 28º 39' 97'' Lon 47º 05' 62'')
I will need to make in a few steps
1 - Extract all coordinates sets creating a new row for each set of each record;
2 - Extract the text label of record to a new column, concatenating them;
3- Convert the coordinates from 00º00'00''(28.39.97) to 00.0000º (28.6769 - decimal dregree) so all coordinates are in the same format. I can easily convert if they are as numeric.
4 - Add dot (.) to separate the decimal degree values (from 3438 to 34.38) and add (-) to identify as (-34.38) south west hemisphere. All value must have (-) sign.
I'am trying to get something like this:
Step 1 and 2 - Extract coordinates sets and names
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 28.39.97 47.05.62
4 29.09.13 47.44.03
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 3 - convert coordinates format to decimal degree (ID 4)
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 286769 471005
4 291536 470675
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 4 - change display format
ID x y label
1 -34.38 -51.50
2 -33.46 -51.08 East island, South port
2 -33.52 -51.20 East island, South port
3 -28.43 -47.22 West coast
4 -28.6769 -47.1005
4 -29.1536 -47.0675
5 -28.43 -47.22 1H-2H-3H-4F
5 -33.59 -51.22 1H-2H-3H-4F
I have edit the question to better clarify my problems and change some of my needs. I realized that it was messy to understand.
So, has anyone worked with something similar?
Any other suggestion would be of great help.
Thank you again for the time to help.
Note: the first answers address the original asking of the question and the last answer addresses its current state. The data in data1 should be set appropriately for each solution.
The following should address your first question given the data you provided and the expected output (using dplyr and tidyr).
library(dplyr)
library(tidyr)
### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120",
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA,
-4L))
### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }
### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
## ID x.1 y.1 x.2 y.2
## 1 1 34.38 51.50 NA NA
## 2 2 33.46 51.08 33.52 51.20
## 3 3 28.43 47.22 33.59 51.22
## 4 4 28.39 47.05 29.09 47.44
The call to mutate modifies Coordinates twice to make substitutions easier.
Edit
A variation that uses another regex substitution instead of mutate_at.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates),
Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)
Edit 2: The following solution addresses the updated version of the question
The following solution does a number of transformations to transform the data. These are separate to make it a bit easier to think about (much easier relatively speaking).
library(dplyr)
library(tidyr)
data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port",
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03",
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))
### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
nres <- rep(0, length(llval))
coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval)
nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
nres
}
### Each mutate works to transform the various data formats
### into a single format. The 'separate' commands then split
### the data into the appropriate columns. The action of each
### 'mutate' can be seen by progressively viewing the results
### (i.e. adding one 'mutate' command at a time).
data1 %>%
mutate(Coordinates_new = Coordinates) %>%
mutate(Coordinates_new = gsub('\\([^) ]+\\)', '', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(.*?)\\(((\\d{6})[ ](\\d{6}))\\).*', '\\3-\\4 \\1', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>%
separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>%
separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))
## ID Coordinates x.1 y.1 x.2 y.2 label
## 1 1 3438-5150 -34.38000 -51.50000 NA NA <NA>
## 2 2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3 3 East coast (284312 472254) -28.72000 -47.38167 NA NA East coast
## 4 4 28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417 <NA>
## 5 5 2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000 <NA>
We can use stringi. We create a . between the 4 digit numbers with gsub, use stri_extract_all (from stringi) to extract two digit numbers followed by a dot followed by two digit numbers (\\d{2}\\.\\d{2}) to get a list output. As the list elements have unequal length, we can pad NA at the end for those elements that have shorter length than the maximum length and convert to matrix (using stri_list2matrix). After converting to data.frame, changing the character columns to numeric, and cbind with the 'ID' column of the original dataset.
library(stringi)
d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\\d{2})(\\d{2})",
"\\1.\\2", data1$Coordinates), "\\d{2}\\.\\d{2}"), byrow=TRUE), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
cbind(data1[1], d1)
# ID x.1 y.1 x.2 y.2
#1 1 34.38 51.50 NA NA
#2 2 33.46 51.08 33.52 51.20
#3 3 28.43 47.22 33.59 51.22
#4 4 28.39 47.05 29.09 47.44
But, this can also be done with base R.
#Create the dots for the 4-digit numbers
str1 <- gsub("(\\d{2})(\\d{2})", "\\1.\\2", data1$Coordinates)
#extract the numbers in a list with gregexpr/regmatches
lst <- regmatches(str1, gregexpr("\\d{2}\\.\\d{2}", str1))
#convert to numeric
lst <- lapply(lst, as.numeric)
#pad with NA's at the end and convert to data.frame
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
#change the column names
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
#cbind with the first column of 'data1'
cbind(data1[1], d1)
I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?
If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
Another option is to use read.table with = as a separator then pasting the 2 columns:
res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2), ## get rid from last part here
sub('^ +| +$','',res$V4),sep='-') ## remove extra spaces
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5"
[5] "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5" "hsa-miR-650-OR4F5"
The simple sub solution already given looks quite nice but just in case here are some other approaches:
1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:
library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
giving:
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"
[4] "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
[7] "hsa-miR-650-OR4F5"
The intermediate data frame, DF, that was produced looks like this:
> DF
V1 V2 V3 V4 V5 V6
1 mature hsa-miR-5087 mir_Family - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family mir-26 Gene OR4F9
3 mature hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature hsa-miR-659-3p mir_Family - Gene OR4F5
5 mature hsa-miR-5197-3p mir_Family - Gene OR4F5
6 mature hsa-miR-5093 mir_Family - Gene OR4F5
7 mature hsa-miR-650 mir_Family mir-650 Gene OR4F5
Here is a visualization of the regular expression we used:
(\w+)=([^|]*)
Debuggex Demo
1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:
DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))
paste(DF$mature, DF$Gene, sep = "-") # same answer as above
The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:
> DF
mature mir_Family Gene
1 hsa-miR-5087 - OR4F5
2 hsa-miR-26a-1-3p mir-26 OR4F9
3 hsa-miR-448 mir-448 OR4F5
4 hsa-miR-659-3p - OR4F5
5 hsa-miR-5197-3p - OR4F5
6 hsa-miR-5093 - OR4F5
7 hsa-miR-650 mir-650 OR4F5
2) strapplyc
Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
giving the same result.
Here is a visualization of the regular expression used:
=([^|]*)
Debuggex Demo
Here is one approach:
Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
# [7] "hsa-miR-650-OR4F5"
Maybe not the more elegant but you can try :
sapply(Data[,1],function(x){
parts<-strsplit(x,"\\|")[[1]]
y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
return(y)
})
Example
Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)
> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5" "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"
> sapply(Data[,1],function(x){
+ parts<-strsplit(x,"\\|")[[1]]
+ y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+ return(y)
+ })
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
"hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9"
I have a table with 1 columns and I want to extract one among the other elements in rows and insert into new column.
lets say my table: df
V1
elements-of-01-to-20
ACTCTGCGACHCHAHAATT
elements-of-21-to-30
ACTAGCTATTATCGATATT
elements-of-31-to-40
CCCTTATATTGGAGCTACT
my desired result:
V1 V2
elements-of-01-to-20 ACTCTGCGACHCHAHAATT
elements-of-21-to-20 ACTAGCTATTATCGATATT
elements-of-31-to-20 CCCTTATATTGGAGCTACT
elements-of-31-to-40 CCCTTATATTGGAGCTACT
edited:
thanks for all replies. my second question is what if my dataset has multiple sequences followed by specific term like elements-of:
V1 => result => V1 v2
elements-of-01-to-20 elements-of-01-to-20 ACTCTGCGACHCHAHAATTAGGGGATGCTGATTTAGTA
ACTCTGCGACHCHAHAATT elements-of-21-to-30 ACTAGCTATTATCGATATT
AGGGGATGCTGATTTAGTA
elements-of-21-to-30
ACTAGCTATTATCGATATT
If the pattern is the same as in the example
indx <- c(TRUE, FALSE)
data.frame(V1=df$V1[indx], V2=df$V1[!indx])
# V1 V2
#1 elements-of-01-to-20 ACTCTGCGACHCHAHAATT
#2 elements-of-21-to-30 ACTAGCTATTATCGATATT
#3 elements-of-31-to-40 CCCTTATATTGGAGCTACT
Update
Based on the updated dataset
library(data.table)
setDT(df)[,list(V1=V1[1], V2=paste(V1[-1], collapse='')),
by= list(indx=cumsum(grepl('^[^A-Z]', df$V1)))][, indx:=NULL][]
# V1 V2
#1: elements-of-01-to-20 ACTCTGCGACHCHAHAATTAGGGGATGCTGATTTAGTA
#2: elements-of-21-to-30 ACTAGCTATTATCGATATT
New data
df <- structure(list(V1 = c("elements-of-01-to-20", "ACTCTGCGACHCHAHAATT",
"AGGGGATGCTGATTTAGTA", "elements-of-21-to-30", "ACTAGCTATTATCGATATT"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -5L))
If that is just a fasta file then look at the Biostrings package. You could do it this way too
MySeq <- data.frame("Name" = df$V1[(seq(1, length(df$V1), by=2)],
"Seq" = df$V1[(seq(2, length(df$V1), by=2)],
stringsAsFactors = FALSE)
Here is another way using grepl:
#dummy data
df <- read.table(text=" V1
elements-of-01-to-20
ACTCTGCGACHCHAHAATT
elements-of-21-to-30
ACTAGCTATTATCGATATT
elements-of-31-to-40
CCCTTATATTGGAGCTACT",
as.is=TRUE,header=TRUE)
#result
cbind(df[ grepl("elements",df$V1), "V1"],
df[ !grepl("elements",df$V1), "V1"])
#output
# [,1] [,2]
# [1,] "elements-of-01-to-20" "ACTCTGCGACHCHAHAATT"
# [2,] "elements-of-21-to-30" "ACTAGCTATTATCGATATT"
# [3,] "elements-of-31-to-40" "CCCTTATATTGGAGCTACT"
Try (using traditional programming methods):
ndf = data.frame(V1="", V2="", stringsAsFactors=FALSE)
i=1
while(i<nrow(df)){
ndf[(nrow(ndf)+1),]=c(df[i,1], df[(i+1),1])
i=i+2
}
ndf[-1,]
V1 V2
2 elements-of-01-to-20 ACTCTGCGACHCHAHAATT
3 elements-of-21-to-30 ACTAGCTATTATCGATATT
4 elements-of-31-to-40 CCCTTATATTGGAGCTACT
I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.
Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)