substring characters from a column in a data.table in R - regex

Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?
I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.
> srcDT<- data.table(
alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
degree=c("W72","WG95","W88")
)
> srcDT
alum degree
1: Paul Lennon W72
2: Stevadora Nicks WG95
3: Fred Murcury W88
I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"
No problem:
> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]
> srcDT
alum degree degree_year
1: Paul Lennon W72 72
2: Stevadora Nicks WG95 95
3: Fred Murcury W88 88
If only it were always that simple.
The problem is, the degree strings only sometimes look like the above. More often, they look like this:
srcDT<- data.table(
alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
degree=c("W72 C73","WG95 L95","W88 WG90")
)
I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)
Here's how I solved it:
x <-srcDT$degree ##grab just the degree column
z <-character() ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
## define a vector of regex's, in the order
## I want them
for(i in 1:length(x)){ ## loop thru all elements in degree column
matched=F ## at the start of the loop, reset flag to F
for(j in 1:length(degree.grep.pattern)){
## loop thru all elements of the pattern vector
if(length(grep(degree.grep.pattern[j],x[i]))>0){
## see if you get a match
m <- regexpr(degree.grep.pattern[j],x[i])
## if you do, great! grab the index of the match
y<-regmatches(x[i],m) ## then subset down. y will equal "WG95"
matched=T ## set the flag to T
break ## stop looping
}
## if no match, go on to next element in pattern vector
}
if(matched){ ## after finishing the loop, check if you got a match
yr <- substr(y,nchar(y)-1,nchar(y))
## if yes, then grab the last 2 characters of it
}else{
#if you run thru the whole list and don't match any pattern at all, just
# take the last two characters from the affilitation
yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
}
z<-c(z,yr) ## add this result (95) to the character vector
}
srcDT$degree_year<-z ## set the column to the results.
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
This works. 100% of the time. No errors, no mis-matches.
The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.
Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."
Thoughts on improvement?
Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings.
Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.

As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job
srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
# alum degree degree_year
# 1: Ringo Harrison W72 C73 72
# 2: Brian Wilson WG95 L95 95
# 3: Mike Jackson W88 WG90 90

Here's a solution based on the assumption that want the most recent degree with W in it:
regex <- "(?<=W|(?<=W)G)[0-9]{2}"
srcDT[ , degree_year :=
sapply(regmatches(degree,
gregexpr(regex, degree, perl = TRUE)),
function(x) max(as.integer(x)))]
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
You said:
I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
But I'm not sure what this means. There are more options besides W and WG?

Here is one quick hack:
# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)
# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)
# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"
Fast with 100k rows.

A solution without regular expressions, it's quite slow as it creates a sparse table... but it's clean and flexible so i leave it here.
First I split the degreeyears by space, then browse through them and build a clean structured table with one column per degree, that i fill it with years.
degreeyear_split <- sapply(srcDT$degree,strsplit," ")
for(i in 1:nrow(srcDT)){
for (degree_year in degreeyear_split[[i]]){
n <- nchar(degree_year)
degree <- substr(degree_year,1,n-2)
year <- substr(degree_year,n-1,n)
srcDT[i,degree] <- year
}}
Here I have my structure table, I paste W on the year i'm interested in, then paste WG on top of it.
srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]
Then here's you result:
srcDT
alum degree W C WG L year
1: Ringo Harrison W72 C73 72 73 72
2: Brian Wilson WG95 L95 95 95 95
3: Mike Jackson W88 WG90 88 90 90

Related

Subtracting every two columns

Imagine I have a dataframe like this (or the names of all months)
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(20),3)))
mydata <- rbind(mydata,c(2,round(runif(20),3)))
mydata <- rbind(mydata,c(3,round(runif(20),3)))
colnames(mydata) <- c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2))
.
id Mary1 Mary2 Bob1 Bob2 Dylan1 Dylan2 Tom1 Tom2 Jane1 Jane2 Sam1 Sam2 Tony1 Tony2 Luke1 Luke2 John1 John2 Pam1 Pam2
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
2 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
3 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
Usually with many more columns.
And I want to add columns (it's up to you to decide to add them to the right, or create a new dataframe with these new columns) substracting every two.. (*)
id, Mary1-Mary2, Bob1-Bob2, Dylan1-Dylan2, Tom1-Tom2, Jane1-Jane2,...
This operation is quite common.
I'd like to do it by name, not by position, to prevent problems if they are not consecutive.
It could even happen that some columns don't have it's "twin" column, just leave as is, or ignore this complication now.
(*) The names of the columns have a prefix and a number.
Instead of just substracting two columns I could have groups of 5 and I may want to do something such as adding all numbers. A generic solution would be great.
I first tried to do it by convert it to long format, later operate with aggregate, and convert it back to wide format, but maybe it's much easier to do it directly in wide format. I know the problem is mainly related to use efficiently regular expressions.
R, data.table or dplyr, long format splitting colnames
I don't mind the speed but the simplest solution.
Any package is wellcome.
PD: All your codes fail if I add a lonely column.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(21),3)))
mydata <- rbind(mydata,c(2,round(runif(21),3)))
mydata <- rbind(mydata,c(3,round(runif(21),3)))
colnames(mydata) <- c(c("id", paste0(rep(c('Mary', 'Bob', 'Dylan', 'Tom', 'Jane', 'Sam', 'Tony', 'Luke', 'John', "Pam"), each=2), 1:2)),"Lola" )
I know I could filter it out manually but it would be better if the result is the difference (*) of every pair and leave alone the lonely column. (In case of differences of groups of size two)
The best option would be not manually remove the first column but split all columns in single and multiple columns.
How about using base R:
cn <- unique(gsub("\\d", "", colnames(mydata)))[-1]
sapply(cn, function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
You can use this approach for any arbitrary number of groups. For example this would return the row sums across the names with the suffix 1 or 2.:
sapply(cn, function(x) rowSums(mydata[, paste0(x, 1:2)]))
This paste approach could be replaced by regular expressions for more general applications.
You can do something like,
sapply(unique(sub('\\d', '', names(mydata[,-1]))),
function(i) Reduce('-', mydata[,-1][,grepl(i, sub('\\d', '', names(mydata[,-1])))]))
# Mary Bob Dylan Tom Jane Sam Tony Luke John Pam
#[1,] -0.106 -0.335 -0.696 0.284 0.567 0.029 0.303 0.272 -0.274 -0.397
#[2,] 0.723 0.526 -0.119 -0.369 0.530 -0.118 0.308 0.159 0.686 0.313
#[3,] 0.174 0.230 -0.259 -0.454 0.039 -0.383 0.193 -0.028 -0.203 0.255
As per your comment, we can easily sort the columns and then apply the formula above,
sorted.names <- names(mydata)[order(nchar(names(mydata)), names(mydata))]
mydata <- mydata[,sorted.names]
This solution handles an arbitrary number of twins.
## return data frame
twin.vars <- function(prefix, df) {
df[grep(paste0(prefix, '[0-9]+$'), names(df))]
}
pfx <- unique(sub('[0-9]*$', '', names(mydata[-1])))
tmp <- lapply(pfx, function(x) Reduce(`-`, twin.vars(x, mydata)))
cbind(id=mydata$id, as.data.frame(setNames(tmp, pfx)))
OK, I've chosen #NBATrends solution because it works well almost always and he was the first.
Anyway, I add my little contribution, just in case anybody is interested:
runs <- rle(sort(sub('\\d$', '', names(mydata))))
sapply(runs[[2]][runs[[1]]>1], function(x) mydata[[paste0(x, 1)]] - mydata[[paste0(x, 2)]] )
The only "problem" is that it changes the final order, but you don't need to manually remove isolated columns, and works for disordered columns too.
I'm perplexed because nobody posted a solution with dplyr or data.table :)

Extracting everything after first two words in R

I am trying to extract all the info, using a regular expression in R, after the first number and first word of an entry in a data frame.
For example:
Header =
c("2006 Volvo XC70",
"2012 Ford Econoline Cargo Van E-250 Commercial",
"2012 Nissan Frontier",
"2012 Kia Soul 5dr Wagon Automatic")
I want to write a pattern that will grab Volvo XC70, or Econoline Cargo Van E-250 Commercial (everything after the year and make) from an entry in my "header" column so that I may run the function on my data frame and create a new "model" column. I can't figure out a pattern that will allow me to skip the first string of integers, then a space, then the first string of characters, and then a space, and then grab everything proceeding.
Any help would be appreciated. Thanks!
Just use sub.
sub("^\\d+\\s+\\w+\\s+", "", df$x)
Example:
x <- "2012 Ford Econoline Cargo Van E-250 Commercial"
sub("^\\d+\\s+\\w+\\s+", "", x)
# [1] "Econoline Cargo Van E-250 Commercial"
For this task, I would fetch a basic list using the XML package:
library(XML)
doc <- xmlParse('http://www.fueleconomy.gov/ws/rest/ympg/shared/menu/make')
Now that we fetched the XML data we can create a vector with the car makes:
mk <- xpathSApply(doc, '//value', xmlValue)
Finally, I'll compile the pattern and play around with sprintf and sub:
df$Makes <- sub(sprintf('\\d+ (?:%s) ', paste(mk, collapse='|')), '', df$Header)
Output:
## Header
# 1 2006 Volvo XC70
# 2 2012 Ford Econoline Cargo Van E-250 Commercial
# 3 2012 Nissan Frontier
# 4 2012 Kia Soul 5dr Wagon Automatic
## Makes
# 1 XC70
# 2 Econoline Cargo Van E-250 Commercial
# 3 Frontier
# 4 Soul 5dr Wagon Automatic

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.

Split one column into two columns and retaining the seperator

I have a very large data array:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. I am currently stuck at creating a clean data file she can analyze
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. The file I am working on is 8GB. do you that is the problem. How do i convert it to a data.table for faster processing?
What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. The data contains alphanumeric and numbers. There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
We will replace the full names later but can some one suggest on how I separate
MURD11 --> MURD and 11 (in two columns)
NEG_M10 --> NEG_M and 10 (in two columns)
etc...
I have tried using,
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. The second one does not work at all.
I also tried
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
Since you're using tidyr already (as evidenced by separate), try the extract function, which, given a regex, puts each captured group into a new column. The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. Adjust the regex accordingly.
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R, one option would be to create a unique delimiter between the non-numeric and numeric part. We can capture as a group the non-numeric ([^0-9]+) and numeric ([0-9]+) characters by wrapping it inside the parentheses ((..)) and in the replacement we use \\1 for the first capture group, followed by a , and the second group (\\2). This can be used as input vector to read.table with sep=',' to read as two columns.
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind with the original dataset
cbind(totallong, df1)
Or in base R, we can use strsplit with split specifying the boundary between non-number ((?<=[^0-9])) and a number ((?=[0-9])). Here we use lookarounds to match the boundary. The output will be a list, we can rbind the list elements with do.call(rbind and convert it to data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit from the devel version of data.table ie. v1.9.5. Here also, we use the same regex. In addition, there is option to convert the output columns into different class.
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
Benchmarks
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
data
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

Ectracting some content between two words in R

If this is the test string -
alt="mass |36 grams\nserving volume | 63 mL (milliliters)\nserving density | 0.57 g\/cm^3 (grams per cubic centimeter)" title="mass | 36 grams.
\btitle="mass| \b.*+\s*+\K.*(?=serving volume\b)
This is my code but it does not return what is required.
Then how to extract 36 grams from this text?
It would be great if someone could share a link from where I can learn regex.
gsub('mass \\|([0-9]* [A-Za-z]*).*', '\\1', alt)
[1] "36 grams"
To exclude the unit:
gsub('mass \\|([0-9]*).*', '\\1', alt)
[1] "36"
Careful with the extra space, it will be captured too. This is not what you want:
gsub('mass \\|([0-9]* ).*', '\\1', alt)
[1] "36 "
For the example you gave this will work, but depending on what you want to do you might need something more general:
alt<-"mass |36 grams\nserving volume | 63 mL (milliliters)\nserving density | 0.57 g/cm^3 (grams per cubic centimeter)"
gsub(".*\\|([0-9]+ gram).*","\\1",alt)
[1] "36 gram"
Did you try with:
/mass \|([a-zA-Z-0-9\s]+)\sserving volume/