I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...) commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run.
There has got to be a better way. I'm new to string searching, I found the *$ trick, which helped. Is there a way to do more than one search in a single mapply? I imagine that maybe faster?
Any help would be great. Thanks.
Here is some of the code below. Test is a column of 2 million individual job titles.
test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)
One option would be to use mgsub from library(qdap)
mgsub(patternVec, replaceVec, test)
data
patternVec <- c(" Admin ", "Admin ")
replaceVec <- c(" Administrator ", "Administrator ")
Here is a base R solution which works. You can define a data frame which will contain all patterns and their replacements. Then you use apply() in row mode and call gsub() on your test vector for each pattern/replacement combination. Here is sample code demonstrating this:
df <- data.frame(pattern=c(" Admin ", "Admin "),
replacement=c(" Administrator ", "Administrator "))
test <- c(" Admin ", "Admin ")
apply(df, 1, function(x) {
test <<- gsub(x[1], x[2], test)
})
> test
[1] " Administrator " "Administrator "
Related
Cell A1 contains multiple strings, eg "CAT DOG RAT GNU";
Cell B1 contains multiple strings, eg "RAT CAT";
How can I run a test (using formula in C1) to find if all the strings in B1 are present in cell A1?
Returning true/false would be good
Strings not necessarily in the same order, as example above
The number of items can vary
Multiple instances not a problem, so long as they're there
But returns true only if all items in cell B1 are present in cell A1.
So far I've tried transposed-split arrays with vlookups and matches, counts, etc, but nothing working for me. (And maybe regex won't do it as can't loop for each string?)
you can try:
=ARRAYFORMULA(IF(PRODUCT(N(NOT(ISNA(REGEXEXTRACT(SPLIT(B1, " "),
SUBSTITUTE(A1, " ", "|"))))))=1, TRUE))
for more precision you can do:
=ARRAYFORMULA(IF(PRODUCT(N(NOT(ISNA(REGEXEXTRACT(SPLIT(B1, " "),
"^"&SUBSTITUTE(A1, " ", "$|^")&"$")))))=1, TRUE))
then for case insensivity:
=ARRAYFORMULA(IF(PRODUCT(N(NOT(ISNA(REGEXEXTRACT(SPLIT(LOWER(B1), " "),
"^"&SUBSTITUTE(LOWER(A1), " ", "$|^")&"$")))))=1, TRUE))
and true ArrayFormula would be:
=ARRAYFORMULA(IF((A1:A<>"")*(B1:B<>""), IF(REGEXMATCH(TRANSPOSE(QUERY(TRANSPOSE(IFERROR(
REGEXMATCH(IF(SPLIT(B1:B, " ")<>"", SPLIT(LOWER(B1:B), " "), 0),
"^"&SUBSTITUTE(LOWER(A1:A), " ", "$|^")&"$"))),,999^99)), "FALSE"), FALSE, TRUE), ))
I've searched many times, and haven't found the answer here or elsewhere. I want to replace each space ' ' in variables containing file names with a '\ '. (A use case could be for shell commands, with the spaces escaped, so each file name doesn't appear as a list of arguments.) I have looked through the StackOverflow question "how to replace single backslash in R", and find that many combinations do work as advertised:
> gsub(" ", "\\\\", "a b")
[1] "a\\b"
> gsub(" ", "\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
but try these with a single-slash version, and R ignores it:
> gsub(" ", "\\ ", "a b")
[1] "a b"
> gsub(" ", "\ ", "a b", fixed = TRUE)
[1] "a b"
For the case going in the opposite direction — removing slashes from a string, it works for two:
> gsub("\\\\", " ", "a\\b")
[1] "a b"
> gsub("\\", " ", "a\\b", fixed = TRUE)
[1] "a b"
However, for single slashes some inner perversity in R prevents me from even attempting to remove them:
> gsub("\\", " ", "a\\b")
Error in gsub("\\", " ", "a\\b") :
invalid regular expression '\', reason 'Trailing backslash'
> gsub("\", " ", "a\b", fixed = TRUE)
Error: unexpected string constant in "gsub("\", " ", ""
The 'invalid regular expression' is telling us something, but I don't see what. (Note too that the perl = True option does not help.)
Even with three back slashes R fails to notice even one:
> gsub(" ", "\\\ ", "a b")
[1] "a b"
The patter extends too! Even multiples of two work:
> gsub(" ", "\\\\\\\\", "a b")
[1] "a\\\\b"
but not odd multiples (should get '\\\ ':
> gsub(" ", "\\\\\\ ", "a b")
[1] "a\\ b"
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
(I would expect 3 slashes, not two.)
My two questions are:
How can my goal of replacing a ' ' with a '\ ' be accomplished?
Why did the odd number-slash variants of the replacements fail, while the even number-slash replacements worked?
For shell commands a simple work-around is to quote the file names, but part of my interest is just wanting to understand what is going on with R's regex engine.
Get ready for a face-palm, because this:
> gsub(" ", "\\\ ", "a b", fixed = TRUE)
[1] "a\\ b"
is actually working.
The two backslashes you see are just the R console's way of displaying a single backslash, which is escaped when printed to the screen.
To confirm the replacement with a single backslash is indeed working, try writing the output to a text file and inspect yourself:
f <- file("C:\\output.txt")
writeLines(gsub(" ", "\\", "a b", fixed = TRUE), f)
close(f)
In output.txt you should see the following:
a\b
Very helpful discussion! (I've been Googling the heck out of this for 2 days.)
Another way to see the difference (rather than writing to a file) is to compare the contents of the string using print and cat.
z <- gsub(" ", "\\", "a b", fixed = TRUE)
> print(z)
[1] "a\\ b"
> cat(z)
a\ b
So, by using cat instead of print we can confirm that the gsub line is doing what was intended when we're trying to add single backslashes to a string.
Building on top of two questions I previously asked:
R: How to prevent memory overflow when using mgsub in vector mode?
gsub speed vs pattern length
I do like suggestions on usage of fixed=TRUE by #Tyler as it speeds up calculations significantly. However, it's not always applicable. I need to substitute, say, caps as a stand-alone word w/ or w/o punctuation that surrounds it. A priori it's not know what can follow or precede the word, but it must be any of regular punctuation signs (, . ! - + etc). It cannot be a number or a letter. Example below. capsule must stay as is.
i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"
orig = "caps"
change = "cap"
gsub_FixedTrue <- function(i) {
i = paste0(" ", i, " ")
orig = paste0(" ", orig, " ")
change = paste0(" ", change, " ")
i = gsub(orig,change,i,fixed=TRUE)
i = gsub("^\\s|\\s$", "", i, perl=TRUE)
return(i)
}
#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {
i = gsub(paste0("\\b",orig,"\\b"),change,i)
return(i)
}
print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct
Results. Second output is desired
[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"
Using parts from your previous question to test I think we can put a place holder in front of punctuation as follows, without slowing it down too much:
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
"Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")
line <- rep(line, 1700000/length(line))
line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)
I have this string:
c <- "thethirsty thirsty itthirsty (thirsty) is"
I want the output to be as
"thethirsty thirsty itthirsty no is"
This is what I am trying.
gsub(" (thirsty) ", " no ", c)
This is what I am getting. Why does not it work? And suggest an alternative to do this.
"thethirsty no itthirsty (thirsty) is"
By default gsub interprets the first parameter as a regular expression. You don't want that and should set fixed=TRUE:
gsub(" (thirsty) ", " no ", c, fixed=TRUE)
#[1] "thethirsty thirsty itthirsty no is"
I am on the lookout for two R functions that would perform the following string rearrangements:
(1) place the parts following a ", " in a string at the start of a string, e.g.
name="2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-"
should yield
"(E)-3,7-dimethyl-2,6-Octadien-1-ol"
(note that there could be any number of ", " in a string, or none at all, and that the parts after the ", " should be placed at the start of the string successively, starting from the end of the string. What would be the most efficient way of achieving this in R (without using loops etc)?
(2) place the parts between "<" and ">" at the start of a string and remove any ", ".
E.g.
name="Pyrazine <2-acetyl-, 3-ethyl->"
should yield
"2-acetyl-3-ethyl-Pyrazine"
(this is a simpler gsub problem, right?)
The part between the "<" and ">" could be in any place in the string though.
E.g.
name="Cyclohexanol <4-tertbutyl-> acetate"
should yield
"4-tertbutyl-Cyclohexanol acetate"
Any thoughts would be welcome!
cheers,
Tom
For the first problem:
name <- c("2,6-Octadien-1-ol, 3,7-dimethyl-, (E)-",
"2,6-Octadien-1-ol,3,7-dimethyl-,(E)-")
sapply(strsplit(name, "(?<!\\d), ?", perl = TRUE), function(x)
paste(rev(x), collapse = ""))
# [1] "(E)-3,7-dimethyl-2,6-Octadien-1-ol" "(E)-3,7-dimethyl-2,6-Octadien-1-ol"
For the second problem:
name <- c("Pyrazine <2-acetyl-, 3-ethyl->",
"Cyclohexanol <4-tertbutyl-> acetate")
inside <- gsub(", ", "", sub("^.*<(.+)>.*$", "\\1", name))
outside <- sub("^(.*) <.*>(.*)$" , "\\1\\2", name)
paste0(inside, outside)
# [1] "2-acetyl-3-ethyl-Pyrazine" "4-tertbutyl-Cyclohexanol acetate"