How to create a `if` statement dynamically in R? - if-statement

How to make the following code more concise, so as not to repeat the if statement three times?
txtA <- "text A"
txtB <- "text B"
txtC <- "text C"
section <- "B"
# How to make the following code in a more programmatic manner?
if (section == "A") txtA <- tolower(txtA)
if (section == "B") txtB <- tolower(txtB)
if (section == "C") txtC <- tolower(txtC)
I don't expect a simple dplyr::case_when but something that also does the txtXX <- tolower(txtXX) more general.

Not sure, if I got your goal correct and you simply want to get rid of the explicit if statements?
I would suggest collecting all 'paragraphs' in a single object first for easier handling.
all_txt <- list(
A = "text A",
B = "text B",
C = "text C"
)
section <- 'B'
all_txt[[section]] <- tolower(all_txt[[section]])
all_txt

Related

Removing rows containing special characters

I am working on filtering out a massive dataset that reads in as a list. I need to filter out special markings and am getting stuck on some of them. Here is what I currently have:
library(R.utils)
library(stringr)
gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)]
g <- e[!grepl("\\!)", f)]
i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)
I still need to clean it up a bit, and remove the original column (i) but from the output you can see that it is not removing the special characters ! or #
> head(i)
i Date Title
1 "!Next?" (1994)1994-1995 1994-1995 "!Next?"
2 "#1 Single" (2006)2006-???? 2006-???? "#1 Single"
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare"
4 "#30Nods" (2014)2014-2015 2014-2015 "#30Nods"
5 "#7DaysLater" (2013)2013-???? 2013-???? "#7DaysLater"
6 "#ATown" (2014)2014-???? 2014-???? "#ATown"
What I actually want to do is remove the entire rows containing those special characters. Everything I have tried has thrown errors. Any suggestions?
You could sub anything that is not alphanumeric or a "-" or "()" like this:
gsub("[^A-Za-z()-]", "", row)
In order to remove the rows you can try something like the one below:
data[!grepl(pattern = "[#!]", x = data)]
In case you want to remove all the rows with special characters you can use the code suggested by #luke1018 using grepl:
data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

Selecting the word immediately after a keyword

I'm trying to extract the word immediately a keyword using R. I don't have a lot of experience with regular expressions so everything I've found so far doesn't help me much. If I could get the function to return multiple instances that would be ideal.
For example if my keyword was the and my string was:
The yellow log is in the stream
It would return yellow and stream.
I found this solution for c# and it seems exactly like what I want but I'm having trouble implementing it in R.
You can try
library(stringr)
str_extract_all(str1, perl('(?<=\\b(?i)The )\\w+'))[[1]]
#[1] "yellow" "stream"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '(?<=\\b(?i)The )\\w+')[[1]]
#[1] "yellow" "stream"
EDIT: Changed based on #Roland's suggestion in the comments.
data
str1 <- 'The yellow log is in the stream'
assign key to whatever string you want and use
key <- 'the'
p <- "The yellow log is in the stream"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
or as #Roland points out, it would be safer to use a word boundary around your keyword to avoid this:
key <- 'the'
p <- "The yellow log is in the stream drinking absinthe and beer"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream" "and"
regmatches(p, gregexpr(sprintf('(?i)(?<=\\b%s )\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
Here is non regex solution:
mytext <- "The yellow log is in the stream"
mykey <- "the"
x <- unlist(strsplit(mytext," "))
x[which(tolower(x)==mykey)+1]
Try this: this returns 'yellow' and 'stream'
x <- "The yellow log is in the stream"
regmatches(x, gregexpr("(?:(?:T|t)he)\\s(\\w+)", x, perl = TRUE))[[1]]
## [1] "The yellow" "the stream"
The qdapRegex package I maintain has a regular expression after_ in the regex_supplement dictionary that is perfect for this. You can use rm_ to make your own after_the function:
library(qdapRegex)
x<- "The yellow log is in the stream"
after_the <- rm_(pattern = S("#after_", "[Tt]he"), extract = TRUE)
after_the(x)
## [[1]]
## [1] "yellow" "stream"
The S function is a wrapper for sprintf that allows you to easily pass elements (like the work "the" in this case) to the base regex producing:
S("#after_", "the", "The")
## [1] "(?<=\\b(the|The)\\s)(\\w+)"
EDIT
library(qdapRegex)
x<- c("The yellow log is in the stream", "I like the one box for a pack")
after_ <- rm_(extract = TRUE)
after_the(x)
after_ <- rm_(extract = TRUE)
words <- c("the", "a", "one")
setNames(lapply(words, function(y){
after_(x, pattern = S("#after_", y, TC(y)))
}), words)
## $the
## $the[[1]]
## [1] "yellow" "stream"
##
## $the[[2]]
## [1] "one"
##
##
## $a
## $a[[1]]
## [1] NA
##
## $a[[2]]
## [1] "pack"
##
##
## $one
## $one[[1]]
## [1] NA
##
## $one[[2]]
## [1] "box"

regex split on Boolean

I wish to split a string into vectors and lists. If there is a OR or an || I want to split into lists. If there is ANDor&&split into a vector. With the word version I get it but not with the use of|and&`. Here is the code:
splitting <- function(x) {
lapply(strsplit(x, "OR|[\\|\\|]"), function(y){
strsplit(y, "AND|[\\&\\&]")
})
}
splitting("3AND4AND5OR4OR6AND7") ## desired outcome for all three
splitting("3&&4&&5||4||6&&7")
splitting("3&&4&&5OR4||6&&7")
Here is the desired outcome:
> splitting("3AND4AND5OR4OR6AND7")
[[1]]
[[1]][[1]]
[1] "3" "4" "5"
[[1]][[2]]
[1] "4"
[[1]][[3]]
[1] "6" "7"
How can I set this regex appropriately? What am I doing incorrect?
I'm not saying it's the best answer but if you've already solved the problem using "AND" and "OR" then why not reduce it down to a problem you've already solved?
splitting <- function(x) {
x <- gsub("&&", "AND", x, fixed = TRUE)
x <- gsub("||", "OR", x, fixed = TRUE)
lapply(strsplit(x, "OR|[\\|\\|]"), function(y){
strsplit(y, "AND|[\\&\\&]")
})
}
splitting("3AND4AND5OR4OR6AND7") ## desired outcome for all three
splitting("3&&4&&5||4||6&&7")
splitting("3&&4&&5OR4||6&&7")
this was just the first thing that popped into my head and I haven't really thought about if there is a better way to do it.
Also this appears to work
splitting <- function(x) {
#x <- gsub("&&", "AND", x, fixed = T)
#x <- gsub("||", "OR", x, fixed = T)
lapply(strsplit(x, "OR|\\|\\|"), function(y){
strsplit(y, "AND|\\&\\&")
})
}