Extracting text in R

Extracting text in R - regex

I am trying to extract a variable-length substring of text using R. I have several characters such as the following:
"\"/Users/Nel/Documents/Project/Data/dataset.csv\""
I need to extract the file path from each such character. In this case what I am trying to get is:
path1 <- "/Users/Nel/Documents/Project/Data/dataset.csv"
I am able to use the substring function:
path1 <- substr("\"/Users/Nel/Documents/Project/Data/dataset.csv\"", 3, 46)
with the indices hard-coded to get what I want in this particular instance. However, this particular path is one of many, and I need to be able to find these indices on the fly. I believe the
grep()
function could work but I can't figure out the relevant regular expressions. Thanks.

It seems like you are just trying to remove some hard-coded quotation marks.
Try gsub:
x
# [1] "\"/Users/Nel/Documents/Project/Data/dataset.csv\""
gsub('\"',"",x)
# [1] "/Users/Nel/Documents/Project/Data/dataset.csv"
## or
# gsub('["]', "", x)

Related

Use lapply on a subset of list elements and return list of same length as original in R

I want to apply a regex operation to a subset of list elements (which are character strings) using lapply and return a list of same length as the original. The list elements are long strings (derived from reading in long text files and collapsing paragraphs into a single string). The regex operation is valid only for the subset of list elements/strings. I want the non-subsetted list elements (character strings) to be returned in their original state.
The regex operation is str_extract from the stringr package, i.e. I want to extract a substring from a longer string. I subset the list elements based on a regex pattern in the filename.
An example with simplified data:
library(stringr)
texts <- as.list(c("abcdefghijkl", "mnopqrstuvwxyz", "ghijklmnopqrs", "uvwxyzabcdef"))
filenames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(texts) <- filenames
regexp <- "abcdef"
I know in advance to which strings I want to apply the regex operation, and hence I want to subset these strings. That is, I don't want to run the regex over all elements in the list, as doing so will return some invalid results (which is not apparent in this simplified example).
I've made a few naive efforts, e.g.:
x <- lapply(texts[str_detect(names(texts), "1997")], str_extract, regexp)
> x
$AB1997R.txt
[1] "abcdef"
$DC1997S.txt
[1] "abcdef"
which returns a reduced-length list containing just the substrings found.
But the results I want to get are:
> x
$AB1997R.txt
[1] "abcdef"
$BG2000S.txt
[1] "mnopqrstuvwxyz"
$MN1999R.txt
[1] "ghijklmnopqrs"
$DC1997S.txt
[1] "abcdef"
where the strings not containing the regex pattern are returned in their original state.
I have informed myself about stringr, lapply and llply (in the plyr package), but many operations are illustrated using dataframes as examples, not lists, and don't involve regex operations on character strings. I can achieve my goal using a for loop, but I'm trying to get away from that, as is generally advised, and get better at using the apply-class of functions.

You can use the subset operator [<-:
x <- texts
is1997 <- str_detect(names(texts), "1997")
x[is1997] <- lapply(texts[is1997], str_extract, regexp)
x
# $AB1997R.txt
# [1] "abcdef"
#
# $BG2000S.txt
# [1] "mnopqrstuvwxyz"
#
# $MN1999R.txt
# [1] "ghijklmnopqrs"
#
# $DC1997S.txt
# [1] "abcdef"
#

You can try sub
sub(paste0('.*(', regexp, ').*'), '\\1', texts)
# AB1997R.txt BG2000S.txt MN1999R.txt DC1997S.txt
# "abcdef" "mnopqrstuvwxyz" "ghijklmnopqrs" "abcdef"
Also, if you need to match the names of 'texts' with 1997, we can use grep
indx <- grep('1997', names(texts))
texts[indx] <- sub(paste0('.*(', regexp, ').*'), '\\1', texts[indx])
as.list(texts)

Extract text between certain symbols using Regular Expression in R

I have a series of expressions such as:
"<i>the text I need to extract</i></b></a></div>"
I need to extract the text between the <i> and </i> "symbols". This is, the result should be:
"the text I need to extract"
At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i> and </i>?
Thanks.

If there is only one <i>...</i> as in the example then match everything up to <i> and everything from </i> forward and replace them both with the empty string:
x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)
giving:
[1] "the text I need to extract"
If there could be multiple occurrences in the same string then try:
library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)
giving the same in this example.

This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:
library(qdapRegex)
x <- "<i>the text I need to extract</i></b></a></div>"
rm_between(x, "<i>", "</i>", extract=TRUE)
## [[1]]
## [1] "the text I need to extract"
I would point out that it may be more reliable to use an html parser for this job.

If this is html (which it look like it is) you should probably use an html parser. Package XML can do this
library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
On an entire html document, you can use
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)

You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.
vec <- c("<i>the text I need to extract</i></b></a></div>",
"abc <i>another text</i> def <i>and another text</i> ghi")
regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
#
# [[2]]
# [1] "another text" "and another text"

<i>((?:(?!<\/i>).)*)<\/i>
This should do it for you.

r gsub and regex, obating y_x from y_x_xxxx.csv

General situation: I am currently trying to name dataframes inside a list in accordance to the csv files they have been retrieved from, I found that using gsub and regex is the way to go. Unfortunately, I can’t produce exactly what I need, just sort of.
I would be very grateful for some hints from someone more experienced, maybe there is a reasonable R regex cheat cheet ?
File are named r2_m1_enzyme.csv, the script should use the first 4 characters to name the corresponding dataframe r2_m1, and so on…
# generates a list of dataframes, to mimic a lapply(f,read.csv) output:
data <- list(data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)),data.frame(c(1,2)))
# this mimics file names obtained by list.files() function
f <-c("r1_m1_enzyme.csv","r2_m1_enzyme.csv","r1_m2_enzyme.csv","r2_m2_enzyme.csv")
# this should name the data frames according to the csv file they have been derived from
names(data) <- gsub("r*_m*_.*","\\1", f)
but it doesnt work as expected... they are named r2_m1_enzyme.csv instead of the desired r2_m1, although .* should stop it?
If I do:
names(data) <- gsub("r*_.*","\\1", f)
I do get r1, r2, r3 ... but I am missing my second index.
The question: So my questions is, what regex expression would allow me to obtain strings “r1_m1”, “r2_m1”, “r1_m2”, ... from strings that are are named r*_m*_xyz.csv
Search history: R regex use * for only one character, Gsub regex replacement, R ussing parts of filename to name dataframe, R regex cheat sheet,...

If your names are always five characters long you could use substr:
substr(f, 1, 5)
If you want to use gsub you have to group your expression (via ( and )) because \\1 refers to the first group and insert its content, e.g.:
gsub("^(r[0-9]+_m[0-9]+).*", "\\1", f)

Capturing specific part of domain name in R using regex

I am trying to capture domain names from a long string in R. The domain names are as follows.
11.22.44.55.url.com.localhost
The regex I am using is as following,
(gsub("(.*)\\.([^.]*url[^.]*)\\.(.*)","\\2","11.22.44.55.test.url.com.localhost",ignore.case=T)[1])
When I test it, I get the right answer that is
url.com
But when I run it as a job on a large dataset, (I run this using R and Hadoop), the result ends up being this,
11.22.44.55.url
And sometimes when the domain is
11.22.44.55.test.url.com.localhost
but I never get
url.com
I am not sure how this could happen. I know while I test it individually its fine but while running it on my actual dataset it fails. Am I missing any corner case that is causing a problem?
Additional information on the dataset, each of these domain addresses is an element in a list, stored as a string, I extract this and run the gsub on it.

This solution is based on using sub twice. First,".localhost" is removed from the string. Then, the URL is extracted:
# example strings
test <- c("11.22.44.55.url.com.localhost",
"11.22.44.55.test.url.com.localhost",
"11.22.44.55.foo.bar.localhost")
sub(".*\\.(\\w+\\.\\w+)$", "\\1", sub("\\.localhost", "", test))
# [1] "url.com" "url.com" "foo.bar"
This solution works also for strings ending with "url.com" (without ".localhost").

Why not try something simpler, split on ., and pick the parts you want
x <-unlist(strsplit("11.22.44.55.test.url.com.localhost",
split=".",fixed=T))
paste(x[6],x[7],sep=".")

I'm not 100% sure what you're going for with the match, but this will grab "url" plus the next word/numeric sequence after that. I think the "*" wildcard is too greedy, so I made use of the "+", which matches 1 or more characters, rather than 0 or more (like "*").
>oobar = c(
>"11.22.44.55.url.com.localhost",
>"11.22.44.55.test.url.cog.localhost",
>"11.22.44.55.test.url.com.localhost"
>)
>f = function(url) (gsub("(.+)[\\.](url[\\.]+[^\\.]+)[\\.](.+)","\\2",url,ignore.case=TRUE))
>f(oobar)
[1] "url.com" "url.cog" "url.com"

How to extract words between two periods using gsub

I have a text that looks like this:
txt <- "w.raw.median"
I want to extract the second word in between two periods (.),
giving this output
> raw
But why this doesn't work
gsub(".*\\.", "", txt)
What's the right way to do it?

Try this:
gsub(".*\\.(.*)\\..*", "\\1", txt)
[1] "raw"

Also consider
strsplit(txt,'.',fixed=TRUE)[[1]][2]
for a (slightly) more readable version

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting text in R - regex

It seems like you are just trying to remove some hard-coded quotation marks. Try gsub: x # [1] "\"/Users/Nel/Documents/Project/Data/dataset.csv\"" gsub('\"',"",x) # [1] "/Users/Nel/Documents/Project/Data/dataset.csv" ## or # gsub('["]', "", x)

Related

Use lapply on a subset of list elements and return list of same length as original in R

Extract text between certain symbols using Regular Expression in R

r gsub and regex, obating y_x from y_x_xxxx.csv

Capturing specific part of domain name in R using regex

How to extract words between two periods using gsub

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting text in R - regex

It seems like you are just trying to remove some hard-coded quotation marks. Try gsub: x # [1] "\"/Users/Nel/Documents/Project/Data/dataset.csv\"" gsub('\"',"",x) # [1] "/Users/Nel/Documents/Project/Data/dataset.csv" ## or # gsub('["]', "", x)

Related

Use lapply on a subset of list elements and return list of same length as original in R

Extract text between certain symbols using Regular Expression in R

r gsub and regex, obating y*_x* from y*_x*_xxxx.csv

Capturing specific part of domain name in R using regex

How to extract words between two periods using gsub

Categories

Resources

r gsub and regex, obating y_x from y_x_xxxx.csv