How to extract words between two periods using gsub - regex

I have a text that looks like this:
txt <- "w.raw.median"
I want to extract the second word in between two periods (.),
giving this output
> raw
But why this doesn't work
gsub(".*\\.", "", txt)
What's the right way to do it?

Try this:
gsub(".*\\.(.*)\\..*", "\\1", txt)
[1] "raw"

Also consider
strsplit(txt,'.',fixed=TRUE)[[1]][2]
for a (slightly) more readable version

Related

Extract info inside parenthesis in R

I have some rows, some have parenthesis and some don't. Like ABC(DEF) and ABC. I want to extract info from parenthesis:
ABC(DEF) -> DEF
ABC -> NA
I wrote
gsub(".*\\((.*)\\).*", "\\1",X).
It works good for ABC(DEF), but output "ABC" when there is not parenthesis.
If you do not want to get ABC when using sub with your regex, you need to add an alternative that would match all the non-empty string and remove it.
X <- c("ABC(DEF)", "ABC")
sub(".*(?:\\((.*)\\)).*|.*", "\\1",X)
^^^
See the IDEONE demo.
Note you do not have to use gsub, you only need one replacement to be performed, so a sub will do.
Also, a stringr str_match would also be handy for this task:
str_match(X, "\\((.*)\\)")
or
str_match(X, "\\(([^()]*)\\)")
Using string_extract() will work.
library(stringr)
df$`new column` <- str_extract(df$`existing column`, "(?<=\\().+?(?=\\))")
This creates a new column of any text inside parentheses of an existing column. If there is no parentheses in the column, it will fill in NA.
The inspiration for my answer comes from this answer on the original question about this topic

Removing parentheses as unwanted text in R using gsub

I'm trying to clean up a column in my data frame where the rows look like this:
1234, text ()
and I need to keep just the number in all the rows. I used:
df$column = gsub(", text ()", "", df$column)
and got this:
1234()
I repeated the operation with only the parentheses, but they won't go away. I wasn't able to find an example that deals specifically with parentheses being eliminated as unwanted text. sub doesn't work either.
Anyone knows why this isn't working?
Parentheses are stored metacharacters in regex. You should escape them either using \\ or [] or adding fixed = TRUE. But in your case you just want to keep the number, so just remove everything else using \\D
gsub("\\D", "", "1234, text ()")
## [1] "1234"
If your column always looks like a format described above :
1234, text ()
Something like the following should work:
string extractedNumber = Regex.Match( INPUT_COLUMN, #"^\d{4,}").Value
Reads like: From the start of the string find four or more digits.

Extract text between certain symbols using Regular Expression in R

I have a series of expressions such as:
"<i>the text I need to extract</i></b></a></div>"
I need to extract the text between the <i> and </i> "symbols". This is, the result should be:
"the text I need to extract"
At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i> and </i>?
Thanks.
If there is only one <i>...</i> as in the example then match everything up to <i> and everything from </i> forward and replace them both with the empty string:
x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)
giving:
[1] "the text I need to extract"
If there could be multiple occurrences in the same string then try:
library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)
giving the same in this example.
This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:
library(qdapRegex)
x <- "<i>the text I need to extract</i></b></a></div>"
rm_between(x, "<i>", "</i>", extract=TRUE)
## [[1]]
## [1] "the text I need to extract"
I would point out that it may be more reliable to use an html parser for this job.
If this is html (which it look like it is) you should probably use an html parser. Package XML can do this
library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
On an entire html document, you can use
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)
You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.
vec <- c("<i>the text I need to extract</i></b></a></div>",
"abc <i>another text</i> def <i>and another text</i> ghi")
regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
#
# [[2]]
# [1] "another text" "and another text"
<i>((?:(?!<\/i>).)*)<\/i>
This should do it for you.

how do I replace text within a string

I have a data such as
c("1988-10-25T11:12:47.00", "1988-10-25T14:43:24.36", "1988-10-26T14:14:25.60")
and I would like to replace everything after period to A. I tried to use gsub but after the period
all the numbers are different. What should I do?
the expected output ("1988-10-25T11:12:47A", "1988-10-25T14:43:24A", "1988-10-26T14:14:25A")
You can use sub:
s <- c("1988-10-25T11:12:47.00", "1988-10-25T14:43:24.36", "1988-10-26T14:14:25.60")
sub("\\..*", "A", s)
# [1] "1988-10-25T11:12:47A" "1988-10-25T14:43:24A" "1988-10-26T14:14:25A"

Extracting text in R

I am trying to extract a variable-length substring of text using R. I have several characters such as the following:
"\"/Users/Nel/Documents/Project/Data/dataset.csv\""
I need to extract the file path from each such character. In this case what I am trying to get is:
path1 <- "/Users/Nel/Documents/Project/Data/dataset.csv"
I am able to use the substring function:
path1 <- substr("\"/Users/Nel/Documents/Project/Data/dataset.csv\"", 3, 46)
with the indices hard-coded to get what I want in this particular instance. However, this particular path is one of many, and I need to be able to find these indices on the fly. I believe the
grep()
function could work but I can't figure out the relevant regular expressions. Thanks.
It seems like you are just trying to remove some hard-coded quotation marks.
Try gsub:
x
# [1] "\"/Users/Nel/Documents/Project/Data/dataset.csv\""
gsub('\"',"",x)
# [1] "/Users/Nel/Documents/Project/Data/dataset.csv"
## or
# gsub('["]', "", x)