Finding last character in R using regexpr function - regex

I am having problem with finding the last character in a string. I am trying to use the regexpr function to check if the last character is equal to / forward slash.
But unfortunately it does work. Can anyone help me? Below is my code.
regexpr( pattern = ".$", text = /home/rexamine/archivist2/ex/// ) != "/"

You can avoid using regular expression and use substr to do this.
> x <- '/home/rexamine/archivist2/ex///'
> substr(x, nchar(x)-1+1, nchar(x)) == '/'
[1] TRUE
Or use str_sub from the stringr package:
> str_sub(x, -1) == '/'
[1] TRUE

You could use a simple grepl function,
> text = "/home/rexamine/archivist2/ex///"
> grepl("/$", text, perl=TRUE)
[1] TRUE
> text = "/home/rexamine/archivist2/ex"
> grepl("/$", text, perl=TRUE)
[1] FALSE

^.*\/$
You can use this.This will fail if last character is not /.

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Replace a random block of characters in a string in R

I have a text and I want to replace a text block in a line, like that:
"\t\t\tFGHGFJKJKJKGDSJS"
with
x= "ABCCCBBHHJJJH"
I'm interested in changing just the text block (FGHGFJKJKJKGDSJS) without modyfing the presence of other special characters. So obtaining:
"\t\t\tABCCCBBHHJJJH"
Do it exist a way to replace FGHGFJKJKJKGDSJS without clearly specify the exact combination of letters?
I found a solution in this way: txt[n° of the line] = paste0(\t,\t,\t,x)
But I would like to know whether there is a more general solution.
> library(stringr)
> mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
> x <- "ABCCCBBHHJJJH"
> str_replace(mystring,"\\w+",x)
[1] "\t\t\tABCCCBBHHJJJH"
\w+mean match any character or number or underscore at least once and as many as possible. So each part not a normal char will be replace by your x variable.
> a = "\t\t\tDFGGD"
> gsub("(\t\t\t).*","\\1ABCDF",a)
[1] "\t\t\tABCDF
mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
x <- "ABCCCBBHHJJJH"
sub('\\w+',x,mystring,ignore.case=T)

Extracting numbers from a string including decimels and scientific notation

I have some strings that look like
x<-"p = 9.636e-05"
And I would like to extract just the number using gsub. So far I have
gsub("[[:alpha:]](?!-)|=|\\^2", "", x)
But that removes the 'e' from the scientific notation, giving me
" 9.636-05"
Which can't be converted to a number using as.numeric. I know that it would be possible to use a lookahead to match the "-", but I don't know exactly how to go about doing this.
You could try
sub('.* = ', '', x)
#[1] "9.636e-05"
You can use the following to initially remove all non-digit characters at the start of the string:
sub('^\\D+', '', x)
Try
format(as.numeric(gsub("[^0-9e.-]", "", x)), scientific = FALSE)
# [1] "0.00009636"
Through sub or regmatches function.
> x<-"p = 9.636e-05"
> sub(".* ", "", x)
[1] "9.636e-05"
> regmatches(x, regexpr("\\S+$", x))
[1] "9.636e-05"
> library(stringi)
> stri_extract(x, regex="\\S+$")
[1] "9.636e-05"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

R : regular expression for 'not followed by' not working

I needed to retain the words enclosed in brackets and delete the others in the following string.
(a(b(c)d)(e)f)
So what I expected would be (((c))(e)).
To delete a, b, d, f, I tried the 'not followed by' regex.
str <- "(a(b(c)d)(e)f)"
gsub("([a-z]+)(?!\\))", "", str) #(sub. anything that isn't followed by a ")" )
The message shows my regex in invalid. As I can see, the brackets in the second part of the regex "(?!\))" don't match properly. As for my editor, the first "(" matches with the immediately following ")", which is not meant to be a closure bracket (the one to its right is). I could make out just this error from my regex. Can you please tell me what actually is wrong? Is there any other way to do this?
In two steps, and using positive lookaheads:
str1 <- gsub("\\([a-z](?=\\()", "\\(", str, perl=TRUE)
str1
# [1] "(((c)d)(e)f)"
str2 <- gsub("\\)[a-z](?=\\))", "\\)", str1, perl=TRUE)
str2
# [1] "(((c))(e))"
Edit: it turns out you can even do it in one:
gsub("([\\(\\)])[a-z](?=\\1)", "\\1", str, perl=TRUE)
# [1] "(((c))(e))"
I agree with #Dason's comment:
st <- "(a(b(c)d)(e)f)"
while(grepl("\\([a-z]+\\(",st)) {
st <- sub("\\([a-z]+(\\(.+\\))[a-z]+\\)","\\1",st)
}
> st
[1] "(c)(e)"
Written on my iPad :-)