Extract 2nd to last word in string - regex

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)

sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence

Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]

Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]

Related

Regex to reject only nonalphanumeric characters

If the keyword to be checked is other. It should not be preceded or followed by alphanumeric character.
spaces are allowed, \n allowed, Special characters allowed.
Not allowed - "AOther9", "noTHERX"
Allowed - "other", "\nother" , " other ", "$other/"
grepl(paste("[^a-zA-Z0-9]","other","[^a-zA-Z0-9]",sep=""),String1 , ignore.case = TRUE)
The above regex works well for all cases other than “check” - when check is preceded and followed by nothing.
You need to use a PCRE regex with lookarounds:
grepl(paste("(?<![a-zA-Z0-9])","other","(?![a-zA-Z0-9])",sep=""), String1, ignore.case = TRUE, perl=TRUE)
^^^^ ^ ^^^ ^ ^^^^^^^^^
The negative lookarounds will not consume the non-alphanumeric characters, they do not require those characters to actually be present in the string.
You can read more about lookarounds here.
Add a * quantifier to the inverted ranges, and start ^ and end $ of line anchors:
String1 <- c("AOther9", "noTHERX", "other", "\nother", " other ", "$other/")
grep('^[^a-z0-9]*other[^a-z0-9]*$', String1, ignore.case = TRUE, value = TRUE)
# [1] "other" "\nother" " other " "$other/"

delete the words with length greater than X in R

In R programming after i remove the punctuation, numbers and non-ascii characters, i remained with many words with long characters:
ques1<-gsub("[[:digit:]]"," ", ques1,perl=TRUE)
ques1<-gsub("[[:punct:]]"," ", ques1,perl=TRUE)
ques1<-iconv(ques1, "latin1", "ASCII", sub=" ")
ques1<-rm_white(ques1)
ques1
I checked the longest length of character is 35 using
max(nchar(strsplit(ques1, " ")[[1]]))
[1] 35
Now, i want to remove the words which has more than 10 characters, as i didn't want them, such as
wwwhotmailcomlearnbyexample
Please help me out !!!
Use the following gsub:
ques1 = "A long sentence with long wwwhotmailcomlearnbyexample"
gsub("\\b[[:alpha:]]{11,}\\b", "", ques1, perl=T)
The \\b[[:alpha:]]{11,}\\b regex will match words with length of 11 or more (\\b is a word boundary and [:alpha:] stands for any letter).
See IDEONE demo

Strsplit() by second occurence of the delimiter

I am trying to split by the second occurrence of a character in a string (return substring in string before second appearance of character x)
For the string:
s <-"a_b_c" , if delimiter is "_" , I need the substring : "a_b"
My function returns the substring by first occurence:
return_topic<-function(s)
{
if (length(grep("_",s))>0)
{ return (unlist(strsplit(s,"_"))[1])}
else return (" ")
}
> return_topic("a_b_c")
[1] "a"
You can use sub:
sub("(.*?_.*?)_.*", "\\1", s)
# [1] "a_b"
One way using strsplit
s <- c('a_b_c', '_b', '_bc_', 'abc__')
sapply(strsplit(s, '^[^_]*?[_][^_]*?(*SKIP)(*F)|_', perl=TRUE),`[`,1)
#[1] "a_b" "_b" "_bc" "abc_"

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.