R Regex number followed by punctuation followed by space - regex

Suppose I had a string like so:
x <- "i2: 32390. 2093.32: "
How would I return a vector that would give me the positions of where a number is followed by a : or a . followed by a space?
So for this string it would be
"2: ","0. ","2: "

The regex you need is just '\\d[\\.:]\\s'. Using stringr's str_extract_all to quickly extract matches:
library(stringr)
str_extract_all("i2: 32390. 2093.32: ", '\\d[\\.:]\\s')
produces
[[1]]
[1] "2: " "0. " "2: "
You can use it with R's built-in functions, and it should work fine, as well.
What it matches:
\\d matches a digit, i.e. number
[ ... ] sets up a range of characters to match
\\. matches a period
: matches a colon
\\s matches a space.

Related

Regex: Match a pattern within quoted texts

Text is
lemma A:
"
abx K() bc
"
// comment lemma B
lemma B:
"
abx bc sdsf
"
lemma C:
"
abfdfx K() bc
"
lemma D:
"
abxsf bc
"
I want to find the lemmas which contain K() inside its following quoted text. I have tried Perl regex (?s)^[ ]*lemma.*?"(?!").*?K\( but it overlaps two lemmas. The output should be: lemma A: "..." and lemma C: "...".
If the double quotes are at the start of the string, you can match a newline and then the double quote.
Then match any char except the double quote until you match K(
^[ ]*lemma\b.*\R"[^"]*K\(
^ Start of string
[ ]*lemma\b Match optional spaces and lemma
.*\R Match the rest of the line and a newline
"[^"]* Match " followed by optional chars other than "
K\( Match K(
Regex demo
You could use:
(?s)^[ ]*lemma[^"]*"[^"]*?K\(
[^"] means "any character but ""
See a demo here

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

How to use gsub() to add a string to the front and remove a string from the end at same time?

I want to match end of given string to a certain pattern. If pattern is there, then I want to remove it and add a certain string to the front of given string.
E.g: If I have gradesmeanA and the pattern is meanA then I want to add "mean of class A " to the front of gradesmeanA and remove meanA at the end. So the result should be "mean of class A grades".
I want to use gsub() and regular expressions. I want to do this in one step.
What I tried:
s1<-gsub(pattern="/\\w /meanA$",replacement="mean of class A / \\w/","gradesmeanA")
but didn't work.
You could try the below code,
> s <- "gradesmeanA"
> sub("^(.*?)meanA$", "mean of class A \\1", s, perl=T)
[1] "mean of class A grades"
> sub("^(.*)meanA$", "mean of class A \\1", s)
[1] "mean of class A grades"
Pattern Explanation:
^ Matches the start of a line.
() Capturing group usually used to capture characters.
(.*)meanA$, all the characters before the last string meanA is captured and the last meanA is matched.
In sub or gsub, all the matched characters must be replaced. In our case, all the matched characters (the whole line) are replaced by mean of class A plus the chaarcters present inside the group index 1.
$ Matches the end of a line.
OR
An ugly one,
> gsub("^(.*?)(mean)(A)$", "\\2 of class \\3 \\1", s, perl=T)
[1] "mean of class A grades"

How does "gsub" handle spaces?

I have a character string "ab b cde", i.e. "ab[space]b[space]cde". I want to replace "space-b" and "space-c" with blank spaces, so that the output string is "ab[space][space][space][space]de". I can't figure out how to get rid of the second "b" without deleting the first one. I have tried:
gsub("[\\sb,\\sc]", " ", "ab b cde", perl=T)
but this is giving me "a[spaces]de". Any pointers? Thanks.
Edit: Consider a more complicated problem: I want to convert the string "akui i ii" i.e. "akui[space]i[space]ii" to "akui[spaces|" by removing the "space-i" and "space-ii".
[\sb,\sc] means "one character among space, b, ,, space, c".
You probably want something like (\sb|\sc), which means "space followed by b, or space followed by c"
or \s[bc] which means "space followed by b or c".
s <- "ab b cde"
gsub( "(\\sb|\\sc)", " ", s, perl=TRUE )
gsub( "\\s[bc]", " ", s, perl=TRUE )
gsub( "[[:space:]][bc]", " ", s, perl=TRUE ) # No backslashes
To remove multiple instances of a letter (as in the second example) include a + after the letter to be removed.
s2 <- "akui i ii"
gsub("\\si+", " ", s2)
There is a simple solution to this.
gsub("\\s[bc]", " ", "ab b cde", perl=T)
This will give you what you want.
You can use lookbehind matching like this:
gsub("(?<=\\s)i+", " ", "akui i ii", perl=T)
Edit:
lookbehind is still the way to go, demonstrated with an other example from your original post. Hope this helps.
x<-"ab b cde"
gsub(" b| c", " ",x)
Note the double spaces in the 2nd argument.