replace every other space with new line - regex

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?

#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

Removing multiple commas and trailing commas using gsub

This question is very similar to Removing multiple spaces and trailing spaces using gsub, except that I'd like to apply it to commas instead of spaces.
For example, I'd like a function TrimCommas to turn x into y:
x <- c("a,b,c", ",a,b,,c", ",,,a,,,b,c,,,")
# y <- TrimCommas(x) # presumably
y <- c("a,b,c", "a,b,c", "a,b,c")
The solution for spaces was gsub("^ *|(?<= ) | *$", "", x, perl=T), so I'm hoping comparing the solution for this will help explain some regex fundamentals as well.
Isn't the solution pretty similar?
x <- c("a,b,c", ",a,b,,c", ",,,a,,,b,c,,,")
gsub("^,*|(?<=,),|,*$", "", x, perl=T)
# [1] "a,b,c" "a,b,c" "a,b,c"
There are three parts to the regex ^,*|(?<=,),|,*$:
^,* -- this matches 0 or more commas at the beginning of the string
(?<=,), -- this is a positive lookbehind to see if there a comma behind a comma, so it matches , in ,,
,*$ -- this matches 0 or more commas at the end of the string
As you can see all of the above are substituted with nothing.
You can make this generic to any character (" ", ",", etc.) with this function:
TrimMult <- function(x, char=" ") {
return(gsub(paste0("^", char, "*|(?<=", char, ")", char, "|", char, "*$"),
"", x, perl=T))
}

How to trim and replace a string

string<-c(" this is a string ")
Is it possible to trim-off the white spaces on both the sides of the string (or just one side as required) and replace it with a desired character, such as this, in R? The number of white spaces differ on each side of the string and have to be retained on replacement.
"~~~~~~~this is a string~~"
This seems like an inefficient way of doing it, but maybe you should be looking in the direction of gregexpr and regmatches instead of gsub:
x <- " this is a string "
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(x, gregexpr(pattern, x))[[1]])
text <- paste(regmatches(x, gregexpr(pattern, x), invert=TRUE)[[1]], collapse="")
paste0(startstop[1], text, startstop[2])
# [1] "~~~~this is a string~~"
And, for fun, as a function, and a "vectorized" function:
## The function
replaceEnds <- function(string) {
pattern <- "^ +?\\b|\\b? +$"
startstop <- gsub(" ", "~", regmatches(string, gregexpr(pattern, string))[[1]])
text <- paste(regmatches(string, gregexpr(pattern, string), invert = TRUE)[[1]],
collapse = "")
paste0(startstop[1], text, startstop[2])
}
## use Vectorize here if you want to apply over a vector
vReplaceEnds <- Vectorize(replaceEnds)
Some sample data:
myStrings <- c(" Four at the start, 2 at the end ",
" three at the start, one at the end ")
vReplaceEnds(myStrings)
# Four at the start, 2 at the end three at the start, one at the end
# "~~~~Four at the start, 2 at the end~~" "~~~three at the start, one at the end~"
Use gsub:
gsub(" ", "~", " this is a string ")
[1] "~~~~this~is~a~string~~"
This function uses regular expressions to replace (i.e. sub), all occurrences of a pattern inside a string.
In your case, you have to express the pattern in a special way:
gsub("(^ *)|( *$)", "~~~", " this is a string ")
[1] "~~~this is a string~~~"
The pattern means:
(^ *): Find one or more spaces at the start of the string
( *$): Find one or more spaces at the end of the string
`|: The OR operator
Now you can use this approach to tackle your problem of replacing each space with a new character:
txt <- " this is a string "
foo <- function(x, new="~"){
lead <- gsub("(^ *).*", "\\1", x)
last <- gsub(".*?( *$)", "\\1", x)
mid <- gsub("(^ *)|( *$)", "", x)
paste0(
gsub(" ", new, lead),
mid,
gsub(" ", new, last)
)
}
> foo(" this is a string ")
[1] "~~~~this is a string~~"
> foo(" And another one ")
[1] "~And another one~~~~~~~~"
For more, see ?gsub or ?regexp.
Or using a more complex pattern matching and gsub...
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", " this is a string " , perl = TRUE )
#[1] "~~~~this is a string~~"
Or with #AnandaMahto's data:
gsub("\\s(?!\\b)|(?<=\\s)\\s(?=\\b)", "~", myStrings , perl = TRUE )
#[1] "~~~~Four at the start, 2 at the end~~"
#[2] "~~~three at the start, one at the end~"
Explanation
This uses the positive and negative lookahead and look behind assertions:
\\s(?!\\b) - match a space, \\s not followed by a word boundary, (?!\\b). This would work by itself for everything except the last space before the first word, i.e. by itself we would get
"~~~~ this is a string~~". So we need another pattern...
(?<=\\s)\\s(?=\\b) - match a space, \\s that is preceded by another space, (?<=\\s) and is followed by a word boundary, (?=\\b).
And it is gsub so it tries to make the maximal number of matches that it can.

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.