Remove spaces between words of a certain length - regex

I have strings of the following variety:
A B C Company
XYZ Inc
S & K Co
I would like to remove the spaces in these strings that are only between words of 1 letter length. For example, in the first string I would like to remove the spaces between A B and C but not between C and Company. The result should be:
ABC Company
XYZ Inc
S&K Co
What is the proper regex expression to use in gsub for this?

Here is one way you could do this seeing how & is mixed in and not a word character ...
x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company')
gsub('(?<!\\S\\S)\\s+(?=\\S(?!\\S))', '', x, perl=TRUE)
# [1] "ABC Company" "XYZ Inc" "S&K Co" "ABCDEFG Company"
Explanation:
First we assert that two non-whitespace characters do not precede back to back. Then we look for and match whitespace "one or more" times. Next we lookahead to assert that a non-whitespace character follows while asserting that the next character is not a non-whitespace character.
(?<! # look behind to see if there is not:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
) # end of look-behind
\s+ # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
(?= # look ahead to see if there is:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
(?! # look ahead to see if there is not:
\S # non-whitespace (all but \n, \r, \t, \f, and " ")
) # end of look-ahead
) # end of look-ahead

Obligatory strsplit / paste answer. This will also get those single characters that might be in the middle or at the end of the string.
x <- c('A B C Company', 'XYZ Inc', 'S & K Co',
'A B C D E F G Company', 'Company A B C', 'Co A B C mpany')
foo <- function(x) {
x[nchar(x) == 1L] <- paste(x[nchar(x) == 1L], collapse = "")
paste(unique(x), collapse = " ")
}
vapply(strsplit(x, " "), foo, character(1L))
# [1] "ABC Company" "XYZ Inc" "S&K Co"
# [4] "ABCDEFG Company" "Company ABC" "Co ABC mpany"

Coming late to the game but would this pattern work for you
(?<!\\S\\S)\\s+(?!\\S\\S)
Demo

Another option
(?![ ]+\\S\\S)[ ]+

You could do this also through PCRE verb (*SKIP)(*F)
> x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company', ' H & K')
> gsub("\\s*\\S\\S+\\s*(*SKIP)(*F)|(?<=\\S)\\s+(?=\\S)", "", x, perl=TRUE)
[1] "ABC Company" "XYZ Inc" "S&K Co" "ABCDEFG Company"
[5] " H&K"
Explanation:
\\s*\\S\\S+\\s* Would match two or more non-space characters along with the preceding and following spaces.
(*SKIP)(*F) Causes the match the to fail.
| Now ready to choose the characters from the remaining string.
(?<=\\S)\\s+(?=\\S) one or more spaces which are preceded by a non-space , followed by a non-space character are matched.
Removing the spaces will give you the desired output.
Note: See the last element, this regex won't replace the preceding spaces at the first because the spaces at the start isn't preceded by a single non-space character.

Related

Regex capture optional group in any order

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
So the following:
"123 dog cat cow 456 678 890 sheep"
Would return the following:
[["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil], ["890", sheep]]
A regular expression can get us part of the way, but I do not believe all the way.
r = /
(?: # begin non-capture group
\d+ # match 1+ digits
[ ] # match 1 space
[^ \d]+ # match 1+ chars other than digits and spaces
| # or
[^ \d]+ # match 1+ chars other than digits and spaces
[ ] # match 1 space
\d+ # match 1+ digits
| # or
[^ ]+ # match 1+ chars other than spaces
) # end non-capture group
/x # free-spacing regex definition mode
str = "123 dog cat cow 456 678 890 sheep"
str.scan(r).map do |s|
case s
when /\d [^ \d]/
s.split(' ')
when /[^ \d] \d/
s.split(' ').reverse
when /\d/
[s,nil]
else
[nil,s]
end
end
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"],
# ["678", nil], ["890", "sheep"]]
Note:
str.scan r
#=> ["123 dog", "cat", "cow 456", "678", "890 sheep"]
This regular expression is conventionally written
/(?:\d+ [^ \d]+|[^ \d]+ \d+|[^ ]+)/
Here is another solution that only uses regular expressions incidentally.
def doit(str)
str.gsub(/[^ ]+/).with_object([]) do |s,a|
prev = a.empty? ? [0,'a'] : a.last
case s
when /\A\d+\z/ # all digits
if prev[0].nil?
a[-1][0] = s
else
a << [s,nil]
end
when /\A\D+\z/ # all non-digits
if prev[1].nil?
a[-1][1] = s
else
a << [nil,s]
end
else
raise ArgumentError
end
end
end
doit str
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil],
# ["890", "sheep"]]
This uses of the form of String#gsub that has no block and therefore returns an enumerator:
enum = str.gsub(/[^ ]+/)
#=> #<Enumerator: "123 dog cat cow 456 678 890 sheep":gsub(/[^ ]+/)>
enum.next
#=> "123"
enum.next
#=> "dog"
...
enum.next
#=> "sheep"
enum.next
#=> StopIteration (iteration reached an end)

Regex to reject only nonalphanumeric characters

If the keyword to be checked is other. It should not be preceded or followed by alphanumeric character.
spaces are allowed, \n allowed, Special characters allowed.
Not allowed - "AOther9", "noTHERX"
Allowed - "other", "\nother" , " other ", "$other/"
grepl(paste("[^a-zA-Z0-9]","other","[^a-zA-Z0-9]",sep=""),String1 , ignore.case = TRUE)
The above regex works well for all cases other than “check” - when check is preceded and followed by nothing.
You need to use a PCRE regex with lookarounds:
grepl(paste("(?<![a-zA-Z0-9])","other","(?![a-zA-Z0-9])",sep=""), String1, ignore.case = TRUE, perl=TRUE)
^^^^ ^ ^^^ ^ ^^^^^^^^^
The negative lookarounds will not consume the non-alphanumeric characters, they do not require those characters to actually be present in the string.
You can read more about lookarounds here.
Add a * quantifier to the inverted ranges, and start ^ and end $ of line anchors:
String1 <- c("AOther9", "noTHERX", "other", "\nother", " other ", "$other/")
grep('^[^a-z0-9]*other[^a-z0-9]*$', String1, ignore.case = TRUE, value = TRUE)
# [1] "other" "\nother" " other " "$other/"

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

Regular expression in R to remove the part of a string after the last space

I would like to have a gsub expression in R to remove everything in a string that occurs after the last space. E.g. string="Da Silva UF" should return me "Da Silva". Any thoughts?
Using $ anchor:
> string = "Da Silva UF"
> gsub(" [^ ]*$", "", string)
[1] "Da Silva"
You can use the following.
string <- 'Da Silva UF'
gsub(' \\S*$', '', string)
[1] "Da Silva"
Explanation:
' '
\S* non-whitespace (all but \n, \r, \t, \f, and " ") (0 or more times)
$ before an optional \n, and the end of the string

Extract 2nd to last word in string

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)
sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence
Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]
Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]