Capture word from regex match - regex

I am working on a regex in perl, which identifies what I want it to: word final g (but not following an 'n') or k (but not following an 'r') that precedes word-initial g (but not l or r), word-initial k, or word-initial c (but not c preceding i, e, y, or h):
(((?<!n)g)|(?<!r)k)\s(g(?!l|r)|k|c(?!i|e|y|h));
However, I want it to capture the word that has the g or k at the end of it, so I tried something like this:
(^|\s.*(((?<!n)g)|(?<!r)k))\s(g(?!l|r)|k|c(?!i|e|y|h)); so that $1 captures the beginning of the line or a white space (to signify the beginning of a word) until the next white space before the g, k, or c (the end of the word). Perhaps this is a parentheses problem, but I'm not sure how to keep the grouping I have while also specifying where I want $1 to capture.

What about /(\S*(((?<!n)g)|(?<!r)k))\s(g(?!l|r)|k|c(?!i|e|y|h))/?
EDIT: Looking at it, it could use some clean up :D
/(\S*([^n]g|[^r]k))\s(g[^lr]|k|c[^ieyh])/

Related

Match string with a multi-character limit

I need to divide a list in alphabetical order.
I am using:
regexp_match("via",'^[A-G]')
for one segment, and
regexp_match("via",'^[H-Z]')
However, I need to cut the list halfway the "G" set of words, that is: to make "Galveston" fall in the first segment, and "Geneve" in the second.
How can I do this?
You can use the following two regexps:
^([A-F]|G[a-d])
^([H-Z]|G[e-z])
See regex demo #1 and regex demo #2.
Details
^([A-F]|G[a-d]) - a letter from A to F, or G followed with a letter from a to d
^([H-Z]|G[e-z]) - a letter from H to Z, or G followed with a letter from e to z.

Remove last distinct K character(s) from end of a string

How to I remove G s at the end of each string, and capture string left of it.
...TGTGGG
...CTGAGGGGG
...ACAGGGGGGGG
...CAAACAGGGGGGGGGGGG
The result would like this. If possible I want to capture this remaining string in a regex.
...TGT
...CTGA
...ACA
...CAAACA
Thank you.
Removing trailing Gs is easy.
s/G*$//
If it's not necessarily a G, you can match it with a capture group.
s/(.)\1*$//
If you want to only remove a character if it is repeated at the end (so ATCG would be untouched but ATCGGG would change), you can do that with +
s/(.)\1+$//

removing one letter except a compination

Trying to remove all characters except from the compination of 'r d`. To be more clear some examples:
a ball -> ball
r something -> something
d someone -> someone
r d something -> r d something
r something d -> something
Till now I managed to remove the letters except from r or d, but this is not what i want. I want to keep only the compination(ex.4). I use this:
\b(?!r|d)\w{1}\b
Any idea who to do it?
Edit:The reg engine supports lookbehinds.
You may capture the r d combination and use a backreference in the replacement pattern to restore that combination, and remove all other matches:
\b(r d)\b|\b\w\b\s*
See the regex demo (replace with $1 that will put the r d back into the result).
Details:
\b(r d)\b - a "whole word" r d that is captured into Group 1
| - or
\b\w\b\s* - a single whole word consisting of 1 letter/digit/underscore (\b\w\b) and followed with 0+ whitespaces (\s*, just for removing the excessive whitespace, might not be necessary).

Google Sheets REGEXREPLACE to keep one of repeated strings if in the middle but remove them if at the beginning or the end

I have a string that may have repeated ", " (a comma and a space) in the middle, or at the beginning, or at the end.
for example, to clean ", , a, , c, d, "
I use REGEXREPLACE twice:
=REGEXREPLACE(REGEXREPLACE(", , a, , c, d, ","(, )+",", "),"^(, )|(, )$","")
Result: "a, c, d"
Is it possible to do it in just one REGEXREPLACE?
use the regex
^[, ]+(?=[a-z])|[, ]+$|[, ]+(?=, )
http://regexr.com/3ct8r
or
^[, ]+(?=[a-zA-Z])|[, ]+$|[, ]+(?=, )
for lower and upper case support, and replace with nothingness
I have just read the doc syntax of RE2 at:
https://re2.googlecode.com/hg/doc/syntax.html
Single characters:
[xyz] character class
Composites:
x|y x or y (prefer x)
Repetitions:
x+ one or more x, prefer more
Grouping:
(re) numbered capturing group
(?:re) non-capturing group
Empty strings:
^ at beginning of text or line (m=true)
$ at end of text (like \z not \Z) or line (m=true)
then, the regex
^[, ]+|[, ]+$|(?:, )+(, [a-zA-Z])
and replacement with "capturing group" 1, should do the trick.
This works, pretending your text is in A1:
=join(" ",(REGEXEXTRACT(A1,"^.*"&rept("(\w,).*",counta(split(regexreplace(A1,"\w,","$"),"$"))-1)&"(\w),?$")))
it doesnt quite do it in one formula - like you were asking, I think, but it does handle the various cases.

How to find the base part of a variable name that ends in I, J, or IJ

I want to cut off the end part of a vector of characters of variable length that all end in either I, J, or IJ, but haven't quite got it right yet:
Current attempt, using a simple case.
vars <- c("VARI", "VARJ", "VARIJ")
sapply(vars, function(v) {
m <- regexec("^(.*)(?:I|J|IJ)$", v)
regmatches(v, m)[[1]][2]
})
However, it doesn't work for the IJ case:
VARI VARJ VARIJ
"VAR" "VAR" "VARI"
Try putting the IJ first in the group:
^(.*?)(?:IJ|J|I)$
It'll match IJ before trying to match I or J alone.
Then make the .* lazy (by adding a ?) to prevent the . from eating too much.
EDIT: Actually, I messed up. Here's the deal:
In ^(.*)(?:J|I|IJ)$, .* will match as much as possible, meaning the whole string. In the case of VARIJ, it will backtrack to VARI and see that the `(?:J|I|IJ)$`` part matches.
Making the .* lazy (by adding a ?), the dot will first match V in VARIJ, then as there are no matches for (?:J|I|IJ)$`` will continue with matchingA. When it reachesR, it finds a match in(?:J|I|IJ)$`` and stops eating more characters.
I initially messed up since this question was a bit like a previous one where something like (1|5|10|50|100|500) was used to match 500 but only 5 got matched. This is different here because of the end of line anchor $. I apologize for not having noticed the variation immediately.
Conclusion, you can still use (?:J|I|IJ)$ as long as .* is lazy.
What about good old simple gsub which is vectorised so you just need to do...
gsub( "I$|J$|IJ$" , "" , vars )
#[1] "VAR" "VAR" "VAR"
$ anchors the regex at the end of the string and then matches either I or J or IJ and replaces them with nothing.