I am working on a regex in perl, which identifies what I want it to: word final g (but not following an 'n') or k (but not following an 'r') that precedes word-initial g (but not l or r), word-initial k, or word-initial c (but not c preceding i, e, y, or h):
(((?<!n)g)|(?<!r)k)\s(g(?!l|r)|k|c(?!i|e|y|h));
However, I want it to capture the word that has the g or k at the end of it, so I tried something like this:
(^|\s.*(((?<!n)g)|(?<!r)k))\s(g(?!l|r)|k|c(?!i|e|y|h)); so that $1 captures the beginning of the line or a white space (to signify the beginning of a word) until the next white space before the g, k, or c (the end of the word). Perhaps this is a parentheses problem, but I'm not sure how to keep the grouping I have while also specifying where I want $1 to capture.
What about /(\S*(((?<!n)g)|(?<!r)k))\s(g(?!l|r)|k|c(?!i|e|y|h))/?
EDIT: Looking at it, it could use some clean up :D
/(\S*([^n]g|[^r]k))\s(g[^lr]|k|c[^ieyh])/
Related
I need to divide a list in alphabetical order.
I am using:
regexp_match("via",'^[A-G]')
for one segment, and
regexp_match("via",'^[H-Z]')
However, I need to cut the list halfway the "G" set of words, that is: to make "Galveston" fall in the first segment, and "Geneve" in the second.
How can I do this?
You can use the following two regexps:
^([A-F]|G[a-d])
^([H-Z]|G[e-z])
See regex demo #1 and regex demo #2.
Details
^([A-F]|G[a-d]) - a letter from A to F, or G followed with a letter from a to d
^([H-Z]|G[e-z]) - a letter from H to Z, or G followed with a letter from e to z.
How to I remove G s at the end of each string, and capture string left of it.
...TGTGGG
...CTGAGGGGG
...ACAGGGGGGGG
...CAAACAGGGGGGGGGGGG
The result would like this. If possible I want to capture this remaining string in a regex.
...TGT
...CTGA
...ACA
...CAAACA
Thank you.
Removing trailing Gs is easy.
s/G*$//
If it's not necessarily a G, you can match it with a capture group.
s/(.)\1*$//
If you want to only remove a character if it is repeated at the end (so ATCG would be untouched but ATCGGG would change), you can do that with +
s/(.)\1+$//
Trying to remove all characters except from the compination of 'r d`. To be more clear some examples:
a ball -> ball
r something -> something
d someone -> someone
r d something -> r d something
r something d -> something
Till now I managed to remove the letters except from r or d, but this is not what i want. I want to keep only the compination(ex.4). I use this:
\b(?!r|d)\w{1}\b
Any idea who to do it?
Edit:The reg engine supports lookbehinds.
You may capture the r d combination and use a backreference in the replacement pattern to restore that combination, and remove all other matches:
\b(r d)\b|\b\w\b\s*
See the regex demo (replace with $1 that will put the r d back into the result).
Details:
\b(r d)\b - a "whole word" r d that is captured into Group 1
| - or
\b\w\b\s* - a single whole word consisting of 1 letter/digit/underscore (\b\w\b) and followed with 0+ whitespaces (\s*, just for removing the excessive whitespace, might not be necessary).
I have a string that may have repeated ", " (a comma and a space) in the middle, or at the beginning, or at the end.
for example, to clean ", , a, , c, d, "
I use REGEXREPLACE twice:
=REGEXREPLACE(REGEXREPLACE(", , a, , c, d, ","(, )+",", "),"^(, )|(, )$","")
Result: "a, c, d"
Is it possible to do it in just one REGEXREPLACE?
use the regex
^[, ]+(?=[a-z])|[, ]+$|[, ]+(?=, )
http://regexr.com/3ct8r
or
^[, ]+(?=[a-zA-Z])|[, ]+$|[, ]+(?=, )
for lower and upper case support, and replace with nothingness
I have just read the doc syntax of RE2 at:
https://re2.googlecode.com/hg/doc/syntax.html
Single characters:
[xyz] character class
Composites:
x|y x or y (prefer x)
Repetitions:
x+ one or more x, prefer more
Grouping:
(re) numbered capturing group
(?:re) non-capturing group
Empty strings:
^ at beginning of text or line (m=true)
$ at end of text (like \z not \Z) or line (m=true)
then, the regex
^[, ]+|[, ]+$|(?:, )+(, [a-zA-Z])
and replacement with "capturing group" 1, should do the trick.
This works, pretending your text is in A1:
=join(" ",(REGEXEXTRACT(A1,"^.*"&rept("(\w,).*",counta(split(regexreplace(A1,"\w,","$"),"$"))-1)&"(\w),?$")))
it doesnt quite do it in one formula - like you were asking, I think, but it does handle the various cases.
I want to cut off the end part of a vector of characters of variable length that all end in either I, J, or IJ, but haven't quite got it right yet:
Current attempt, using a simple case.
vars <- c("VARI", "VARJ", "VARIJ")
sapply(vars, function(v) {
m <- regexec("^(.*)(?:I|J|IJ)$", v)
regmatches(v, m)[[1]][2]
})
However, it doesn't work for the IJ case:
VARI VARJ VARIJ
"VAR" "VAR" "VARI"
Try putting the IJ first in the group:
^(.*?)(?:IJ|J|I)$
It'll match IJ before trying to match I or J alone.
Then make the .* lazy (by adding a ?) to prevent the . from eating too much.
EDIT: Actually, I messed up. Here's the deal:
In ^(.*)(?:J|I|IJ)$, .* will match as much as possible, meaning the whole string. In the case of VARIJ, it will backtrack to VARI and see that the `(?:J|I|IJ)$`` part matches.
Making the .* lazy (by adding a ?), the dot will first match V in VARIJ, then as there are no matches for (?:J|I|IJ)$`` will continue with matchingA. When it reachesR, it finds a match in(?:J|I|IJ)$`` and stops eating more characters.
I initially messed up since this question was a bit like a previous one where something like (1|5|10|50|100|500) was used to match 500 but only 5 got matched. This is different here because of the end of line anchor $. I apologize for not having noticed the variation immediately.
Conclusion, you can still use (?:J|I|IJ)$ as long as .* is lazy.
What about good old simple gsub which is vectorised so you just need to do...
gsub( "I$|J$|IJ$" , "" , vars )
#[1] "VAR" "VAR" "VAR"
$ anchors the regex at the end of the string and then matches either I or J or IJ and replaces them with nothing.