Regex - more compact/elegant form - regex

I would like to write this regex (later simplified form) in a more compact/ elegant/ systematic. PCRE or Python (newer engine) preferred. Shortly, I would like to capture each artery name (iliac, femoral, popliteal and so on), regardless of the string between them . Ideally, the resulted regex won't depend on any kind of regex flavor.
LE2: Even more simplified regex, but not working correctly: https://www.regex101.com/r/cK5wB6/7. I've eliminated DEFINE section - this was added only for modularity purposes, and DEFINE is not compatible with Python anyway (newer, v1 engine added this feature). I want capture all the artery names, equivalent of getting a vector of all artery names, regardless of number of names, or strings between them.
(arteries:.{0,25}?)
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
(?<artfinal>(?&art))
The problem is that some arteries are still not recognized correctly (at least visually). I'm trying to capture those names, without explicitly write capturing groups like in this.
LE4: The last variant actually ignore all names, aside the 1st and the last two.

First, regex flavour independent pattern is a myth. Regex engines are different, have different features, and even a same pattern that only uses common tokens between two or more regex engines can return different results.
An example with the Python regex module that has interesting features like the ability to use a set (\L<arteries> in the pattern) and the ability to store repeated capture groups:
import regex
s = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
arteries_set = ['femoral', 'iliac', 'peroneal', 'tibial']
p = regex.compile(r'^arteries: (?: [^\w\n]* (?>\w+[^\w\n]+)*? (\L<arteries>) \M)+', regex.M | regex.I | regex.X, arteries=arteries_set)
for m in p.finditer(s):
print(m.captures(1))
I voluntary removed the "less than 25 characters" condition to build a more efficient pattern, but feel free to replace [^\w\n]* (?>\w+[^\w\n]+)*? with .{0,25}? \m
(\m and \M are word boundaries, respectively for the start and the end of a word)

I want capture all the artery names,
The problem is that some arteries are still not recognized correctly (at least visually)
The problem with this regex:
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
is that the group art continuously overwrites its capture with the last match. This is an expected behaviour by design.
I want to recognize all the arteries given in the list (see definition
of ) The same artery could appear 1...n to and in any position.
The strings between artery names could be anything of max. 25 chars.
As of flavors, let's stick with PCRE
Provided you're working with PCRE, instead of matching all occurences of arteries at once, I would suggest matching 1 artery at a time. And to achieve that, we can use \G to match at the end of last match.
Regex:
/\G # Match anchor (BoS or EoLastMatch)
(?:
(?!^) # With previous match
|
.*? # Or first occurence
arteries: # of arteries:
)
.{1,25}? # Separated by max 25 chars
(?P<art> # Group 1 (capture 1 artery)
\b # List of arteries
(?:iliac|femoral|popliteal|peroneal|tibial)
\b # in between word boundaries
# Modif: global, caseless, singleline, extra
)/gixs
This will capture each artery in group art (group 1).
DEMO
Notes about other flavours:
As for compatibility with other regex flavours, you could loop each match in your code to simulate \G (which is not implemented in almost any other flavour). Another option is to split the text with the expression:
(arteries:|\b(?:iliac|femoral|popliteal|peroneal|tibial)\b)
and then check the length of each token to guarantee there isn't more than 25 chars in between.
the code will have to be migrated to Python one (not so distant) day
Update: Migrating to Python:
You can use \G in Python since the regex module implemented it, but if you do use that module, take advantage if its ability to retrieve repeated captures from a group with the .captures method. Check #CasimiretHippolyte's answer, a perfect example of using captures in this case.
On the other hand, if you stick to the standard re module, I'd recommend looping each match to simulate the same behaviour.
Code:
import re
text = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
some arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
n = 0
pattern_from = re.compile( r'arteries:', re.I)
pattern_token = re.compile( r'.{1,25}?\b(iliac|femoral|popliteal|peroneal|tibial)\b', re.I)
for match_from in pattern_from.finditer(text):
n = n + 1
print( '\nMatch #%s:' % n, end="")
match_token = pattern_token.match( text, match_from.end())
while match_token:
print( '[%s:%s]="%s" ' % (match_token.start(1), match_token.end(1), match_token.group(1)), end="")
match_token = pattern_token.match( text, match_token.end())
Output:
Match #1:[15:20]="iliac" [26:33]="femoral"
Match #2:[52:57]="iliac" [63:70]="femoral" [76:85]="popliteal"
Match #3:[106:115]="popliteal" [125:130]="iliac" [132:138]="tibial"
Match #4:[171:176]="iliac" [179:187]="peroneal"
Match #5:[243:251]="peroneal" [257:264]="femoral" [267:273]="tibial" [282:288]="tibial"
Match #6:[344:351]="femoral"
ideone DEMO

Related

Regex pattern for mm/dd/yyyy and mmddyyyy in Scala

I have date in my .txt file which comes like either of the below:
mmddyyyy
OR
mm/dd/yyyy
Below is the regex which works fine for mm/dd/yyyy.
^02\/(?:[01]\d|2\d)\/(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)\/(?:[0-2]\d|3[01])\/(?:19|20)\d{2}|(?:0[469]|11)\/(?:[0-2]\d|30)\/(?:19|20)\d{2}|02\/(?:[0-1]\d|2[0-8])\/(?:19|20)\d{2}$
However, unable to build the regex for mmddyyyy. I just want to understand is there any generic regex that would work for both cases?
Why use regex for this? Seems like a case of "Now you have two problems"
It would be more effective (and easier to understand) to use a DateTimeFormatter (assuming you are on the JVM and not using scala-js)
The format patterns support using [] to surround optional sections, such as the /, and the formatters inherently perform input validation so if you plug in a month or day that can't exist, it'll throw an exception.
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val mdy = DateTimeFormatter.ofPattern("MM[/]dd[/]yyyy")
def parse(rawDate: String) = LocalDate.parse(rawDate, mdy)
scala> parse("12252022")
res7: java.time.LocalDate = 2022-12-25
scala> parse("12/25/2022")
res8: java.time.LocalDate = 2022-12-25
scala> parse("25/12/2022")
java.time.format.DateTimeParseException: Text '25/12/2022' could not be parsed: Invalid value for MonthOfYear (valid values 1 - 12): 25
scala> parse("abc123")
java.time.format.DateTimeParseException: Text 'abc123' could not be parsed at index 0
If you want to match all those variations with either 2 forward slashes or only digits, you can use a positive lookahead to assert either only digits or 2 forward slashes surrounded by digits.
Then in the pattern itself you can make matching the / optional.
Note that you don't have to escape the \/
^(?=\d+(?:/\d+/\d+)?$)(?:02/?(?:[01]\d|2\d)/?(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)/?(?:[0-2]\d|3[01])/?(?:19|20)\d{2}|(?:0[469]|11)/?(?:[0-2]\d|30)/?(?:19|20)\d{2}|02/?(?:[0-1]\d|2[0-8])\?(?:19|20)\d{2})$
Regex demo
Another option is to write an alternation | matching the same pattern without the / in it.
First of all, there is a tiny shortcoming in your regex: the ^ anchor only applies to the first part of your regex, not to the other alternatives that are separated by |. Similarly the final $ applies only to the final alternative. You should put all alternatives in a non-capturing group, like ^(?: | | | )$
Then for the question itself, you could make the forward slash that follows the month optional and put it in a capture group. Then what comes between the day and the year could be a backreference to that capture group. So (\/?) and \1.
^(?:02(\/?)(?:[01]\d|2\d)\1(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)(\/?)(?:[0-2]\d|3[01])\2(?:19|20)\d{2}|(?:0[469]|11)(\/?)(?:[0-2]\d|30)\3(?:19|20)\d{2}|02(\/?)(?:[0-1]\d|2[0-8])\4(?:19|20)\d{2})$

Regular Expression to match many coordinate formats

I am working on a regex that will match many different types of of location coordinates. So far it matches about 90% of the formats:
([SNsn][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([SNsn]?)[^\\dSNsnEWew]+([EWew][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([EWew]?)
Testing the formats:
N 45° 55.732 W 122° 29.882
N 047° 38.938', W 122° 20.887'
40.123, -74.123
40.123° N 74.123° W
40° 7´ 22.8" N 74° 7´ 22.8" W
40° 7.38’ , -74° 7.38’
N40°7’22.8, W74°7’22.8"
40°7’22.8"N, 74°7’22.8"W
40 7 22.8, -74 7 22.8
40.123 -74.123
40.123°,-74.123°
144442800, -266842800
40.123N74.123W
4007.38N7407.38W
40°7’22.8"N, 74°7’22.8"W
400722.8N740722.8W
N 40 7.38 W 74 7.38
40:7:23N,74:7:23W
40:7:22.8N 74:7:22.8W
40°7’23"N 74°7’23"W
40°7’23" -74°7’23"
40d 7’ 23" N 74d 7’ 23" W
40.123N 74.123W
40° 7.38, -74° 7.38
Testing if it works: https://regexr.com/3ivu2
As you can see there are issues with the spaces and commas that are causing the regex to not match some of these formats.
I am trying to match the coordinate strings so that they can be highlighted in my iOS app and allow the user to tap them.
What can I do to update the regex and fix the matching issues?
Overview
I'm sure there are many ways to go about this. Since you haven't specified a regex engine or programming language, I'll post one that works in PCRE and what that should work in most engines. The PCRE regex is much easier to understand than the non-PCRE regex, but both use the exact same logic.
The patterns defined below match each string you've presented in your question and properly separates each part of the coordinate (x, y).
Code
PCRE
This method uses the DEFINE construct to pre-define patterns. The beauty of this construct is that you can define reusable parts of your regex in one location, thus, you can edit most of the regex just by editing these subpatterns.
See regex in use here
(?(DEFINE)
(?<ns>[ns])
(?<ew>[ew])
(?<d>[°´’'"d:])
(?<n>[+-]?\d+(?:\.\d+)?)
)
(
(?&ns)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ns)?
)
\ ?,?\ ?
(
(?&ew)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ew)?
)
Flags: gix
Non-PCRE
See regex in use here
(
[ns]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ns]?
)
\ ?,?\ ?
(
[ew]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ew]?
)
Flags: gix.
Some engines don't have the x flag. For those engines you can use the following one-liner (as seen here):
([ns]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ns]?) ?,? ?([ew]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ew]?)
Explanation
Since both patterns are essentially the same (non-PCRE is just an expanded version of the PCRE), I'll define the PCRE regex pattern since it's easier to grasp.
Note that the patterns that use x have escaped spaces since they would otherwise be ignored (x ignores whitespace within the pattern). The i flag allows us to match text regardless of case (i makes our pattern case-insensitive).
DEFINE
(?(DEFINE)...) The DEFINE group is completely ignored by regex. It gets treated as a var name=value, whereas you can recall the specific pattern for use via its name.
(?<ns>[ns]) The group ns matches any character in the set nsNS
(?<ew>[ew]) The group ew matches any character in the set ewEW
(?<d>[°´’'"d:]) The group d matches any character in the set °´’'"d:
(?<n>[+-]?\d+(?:\.\d+)?) The group n matches any number that matches the following structure
[+-]? Optionally match any character in the set +-
\d+ Match one or more digits
(?:\.\d+)? Optionally match a decimal point followed by one or more digits
Pattern
The pattern is composed of 3 larger parts. The first and last are capture groups (the coordinates themselves) and the second is what separates the two.
Capture 1:
(?&ns)? Optionally match the group ns
(?:\ ?(?&n)(?&d)?){1,3} Matches [an optional space, followed by the group n then optionally group d] between one and three times
\ ?(?&ns)? Optionally match a space, optionally match the group ns
\ ?,?\ ? Match an optional space, comma and space (this separates each coordinate part)
Capture 2: This is the same as Capture 1 but replaces the group ns with the group ew
This simplified regex literally matches all the patterns you've given:
^((?:[NW]? ?(?:[-\d.d]+[NW:°´’'",]?[ NW]?)+[, ]*)+[NW]?)$
I'm not an expert for coordinates, but you can modify it easily if I didn't take into account some specifics.
A full test is here.

A strict regular expression for matching chemical formulae

In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.
Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb
(Thus e.g. C, Cm, and Cn will pass, but not Cg or Cx.)
As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O and (CH3)2CFCOO(CH2)2Si(CH3)2Cl are matched.
So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?
(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)
Brief
I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.
Assumptions
I am assuming the following since the OP has not given a full list of positive and negative matches:
Nested parentheses aren't possible
Nested square brackets aren't possible
Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group
If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly
Answer
View this regex in use here
Code
(?(DEFINE)
(?# Periodic elements )
(?<Hydrogen>H)
(?<Helium>He)
(?<Lithium>Li)
(?<Beryllium>Be)
(?<Boron>B)
(?<Carbon>C)
(?<Nitrogen>N)
(?<Oxygen>O)
(?<Fluorine>F)
(?<Neon>Ne)
(?<Sodium>Na)
(?<Magnesium>Mg)
(?<Aluminum>Al)
(?<Silicon>Si)
(?<Phosphorus>P)
(?<Sulfur>S)
(?<Chlorine>Cl)
(?<Argon>Ar)
(?<Potassium>K)
(?<Calcium>Ca)
(?<Scandium>Sc)
(?<Titanium>Ti)
(?<Vanadium>V)
(?<Chromium>Cr)
(?<Manganese>Mn)
(?<Iron>Fe)
(?<Cobalt>Co)
(?<Nickel>Ni)
(?<Copper>Cu)
(?<Zinc>Zn)
(?<Gallium>Ga)
(?<Germanium>Ge)
(?<Arsenic>As)
(?<Selenium>Se)
(?<Bromine>Br)
(?<Krypton>Kr)
(?<Rubidium>Rb)
(?<Strontium>Sr)
(?<Yttrium>Y)
(?<Zirconium>Zr)
(?<Niobium>Nb)
(?<Molybdenum>Mo)
(?<Technetium>Tc)
(?<Ruthenium>Ru)
(?<Rhodium>Rh)
(?<Palladium>Pd)
(?<Silver>Ag)
(?<Cadmium>Cd)
(?<Indium>In)
(?<Tin>Sn)
(?<Antimony>Sb)
(?<Tellurium>Te)
(?<Iodine>I)
(?<Xenon>Xe)
(?<Cesium>Cs)
(?<Barium>Ba)
(?<Lanthanum>La)
(?<Cerium>Ce)
(?<Praseodymium>Pr)
(?<Neodymium>Nd)
(?<Promethium>Pm)
(?<Samarium>Sm)
(?<Europium>Eu)
(?<Gadolinium>Gd)
(?<Terbium>Tb)
(?<Dysprosium>Dy)
(?<Holmium>Ho)
(?<Erbium>Er)
(?<Thulium>Tm)
(?<Ytterbium>Yb)
(?<Lutetium>Lu)
(?<Hafnium>Hf)
(?<Tantalum>Ta)
(?<Tungsten>W)
(?<Rhenium>Re)
(?<Osmium>Os)
(?<Iridium>Ir)
(?<Platinum>Pt)
(?<Gold>Au)
(?<Mercury>Hg)
(?<Thallium>Tl)
(?<Lead>Pb)
(?<Bismuth>Bi)
(?<Polonium>Po)
(?<Astatine>At)
(?<Radon>Rn)
(?<Francium>Fr)
(?<Radium>Ra)
(?<Actinium>Ac)
(?<Thorium>Th)
(?<Protactinium>Pa)
(?<Uranium>U)
(?<Neptunium>Np)
(?<Plutonium>Pu)
(?<Americium>Am)
(?<Curium>Cm)
(?<Berkelium>Bk)
(?<Californium>Cf)
(?<Einsteinium>Es)
(?<Fermium>Fm)
(?<Mendelevium>Md)
(?<Nobelium>No)
(?<Lawrencium>Lr)
(?<Rutherfordium>Rf)
(?<Dubnium>Db)
(?<Seaborgium>Sg)
(?<Bohrium>Bh)
(?<Hassium>Hs)
(?<Meitnerium>Mt)
(?<Darmstadtium>Ds)
(?<Roentgenium>Rg)
(?<Copernicium>Cn)
(?<Nihonium>Nh)
(?<Flerovium>Fl)
(?<Moscovium>Mc)
(?<Livermorium>Lv)
(?<Tennessine>Ts)
(?<Oganesson>Og)
(?# Regex )
(?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
(?<Num>(?:[1-9]\d*)?)
(?<ElementGroup>(?:(?&Element)(?&Num))+)
(?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
(?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$
Explanation
The first part of the (?(DEFINE)) section lists each periodic element (ordered by atomic number for easy lookup).
The Element group acts as a simple or | between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, Carbon C instead of Calcium Ca)
ElementGroup specifies a group of chemicals in the format: one or more Element followed by zero or more digits, excluding zero (specified by the group Num)
Valid Examples
C - Element
CH - Element followed by another Element
CH3 -Element followed by another Element and a Num
O2 - Element followed by a Num
Invalid Examples
N0 - 0 cannot be used explicitly
N01 - Num group specifies the number must begin with 1-9 or not have a number
A - Element does not exist
c - Element does not exist - case sensitive regex
ElementParenthesesGroup specifies one or more groupings of ElementGroup between parentheses ( ) but containing at least one ElementGroup
Valid Examples
(CH) - ElementGroup surrounded by parentheses
(CH3) - ElementGroup surrounded by parentheses
(CH3NO4) - multiple ElementGroup surrounded by parentheses
(CH3N04)2 - multiple ElementGroup surrounded by parentheses followed by a Num
Invalid Examples
(CH[NO4]) - Only ElementGroup is valid inside ElementParenthesesGroup
ElementSquareBracketGroup specifies a grouping of ElementParenthesesGroup or ElementGroup between square brackets [ ] but containing at least one ElementParenthesesGroup and one other group (ElementParenthesesGroup or ElementGroup)
Valid Examples
[CH3(NO4)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
[(NO4)CH]2 - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup followed by Num
[(NO4)(CH3)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
Invalid Examples
[(NO4)] - Does not contain second group, brackets [ ] are redundant
[NO4] - Does not contain ElementParenthesesGroup
Additional Information
I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.
Ensure the following flags are set:
g - ensures global matches
x - ensures whitespace is ignored
if the data is across multiple lines (separated by a newline character) use m for multi line
Note: Regex will only capture the last group of type X that it finds (and overwrite the previously captured group of said type X. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl since there are multiple of each group type.
It is best not to assemble such a large regex manually. Instead, let's assume we have an array of atoms #atoms. We can then create a regex matching any of these atoms like:
my ($atoms_regex) = map qr/$_/, join '|', map quotemeta, sort #atoms;
(Sort all items so that shorter atom names come first, then escape all items with quotemeta, join them with a | for alternatives, and compile the regex.)
You can add any used abbreviations to the #atoms array.
Next, we can write a regex that allows grouping and numbering. Our regex will match any number of items, where an item may be an atom or a group, and may be followed by a number:
my $chemical_formula_regex = qr/
(?&item)++
(?(DEFINE)
(?<item> (?: \((?&item)++\) | \[(?&item)++\] | $atoms_regex ) [0-9]* )
)
/x;
Within the (?(DEFINE) ...) group we can define named subpatterns with (?<name> ...). A subpattern is like a subroutine for a regex. We can call those subpatterns with (?&name). This allows us to structure the regex without unnecessary repetition.
The /x flag allows us to use whitespace and linebreaks and comments to lay out the regex in a more readable fashion. Regexes don't have to be an incomprehensible mess!
The ++ quantifier instead of + is not strictly necessary, but prevents unwanted backtracking. That may be a bit faster when a match fails.
Since this post is a top result for "regex chemistry symbols", I would also like to submit a solution. It is a python script that uses a regex to match chemistry formulas of the type A#B#, where A and B are chemical symbols and # is a number. The script as implemented matches and then surrounds the matches with \ce{} for use in LaTeX. It also includes the ability to exclude matches if the capture is in a user-defined list, which means words such as "I" and "In" will not be matched. Gist Link.
#!/usr/bin/env python3
# Find chemical symbols and surround them with \ce{ Symbol }
# Problem words: I, HOW, In, degrees K. Add words to exlist to ignore them.
import re, sys
if len(sys.argv) < 2 :
print('Usage:> {} <filename>'.format(sys.argv[0]))
sys.exit(1)
ptable =" H He "
ptable+=" Li Be B C N O F Ne "
ptable+=" Na Mg Al Si P S Cl Ar "
ptable+=" K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr "
ptable+=" Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe "
ptable+=" Cs Ba La Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn "
ptable+=" Fr Ra Ac Rf Db Sg Bh Hs Mt Ds Rg Cn Nh Fl Mc Lv Ts Og "
ptable+=" Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu "
ptable+=" Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr "
exlist = ['C','I','In','K','HOW'] # exclude these words from being replaced
orsyms = '|'.join(ptable.split())
resyms = re.compile(r'\b'+'((?:(?:{})\d*)+)'.format(orsyms)+r'\b')
latexfile=sys.argv[1]
with open(latexfile,'r') as fd:
for line in fd:
for m in list(set(resyms.findall(line))):
if m not in exlist :
line = re.sub(r'\b'+m+r'\b', r'\ce{'+m+r'}', line)
print(line,end='')

How do I access the captures within a match?

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)
The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

regex to duplicate repeated patterns, substituting part of the pattern

I'd like to duplicate a multiple matches in a line, substituting part of the match, but keeping the runs of matches together (that seems to be the tricky part).
e.g.:
Regex:
(x(\d)(,)?)
Replacement:
X$2,O$2$3
Input:
x1,x2,Z3,x4,Z5,x6
Output: (repeated groups broken apart)
X1,O1,X2,O2,Z3,X4,O4,Z5,X6,O6
Desired output (repeated groups, "X1,X2" kept together):
X1,X2,O1,O2,Z3,X4,O4,Z5,X6,O6
Demo: https://regex101.com/r/gH9tL9/1
Is this possible with regex or do I need to use something else?
Update: Wills answer is what I expected. It occurs to me that it might be possible with multiple passes of regex.
You would have to capture the repeating patterns as one match and write out replacements for the whole repeating pattern at once. your current pattern cannot tell that your first and second matches, x1, and x2, respectively, are adjacent.
Im going to say no, this is not possible with one pure regex.
This is because of two important facts about capture groups and replacing.
Repeated capture groups will return the last capture:
Regex's are able to capture patterns which repeat an arbitrary amount of time by using the form <PATTERN>{1,},<PATTERN>+ or <PATTERN>*. However any capture group within <PATTERN> would only return the captures from the last iteration of the pattern. This would prevent your desired ability to capture matches that arbitrarily repeat.
"Hold on", you might say, "I only want to capture patterns that repeat one or two times, I could use (x(\d)(,)?)(x(\d)(,)?)?", which brings us to point 2.
There is no conditional replacement
Using the above pattern we could get your desired output for the repeated match, but not without mangling the solo match replacement.
See: https://regex101.com/r/gH9tL9/2 Without the ability to turn off sections of the replacement based on the existence of capture groups, we cannot achieve the desired output.
But "No, you can't do that" is a challenge to a hacker, I hope I am shown up by a true regex ninja.
Solution with 2 regexes and some code
There's definitely ways to achieve this goal with some code.
Here's a quick and dirty python hack using two regexes http://pythonfiddle.com/wip-soln-for-so-q/
This makes use of python's re.sub(), which can pass matches to one regex to a function ordered_repl which returns the replacement string. By using your original regex within the ordered_repl we can extract the information we want and get the right order by buffering our lists of Xs and Os.
import re
input_string="x1,x2,Z3,x4,Z5,x6"
re1 = re.compile("(?:x\d,?)+") # captures the general thing you want to match using a repeating non-capturing group
re2 = re.compile("(x(\d)(,)?)") # your actual matcher
def ordered_repl(m): # m is a matchobj
buf1 = []
buf2 = []
cap_iter = re.finditer(re2,m.group(0)) # returns an iterator of MatchObjects for all non-overlapping matches
for cap_group in cap_iter:
capture = cap_group.group(2) # capture the digit
buf1.append("X%s" % capture) # buffer X's of this submatch group
buf2.append("O%s" % capture) # buffer O's of this submatch group
return "%s,%s," % (",".join(buf1),",".join(buf2)) # concatenate the buffers and return
print re.sub(re1,ordered_repl,input_string).rstrip(',') # searches string for matches to re1 and passes them to the ordered_repl function
In my specific case I'm using powershell, so I was able to come up with the following:
(linebreaks added for readability)
("x1,x2,z3,x4,z5,x6"
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
| Foreach-Object {
if ($_ -match 'x') {
$_ + ',' + ($_ -replace 'x','y')
} else {$_}
}
) -join ''
Outputs:
x1,x2,y1,y2,z3,x4,y4,z5,x6,y6
Where:
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
breaks apart the string into these groups:
x1,x2
,
z3
,
x4
,
z5
,
x6
using positive and negative lookahead and lookbehind:
comma with x\d before and without x after:
(?<=x\d),(?!x)
comma without x\d before and with x after:
(?<!x\d),(?=x)