In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.
Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb
(Thus e.g. C, Cm, and Cn will pass, but not Cg or Cx.)
As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O and (CH3)2CFCOO(CH2)2Si(CH3)2Cl are matched.
So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?
(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)
Brief
I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.
Assumptions
I am assuming the following since the OP has not given a full list of positive and negative matches:
Nested parentheses aren't possible
Nested square brackets aren't possible
Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group
If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly
Answer
View this regex in use here
Code
(?(DEFINE)
(?# Periodic elements )
(?<Hydrogen>H)
(?<Helium>He)
(?<Lithium>Li)
(?<Beryllium>Be)
(?<Boron>B)
(?<Carbon>C)
(?<Nitrogen>N)
(?<Oxygen>O)
(?<Fluorine>F)
(?<Neon>Ne)
(?<Sodium>Na)
(?<Magnesium>Mg)
(?<Aluminum>Al)
(?<Silicon>Si)
(?<Phosphorus>P)
(?<Sulfur>S)
(?<Chlorine>Cl)
(?<Argon>Ar)
(?<Potassium>K)
(?<Calcium>Ca)
(?<Scandium>Sc)
(?<Titanium>Ti)
(?<Vanadium>V)
(?<Chromium>Cr)
(?<Manganese>Mn)
(?<Iron>Fe)
(?<Cobalt>Co)
(?<Nickel>Ni)
(?<Copper>Cu)
(?<Zinc>Zn)
(?<Gallium>Ga)
(?<Germanium>Ge)
(?<Arsenic>As)
(?<Selenium>Se)
(?<Bromine>Br)
(?<Krypton>Kr)
(?<Rubidium>Rb)
(?<Strontium>Sr)
(?<Yttrium>Y)
(?<Zirconium>Zr)
(?<Niobium>Nb)
(?<Molybdenum>Mo)
(?<Technetium>Tc)
(?<Ruthenium>Ru)
(?<Rhodium>Rh)
(?<Palladium>Pd)
(?<Silver>Ag)
(?<Cadmium>Cd)
(?<Indium>In)
(?<Tin>Sn)
(?<Antimony>Sb)
(?<Tellurium>Te)
(?<Iodine>I)
(?<Xenon>Xe)
(?<Cesium>Cs)
(?<Barium>Ba)
(?<Lanthanum>La)
(?<Cerium>Ce)
(?<Praseodymium>Pr)
(?<Neodymium>Nd)
(?<Promethium>Pm)
(?<Samarium>Sm)
(?<Europium>Eu)
(?<Gadolinium>Gd)
(?<Terbium>Tb)
(?<Dysprosium>Dy)
(?<Holmium>Ho)
(?<Erbium>Er)
(?<Thulium>Tm)
(?<Ytterbium>Yb)
(?<Lutetium>Lu)
(?<Hafnium>Hf)
(?<Tantalum>Ta)
(?<Tungsten>W)
(?<Rhenium>Re)
(?<Osmium>Os)
(?<Iridium>Ir)
(?<Platinum>Pt)
(?<Gold>Au)
(?<Mercury>Hg)
(?<Thallium>Tl)
(?<Lead>Pb)
(?<Bismuth>Bi)
(?<Polonium>Po)
(?<Astatine>At)
(?<Radon>Rn)
(?<Francium>Fr)
(?<Radium>Ra)
(?<Actinium>Ac)
(?<Thorium>Th)
(?<Protactinium>Pa)
(?<Uranium>U)
(?<Neptunium>Np)
(?<Plutonium>Pu)
(?<Americium>Am)
(?<Curium>Cm)
(?<Berkelium>Bk)
(?<Californium>Cf)
(?<Einsteinium>Es)
(?<Fermium>Fm)
(?<Mendelevium>Md)
(?<Nobelium>No)
(?<Lawrencium>Lr)
(?<Rutherfordium>Rf)
(?<Dubnium>Db)
(?<Seaborgium>Sg)
(?<Bohrium>Bh)
(?<Hassium>Hs)
(?<Meitnerium>Mt)
(?<Darmstadtium>Ds)
(?<Roentgenium>Rg)
(?<Copernicium>Cn)
(?<Nihonium>Nh)
(?<Flerovium>Fl)
(?<Moscovium>Mc)
(?<Livermorium>Lv)
(?<Tennessine>Ts)
(?<Oganesson>Og)
(?# Regex )
(?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
(?<Num>(?:[1-9]\d*)?)
(?<ElementGroup>(?:(?&Element)(?&Num))+)
(?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
(?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$
Explanation
The first part of the (?(DEFINE)) section lists each periodic element (ordered by atomic number for easy lookup).
The Element group acts as a simple or | between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, Carbon C instead of Calcium Ca)
ElementGroup specifies a group of chemicals in the format: one or more Element followed by zero or more digits, excluding zero (specified by the group Num)
Valid Examples
C - Element
CH - Element followed by another Element
CH3 -Element followed by another Element and a Num
O2 - Element followed by a Num
Invalid Examples
N0 - 0 cannot be used explicitly
N01 - Num group specifies the number must begin with 1-9 or not have a number
A - Element does not exist
c - Element does not exist - case sensitive regex
ElementParenthesesGroup specifies one or more groupings of ElementGroup between parentheses ( ) but containing at least one ElementGroup
Valid Examples
(CH) - ElementGroup surrounded by parentheses
(CH3) - ElementGroup surrounded by parentheses
(CH3NO4) - multiple ElementGroup surrounded by parentheses
(CH3N04)2 - multiple ElementGroup surrounded by parentheses followed by a Num
Invalid Examples
(CH[NO4]) - Only ElementGroup is valid inside ElementParenthesesGroup
ElementSquareBracketGroup specifies a grouping of ElementParenthesesGroup or ElementGroup between square brackets [ ] but containing at least one ElementParenthesesGroup and one other group (ElementParenthesesGroup or ElementGroup)
Valid Examples
[CH3(NO4)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
[(NO4)CH]2 - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup followed by Num
[(NO4)(CH3)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
Invalid Examples
[(NO4)] - Does not contain second group, brackets [ ] are redundant
[NO4] - Does not contain ElementParenthesesGroup
Additional Information
I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.
Ensure the following flags are set:
g - ensures global matches
x - ensures whitespace is ignored
if the data is across multiple lines (separated by a newline character) use m for multi line
Note: Regex will only capture the last group of type X that it finds (and overwrite the previously captured group of said type X. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl since there are multiple of each group type.
It is best not to assemble such a large regex manually. Instead, let's assume we have an array of atoms #atoms. We can then create a regex matching any of these atoms like:
my ($atoms_regex) = map qr/$_/, join '|', map quotemeta, sort #atoms;
(Sort all items so that shorter atom names come first, then escape all items with quotemeta, join them with a | for alternatives, and compile the regex.)
You can add any used abbreviations to the #atoms array.
Next, we can write a regex that allows grouping and numbering. Our regex will match any number of items, where an item may be an atom or a group, and may be followed by a number:
my $chemical_formula_regex = qr/
(?&item)++
(?(DEFINE)
(?<item> (?: \((?&item)++\) | \[(?&item)++\] | $atoms_regex ) [0-9]* )
)
/x;
Within the (?(DEFINE) ...) group we can define named subpatterns with (?<name> ...). A subpattern is like a subroutine for a regex. We can call those subpatterns with (?&name). This allows us to structure the regex without unnecessary repetition.
The /x flag allows us to use whitespace and linebreaks and comments to lay out the regex in a more readable fashion. Regexes don't have to be an incomprehensible mess!
The ++ quantifier instead of + is not strictly necessary, but prevents unwanted backtracking. That may be a bit faster when a match fails.
Since this post is a top result for "regex chemistry symbols", I would also like to submit a solution. It is a python script that uses a regex to match chemistry formulas of the type A#B#, where A and B are chemical symbols and # is a number. The script as implemented matches and then surrounds the matches with \ce{} for use in LaTeX. It also includes the ability to exclude matches if the capture is in a user-defined list, which means words such as "I" and "In" will not be matched. Gist Link.
#!/usr/bin/env python3
# Find chemical symbols and surround them with \ce{ Symbol }
# Problem words: I, HOW, In, degrees K. Add words to exlist to ignore them.
import re, sys
if len(sys.argv) < 2 :
print('Usage:> {} <filename>'.format(sys.argv[0]))
sys.exit(1)
ptable =" H He "
ptable+=" Li Be B C N O F Ne "
ptable+=" Na Mg Al Si P S Cl Ar "
ptable+=" K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr "
ptable+=" Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe "
ptable+=" Cs Ba La Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn "
ptable+=" Fr Ra Ac Rf Db Sg Bh Hs Mt Ds Rg Cn Nh Fl Mc Lv Ts Og "
ptable+=" Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu "
ptable+=" Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr "
exlist = ['C','I','In','K','HOW'] # exclude these words from being replaced
orsyms = '|'.join(ptable.split())
resyms = re.compile(r'\b'+'((?:(?:{})\d*)+)'.format(orsyms)+r'\b')
latexfile=sys.argv[1]
with open(latexfile,'r') as fd:
for line in fd:
for m in list(set(resyms.findall(line))):
if m not in exlist :
line = re.sub(r'\b'+m+r'\b', r'\ce{'+m+r'}', line)
print(line,end='')
Related
I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.
I am working on a regex that will match many different types of of location coordinates. So far it matches about 90% of the formats:
([SNsn][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([SNsn]?)[^\\dSNsnEWew]+([EWew][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([EWew]?)
Testing the formats:
N 45° 55.732 W 122° 29.882
N 047° 38.938', W 122° 20.887'
40.123, -74.123
40.123° N 74.123° W
40° 7´ 22.8" N 74° 7´ 22.8" W
40° 7.38’ , -74° 7.38’
N40°7’22.8, W74°7’22.8"
40°7’22.8"N, 74°7’22.8"W
40 7 22.8, -74 7 22.8
40.123 -74.123
40.123°,-74.123°
144442800, -266842800
40.123N74.123W
4007.38N7407.38W
40°7’22.8"N, 74°7’22.8"W
400722.8N740722.8W
N 40 7.38 W 74 7.38
40:7:23N,74:7:23W
40:7:22.8N 74:7:22.8W
40°7’23"N 74°7’23"W
40°7’23" -74°7’23"
40d 7’ 23" N 74d 7’ 23" W
40.123N 74.123W
40° 7.38, -74° 7.38
Testing if it works: https://regexr.com/3ivu2
As you can see there are issues with the spaces and commas that are causing the regex to not match some of these formats.
I am trying to match the coordinate strings so that they can be highlighted in my iOS app and allow the user to tap them.
What can I do to update the regex and fix the matching issues?
Overview
I'm sure there are many ways to go about this. Since you haven't specified a regex engine or programming language, I'll post one that works in PCRE and what that should work in most engines. The PCRE regex is much easier to understand than the non-PCRE regex, but both use the exact same logic.
The patterns defined below match each string you've presented in your question and properly separates each part of the coordinate (x, y).
Code
PCRE
This method uses the DEFINE construct to pre-define patterns. The beauty of this construct is that you can define reusable parts of your regex in one location, thus, you can edit most of the regex just by editing these subpatterns.
See regex in use here
(?(DEFINE)
(?<ns>[ns])
(?<ew>[ew])
(?<d>[°´’'"d:])
(?<n>[+-]?\d+(?:\.\d+)?)
)
(
(?&ns)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ns)?
)
\ ?,?\ ?
(
(?&ew)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ew)?
)
Flags: gix
Non-PCRE
See regex in use here
(
[ns]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ns]?
)
\ ?,?\ ?
(
[ew]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ew]?
)
Flags: gix.
Some engines don't have the x flag. For those engines you can use the following one-liner (as seen here):
([ns]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ns]?) ?,? ?([ew]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ew]?)
Explanation
Since both patterns are essentially the same (non-PCRE is just an expanded version of the PCRE), I'll define the PCRE regex pattern since it's easier to grasp.
Note that the patterns that use x have escaped spaces since they would otherwise be ignored (x ignores whitespace within the pattern). The i flag allows us to match text regardless of case (i makes our pattern case-insensitive).
DEFINE
(?(DEFINE)...) The DEFINE group is completely ignored by regex. It gets treated as a var name=value, whereas you can recall the specific pattern for use via its name.
(?<ns>[ns]) The group ns matches any character in the set nsNS
(?<ew>[ew]) The group ew matches any character in the set ewEW
(?<d>[°´’'"d:]) The group d matches any character in the set °´’'"d:
(?<n>[+-]?\d+(?:\.\d+)?) The group n matches any number that matches the following structure
[+-]? Optionally match any character in the set +-
\d+ Match one or more digits
(?:\.\d+)? Optionally match a decimal point followed by one or more digits
Pattern
The pattern is composed of 3 larger parts. The first and last are capture groups (the coordinates themselves) and the second is what separates the two.
Capture 1:
(?&ns)? Optionally match the group ns
(?:\ ?(?&n)(?&d)?){1,3} Matches [an optional space, followed by the group n then optionally group d] between one and three times
\ ?(?&ns)? Optionally match a space, optionally match the group ns
\ ?,?\ ? Match an optional space, comma and space (this separates each coordinate part)
Capture 2: This is the same as Capture 1 but replaces the group ns with the group ew
This simplified regex literally matches all the patterns you've given:
^((?:[NW]? ?(?:[-\d.d]+[NW:°´’'",]?[ NW]?)+[, ]*)+[NW]?)$
I'm not an expert for coordinates, but you can modify it easily if I didn't take into account some specifics.
A full test is here.
I'm dealing with a large spreadsheet of order data which uses a unique, sort of hashed string of syntax to represent order items and attributes.
I currently have this data in Google Sheets and I'm hoping to be able to make use of the REGEXEXTRACT function (https://support.google.com/docs/answer/3098244) to retrieve the pieces of information I need from each row.
Example of function: REGEXEXTRACT("Needle in a haystack", ".e{2}dle")
Order data is huge and I believe I can use this regex function to isolate the piece of information I want.
Examples of the portion of string snippets I need to work with. Keeping in mind the actual order string is much longer than this:
"Location\";s:5:\"value\";s:7:\"Atlanta\"
"Location\";s:2:\"value\";s:8:\"New York\"
"Location\";s:5:\"value\";s:15:\"barrio de boedo\"
So the common string in each row is Location, as value is used multiple times throughout each order.
Let me see if I can articulate this correctly: How would I use regex to specify the value between the 4th and 5th double quotes occurring immediately after the string 'Location' so that in my examples above, the results would be Atlanta, New York, barrio de boedo?
For reference, the barrio de boedo example in its entirety:
\";s:7:\"product\";s:2:\"31\";s:8:\"form_key\";s:16:\"aasdf\";s:7:\"options\";a:2:{i:1;s:1:\"2\";i:2;s:15:\"barrio de boedo\";}s:15:\"super_attribute\";a:2:{i:92;s:1:\"4\";i:132;s:1:\"9\";}s:3:\"qty\";s:1:\"1\";}s:7:\"options\";a:2:{i:0;a:7:{s:5:\"label\";s:15:\"Language-Gender\";s:5:\"value\";s:8:\"spa-male\";s:11:\"print_value\";s:8:\"spa-male\";s:9:\"option_id\";s:1:\"1\";s:11:\"option_type\";s:9:\"drop_down\";s:12:\"option_value\";s:1:\"2\";s:11:\"custom_view\";b:0;}i:1;a:7:{s:5:\"label\";s:8:\"Location\";s:5:\"value\";s:15:\"barrio de boedo\";s:11:\"print_value\";s:15:\"barrio de boedo\";s:9:\"option_id\";s:1:\"2\";s:11:\"option_type\";s:5:\"field\";s:12:\"option_value\";s:15:\"barrio de boedo\";s:11:\"custom_view\";b:0;}}s:15:\"attributes_info\";a:2:{i:0;a:2:{s:5:\"label\";s:5:\"Color\";s:5:\"value\";s:4:\"Grey\";}i:1;a:2:{s:5:\"label\";s:4:\"Size\";s:5:\"value\";s:1:\"L\";}}s:11:\"simple_name\";s:14:\"T-Shirt-Grey-L\";s:10:\"simple_sku\";s:14:\"t-shirt-Grey-L\";s:20:\"product_calculations\";i:1;s:13:\"shipment_type\";i:0;}"
You need to use the following pattern in REGEXEXTRACT:
"Location\\?(?:""[^""]*){3}""([^""]+)\\"""
See the regex demo.
The pattern is Location\\?(?:"[^"]*){3}"([^"]+)\\" and it matches:
Location - a substring Location
\\? - 1 or 0 \ symbols (the ? makes a pattern optional)
(?:"[^"]*){3} - exactly 3 occurrences (due to the limiting quantifier {3}) of a " followed with zero or more (due to the * quantifier) chars other than " (the [^...] is a negated character class that matches any chars but those defined in the class)
" - a single double quote
([^"]+) - capturing group #1 (whose contents will be returned with REGEXEXTRACT): 1 or more (due to + quantifier) chars other than "
\\" - a \" substring.
I'm using Notepad++ v6.9.2. I need to find ICD9 Codes which will take the following forms:
(X##.), (X##.#) or (X##.##) where X is a letter and always at the beginning and # is a number
(##.), (##.#), (##.##), (###.), (###.#), (###.##) or (###.###) where # is a number
and
replace the first ( with | and the ) and single space behind second with |.
EXAMPLE
(305.11) TOBACCO ABUSE-CONTINUOUS
Becomes:
|305.11|TOBACCO ABUSE-CONTINUOUS
OTHER CONSIDERATIONS:
There are other places with parentheses but will only contain letters. Those do not need to be changed. Some examples:
UE (Major) Amputation
(282.45) THALASSEMIA (ALPHA)
(284.87) RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) (338.3) Neoplasm related pain (acute) (chronic)
Becomes
UE (Major) Amputation
|282.45|THALASSEMIA (ALPHA)
|284.87|RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) |338.3|Neoplasm related pain (acute) (chronic)
You can use a regex like this to match ICD9 codes:
[EV]\d+\.?\d*
This covers both E and V codes and cases where the . is omitted (in my experience this is not uncommon). Use this regex to match the portions of text you need:
\(([EV]?\d+\.?\d*)\)\s?
The outer parentheses are escaped to match literal ( and ) characters, and the inner parentheses create a group for replacement (\1). The \s? at end will capture an optional space after the parentheses.
So your Notepad++ replace window should look like this:
I would like to write this regex (later simplified form) in a more compact/ elegant/ systematic. PCRE or Python (newer engine) preferred. Shortly, I would like to capture each artery name (iliac, femoral, popliteal and so on), regardless of the string between them . Ideally, the resulted regex won't depend on any kind of regex flavor.
LE2: Even more simplified regex, but not working correctly: https://www.regex101.com/r/cK5wB6/7. I've eliminated DEFINE section - this was added only for modularity purposes, and DEFINE is not compatible with Python anyway (newer, v1 engine added this feature). I want capture all the artery names, equivalent of getting a vector of all artery names, regardless of number of names, or strings between them.
(arteries:.{0,25}?)
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
(?<artfinal>(?&art))
The problem is that some arteries are still not recognized correctly (at least visually). I'm trying to capture those names, without explicitly write capturing groups like in this.
LE4: The last variant actually ignore all names, aside the 1st and the last two.
First, regex flavour independent pattern is a myth. Regex engines are different, have different features, and even a same pattern that only uses common tokens between two or more regex engines can return different results.
An example with the Python regex module that has interesting features like the ability to use a set (\L<arteries> in the pattern) and the ability to store repeated capture groups:
import regex
s = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
arteries_set = ['femoral', 'iliac', 'peroneal', 'tibial']
p = regex.compile(r'^arteries: (?: [^\w\n]* (?>\w+[^\w\n]+)*? (\L<arteries>) \M)+', regex.M | regex.I | regex.X, arteries=arteries_set)
for m in p.finditer(s):
print(m.captures(1))
I voluntary removed the "less than 25 characters" condition to build a more efficient pattern, but feel free to replace [^\w\n]* (?>\w+[^\w\n]+)*? with .{0,25}? \m
(\m and \M are word boundaries, respectively for the start and the end of a word)
I want capture all the artery names,
The problem is that some arteries are still not recognized correctly (at least visually)
The problem with this regex:
((?<art>iliac|femoral|popliteal|peroneal|tibial).*?)*
is that the group art continuously overwrites its capture with the last match. This is an expected behaviour by design.
I want to recognize all the arteries given in the list (see definition
of ) The same artery could appear 1...n to and in any position.
The strings between artery names could be anything of max. 25 chars.
As of flavors, let's stick with PCRE
Provided you're working with PCRE, instead of matching all occurences of arteries at once, I would suggest matching 1 artery at a time. And to achieve that, we can use \G to match at the end of last match.
Regex:
/\G # Match anchor (BoS or EoLastMatch)
(?:
(?!^) # With previous match
|
.*? # Or first occurence
arteries: # of arteries:
)
.{1,25}? # Separated by max 25 chars
(?P<art> # Group 1 (capture 1 artery)
\b # List of arteries
(?:iliac|femoral|popliteal|peroneal|tibial)
\b # in between word boundaries
# Modif: global, caseless, singleline, extra
)/gixs
This will capture each artery in group art (group 1).
DEMO
Notes about other flavours:
As for compatibility with other regex flavours, you could loop each match in your code to simulate \G (which is not implemented in almost any other flavour). Another option is to split the text with the expression:
(arteries:|\b(?:iliac|femoral|popliteal|peroneal|tibial)\b)
and then check the length of each token to guarantee there isn't more than 25 chars in between.
the code will have to be migrated to Python one (not so distant) day
Update: Migrating to Python:
You can use \G in Python since the regex module implemented it, but if you do use that module, take advantage if its ability to retrieve repeated captures from a group with the .captures method. Check #CasimiretHippolyte's answer, a perfect example of using captures in this case.
On the other hand, if you stick to the standard re module, I'd recommend looping each match to simulate the same behaviour.
Code:
import re
text = '''arteries: jhjh iliac jdfd femoral
arteries: sdsdsd iliac jdfd femoral fd d popliteal
some arteries: hgv popliteal,sddsdsds iliac tibial nkjkknperoneal nkjkkn
arteries: iliac, peroneal jm tibia nktibial nkjkkn
arteries: m bkjkjnperoneal vc peroneal fdfd femoral n tibial jnmmmmm tibial jnnjnjmbn n iliacbjk
arteries:m bkjnkjnperoneal mm femoral jnnbn n right femoralbjkkbb jk'''
n = 0
pattern_from = re.compile( r'arteries:', re.I)
pattern_token = re.compile( r'.{1,25}?\b(iliac|femoral|popliteal|peroneal|tibial)\b', re.I)
for match_from in pattern_from.finditer(text):
n = n + 1
print( '\nMatch #%s:' % n, end="")
match_token = pattern_token.match( text, match_from.end())
while match_token:
print( '[%s:%s]="%s" ' % (match_token.start(1), match_token.end(1), match_token.group(1)), end="")
match_token = pattern_token.match( text, match_token.end())
Output:
Match #1:[15:20]="iliac" [26:33]="femoral"
Match #2:[52:57]="iliac" [63:70]="femoral" [76:85]="popliteal"
Match #3:[106:115]="popliteal" [125:130]="iliac" [132:138]="tibial"
Match #4:[171:176]="iliac" [179:187]="peroneal"
Match #5:[243:251]="peroneal" [257:264]="femoral" [267:273]="tibial" [282:288]="tibial"
Match #6:[344:351]="femoral"
ideone DEMO