I am working on a regex that will match many different types of of location coordinates. So far it matches about 90% of the formats:
([SNsn][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([SNsn]?)[^\\dSNsnEWew]+([EWew][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([EWew]?)
Testing the formats:
N 45° 55.732 W 122° 29.882
N 047° 38.938', W 122° 20.887'
40.123, -74.123
40.123° N 74.123° W
40° 7´ 22.8" N 74° 7´ 22.8" W
40° 7.38’ , -74° 7.38’
N40°7’22.8, W74°7’22.8"
40°7’22.8"N, 74°7’22.8"W
40 7 22.8, -74 7 22.8
40.123 -74.123
40.123°,-74.123°
144442800, -266842800
40.123N74.123W
4007.38N7407.38W
40°7’22.8"N, 74°7’22.8"W
400722.8N740722.8W
N 40 7.38 W 74 7.38
40:7:23N,74:7:23W
40:7:22.8N 74:7:22.8W
40°7’23"N 74°7’23"W
40°7’23" -74°7’23"
40d 7’ 23" N 74d 7’ 23" W
40.123N 74.123W
40° 7.38, -74° 7.38
Testing if it works: https://regexr.com/3ivu2
As you can see there are issues with the spaces and commas that are causing the regex to not match some of these formats.
I am trying to match the coordinate strings so that they can be highlighted in my iOS app and allow the user to tap them.
What can I do to update the regex and fix the matching issues?
Overview
I'm sure there are many ways to go about this. Since you haven't specified a regex engine or programming language, I'll post one that works in PCRE and what that should work in most engines. The PCRE regex is much easier to understand than the non-PCRE regex, but both use the exact same logic.
The patterns defined below match each string you've presented in your question and properly separates each part of the coordinate (x, y).
Code
PCRE
This method uses the DEFINE construct to pre-define patterns. The beauty of this construct is that you can define reusable parts of your regex in one location, thus, you can edit most of the regex just by editing these subpatterns.
See regex in use here
(?(DEFINE)
(?<ns>[ns])
(?<ew>[ew])
(?<d>[°´’'"d:])
(?<n>[+-]?\d+(?:\.\d+)?)
)
(
(?&ns)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ns)?
)
\ ?,?\ ?
(
(?&ew)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ew)?
)
Flags: gix
Non-PCRE
See regex in use here
(
[ns]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ns]?
)
\ ?,?\ ?
(
[ew]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ew]?
)
Flags: gix.
Some engines don't have the x flag. For those engines you can use the following one-liner (as seen here):
([ns]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ns]?) ?,? ?([ew]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ew]?)
Explanation
Since both patterns are essentially the same (non-PCRE is just an expanded version of the PCRE), I'll define the PCRE regex pattern since it's easier to grasp.
Note that the patterns that use x have escaped spaces since they would otherwise be ignored (x ignores whitespace within the pattern). The i flag allows us to match text regardless of case (i makes our pattern case-insensitive).
DEFINE
(?(DEFINE)...) The DEFINE group is completely ignored by regex. It gets treated as a var name=value, whereas you can recall the specific pattern for use via its name.
(?<ns>[ns]) The group ns matches any character in the set nsNS
(?<ew>[ew]) The group ew matches any character in the set ewEW
(?<d>[°´’'"d:]) The group d matches any character in the set °´’'"d:
(?<n>[+-]?\d+(?:\.\d+)?) The group n matches any number that matches the following structure
[+-]? Optionally match any character in the set +-
\d+ Match one or more digits
(?:\.\d+)? Optionally match a decimal point followed by one or more digits
Pattern
The pattern is composed of 3 larger parts. The first and last are capture groups (the coordinates themselves) and the second is what separates the two.
Capture 1:
(?&ns)? Optionally match the group ns
(?:\ ?(?&n)(?&d)?){1,3} Matches [an optional space, followed by the group n then optionally group d] between one and three times
\ ?(?&ns)? Optionally match a space, optionally match the group ns
\ ?,?\ ? Match an optional space, comma and space (this separates each coordinate part)
Capture 2: This is the same as Capture 1 but replaces the group ns with the group ew
This simplified regex literally matches all the patterns you've given:
^((?:[NW]? ?(?:[-\d.d]+[NW:°´’'",]?[ NW]?)+[, ]*)+[NW]?)$
I'm not an expert for coordinates, but you can modify it easily if I didn't take into account some specifics.
A full test is here.
Related
Struggling trying to construct a Regexp to identify equipment numbers, I require this to identify equipment numbers in multiple formats including pooled equipment numbers e.g AFD21101 or AFD21101-02-03 or AFD21101-2-3 including various prefixes as per testdata.
Any tips or feedback welcome, possibly it may be easier with multiple RegExp for each scenario but I had hopped to have a master that would identify any of these patterns and be able to extract from a string for further process in a more detailed order. Possibly converting to Long format etc.
Any assistance is greatly appreciated. Hopefully I can return the favour.
What I've tried so far:
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]|[0-9xX-][0-9]|[0-9]
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]
^(BLM)|(SUB)|
(CVR)|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT|[0-9][0-9xX][0-9xX][0-9xX][0-9xX]
Testdata - will have to handle multiple separated by comma or multiline as per testdata examples below
// Example test data 1: (CSV+)
CRN21003 (CB-3), CRN21004 (CB-4)
// Example test data 2: (CSV)
CVR21404, CHU21437, AFD21401
// Example test data 3: (Multi-line)
MGD22401 - 16
DEC22401 - 16
// Example test data 4: (In string)
AFD11122 SOME OTHER RANDOM DATA WDC11121_22 SOME OTHER RANDOM DATA
//Additional matches
AFD21101-03
AFD21101_03
AFD21101-02-03
AFD21101_02_03
AFD21101-2-3
AFD21101_2_3
FDR21407-08
BLM21401
SUB21601
CVR21601
Fdr21601
SMP21501
CRU21501
HXC21501
AFD21501
FTS21X01
DIX21301
DIT22501
FIT21X0X
FCV21501
Pattern:
Base is max 8 digits
1-3 letters (A-Z)
5 Digits (0-9) including X as wildcard
Followed by pooled EQUIPMENT ID's
e.g. AFD21101-2-3, AFD21101-02-03 or AFD21101_02_03
_ or - are delimiters indicating abbreviated subsequent equipment id's or ranges.
AFD21101-02-03 is equivalent to AFD21101, AFD21102, AFD21103 in full form
Possible Prefix's continued
KV
CHU
PLW
BCR
DEC
CTR
CWR
V
DSS
PNL
MTR
LUB
LAU
CCL
DBB
TNK
THK
PIT
AGM2XXXX - valid
Some Invalid matches would be something like
AGM211011 or AGMXXXXX or 21101 or 2110 or AGM21101-094-034 or AGM (prefix only without a trailing 5 digit number/ X wildcard)
If I understand your issue, you need to get the strings which starts with substring provided and contains numbers.
You could try the following regex.
^(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT)[0-9_-]+
Details:
^: start of string
?:: non capturing group
(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT): list of prefixes.
Demo
It isn't 100% clear what you're intending to do because:
The test data you've supplied is comprised wholly of expected matches
The expected output is unclear. Although this largely relays back to point 1!
However, there are many ways of getting the information you require. They all depend on how your source data is organised though...
// Example test data 1:
AFD11122 SOME OTHER RANDOM DATA
WDC11121_22 SOME OTHER RANDOM DATA
// Example test Data 2:
SOME RANDOM DATA AFD11122 AND SOME MORE RANDOM DATA WDC11121_22 WITH SOME MORE
Assuming that the data is at the start of the string AND that you want to capture each string as a whole:
// Option 1
/^(.*?)\s/
^ : Start of string
(.*?) : Non-greedy capture group
\s : First space (first because the capture group was non-greedy)
// Option 2
/^([ABCDEFHIKLMNPRSTUVWX][ABCDEFHILMNRSTUVWX]?[BCDKLMPRSTUVWX]?[x\d]{5}[_\-\d]*)/i
^ : Start of string
( : Start of capture group
[ABCDEFHIKLMNPRSTUVWX] : Capture any letter in character set
[ABCDEFHILMNRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[BCDKLMPRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[x\d]{5} : Capture any number or x 5 times
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 3
/^((?:AFD|BCR|BLM....TNK|V)[\d_\-]*)/i
^ : Start of string
( : Start of capture group
(?: : Start of non-capturing group
AFD|BCR|BLM....TNK|V : List of prefixes separated with "|"
) : End of non-capturing group
[\d_\-]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 4
/^([a-z]{1,3}[x\d]{5}[_\-\d]*)/i :
^ : Start of string
( : Start of capture group
[a-z]{1,3} : Capture any letter [range: a-z] 1 to 3 times {1,3}
[x\d]{5} : Capture any number [\d] or x [x] 5 times {5}
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
Based on your updates to the main question I would stick with option 4 unless you specifically need to make sure that only the set prefixes are matched.
In the event that your data looks more like Example Data 2 then the above expressions will need to be altered accordingly; some examples below:
/([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Remove the ^
/\b([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Add a word boundary to the start of the expression
/[^a-z]([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Start the expression with anything BUT a letter
How you alter it will depend on the data that you're searching through.
Updated RegEx based on latest question edits
/([a-z]{1,3}(?!xxxxx)[x\d]{5}(?!\d)[_\-\d]*)/ig
Try this:
[A-Z]{1,3}[\dX]{5}([_-])0?\d(\10?\d)?
This requires the separator to be the consistent, ie either both - or both _, by capturing the separator and using a back reference to it \1, although the second “pooled ID” is optional.
As far as I can tell, this matches all of your examples.
I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:
In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.
Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb
(Thus e.g. C, Cm, and Cn will pass, but not Cg or Cx.)
As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O and (CH3)2CFCOO(CH2)2Si(CH3)2Cl are matched.
So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?
(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)
Brief
I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.
Assumptions
I am assuming the following since the OP has not given a full list of positive and negative matches:
Nested parentheses aren't possible
Nested square brackets aren't possible
Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group
If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly
Answer
View this regex in use here
Code
(?(DEFINE)
(?# Periodic elements )
(?<Hydrogen>H)
(?<Helium>He)
(?<Lithium>Li)
(?<Beryllium>Be)
(?<Boron>B)
(?<Carbon>C)
(?<Nitrogen>N)
(?<Oxygen>O)
(?<Fluorine>F)
(?<Neon>Ne)
(?<Sodium>Na)
(?<Magnesium>Mg)
(?<Aluminum>Al)
(?<Silicon>Si)
(?<Phosphorus>P)
(?<Sulfur>S)
(?<Chlorine>Cl)
(?<Argon>Ar)
(?<Potassium>K)
(?<Calcium>Ca)
(?<Scandium>Sc)
(?<Titanium>Ti)
(?<Vanadium>V)
(?<Chromium>Cr)
(?<Manganese>Mn)
(?<Iron>Fe)
(?<Cobalt>Co)
(?<Nickel>Ni)
(?<Copper>Cu)
(?<Zinc>Zn)
(?<Gallium>Ga)
(?<Germanium>Ge)
(?<Arsenic>As)
(?<Selenium>Se)
(?<Bromine>Br)
(?<Krypton>Kr)
(?<Rubidium>Rb)
(?<Strontium>Sr)
(?<Yttrium>Y)
(?<Zirconium>Zr)
(?<Niobium>Nb)
(?<Molybdenum>Mo)
(?<Technetium>Tc)
(?<Ruthenium>Ru)
(?<Rhodium>Rh)
(?<Palladium>Pd)
(?<Silver>Ag)
(?<Cadmium>Cd)
(?<Indium>In)
(?<Tin>Sn)
(?<Antimony>Sb)
(?<Tellurium>Te)
(?<Iodine>I)
(?<Xenon>Xe)
(?<Cesium>Cs)
(?<Barium>Ba)
(?<Lanthanum>La)
(?<Cerium>Ce)
(?<Praseodymium>Pr)
(?<Neodymium>Nd)
(?<Promethium>Pm)
(?<Samarium>Sm)
(?<Europium>Eu)
(?<Gadolinium>Gd)
(?<Terbium>Tb)
(?<Dysprosium>Dy)
(?<Holmium>Ho)
(?<Erbium>Er)
(?<Thulium>Tm)
(?<Ytterbium>Yb)
(?<Lutetium>Lu)
(?<Hafnium>Hf)
(?<Tantalum>Ta)
(?<Tungsten>W)
(?<Rhenium>Re)
(?<Osmium>Os)
(?<Iridium>Ir)
(?<Platinum>Pt)
(?<Gold>Au)
(?<Mercury>Hg)
(?<Thallium>Tl)
(?<Lead>Pb)
(?<Bismuth>Bi)
(?<Polonium>Po)
(?<Astatine>At)
(?<Radon>Rn)
(?<Francium>Fr)
(?<Radium>Ra)
(?<Actinium>Ac)
(?<Thorium>Th)
(?<Protactinium>Pa)
(?<Uranium>U)
(?<Neptunium>Np)
(?<Plutonium>Pu)
(?<Americium>Am)
(?<Curium>Cm)
(?<Berkelium>Bk)
(?<Californium>Cf)
(?<Einsteinium>Es)
(?<Fermium>Fm)
(?<Mendelevium>Md)
(?<Nobelium>No)
(?<Lawrencium>Lr)
(?<Rutherfordium>Rf)
(?<Dubnium>Db)
(?<Seaborgium>Sg)
(?<Bohrium>Bh)
(?<Hassium>Hs)
(?<Meitnerium>Mt)
(?<Darmstadtium>Ds)
(?<Roentgenium>Rg)
(?<Copernicium>Cn)
(?<Nihonium>Nh)
(?<Flerovium>Fl)
(?<Moscovium>Mc)
(?<Livermorium>Lv)
(?<Tennessine>Ts)
(?<Oganesson>Og)
(?# Regex )
(?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
(?<Num>(?:[1-9]\d*)?)
(?<ElementGroup>(?:(?&Element)(?&Num))+)
(?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
(?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$
Explanation
The first part of the (?(DEFINE)) section lists each periodic element (ordered by atomic number for easy lookup).
The Element group acts as a simple or | between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, Carbon C instead of Calcium Ca)
ElementGroup specifies a group of chemicals in the format: one or more Element followed by zero or more digits, excluding zero (specified by the group Num)
Valid Examples
C - Element
CH - Element followed by another Element
CH3 -Element followed by another Element and a Num
O2 - Element followed by a Num
Invalid Examples
N0 - 0 cannot be used explicitly
N01 - Num group specifies the number must begin with 1-9 or not have a number
A - Element does not exist
c - Element does not exist - case sensitive regex
ElementParenthesesGroup specifies one or more groupings of ElementGroup between parentheses ( ) but containing at least one ElementGroup
Valid Examples
(CH) - ElementGroup surrounded by parentheses
(CH3) - ElementGroup surrounded by parentheses
(CH3NO4) - multiple ElementGroup surrounded by parentheses
(CH3N04)2 - multiple ElementGroup surrounded by parentheses followed by a Num
Invalid Examples
(CH[NO4]) - Only ElementGroup is valid inside ElementParenthesesGroup
ElementSquareBracketGroup specifies a grouping of ElementParenthesesGroup or ElementGroup between square brackets [ ] but containing at least one ElementParenthesesGroup and one other group (ElementParenthesesGroup or ElementGroup)
Valid Examples
[CH3(NO4)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
[(NO4)CH]2 - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup followed by Num
[(NO4)(CH3)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
Invalid Examples
[(NO4)] - Does not contain second group, brackets [ ] are redundant
[NO4] - Does not contain ElementParenthesesGroup
Additional Information
I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.
Ensure the following flags are set:
g - ensures global matches
x - ensures whitespace is ignored
if the data is across multiple lines (separated by a newline character) use m for multi line
Note: Regex will only capture the last group of type X that it finds (and overwrite the previously captured group of said type X. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl since there are multiple of each group type.
It is best not to assemble such a large regex manually. Instead, let's assume we have an array of atoms #atoms. We can then create a regex matching any of these atoms like:
my ($atoms_regex) = map qr/$_/, join '|', map quotemeta, sort #atoms;
(Sort all items so that shorter atom names come first, then escape all items with quotemeta, join them with a | for alternatives, and compile the regex.)
You can add any used abbreviations to the #atoms array.
Next, we can write a regex that allows grouping and numbering. Our regex will match any number of items, where an item may be an atom or a group, and may be followed by a number:
my $chemical_formula_regex = qr/
(?&item)++
(?(DEFINE)
(?<item> (?: \((?&item)++\) | \[(?&item)++\] | $atoms_regex ) [0-9]* )
)
/x;
Within the (?(DEFINE) ...) group we can define named subpatterns with (?<name> ...). A subpattern is like a subroutine for a regex. We can call those subpatterns with (?&name). This allows us to structure the regex without unnecessary repetition.
The /x flag allows us to use whitespace and linebreaks and comments to lay out the regex in a more readable fashion. Regexes don't have to be an incomprehensible mess!
The ++ quantifier instead of + is not strictly necessary, but prevents unwanted backtracking. That may be a bit faster when a match fails.
Since this post is a top result for "regex chemistry symbols", I would also like to submit a solution. It is a python script that uses a regex to match chemistry formulas of the type A#B#, where A and B are chemical symbols and # is a number. The script as implemented matches and then surrounds the matches with \ce{} for use in LaTeX. It also includes the ability to exclude matches if the capture is in a user-defined list, which means words such as "I" and "In" will not be matched. Gist Link.
#!/usr/bin/env python3
# Find chemical symbols and surround them with \ce{ Symbol }
# Problem words: I, HOW, In, degrees K. Add words to exlist to ignore them.
import re, sys
if len(sys.argv) < 2 :
print('Usage:> {} <filename>'.format(sys.argv[0]))
sys.exit(1)
ptable =" H He "
ptable+=" Li Be B C N O F Ne "
ptable+=" Na Mg Al Si P S Cl Ar "
ptable+=" K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr "
ptable+=" Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe "
ptable+=" Cs Ba La Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn "
ptable+=" Fr Ra Ac Rf Db Sg Bh Hs Mt Ds Rg Cn Nh Fl Mc Lv Ts Og "
ptable+=" Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu "
ptable+=" Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr "
exlist = ['C','I','In','K','HOW'] # exclude these words from being replaced
orsyms = '|'.join(ptable.split())
resyms = re.compile(r'\b'+'((?:(?:{})\d*)+)'.format(orsyms)+r'\b')
latexfile=sys.argv[1]
with open(latexfile,'r') as fd:
for line in fd:
for m in list(set(resyms.findall(line))):
if m not in exlist :
line = re.sub(r'\b'+m+r'\b', r'\ce{'+m+r'}', line)
print(line,end='')
I'm using Notepad++ v6.9.2. I need to find ICD9 Codes which will take the following forms:
(X##.), (X##.#) or (X##.##) where X is a letter and always at the beginning and # is a number
(##.), (##.#), (##.##), (###.), (###.#), (###.##) or (###.###) where # is a number
and
replace the first ( with | and the ) and single space behind second with |.
EXAMPLE
(305.11) TOBACCO ABUSE-CONTINUOUS
Becomes:
|305.11|TOBACCO ABUSE-CONTINUOUS
OTHER CONSIDERATIONS:
There are other places with parentheses but will only contain letters. Those do not need to be changed. Some examples:
UE (Major) Amputation
(282.45) THALASSEMIA (ALPHA)
(284.87) RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) (338.3) Neoplasm related pain (acute) (chronic)
Becomes
UE (Major) Amputation
|282.45|THALASSEMIA (ALPHA)
|284.87|RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) |338.3|Neoplasm related pain (acute) (chronic)
You can use a regex like this to match ICD9 codes:
[EV]\d+\.?\d*
This covers both E and V codes and cases where the . is omitted (in my experience this is not uncommon). Use this regex to match the portions of text you need:
\(([EV]?\d+\.?\d*)\)\s?
The outer parentheses are escaped to match literal ( and ) characters, and the inner parentheses create a group for replacement (\1). The \s? at end will capture an optional space after the parentheses.
So your Notepad++ replace window should look like this:
I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement