Get text between a group of delimiters - regex

I have a string of text with four delimiters ST: SI: T: and I: that are followed by a sequence of digits and numbers. I need to grab the delimiter as a group called group and the digits and numbers as code.
ST:12YEOR48000FCT:24YEOR48000FCSI:12YEOR13000FCI:12YEOR13000FCT:12YEOR51200FCI:12YEOR14500FCST:12YEOR48000FCT:24YEOR48000FCSI:12YEOR13000FCI:12YEOR13000FCT:12ACTYEI:12ACTYET:32000ACTFCI:13300ACTFC
The results should be
GROUP CODE
ST: 12YEOR48000FC
T: 24YEOR48000FC
SI: 12YEOR13000F
CI: 12YEOR13000F
CT: 12YEOR51200F
CI: 12YEOR14500FC
ST: 12YEOR48000F
CT: 24YEOR48000FC
SI: 12YEOR13000F
CI: 12YEOR13000F
CT: 12ACTYE
I: 12ACTYE
T: 32000ACTFC
I: 13300ACTFC
(?'group'ST:|SI:|T:|I:)(?'code'.*?)(?<=ST:|SI:|T:|I:|$)
My thought is that I want grab the starting delimiter as the group, then any character as the code, until another delimiter or end of string is found. The regex I came up with gets the delimiters but not the code.
Thanks for any help.
RegEx101

You're using a positive lookbehind for your code group, which won't accomplish the functionality you're looking for.
However, you're on the right track! Removing the < to create a positive lookahead will achieve what you're looking for:
(?'group'ST:|SI:|T:|I:)(?'code'.*?)(?=ST:|SI:|T:|I:|$)
Regex101
You should also consider optimizing the pattern a bit for maintainability by using nested matching groups to break out the colon token for each of your group items. This will make it easier to add group codes later and limit the potential of typos (i.e., forgetting the colon in the new group code):
(?'group'(?:ST|SI|T|I):)(?'code'.*?)(?=(?:ST|SI|T|I):|$)
Regex101

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Using a regex to identify EQUIPMENTID numbers - VBA

Struggling trying to construct a Regexp to identify equipment numbers, I require this to identify equipment numbers in multiple formats including pooled equipment numbers e.g AFD21101 or AFD21101-02-03 or AFD21101-2-3 including various prefixes as per testdata.
Any tips or feedback welcome, possibly it may be easier with multiple RegExp for each scenario but I had hopped to have a master that would identify any of these patterns and be able to extract from a string for further process in a more detailed order. Possibly converting to Long format etc.
Any assistance is greatly appreciated. Hopefully I can return the favour.
What I've tried so far:
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]|[0-9xX-][0-9]|[0-9]
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]
^(BLM)|(SUB)|
(CVR)|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT|[0-9][0-9xX][0-9xX][0-9xX][0-9xX]
Testdata - will have to handle multiple separated by comma or multiline as per testdata examples below
// Example test data 1: (CSV+)
CRN21003 (CB-3), CRN21004 (CB-4)
// Example test data 2: (CSV)
CVR21404, CHU21437, AFD21401
// Example test data 3: (Multi-line)
MGD22401 - 16
DEC22401 - 16
// Example test data 4: (In string)
AFD11122 SOME OTHER RANDOM DATA WDC11121_22 SOME OTHER RANDOM DATA
//Additional matches
AFD21101-03
AFD21101_03
AFD21101-02-03
AFD21101_02_03
AFD21101-2-3
AFD21101_2_3
FDR21407-08
BLM21401
SUB21601
CVR21601
Fdr21601
SMP21501
CRU21501
HXC21501
AFD21501
FTS21X01
DIX21301
DIT22501
FIT21X0X
FCV21501
Pattern:
Base is max 8 digits
1-3 letters (A-Z)
5 Digits (0-9) including X as wildcard
Followed by pooled EQUIPMENT ID's
e.g. AFD21101-2-3, AFD21101-02-03 or AFD21101_02_03
_ or - are delimiters indicating abbreviated subsequent equipment id's or ranges.
AFD21101-02-03 is equivalent to AFD21101, AFD21102, AFD21103 in full form
Possible Prefix's continued
KV
CHU
PLW
BCR
DEC
CTR
CWR
V
DSS
PNL
MTR
LUB
LAU
CCL
DBB
TNK
THK
PIT
AGM2XXXX - valid
Some Invalid matches would be something like
AGM211011 or AGMXXXXX or 21101 or 2110 or AGM21101-094-034 or AGM (prefix only without a trailing 5 digit number/ X wildcard)
If I understand your issue, you need to get the strings which starts with substring provided and contains numbers.
You could try the following regex.
^(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT)[0-9_-]+
Details:
^: start of string
?:: non capturing group
(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT): list of prefixes.
Demo
It isn't 100% clear what you're intending to do because:
The test data you've supplied is comprised wholly of expected matches
The expected output is unclear. Although this largely relays back to point 1!
However, there are many ways of getting the information you require. They all depend on how your source data is organised though...
// Example test data 1:
AFD11122 SOME OTHER RANDOM DATA
WDC11121_22 SOME OTHER RANDOM DATA
// Example test Data 2:
SOME RANDOM DATA AFD11122 AND SOME MORE RANDOM DATA WDC11121_22 WITH SOME MORE
Assuming that the data is at the start of the string AND that you want to capture each string as a whole:
// Option 1
/^(.*?)\s/
^ : Start of string
(.*?) : Non-greedy capture group
\s : First space (first because the capture group was non-greedy)
// Option 2
/^([ABCDEFHIKLMNPRSTUVWX][ABCDEFHILMNRSTUVWX]?[BCDKLMPRSTUVWX]?[x\d]{5}[_\-\d]*)/i
^ : Start of string
( : Start of capture group
[ABCDEFHIKLMNPRSTUVWX] : Capture any letter in character set
[ABCDEFHILMNRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[BCDKLMPRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[x\d]{5} : Capture any number or x 5 times
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 3
/^((?:AFD|BCR|BLM....TNK|V)[\d_\-]*)/i
^ : Start of string
( : Start of capture group
(?: : Start of non-capturing group
AFD|BCR|BLM....TNK|V : List of prefixes separated with "|"
) : End of non-capturing group
[\d_\-]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 4
/^([a-z]{1,3}[x\d]{5}[_\-\d]*)/i :
^ : Start of string
( : Start of capture group
[a-z]{1,3} : Capture any letter [range: a-z] 1 to 3 times {1,3}
[x\d]{5} : Capture any number [\d] or x [x] 5 times {5}
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
Based on your updates to the main question I would stick with option 4 unless you specifically need to make sure that only the set prefixes are matched.
In the event that your data looks more like Example Data 2 then the above expressions will need to be altered accordingly; some examples below:
/([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Remove the ^
/\b([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Add a word boundary to the start of the expression
/[^a-z]([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Start the expression with anything BUT a letter
How you alter it will depend on the data that you're searching through.
Updated RegEx based on latest question edits
/([a-z]{1,3}(?!xxxxx)[x\d]{5}(?!\d)[_\-\d]*)/ig
Try this:
[A-Z]{1,3}[\dX]{5}([_-])0?\d(\10?\d)?
This requires the separator to be the consistent, ie either both - or both _, by capturing the separator and using a back reference to it \1, although the second “pooled ID” is optional.
As far as I can tell, this matches all of your examples.

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Extract everything between pipes in key value pair

I have following sourceString
|User=gmailUser1|login with password=false|addition information=|source IP location=DE|
I want to extract everything between pipes in key value pair. In this case
User=gmailUser1
Login with password=false
addition information=
Source IP location=DE
My regex pattern is giving me the entire string.
\|(\b+)=(\b+)\|
Try with the expression:
/\|([^=|]+)=([^|]*)/g
or if you just want the pattern:
\|([^=|]+)=([^|]*)
Depending on your environment you will be able to get captures of group 1 and 2 for each key-value pair.
(I'm not able to test it out right now.)
Update 1: I did a short test and adapted it with the optimization of Wiktor Stribizew.
Update 2: Short explanation of the regex used:
The \b in your pattern means word boundary and does not represend a sign. You cannot combine it with +. See also What is a word boudary.
The first group ([^=|]+) matches anything that is not a = or a | with at least one character.
The second group ([^|]*) matches anything that is not a = with zero or more characters (addition information has an empty value).
Try this:
\w+(=|\s|\w+)
this match:
\w+ = numletter chars and a matching group
(=|\s|\w+) = a = sing, blank space or another numletter group

1 to 5 of the same groups in REGEX

For a string such as:
abzyxcabkmqfcmkcde
Notice that there are string patterns between ab and c in bold. To capture the first string pattern:
ab([a-z]{3,5})c
Is it possible to match both of the groups from the sample string? Actually, there should be 1 to 5 groups.
Note: python style regex.
You can verify that a given string conforms to the 1-5 repetitions of ab([a-z]{3,5})c using this regex
(?:ab([a-z]{3,5})c){1,5}
or this one if there are characters expected between the groups
(?:ab([a-z]{3,5})c.*?){1,5}
You will only be able to extract the last matching group from that string however, not any of the previous ones. to get a previous one you need to use hsz's approach
Just match all results - i.e. with g flag:
/ab([a-z]{3,5})c/g
or some method like in Python:
re.findall(pattern, string, flags=0)