Modifying an existing regex to find a pattern until "Index=0" - regex

I have an existing regex that I need to modify.
The current regex is: {(\bVar Name=)\w+\b[^{}]*} which works on patterns as "{Var Name=New Variable Selection List}" and finds the pattern: "New Variable Selection List".
It also works on more complex strings where there are multiple patterns one inside another.
Now, I need to modify the string to: "{Var Name=New Variable Selection List Index=current}" where the "Index=current" part can be letters and numbers with '=' sign inside it.
In this string, I still need to find only the pattern "New Variable Selection List", but the regex still needs to find if there are multiple occurs as it does now.
Also, I need to create another regex to find only the "Index=current" part.

Provided Index=current is always the last string in the sequence and it doesn't contain any spaces (otherwise it would be impossible to distinguish beginning of the index from the end of the previous entry), you can read the string with this expression:
/(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+)) (?=[\w=]+)(?<index>[\w=]+)/gm
Breaking down it into parts:
(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+)) - reads the the first pair of values. entry capture group contains both key and value: Var Name=New Variable Selection List
(?<key>[\w ]+) - looks for the first part of the entry: Var Name
(?<value>[\w ]+) - looks for the second part of the entry: New Variable Selection List
(?=[\w=]+) - a space and a lookahead that prevents the <entry> capture group to extend up to the index part
(?<index>[\w=]+) - looks for the index part: Index=current
Edit:
If Index=current part is optional, you can extend regex like this:
/(?(DEFINE)(?'inseq'([\w]+=[\w]+)))(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+))(?=(?: (?P>inseq)|}))\s?(?<index>(?P>inseq))?/gm
Unlike previous version it has the following additions:
(?(DEFINE)(?'inseq'([\w]+=[\w]+))) - predefined patter to match index part (to reuse it in both lookahead and corresponding capture group)
\s?(?<index>(?P>inseq))? - the final capture group (and space) are now optional
(?=(?: (?P>inseq)|})) - the lookahead now checks for closing curly bracket OR the index pattern

Related

RegExp set contains one or multiple words

Is there a way in regular expressions to match a subset of words against a set of words separated by a separator that does not involve creating a new pattern for every new word added to the set.
Right now I cannot think of anything else than creating a (?:{item1, item2, ...}) pattern for every extra item in the set (see example below).
Example matching a single word of the set:
Set: foo,bar,baz
Match: foo
RegExp:/^(foo|bar|baz)$/ <- MATCH
Example that will match a subset of words:
Set: foo,bar,baz
Match: foo,bar
RegExp: /^(foo|bar|baz)(?:,(foo|bar|baz)(?:,(foo|bar|baz))?)?$/ <- MATCH
The pattern grows rapidly when adding new items to the set. Is there some (magical) way to do this in a shorter version?
One general approach which looks slightly better than your current attempt would be to use lookaheads:
^(?=.*\bfoo\b)(?=.*\bbar\b).*$
Demo
You may add one lookahead assertion for each CSV term which needs to be matched in the input CSV list.
Edit: If you want OR behavior here, then we can use an alternation of lookaheads. To match either foo or bar as a CSV term we can try:
^(?:(?=.*\bfoo\b)|(?=.*\bbar\b)).*$

Dart regex for capturing groups but ignoring certain similar patterns

I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!
I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+

Concentric matches with one expression

What is the regex syntax for combining 2 expressions like a Venn diagram?
I have HTML with 2 table cells. Each of the 2 cells contains several table rows:
https://regex101.com/r/cTXwrT/3
This expression captures the 2nd table cell only:
(?<=your mother)(?s).*(?=Monochrome)
This expression matches table rows from all table cells:
[A-Za-z].*Yoghurt
How do I combine both expressions into one, so that I get the table rows from only the 2nd table cell?
I'm writing in AutoHotkey which uses PCRE for the regex engine.
I apologise for poor terminology— I've read up on recursion, back referencing, capture groups, atomic groups, etc but they didn't seem to apply.
I think you can do what you want with a nested capturing group. Here I capture everything between the td tags in an inner capturing group:
(?<=your mother)(?s).*((?<=\<td bgcolor="#F0F0F0"\>).*(?=\<\/td\>)).*(?=Monochrome)
You might need to tweak it a bit, it's a pretty scrappy regex, but it works for your current use case.
Reading the documentation for AutoHotkey#RegExMatch:
FoundPos := RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPosition = 1])
If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in a pseudo-array whose base name is OutputVar. For example, if the variable's name is Match, the substring that matches the first subpattern would be stored in Match1, the second would be stored in Match2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern "(?P\d{4})" would be stored in MatchYear. If a particular subpattern does not match anything (or if the function returns zero), the corresponding variable is made blank.
So you'd have to call it with UnQuotedOutputVar, say Match, and then look in Match2 for what was captured by the second capturing group.

regex to find value at a particular location

Presently the regex is:
[A-Z]+(?=-\d+$)
This pulls out the correct value for most of the strings which follow the below format:
ANG-RGN-SOR-BCP-0004 i.e. BCP
However it pulls out SS for the following document instead of PMR:
ANG-B31-OPS-PMR-MACE-SS-0229
So basically I want to pull out the fourth term (between the hyphens), so it should pick BCP and PMR.
The following regex will get the 4th item in group 1:
(?:[A-Z0-9]+-){3}([A-Z0-9]+)
The first bit in (?:...) is a "non-capturing group" which acts like a group but won't appear in the backreference list.
The next bit means "3 of these non-capturing groups".
And finally, a capturing group to collect what you want.
I have assumed here that all the groups contain only uppercase letters and digits, you should modify the parts in [square brackets] to represent what these groups could be.
A more easily understandable method in Python:
a = "ANG-B31-OPS-PMR-MACE-SS-0229"
part = a.split('-')[3]
print part
This gives "PMR".
This should suit your needs (demo):
(?:.+?-){3}([^-]+)
You'll be able to access the fourth term in the first capturing group.

Bounding Multiple Matches With Single Text

I'm trying to parse out the properties of a type (eg. the words 'Cusip', 'Issuer', and 'Coupon') shown here:
Public Type GetPricesResponse
Cusip As String
Issuer As String
Coupon As String
End Type
The regex ([a-zA-Z0-9]+).+As works great for this code snippet (see http://regexr.com?300fl), but may not work when mixed with a larger body of code. So, I've tried to "bound" this regex with the words Public Type on the front, and End Type at the end to specifically identify what I need as follows:
Public\sType\s([a-zA-Z0-9]+).+As.+End\sType
...but of course it then doesn't match anything.
I have the MultiLine option set as well.
You've presented two different problems.
The first is, roughly, "can I write a regex to match this thing", the answer is yes. For simplicity I've used \w instead of [a-zA-Z0-9]:
Public\s+Type\s+(\w+)\s+((\w+)\s+As\s+(\w+)\s*('.*\s*)?)+End\s+Type
The next is "how can I parse out the properties" and the answer to that is, as written in the comments: don't use a single regex. First, use a regex which captures only the definitions:
Public\s+Type\s+\w+\s+(.*?)End\s+Type
This uses a the reluctant quantifier *? so that the regex won't gobble up End Type and the DOTALL flag so that you can match several lines. From this match, you take group 1 and repeatedly find the following:
^\s+(\w+)\s+.*$
Group 1 from this match will be your property name.
Use the following regexp to match the whole thing:
Public\s+Type\s+(?<tname>[\w]+)\s+((?<pname>[\w]+)\s+As\s+(?<ptype>[\w]+)\s+)+End\s+Type
Note that it uses named groups for easier access to matched content. Therefore after the whole content is matched, the group named tname matches the class type, the group named pname matches the property name, and the group named ptype matches the corresponding properties type.
Here's its live demo:
http://regexr.com?300l0