Concentric matches with one expression - regex

What is the regex syntax for combining 2 expressions like a Venn diagram?
I have HTML with 2 table cells. Each of the 2 cells contains several table rows:
https://regex101.com/r/cTXwrT/3
This expression captures the 2nd table cell only:
(?<=your mother)(?s).*(?=Monochrome)
This expression matches table rows from all table cells:
[A-Za-z].*Yoghurt
How do I combine both expressions into one, so that I get the table rows from only the 2nd table cell?
I'm writing in AutoHotkey which uses PCRE for the regex engine.
I apologise for poor terminology— I've read up on recursion, back referencing, capture groups, atomic groups, etc but they didn't seem to apply.

I think you can do what you want with a nested capturing group. Here I capture everything between the td tags in an inner capturing group:
(?<=your mother)(?s).*((?<=\<td bgcolor="#F0F0F0"\>).*(?=\<\/td\>)).*(?=Monochrome)
You might need to tweak it a bit, it's a pretty scrappy regex, but it works for your current use case.
Reading the documentation for AutoHotkey#RegExMatch:
FoundPos := RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPosition = 1])
If any capturing subpatterns are present inside NeedleRegEx, their matches are stored in a pseudo-array whose base name is OutputVar. For example, if the variable's name is Match, the substring that matches the first subpattern would be stored in Match1, the second would be stored in Match2, and so on. The exception to this is named subpatterns: they are stored by name instead of number. For example, the substring that matches the named subpattern "(?P\d{4})" would be stored in MatchYear. If a particular subpattern does not match anything (or if the function returns zero), the corresponding variable is made blank.
So you'd have to call it with UnQuotedOutputVar, say Match, and then look in Match2 for what was captured by the second capturing group.

Related

Modifying an existing regex to find a pattern until "Index=0"

I have an existing regex that I need to modify.
The current regex is: {(\bVar Name=)\w+\b[^{}]*} which works on patterns as "{Var Name=New Variable Selection List}" and finds the pattern: "New Variable Selection List".
It also works on more complex strings where there are multiple patterns one inside another.
Now, I need to modify the string to: "{Var Name=New Variable Selection List Index=current}" where the "Index=current" part can be letters and numbers with '=' sign inside it.
In this string, I still need to find only the pattern "New Variable Selection List", but the regex still needs to find if there are multiple occurs as it does now.
Also, I need to create another regex to find only the "Index=current" part.
Provided Index=current is always the last string in the sequence and it doesn't contain any spaces (otherwise it would be impossible to distinguish beginning of the index from the end of the previous entry), you can read the string with this expression:
/(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+)) (?=[\w=]+)(?<index>[\w=]+)/gm
Breaking down it into parts:
(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+)) - reads the the first pair of values. entry capture group contains both key and value: Var Name=New Variable Selection List
(?<key>[\w ]+) - looks for the first part of the entry: Var Name
(?<value>[\w ]+) - looks for the second part of the entry: New Variable Selection List
(?=[\w=]+) - a space and a lookahead that prevents the <entry> capture group to extend up to the index part
(?<index>[\w=]+) - looks for the index part: Index=current
Edit:
If Index=current part is optional, you can extend regex like this:
/(?(DEFINE)(?'inseq'([\w]+=[\w]+)))(?<entry>(?<key>[\w ]+)=(?<value>[\w ]+))(?=(?: (?P>inseq)|}))\s?(?<index>(?P>inseq))?/gm
Unlike previous version it has the following additions:
(?(DEFINE)(?'inseq'([\w]+=[\w]+))) - predefined patter to match index part (to reuse it in both lookahead and corresponding capture group)
\s?(?<index>(?P>inseq))? - the final capture group (and space) are now optional
(?=(?: (?P>inseq)|})) - the lookahead now checks for closing curly bracket OR the index pattern

Regex fragment for "one or more instances of this pattern" [duplicate]

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

Extracting a numerical value from a paragraph based on preceding words

I'm working with some big text fields in columns. After some cleanup I have something like below:
truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]
I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.
This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.
Any help would be much appreciated!
You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:
xerb Scale:\s*\["(\d+)"]
^^^^^
See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:
SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
^^^
Note the 1 argument:
regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

regex to find value at a particular location

Presently the regex is:
[A-Z]+(?=-\d+$)
This pulls out the correct value for most of the strings which follow the below format:
ANG-RGN-SOR-BCP-0004 i.e. BCP
However it pulls out SS for the following document instead of PMR:
ANG-B31-OPS-PMR-MACE-SS-0229
So basically I want to pull out the fourth term (between the hyphens), so it should pick BCP and PMR.
The following regex will get the 4th item in group 1:
(?:[A-Z0-9]+-){3}([A-Z0-9]+)
The first bit in (?:...) is a "non-capturing group" which acts like a group but won't appear in the backreference list.
The next bit means "3 of these non-capturing groups".
And finally, a capturing group to collect what you want.
I have assumed here that all the groups contain only uppercase letters and digits, you should modify the parts in [square brackets] to represent what these groups could be.
A more easily understandable method in Python:
a = "ANG-B31-OPS-PMR-MACE-SS-0229"
part = a.split('-')[3]
print part
This gives "PMR".
This should suit your needs (demo):
(?:.+?-){3}([^-]+)
You'll be able to access the fourth term in the first capturing group.