Regex - Match Character Between Two Elements - regex

I'm trying to match a single character (%) between two elements <p></p>. Given the following HTML I'd want to find a match:
<p>%</p>
However, the following HTML should not return a match:
<p>test</p>
<span>50%<span>
<p>test</p>
The problem I am running into is that my regex finds the starting <p> and the very last <\p> with the % between them and finds the match. Additionally, the regex needs to cope with new lines. Here is my incorrect regex so far:
(<p>)(.|\n|\r)*(%)(.|\n|\r)*(<\/p>)

For the shown strings, this will do:
<p>(?:(?!<\/?p>)[^%])*%[\d\D]*?<\/p>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
<p> '<p>'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
p> 'p>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^%] any character except: '%'
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
% '%'
--------------------------------------------------------------------------------
[\d\D]*? any character of: digits (0-9), non-digits
(all but 0-9) (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
<\/p> '</p>'

Related

Regex match a line if it contains a specific word

I have a long xml text and I want to match each product that is available. The text is made of products that are structured like this:
<product>
...
<available>instock</available>
...
</product>
I can match all products with this regex
((?s)<product>.*?<\/product>)
Example: https://regex101.com/r/kz8cn1/1
However, I want to match, only those products that have an 'instock' value in their tag.
My solution is this:
((?s)<product>(?=.*?\binstock\b).*?<\/product>)
Unfortunately, this works only partially as I believe the lookaround regex is not contained to the match group which results in products with 'outofstock' values being matched as well.
Here is my example:
https://regex101.com/r/AHlC0K/1
How should I change my regex so that the lookaround works only in the context of the match?
Use an XML parser. If there is none you can use use
(?s)<product>(?=(?:(?!<\/?product>).)*?\binstock\b).*?<\/product>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
--------------------------------------------------------------------------------
<product> '<product>'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
product> 'product>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
instock 'instock'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
product> 'product>'

Need help in identifying the correct regex format

I am learning regex and am working on finding the regex format to satisfy below conditions:
check the contents in between "<NoteText>" and "</NoteText>"
If there is one or more "<" symbol not followed by "!", return all the identified "<" symbols.
example:
<NoteText><![CDATA[dvsdhjkndlv <<<RED>>> <72901> </NoteText>
this should return the 3 "<" before RED and the 1 "<" before 72901
initially i tried with the below regex pattern of negative lookahead.
<(?!!)
But it returns the "<" before the "NoteText" phrase as well.
I am not sure how to limit the area of filtering in between "<NoteText>" and "</NoteText>".
trying the below way did not work as well.
(?:<NoteText>.*)(<(?!!)).*(?:<\/NoteText>)
PCRE, not pretty, but working:
(?:\G(?!\A)|<NoteText>)(?:(?!<\/?NoteText>).)*?\K<(?!!)(?=(?:(?!<\/?NoteText>).)*?<\/NoteText>)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\G where the last m//g left off
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
<NoteText> '<NoteText>'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the least amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
\K match reset operator
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
! '!'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
This is a working method in Java 8. Remember that this works only if you don't have nested <NoteText> tags.
String myString = "<NoteText><![CDATA[dvsdhjkndlv <<<RED>>> <72901> </NoteText>";
Matcher outerMatcher = Pattern.compile("(?<=<NoteText>).*?(?=</NoteText>)").matcher(myString);
while (outerMatcher.find()) {
String content = outerMatcher.group(); // this is the content of the current NodeText tag
Matcher innerMatcher = Pattern.compile("<(?!!)").matcher(content);
int count = 0;
while (innerMatcher.find()) count++;
System.out.println(count); // this will print 4
}
The code above is thought for working also with strings of multiple occurrences of <NoteText> tags.
If you know you have only one <NoteText> tag, just replace the while with an if.

How to match strings not containing any word characters between a minus sign and numbers in PL/SQL regexp

I have some strings in Oracle where there is a minus sign (not at the beginning but inside the string), followed by a number (int or decimal with dot or comma).
I would like to find these in PLSQL. I have this already, and it's almost perfect:
REGEXP_LIKE(string, '-\d+(,|\.)*\d*')
I was hoping that it's finding strictly strings like somestring-11,1 but the problem is, it finds also strings like somestring-11a1,1 so where there is eventually a non numeric (or word) character between the minus and the numbers. I was trying to use negative lookahead, but unfortunately it's not working:
REGEXP_LIKE(string, '-\d+!(\w)(,|\.)*\d*')
because somestring-1s won't be found either anymore. Could you please point me to the right direction? Thank you.
Could you please try following, written and tested based on your shown samples. Simple explanation would be: using lazy match to match till - then match digits(1 or more occurrences) followed by , and followed by 1 or more occurrences of digits.
.*?-\d+,\d+
Online regex demo for above regex
Use
(^|\D)-(\d+([,.]*\d+)?)($|\W)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[,.]* any character of: ',', '.' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \3)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z, 0-
9, _)
--------------------------------------------------------------------------------
) end of \4

Regex help - match one string but not another

I have been using this:
~^\/student-accommodation\/(?:[^\/]+?)\/([^\/]+)\/$
to match for URLs like
/student-accommodation/manchester/ropemaker-court-manchester/
But now I need to edit this regex so it also matches for URLs like the below. All these new URLs will follow the same pattern and add a string that starts with #utm-source. Importantly they won't have another / in them.
/student-accommodation/manchester/ropemaker-court-manchester/#utm_source=afs&utm_medium=email&utm_campaign=ropemakercourt_afs_dec20
But then I don't want the regex to match for URLs like the below:
/student-accommodation/manchester/ropemaker-court-manchester/en-suite/
Can anyone help? I am a novice at regex! Thanks
Use
^\/student-accommodation\/[^\/]+\/([^\/]+)\/(?:#utm_source.*)?$
See proof
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
student- 'student-accommodation'
accommodation
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
#utm_source '#utm_source'
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Multiple regex matching within a lookaround

i am trying to do a multiple match within a lookbehind and a look forward
Let's say i have the following string:
$ Hi, this is an example #
Regex: (?<=$).+a.+(?=#)
I am expecting it to return both 'a' within those boundaries, is there a way to do it with only one regex?
Engine: Python
If you can use quantifiers in the lookbehind, use
(?<=\$[^$#]*?)a(?=[^#]*#)
See proof.
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\$ '$'
--------------------------------------------------------------------------------
[^$#]*? any character except: '$' and '#' (0 or more
times (matching the least amount possible))
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
a 'a'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^#]* any character except: '#' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
# '#'
--------------------------------------------------------------------------------
) end of look-ahead
PCRE pattern:
(?:\G(?<!^)|\$)[^$]*?\Ka(?=[^#]*#)
See another proof
Explanation
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\G where the last m//g left off
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\$ '$'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
[^$#]*? any character except: '$' and '#' (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\K match reset operator
--------------------------------------------------------------------------------
a 'a'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
[^#]* any character except: '#' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
# '#'
--------------------------------------------------------------------------------
) end of look-ahead