Perl regular expression explanation - regex

I have regular expression like this:
s/<(?:[^>'"]|(['"]).?\1)*>//gs
and I don't know what exactly does it mean.

The regex looks intended to remove HTML tags from input.
It matches text beginning with < and ending with >, containing non->/non-quotes or quoted strings (which may contain >). But it appears to have an error:
The .? says that quotes may contain 0 or 1 character; it was probably intended to be .*? (0 or more characters). And to prevent backtracking from doing things like making the . match a quote in some odd cases, it needs to change the (?: ... ) grouping to be possessive (> instead of :).

This tool can explain the details: http://rick.measham.id.au/paste/explain.pl?regex=%3C%28%3F%3A[^%3E%27%22]|%28[%27%22]%29.%3F\1%29*%3E
NODE EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^>'"] any character except: '>', ''', '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
['"] any character of: ''', '"'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.? any character except \n (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
> '>'
So it tries to remove HTML tags as ysth also mentions.

Related

Regex to match inverse using negative lookahead ignores matches

I am trying to create a regex to match any tags not including [first].
# Trying to match:
# [second]
# [first.second]
# [first.third]
[first]
# something = else
[second]
test = yes
[first.second]
[first.third]
I was trying ^\[((?!first).*)\]$
https://regex101.com/r/1fz1CW/1
And this seems to match [second] in the example above but I can't figure out why it doesn't match [first.second] or [first.third] I was thinking I may need word boundaries, but I can't seem to get them to work.
Use
^\[((?!first\])[^\]\[]*)\]$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
first 'first'
--------------------------------------------------------------------------------
\] ']'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^\]\[]* any character except: '\]', '\[' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\] ']'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

How to capture multiple sequence of numbers as repeated groups?

I have a URL that contains multiple sequences of numbers I want to capture them all in groups suppose I have the following
https://www.example.com//first/part/54323?key=value
or
https://www.example.com/first/12345/second/part/part2/5432?key=value
I tried to use something like that but it only matches one sequence of numbers
(.*\/)([0-9]{4,})(\/.*|$|)
I want to have multiple groups represent different sections if numbers sequence is included
1st group will be "example.com/first"
2nd group "12345"
3rd group "second/part"
4th group "5432"
5th group "?key=value"
The initial .* is Greedy, meaning it tries to match as much as possible. It matched everything up to the last slash "https://www.example.com/first/12345/second/part". You can modify this behavior by replacing the initial .* with .*?, but that will stop after the first slash, which is also not what you want "https:/" because there are no digits after those slashes.
But really we need to back up and ask some questions about your pattern. Apparently, you have a preamble you are not interested in, an indefinite number of sequences of 'character string, followed by slash, followed by number string' and then there is the "everything after there are no more slash digit patterns".
The key question is whether the number of char/char/digits combos are indefinite or limited to a definite number like the two pairs in your example. To get the regex parser to return an unbounded number of string-number pairs, you are going to want to turn on the /g (Global) switch so regex will return all matches. That is a problem with the part of your URL at the beginning and end which does not fit your pattern.
I recommend first using a regular expression to divide your URL into three parts, preamble, path, remaining data. Then you can pass the path string to a second regular expression to parse the pairs - it will be much simpler.
If you do it that way your first expression could be:
^[a-z+.-]+?:\/\/(:www\.)?([^?#]+?)(.*)$
The first part skips over everything through the optional www. and does not capture it because you are not interested in that part. The second part captures everything up to any query or fragment (delimited by ? and #, respectively) and places it in the first capture group. The last part captures the rest of the URL into the the second capture group. In your example that is ?key=value.
Now take your first capture group, which contains the host and the path, and pass it to a second regex with the global flag set (so it processes all pairs repeatedly). This second regex will be:
(.*?)\/([0-9]{4,})\/?
For each match of this string, the parsed values and numbers will be in capture groups 1 & 2.
It sounds very straight-forward:
https?:\/\/(?:www\.)?(.*?)\/(\d+)\/(.*?)\/(\d+)(?:\?(.*))?
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
http 'http'
--------------------------------------------------------------------------------
s? 's' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
www 'www'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
)? end of grouping

Need help in identifying the correct regex format

I am learning regex and am working on finding the regex format to satisfy below conditions:
check the contents in between "<NoteText>" and "</NoteText>"
If there is one or more "<" symbol not followed by "!", return all the identified "<" symbols.
example:
<NoteText><![CDATA[dvsdhjkndlv <<<RED>>> <72901> </NoteText>
this should return the 3 "<" before RED and the 1 "<" before 72901
initially i tried with the below regex pattern of negative lookahead.
<(?!!)
But it returns the "<" before the "NoteText" phrase as well.
I am not sure how to limit the area of filtering in between "<NoteText>" and "</NoteText>".
trying the below way did not work as well.
(?:<NoteText>.*)(<(?!!)).*(?:<\/NoteText>)
PCRE, not pretty, but working:
(?:\G(?!\A)|<NoteText>)(?:(?!<\/?NoteText>).)*?\K<(?!!)(?=(?:(?!<\/?NoteText>).)*?<\/NoteText>)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\G where the last m//g left off
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
<NoteText> '<NoteText>'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the least amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
\K match reset operator
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
! '!'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
NoteText> 'NoteText>'
--------------------------------------------------------------------------------
) end of look-ahead
This is a working method in Java 8. Remember that this works only if you don't have nested <NoteText> tags.
String myString = "<NoteText><![CDATA[dvsdhjkndlv <<<RED>>> <72901> </NoteText>";
Matcher outerMatcher = Pattern.compile("(?<=<NoteText>).*?(?=</NoteText>)").matcher(myString);
while (outerMatcher.find()) {
String content = outerMatcher.group(); // this is the content of the current NodeText tag
Matcher innerMatcher = Pattern.compile("<(?!!)").matcher(content);
int count = 0;
while (innerMatcher.find()) count++;
System.out.println(count); // this will print 4
}
The code above is thought for working also with strings of multiple occurrences of <NoteText> tags.
If you know you have only one <NoteText> tag, just replace the while with an if.

How to match strings not containing any word characters between a minus sign and numbers in PL/SQL regexp

I have some strings in Oracle where there is a minus sign (not at the beginning but inside the string), followed by a number (int or decimal with dot or comma).
I would like to find these in PLSQL. I have this already, and it's almost perfect:
REGEXP_LIKE(string, '-\d+(,|\.)*\d*')
I was hoping that it's finding strictly strings like somestring-11,1 but the problem is, it finds also strings like somestring-11a1,1 so where there is eventually a non numeric (or word) character between the minus and the numbers. I was trying to use negative lookahead, but unfortunately it's not working:
REGEXP_LIKE(string, '-\d+!(\w)(,|\.)*\d*')
because somestring-1s won't be found either anymore. Could you please point me to the right direction? Thank you.
Could you please try following, written and tested based on your shown samples. Simple explanation would be: using lazy match to match till - then match digits(1 or more occurrences) followed by , and followed by 1 or more occurrences of digits.
.*?-\d+,\d+
Online regex demo for above regex
Use
(^|\D)-(\d+([,.]*\d+)?)($|\W)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \3 (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[,.]* any character of: ',', '.' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of \3 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \3)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z, 0-
9, _)
--------------------------------------------------------------------------------
) end of \4

RegEx for removing everything before and after a delimiter

I am trying to remove everything before and after two | delimiters using regex.
An example being:
EM|CX-001|Test Campaign Name
and grabbing everything except CX-001. I cannot use a substring as the number of characters before and after the pipes may change.
I tried using the regex (?<=\|)(.*?)(?=\-), but while this selects CX-001, I need to select everything else but this.
How do I solve this problem?
You can try the following regular expression:
(^[^|]*\|)|(\|[^|]*$)
String input = "EM|CX-001|Test Campaign Name";
System.out.println(
input.replaceAll("(^[^|]*\\|)|(\\|[^|]*$)", "")
); // prints "CX-001"
Explanation of the regular expression:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
[^|]* any character except: '|' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
[^|]* any character except: '|' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of \2
If you have only 2 pipes in you string, you could either match upon the first pipe or match from the last one until the end of the string:
^.*?\||\|.*$
Explanation
^.*?\| Match from start of string non greedy until the first pipe
| Or
\|.*$ Match from last pipe until end of string
Regex demo
Or you might also use a negated character class [^|]* without the need of capturing groups:
^[^|]*\||\|[^|]*$
Regex demo
Note
In your pattern (?<=\|)(.*?)(?=\-) I think you meant that the last positive lookahead should be (?=\|) instead of the - if you want to select between 2 pipes.
Find: ^[^|]*\|([^|]+).+$
Replace: $1