Protect escaped chars from pattern ending - regex

Assuming you have a pattern "A<(.*?)>"
Using Java, Pattern, Matcher, matcher.find() method as an example.
As input you have "A<v1>" --> Pattern is matching and the group(1) is "v1"
As input you have "A<v1>v2>" --> Pattern is matching and the group(1) is "v1" due to "?" turning ".*" to non-greedy.
Assuming a user want to protect the input like:
"A<v1\>v2>", so the pattern should match and the group(1) has the value "v1>v2".
So the pattern should stay "non-greedy", but a escaped chars is protect and be part of the value (grouping).
The pattern processing is done in a "while" loop, so I want to find all occurences of the pattern in the input. So the pattern should accept a less as possible (non-greedy), but can handle the "escaped" char (here: the ">" is my ending of the pattern)).
Any hints.
Thanks in advance.

You can accept \> as a valid expression to match:
A<((\\>|.)*?)>
The group (\\>|.) will match either the characters \> or, if that doesn't match, .. The order is important, because \> will match two characters while . only matches one, meaning that . will gobble up the \ character if it appears first.
To illustrate:
A < v 1 \> v 2 >
| | | | | | | |
A < ( . . \> . . )*? >
However, the resulting match would be v1\>v2, so you'll need to do some processing after the fact to convert \> to >
If you wanted to go even further and allow escaping the \ character, you could use a character class like so:
A<((\\[>\\]|.)*?)>
Which would match the following:
A<v1\\>

Related

Regex allows for repeating the pattern when I don't want it to

I'm trying to take a query parameter and verify if the syntax provided by the user is correct. Regex seems like the best choice for this, but I'm having trouble making it so the pattern doesn't allow for repeating itself.
The pattern I came up with is:
(^(\w+)(=|!=|>=|>|<=|<|~)((')(.*)('))(\s(AND|OR)\s)(\w+)(=|!=|>=|>|<=|<|~)((')(.*)('))$)
The syntax provided by the user should to be:
[field][predicate][single quote][value][single quote][white space][logical operator][white space][field][predicate][single quote][value][single quote]
Where:
field is [any word]
predicate is [= | != | >= | > | <= | < | ~]
logical operator is [AND | OR (with a space on both sides)]
value is [any word wrapped by single quotes]
An example looks like this: field1='value1' OR field2='value2'
The problem I am having is that the pattern I created allows for things like this:
field1='value1' OR field2='value2field1='value' OR field2='value2'' [This shouldn't work but does]
field1='value1' OR field2='value2 field1='value' OR field2='value2'' [This shouldn't work but does]
field1='value1' OR field2='value2' AND field3='value3' OR field4='value4'' [This shouldn't work but does]
Any help would be appreciated making it so the pattern doesn't match if it repeats.
You might use:
^\w+(?:<=|=>|!=|[~<>=])'\w+'(?: (?:OR|AND) \w+(?:<=|=>|!=|[~<>=])'\w+')*$
^ Start of string
\w+ Match 1 or more word chars
(?: Non capture group
<=|=>|!=|[~<>=] Match one of the alternatives
) Close group
\w+ Match 1 or more word chars between single quotes
(?: Non capture group
(?:OR|AND) \w+ Match space, either AND or OR and 1+ word chars
(?:<=|=>|!=|[~<>=]) Match one of the alternatives
\w+ Match 1 or more word chars between single quotes
)* Close group and repeat 0+ times to also match without AND or OR
$ End of string
If there should be at least a single AND or OR the quantifier of the last group could be + instead of *
The single chars in the predicate could be added to a character class [~<>=] to take out a few alternations.
Regex demo

Hive regex: Positive lookahead to match '&' or end of string

I would like to match text between two strings, although the last string/character might not aways be available.
String1: 'www.mywebsite.com/search/keyword=toys'
String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'
Here I want to match the value in keyword= that is 'toys' and I am using
(?<=keyword=)(.*)(?=&|$)
Works for String1 but for String2 it matches everything after '&'
What am I doing wrong?
.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?
with t as
(
select explode
(
array
(
'www.mywebsite.com/search/keyword=toys'
,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
)
) as (val)
)
select regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from t
;
+------+
| toys |
+------+
| toys |
+------+
You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):
(?<=keyword=)[^&]+
Code:
select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t
See the regex demo
Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.
Pattern details
(?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
[^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!
Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.
You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].
You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/

Regex will not match all of my patterns

I have been trying to get this to work and I am nearly there but can quite get the last match. This is the regex im using:
^`.*` (.*?)(\(.*?\))?\s
These are some examples of the patterns I'm trying to match
1.`asgKey` tinyblob
2.`is_asg` bit(1) DEFAULT NULL
3.`lastModified` datetime DEFAULT NULL
This regex will match 2 and 3 but not 1. I have tried adding ? and * to the space char but it then doesnt match anything. I think I am misunderstanding the matching groups
(.*?) - match any number of characters
(\(.*?\))? - if there are brackets match anything inside them else ignore
\s - space character
group 1 is the string group 2 is the contents of the brackets if they exist
You're matching them one at a time, right? Then what's the \s meant to match for #1?
`asgKey` tinyblob
^ ^ ^^ ^
| | || |
` .* ` (.*?)
There's nothing left, so \s can't match. Maybe you want (?:\s|$) to match a space or EOL.
That said, consider using (\S+) instead of (.*?), as it'll only match non-spaces, and thus will do the same thing, but faster.