Hive regex: Positive lookahead to match '&' or end of string - regex

I would like to match text between two strings, although the last string/character might not aways be available.
String1: 'www.mywebsite.com/search/keyword=toys'
String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'
Here I want to match the value in keyword= that is 'toys' and I am using
(?<=keyword=)(.*)(?=&|$)
Works for String1 but for String2 it matches everything after '&'
What am I doing wrong?

.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?
with t as
(
select explode
(
array
(
'www.mywebsite.com/search/keyword=toys'
,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
)
) as (val)
)
select regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from t
;
+------+
| toys |
+------+
| toys |
+------+

You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):
(?<=keyword=)[^&]+
Code:
select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t
See the regex demo
Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.
Pattern details
(?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
[^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).

Related

Match string between delimiters, but ignore matches with specific substring

I have to parse all the text in a paranthesis but not the one that contains "GST"
e.g:
(AUSTRALIAN RED CROSS – ATHERTON)
(Total GST for this Invoice $1,104.96)
today for a quote (07) 55394226 − admin.nerang#waste.com.au − this applies to your Nerang services.
expected parsed value:
AUSTRALIAN RED CROSS – ATHERTON
I am trying:
^\(((?!GST).)*$
But its only matching the value and not grouping correctly.
https://regex101.com/r/HndrUv/1
What would be the correct regex for the same?
This regex should work to get the expected string:
^\((?!.*GST)(.*)\)$
It first checks if it does not contain the regular expression *GST. If true, it then captures the entire text.
(?!*GST)(.*)
All that is then surrounded by \( and \) to leave it out of the capturing group.
\((?!.*GST)(.*)\)
Finally you add the BOL and EOL symbols and you get the result.
^\((?!.*GST)(.*)\)$
The expected value is saved in the first capture group (.*).
You can use
^\((?![^()]*\bGST\b)([^()]*)\)$
See the regex demo. Details:
^ - start of string
\( - a ( char
(?![^()]*\bGST\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are zero or more chars other than ) and ( and then GST as a whole word (remove \bs if you do not need whole word matching)
([^()]*) - Group 1: any zero or more chars other than ) and (
\) - a ) char
$ - end of string
Bonus:
If substrings in longer texts need to be matched, too, you need to remove ^ and $ anchors in the above regex.

Regex allows for repeating the pattern when I don't want it to

I'm trying to take a query parameter and verify if the syntax provided by the user is correct. Regex seems like the best choice for this, but I'm having trouble making it so the pattern doesn't allow for repeating itself.
The pattern I came up with is:
(^(\w+)(=|!=|>=|>|<=|<|~)((')(.*)('))(\s(AND|OR)\s)(\w+)(=|!=|>=|>|<=|<|~)((')(.*)('))$)
The syntax provided by the user should to be:
[field][predicate][single quote][value][single quote][white space][logical operator][white space][field][predicate][single quote][value][single quote]
Where:
field is [any word]
predicate is [= | != | >= | > | <= | < | ~]
logical operator is [AND | OR (with a space on both sides)]
value is [any word wrapped by single quotes]
An example looks like this: field1='value1' OR field2='value2'
The problem I am having is that the pattern I created allows for things like this:
field1='value1' OR field2='value2field1='value' OR field2='value2'' [This shouldn't work but does]
field1='value1' OR field2='value2 field1='value' OR field2='value2'' [This shouldn't work but does]
field1='value1' OR field2='value2' AND field3='value3' OR field4='value4'' [This shouldn't work but does]
Any help would be appreciated making it so the pattern doesn't match if it repeats.
You might use:
^\w+(?:<=|=>|!=|[~<>=])'\w+'(?: (?:OR|AND) \w+(?:<=|=>|!=|[~<>=])'\w+')*$
^ Start of string
\w+ Match 1 or more word chars
(?: Non capture group
<=|=>|!=|[~<>=] Match one of the alternatives
) Close group
\w+ Match 1 or more word chars between single quotes
(?: Non capture group
(?:OR|AND) \w+ Match space, either AND or OR and 1+ word chars
(?:<=|=>|!=|[~<>=]) Match one of the alternatives
\w+ Match 1 or more word chars between single quotes
)* Close group and repeat 0+ times to also match without AND or OR
$ End of string
If there should be at least a single AND or OR the quantifier of the last group could be + instead of *
The single chars in the predicate could be added to a character class [~<>=] to take out a few alternations.
Regex demo

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!
Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.
You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].
You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Pipe separated values in groups of 3 regex

I have the following string
abc|ghy|33d
The regex below matches it fine
^([\d\w]{3}[|]{1})+[\d\w]{3}$
The string changes but the characters separated by the pipe are always in 3's ... so we can have
krr|455
we can also have
ddc
Here's where the problem happens: The regex explained above doesn't match the string if there is only one set of letters ... i.e. "dcc"
Let's do this step by step.
Your regex :
^([\d\w]{3}[|]{1})+[\d\w]{3}$
We can already see some changes. [|]{1} is equivalent to \|.
Then, we see that you match the first part (aaa|) at least once (the + operator matches once at least). Also, \w matches numbers.
The * operator matches 0 or more. So :
^(?:\w{3}\|)*\w{3}$
works.
See here.
Explanation
^ Matches beggining of string
(?:something)* matches something zero time or more. the group is non-capturing as you won't need to
\w{3} matches 3 alphanumeric characters
\| matches |
$ matches end of string.
^[\d\w]{3}(?:[|][\d\w]{3}){0,2}$
You simply quantify the variable part.See demo.
https://regex101.com/r/tS1hW2/18
You can modify your regex as below:
^([\d\w]{3})(\|[\d\w]{3})*$
here first match 3 alphaNumeric and then alphaNum with | as prefix.
Demo
Your description is a little awkward, but I'm guessing you want to be able to match
abc
abc|def
abc|def|ghi
You can do that with
/^\w{3}(?:\|\w{3}){0,2}$/
Visualization
Explanation
^ — match beginning of string
\w{3} — match any 3 of [A-Za-z0-9_]
(? ... )? — non-capturing group, 0 or 1 matches
\| — literal | character
$ — end of string
If the goal is to match any amount of 3-letter segments, you can use
/^(?:\w{3}(?:\||$))+$/

Regex will not match all of my patterns

I have been trying to get this to work and I am nearly there but can quite get the last match. This is the regex im using:
^`.*` (.*?)(\(.*?\))?\s
These are some examples of the patterns I'm trying to match
1.`asgKey` tinyblob
2.`is_asg` bit(1) DEFAULT NULL
3.`lastModified` datetime DEFAULT NULL
This regex will match 2 and 3 but not 1. I have tried adding ? and * to the space char but it then doesnt match anything. I think I am misunderstanding the matching groups
(.*?) - match any number of characters
(\(.*?\))? - if there are brackets match anything inside them else ignore
\s - space character
group 1 is the string group 2 is the contents of the brackets if they exist
You're matching them one at a time, right? Then what's the \s meant to match for #1?
`asgKey` tinyblob
^ ^ ^^ ^
| | || |
` .* ` (.*?)
There's nothing left, so \s can't match. Maybe you want (?:\s|$) to match a space or EOL.
That said, consider using (\S+) instead of (.*?), as it'll only match non-spaces, and thus will do the same thing, but faster.