Regex Matching a string - c++

I currently am writing a regex to match strings such as this:
( expr ) | id | num
term * factor | factor
expr
I want the regex to match each occurence of set of characters between each ' | ', but also match solo expressions such as:
expr
I currently have this, but I am doing my negative lookahead wrong and I am not really sure how to proceed.
((.*) \|) (.*)$
P.s. I am not really fond of using .* in this situation, but I cannot think of another way to match, because the characters between ' | 's can be word characters, digits, or anything in between.
EDIT:
I would like the output matches to look like this:
Regex ran on line 1, output:
3 matches - ( expor ), id, num
Regex ran on line 2:
2 matches - term * factor, factor
Regex ran on line 3:
1 match - expr

This could be your simple regex:
[^|]+
-capture one or more characters until you reach "|" (or end of string)
Or alternatively you could use String.split("|");
String line = "term * factor | factor";
String[] split = line.split("\\|");

Related

Protect escaped chars from pattern ending

Assuming you have a pattern "A<(.*?)>"
Using Java, Pattern, Matcher, matcher.find() method as an example.
As input you have "A<v1>" --> Pattern is matching and the group(1) is "v1"
As input you have "A<v1>v2>" --> Pattern is matching and the group(1) is "v1" due to "?" turning ".*" to non-greedy.
Assuming a user want to protect the input like:
"A<v1\>v2>", so the pattern should match and the group(1) has the value "v1>v2".
So the pattern should stay "non-greedy", but a escaped chars is protect and be part of the value (grouping).
The pattern processing is done in a "while" loop, so I want to find all occurences of the pattern in the input. So the pattern should accept a less as possible (non-greedy), but can handle the "escaped" char (here: the ">" is my ending of the pattern)).
Any hints.
Thanks in advance.
You can accept \> as a valid expression to match:
A<((\\>|.)*?)>
The group (\\>|.) will match either the characters \> or, if that doesn't match, .. The order is important, because \> will match two characters while . only matches one, meaning that . will gobble up the \ character if it appears first.
To illustrate:
A < v 1 \> v 2 >
| | | | | | | |
A < ( . . \> . . )*? >
However, the resulting match would be v1\>v2, so you'll need to do some processing after the fact to convert \> to >
If you wanted to go even further and allow escaping the \ character, you could use a character class like so:
A<((\\[>\\]|.)*?)>
Which would match the following:
A<v1\\>

RegExp - Notepad++ combine statements

I'm missing something with this regular expression find/replace attempt. I have the following format:
word | word | word
I would like to first replace every word with "word" to produce
"word" | "word" | "word"
and then subsequently every [space]| with ,, finally producing
"word", "word", "word"
Obviously I could just do this with two simple find(f)/replace(r) commands ( f:([a-z]*\>)r:"$1"; f:[space]|r:,), but is there a way to do all of this at once?
I've tried lots of different ideas, but they all failed. The most successful was finding ([a-z]*\>)(( \|)|\R) and replacing with "$1",, which only ever got me a "word", "word", word format. The solution is probably either much more complicated or much simpler than I'm trying, but I'm stumped. Thanks!
You may use
(\w+)|\s*\|
and replace with (?1"$1":,).
Details
(\w+) - Group 1: one or more word chars
| - or
\s*\| - 0+ whitespaces and then a | char.
(?1"$1":,) - a conditional replacement pattern that replaces with " + Group 1 contents + " if Group 1 matches, else, replaces with ,.

Hive regex: Positive lookahead to match '&' or end of string

I would like to match text between two strings, although the last string/character might not aways be available.
String1: 'www.mywebsite.com/search/keyword=toys'
String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'
Here I want to match the value in keyword= that is 'toys' and I am using
(?<=keyword=)(.*)(?=&|$)
Works for String1 but for String2 it matches everything after '&'
What am I doing wrong?
.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?
with t as
(
select explode
(
array
(
'www.mywebsite.com/search/keyword=toys'
,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
)
) as (val)
)
select regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from t
;
+------+
| toys |
+------+
| toys |
+------+
You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):
(?<=keyword=)[^&]+
Code:
select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t
See the regex demo
Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.
Pattern details
(?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
[^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!
Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.
You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].
You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/