regex, extract string NOT between two brackets - regex

OK regex question , how to extract a character NOT between two characters, in this case brackets.
I have a string such as:
word1 | {word2 | word3 } | word 4
I only want to get the first and last 'pipe', not the second which is between brackets. I have tried a myriad of attempts with negative carats and negative groupings and can't seem to get it to work.
Basically I am using this regex in a JavaScript split function to split this into an array containing: "word1", "{word2 | word3}", "word4".
Any assistance would be greatly appreciated!

Try using this pattern
/\|(?![^{]*})/g
with this text
word1 | {word2 | word3 } | word 4 | word 4 | {word2 | word3 }
This should match all of the Pipe symbols that are not inside {}.
*edit - removed link to dead site (Thanks Dennis)

Depends on the language/implementation you're using, but...
\|(?![^{]*})
This matches a pipe that is not followed by a } except in the case that a { comes first.
The (?! ... ) is known as a negative lookahead assertion. This is easier to understand if we start with a positive lookahead assertion:
\|(?=[^{]*})
The above only matches a pipe that is followed by a } without encountering a { first. When you negate that by replacing the = with a !, the match is now only successful if there's no way for the positive case to be true (also known as the complement).

Related

Can you match a single character only that's within parenthesis for replacement using regex?

I have a weird case where the only real tool I have to use is Notepad++ without some heavy lifting, and I have a | delimited text file that has |s in the text that I need to remove.
Each | that I need to remove falls within parenthesis, so the text patterns look like this:
(123 | 456) (11.1 | 11.2)
...and so on.
My ideal result would be removing the |s contained within ()s and replacing with a -, so:
(123 - 456) (11.1 - 11.2)
So far I have:
\(.*\|.*\)
That matches each set of parenthesis that contains a | reliably, but I can't figure out a way to just match the | itself for replacement. Any ideas?
With your shown samples, please try following regex in notepad++
find what: ([^|]*)\|([^)]*\))
Replace with: $1-$2
Online demo for above regex
Explanation of regex: Adding detailed explanation for above regex.
([^|]*) ##Creating 1st capturing group here, which has everything till | comes.
\| ##Matching literal | here.
([^)]*\)) ##Creating 2nd capturing group here, which has everything till ) here including ).
You can use
(\([^()|]*)\|(?=[^()]*\))
and replace with $1-. Details:
(\([^()|]*) - Group 1: ( char and then zero or more chars other than (, ) and |
\| - a | char
(?=[^()]*\)) - there must be zero or more chars other than ( and ) and then a ) char immediately to the right of the current location
See the regex demo and the demo screenshot below:
If you have multiple pipes (like in (123 | 456 | 23) (11.1 | 11.2 | 788 | 6896)):
(\G(?!^)|\()([^()|]*)\|(?=[^()]*\))
But now, replace with $1$2-. See the regex demo. This is compatible with some other common text editors, hence I did not suggest using a pattern with \K (see this regex demo).
I just tested this code, which is a bit safe to use, but a little long code ....
Find: (\(\d+[. \d]*)[|](?=[ \d.]*\))
Replace All: $1-
Updated

Conditional regex in Ruby

I've got the following string:
'USD 100'
Based on this post I'm trying to capture 100 if USD is contained in the string or the individual (currency) characters if USD is not contained in the string.
For example:
'USD 100' # => '100'
'YEN 300' # => ['Y', 'E', 'N']
So far I've got up to this but it's not working:
https://rubular.com/r/cK8Hn2mzrheHXZ
Interestingly if I place the USD after the amount it seems to work. Ideally I'd like to have the same behaviour regardless of the position of the currency characters.
Your regex (?=.*(USD))(?(1)\d+|[a-zA-Z]) does not work because
(?=.*(USD)) - a positive lookahead, triggered at every location inside a string (if scan is used) that matches USD substring after any 0 or more chars other than line break chars as many as possible (it means, there will only be a match if there is USD somewhere on a line)
(?(1)\d+|[a-zA-Z]) - a conditional construct that matches 1+ digits if Group 1 matched (if there is USD), or, an ASCII letter will be tried. However, the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
Look at the USD 100 regex debugger, it shows exactly what happens when the (?=.*(USD))(?(1)\d+|[a-zA-Z]) regex tries to find a match:
Step 1 to 22: The lookahead pattern is tried first. The point here is that the match will fail immediately if the positive lookahead pattern does not find a match. In this case, USD is found at the start of the string (since the first time the pattern is tried, the regex index is at the string start position). The lookahead found a match.
Step 23-25: since a lookahead is a non-consuming pattern, the regex index is still at the string start position. The lookahead says "go-ahead", and the conditional construct is entered. (?(1) condition is met, Group 1, USD, was matched. So, the first, then, part is triggered. \d+ does not find any digits, since there is U letter at the start. The regex match fails at the string start position, but there are more positions in the string to test since there is no \A nor ^ anchor that would only let a match to occur if the match is found at the start of the string/line.
Step 26: The regex engine index is advanced one char to the right, now, it is right before the letter S.
Step 27-40: The regex engine wants to find 0+ chars and then USD immediately to the right of the current location, but fails (U is already "behind" the index).
Then, the execution is just the same as described above: the regex fails to match USD anywhere to the right of the current location and eventually fails.
If the USD is somewhere to the right of 100, then you'd get a match.
So, the lookahead does not set any search range, it simply allows matching the rest of the patterns (if its pattern matches) or not (if its pattern is not found).
You may use
.scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
Pattern details
^USD.*?\K(\d+) - either USD at the start of the string, then any 0 or more chars other than line break chars as few as possible, and then the text matched is dropped and 1+ digits are captured into Group 1
| - or
([a-zA-Z]) - any ASCII letter captured into Group 2.
See Ruby demo:
p "USD 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["100"]
p "YEN 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["Y", "E", "N"]
Anatomy of your pattern
(?=.*(USD))(?(1)\d+|[a-zA-Z])
| | | | | |_______
| | | | | Else match a single char a-zA-Z
| | | | |
| | | | |__
| | | | If group 1 exists, match 1+ digits
| | | |
| | | |__
| | | Test for group 1
| | |_________________
| | If Clause
| |___
| Capture group 1
|__________
Positive lookahead
About the pattern you tried
The positive lookahead is not anchored and will be tried on each position. It will continue the match if it returns true, else the match stops and the engine will move to the next position.
Why does the pattern not match?
On the first position the lookahead is true as it can find USD on the right.
It tries to match 1+ digits, but the first char is U which it can not match.
USD 100
⎸
First position
From the second position till the end, the lookahead is false because it can not find USD on the right.
USD 100
⎸
Second position
Eventually, the if clause is only tried once, where it could not match 1+ digits. The else clause is never tried and overall there is no match.
For the YEN 300 part, the if clause is never tried as the lookahead will never find USD at the right and overall there is no match.
Interesting resources about conditionals can be for example found at rexegg.com and regular-expressions.info
If you want the separate matches, you might use:
\bUSD \K\d+|[A-Z](?=[A-Z]* \d+\b)
Explanation
\bUSD Match USD and a space
\K\d+ Forget what is matched using \K and match 1+ digits
| Or
[A-Z] Match a char A-Z
(?=[A-Z]* \d+\b) Assert what is on the right is optional chars A-Z and 1+ digits
regex demo
Or using capturing groups:
\bUSD \K(\d+)|([A-Z])(?=[A-Z]* \d+\b)
Regex demo
The following pattern seems to work:
\b(?:USD (\d+)|(?!USD\b)(\w+) \d+)\b
This works with caveat that it just has a single capture group for the non USD currency symbol. One part of the regex might merit explanation:
(?!USD\b)(\w+)
This uses a negative lookahead to assert that the currency symbol is not USD. If so, then it captures that currency symbol.
I suggest the information desired be extracted as follows.
R = /\b([A-Z]{3}) +(\d+)\b/
def doit(str)
str.scan(R).each_with_object({}) do |(cc,val),h|
h[cc] = (cc == 'USD') ? val : cc.split('')
end
end
doit 'USD 100'
#=> {"USD"=>"100"}
doit 'YEN 300'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 6000 to spend'
#=> {"USD"=>"6000"}
doit 'I had YEN 25779 to spend'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 60 and CDN 80 to spend'
#=> {"USD"=>"60", "CDN"=>["C", "D", "N"]}
doit 'USD -100'
#=> {}
doit 'YENS 4000'
#=> {}
Regex demo
Ruby's regex engine performs the following operations.
\b : assert a word boundary
([A-Z]{3}) : match 3 uppercase letters in capture group 1
\ + : match 1+ spaces
(\d+) : match 3 digits in capture group 2
\b : assert a word boundary
TLDR;
An excellent working solution can be found in Wiktor's answer and the rest of the posts.
Long answer:
Since I wasn't perfectly satisfied with Wiktor's explanation of why my solution wasn't working, I decided to dig into it a bit more myself and this is my take on it:
Given the string USD 100, the following regex
(?=.*(USD))(?(1)\d+|[a-zA-Z])
simply won't work. The juice of this whole thing is to figure out why.
It turns out that using a lookahead (?=.*(USD)) with a capture group, implicitly suggests that the position of USD (if any is found) is followed by some pattern (defined inside the conditional ((?(1)\d+|[a-zA-Z])) which in this case yields nothing since there's nothing before USD.
If we break it down in steps here's an outline of what -I think- is happening:
The pointer is set at the very beginning. The lookahead (?=.*(USD)) is parsed and executed.
USD is found but since the expression is a lookahead the pointer remains at the beginning of the string and is not consumed.
The conditional ((?(1)\d+|[a-zA-Z])) is parsed and executed.
Group 1 is set (since USD has been found) however \d+ fails since the pointer searches from the beginning of the string to the beginning of the string which turns out is the furthest point we can search when using a lookahead! After all that's exactly why it's called a lookahead: The searching has to happen across a range which stops just before this one starts.
Since no digits nor anything is found before USD, the regex returns no results. And as Wiktor correctly pointed out:
the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
which basically says that since USD is always present in the string, the system would never jump to the "else" statement even if something was eventually found before USD.
As a counter example if the same regex is tested on this string, it will work:
'YEN USD 100'
Hope this helps someone in the future.

Regex that match table input

I have this kind of input
||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country||
|1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | |
I am looking for two regexes, one that match only inside headers ||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country|| from whole table input so no match |1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | | the other I could theoretically split by newlines and by |...
I have tried something like [^\|\|]+(?=\|\|) ist good solution?
regex
You can't negate a sequence of characters with a negated character class, only individual chars.
I suggest using a regex that will extract any chunks of chars other than | between double ||:
(?<=\|\|)[^|]+(?=\|\|)
See the regex demo.
Details
(?<=\|\|) - two | chars must be present immediately on the left
[^|]+ - 1+ chars other than |
(?=\|\|) - two | chars must be present immediately on the right.
If you ever need to make sure there is exactly two pipes on each side, and not match if there are three or more, you will need to precise the pattern as (?<=(?<!\|)\|\|)[^|]+(?=\|\|(?!\|)).

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!
Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.
You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].
You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/