Regex - Contains pattern but not starting with xyz - regex

I'm trying to match a number pattern in a text file.
The file can contain values such as
12345 567890
90123 string word word 54616
98765
The pattern should match on any line that contains a 5 digit number that does not start with 1234
I have tried using ((?!1234).*)[[:digit:]]{5} but it does not give the desired results.
Edit: The pattern can occur anywhere in the line and should still match
Any suggestions?

This regex should work for matching a line containing a number at least 5 digits long iff the line does not start with '12345':
^((?!12345).*\d{5}.*)$
Short explanation:
^((?!12345).*\d{5}.*)$ _____________
^ \_______/\/\___/\/ ^__|match the end|
_____________________________| | _| | |__ |of the line |
|match the start of a line| | | __|____ |
______________________________|_ | |match ey| |
|look ahead and make sure the | | |exactly | |
|line does not begin with "12345"| | |5 digits| |
___|_____ |
|match any|______|
|character|
|sequence |
EDIT:
It seems that the question has been edited, so this solution no longer reflects the OP's requirements. Still I'll leave it here in case someone looking for something similar lands on this page.

The following would work, using \b to match word boundaries such as start of string or space:
\b(?!12345)\d{5}.*

try this, contains at least 5 decimal digits but not 12345 using a negative look behind
\d{5,}(?<!12345)

Related

RegEx Substring Extraction

I am trying to write a RegEx on the following text:
CpuUtilization\[GqIF:CA-TORONTO-1-AD-1 | FAULT-DOMAIN-3 | ocid1.image.oc1.ca-toronto-1.aaaaaaaaq4cxrudcxy5seck2cweks2zglo2tfieag6svtvqssa2zmjha | Default | ca-toronto-1 | oke-ccf3jglvbia-nc7pit2gv2a-sa65utwc32a-2 | ocid1.instance.oc1.ca-toronto-1.an2g6ljrwe6j4fqcgrlo7dmzkrtbcgr3jy35gie3qh3w65ctfh3hsd6da | VM.Standard.E2.2\]
I need to extract oke-ccf3jglvbia-nc7pit2gv2a-sa65utwc32a-2 from the statement. The text above can change depending, so looking for a generic RegEx.
I tried using: (\[^\\|\]+)\\|.+ which extract the first occurrence before |
Why use RegEx?
const s = 'CpuUtilization\[GqIF:CA-TORONTO-1-AD-1 | FAULT-DOMAIN-3 | ocid1.image.oc1.ca-toronto-1.aaaaaaaaq4cxrudcxy5seck2cweks2zglo2tfieag6svtvqssa2zmjha | Default | ca-toronto-1 | oke-ccf3jglvbia-nc7pit2gv2a-sa65utwc32a-2 | ocid1.instance.oc1.ca-toronto-1.an2g6ljrwe6j4fqcgrlo7dmzkrtbcgr3jy35gie3qh3w65ctfh3hsd6da | VM.Standard.E2.2\]'
console.log(s.split(" | ")[5])
A regex solution can be
^(?:[^|]+ \| ){5}([^ ]+).*$
^ start of the string
(?:[^|]+ \| ){5} any character but \ followed by |, 5 times. (The ?: makes this a non capturing group).
([^ ]+) your string as the first group
.*$ any character to end of line
To get your string out of this, subtitute it with $1 or \1.
Test it on regex101. There you can test different programming languages/regex processors.
Remark:
Like the answer of Kazi this works in this case, maybe not in others.
There are no more examples in you question.
This answer is in function nearly the same.

How to exclude groups having alphabetical chars / or capture lines having only numeric chars?

I am trying to create a pattern that matches numeric digits but exclude those which starts with any alphabets/words.
This is the sample text that I am trying to match :
| 30 | 00:45.3 | 00:42.4 | 2.4869 | 5.6578
| event/slno1 | 00:45.3 | 00:42.4 | 2.4869 | 5.6578
| event/slno2 | 00:00.0 | 00:00.0 | 0.0000 | 0.0000
| event/slno3 | 00:45.3 | 00:42.4 | 2.4869 | 5.6578
| event/slno4 | 00:00.0 | 00:00.0 | 0.0000 | 0.0000
I wrote this:
(\d+)|\s+(\d\d):(\d+\.\d)\s+|(\d\d):(\d+\.\d)\s+|(\d+\.\d+)\s+|(\d+\.\d+)
i want to match only the
30 00:45.3 00:42.4 2.4869 5.6578 part and ignore th rest. How can I ignore the additional matches ?sure how i can negate the other ones.
Here the sample : https://regex101.com/r/ArZB3O/1
As you want to match whole lines, you should anchor your matches to the beginning and end of lines. Your input is composed of fields prefixed by a vertical bar, and the ones of interest are the ones that are composed of a sequence of fields that have (to not complicate further on the format of the numeric inputs) digits, colon and dots. So you can use this regexp to do that:
^\s*(\|\s+[0-9:.]+\s*)+$
as demonstrated by this demo As you matching string started after some whitespace, I have added support for it with the first \s*. then it comes a repeating group of one or more sequences of one vertical bar, some whitespace, and some sequence of digits, colons or dots. If you want to be more precise, you can specify the substructure of [0-9:.]+ as they follow a pattern, but I think for your problem it is enough with this.

lookahead in the middle of regex doesn't match

I have a string $s1 = "a_b"; and I want to match this string but only capture the letters. I tried to use a lookahead:
if($s1 =~ /([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}
but this does not seem to match my string. I have solved the original problem by using a (?:_)instead, but I am curious to why my original attempt did not work? To my understanding a lookahead matches but do not capture, so what did I do wrong?
A lookahead looks for next immediate positions and if a true-assertion takes place it backtracks to previous match - right after a - to continue matching. Your regex would work only if you bring a _ next to the positive lookahead ([a-z])(?=_)_([a-z])
You even don't need (non-)capturing groups in substitution:
if ($s1 =~ /([a-z])_([a-z])/) { print "Captured: $1, $2\n"; }
Edit
In reply to #Borodin's comment
I think that moving backwards is the same as a backtrack which is more recognizable by debugging the whole thing (Perl debug mode):
Matching REx "a(?=_)_b" against "a_b"
.
.
.
0 <> <a_b> | 0| 1:EXACT <a>(3)
1 <a> <_b> | 0| 3:IFMATCH[0](9)
1 <a> <_b> | 1| 5:EXACT <_>(7)
2 <a_> <b> | 1| 7:SUCCEED(0)
| 1| subpattern success...
1 <a> <_b> | 0| 9:EXACT <_b>(11)
3 <a_b> <> | 0| 11:END(0)
Match successful!
As above debug output shows at forth line of results (when 3rd step took place) engine consumes characters a_ (while being in a lookahead assertion) and then we see a backtrack happens after successful assertion of positive lookahead, engine skips whole sub-pattern in a reverse manner and starts at the position right after a.
At line #5, engine has consumed one character only: a. Regex101 debugger:
How I interpret this backtrack is more clear in this illustration (Thanks to #JDB, I borrowed his style of representation)
a(?=_)_b
*
|\
| \
| : a (match)
| * (?=_)
| |↖
| | ↖
| |↘ ↖
| | ↘ ↖
| | ↘ ↖
| | : _ (match)
| | ^ SUBPATTERN SUCCESS (OP_ASSERT :=> MATCH_MATCH)
| * _b
| |\
| | \
| | : _ (match)
| | : b (match)
| | /
| |/
| /
|/
MATCHED
By this I mean if lookahead assertion succeeds - since extraction of parts of input string is happened - it goes back upward (back to previous match offset - (eptr (pointer into the subject) is not changed but offset is) and while resetting consumed chars it tries to continue matching from there and I call it a backtrack. Below is a visual representation of steps taken by engine with use of Regexp::Debugger
So I see it a backtrack or a kind of, however if I'm wrong with all these said, then I'd appreciate any reclaims with open arms.

How do I select a substring using a regexp in robot framework

In the Robot Framework library called String, there are several keywords that allow us to use a regexp to manipulate a string, but these manipulations don't seem to include selecting a substring from a string.
To clarify, what I intend is to have a price, i.e. € 1234,00 from which I would like to select only the 4 primary digits, meaning I am left with 1234 (which I will convert to an int for use in validation calculations). I have a regexp which will allow me to do that, which is as follows:
(\d+)[\.\,]
If I use Remove String Using Regexp with this regexp I will be left with exactly what I tried to remove. If I use Get Lines Matching Regexp, I will get the entire line rather than just the result I wanted, and if I use Get Regexp Matches I will get the right result except it will be in a list, which I will then have to manipulate again so that doesn't seem optimal.
Did I simply miss the keyword that will allow me to do this or am I forced to write my own custom keyword that will let me do this? I am slightly amazed that this functionality doesn't seem to be available, as this is the first use case I would think of when I think of using a regexp with a string...
You can use the Evaluate keyword to run some python code.
For example:
| Using 'Evaluate' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${result}= | evaluate | re.search(r'\\d+', '''${string}''').group(0) | re
| | should be equal as strings | ${result} | 1234
Starting with robot framework 2.9 there is a keyword named Get regexp matches, which returns a list of all matches.
For example:
| Using 'Get regexp matches' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${matches}= | get regexp matches | ${string} | \\d+
| | should be equal as strings | ${matches[0]} | 1234

notepad++: keep regex (multi occurence per line) and line structure, remove other characters

I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.
Original file structure:
US20110228428A1 | US | | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC
US20120026629A1 | US | | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US | | EXAMINER | 2010-11-24 | | US20120147501A1 | US | | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC
Desired file structure:
2010-03-19
2010-07-28 2010-11-24 2010-12-09
Thank you for your help!
Search for
.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)
And replace with
" $1"
Don't put the quotes, just to show there is a space before the $1. This will also put a space before the first match in a row.
This regex will match as less as possible .*? before it finds either the Date or the end of the row (the $). If a date is found it is stored in $1 because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1.