I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo
Related
Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo
I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).
We are doing lose validation on zipcode of form CITY, ST, ZIP. These can span countries, so all of the following are valid:
PITTSBURGH, PA, 15020
HAMILTON,ONTARIO,L8E 4B3
All I want to validate is that we have three comma-separated words (whitespace is fine). All of these would be valid:
foo, bar, baz
foo,bar,baz123
However these would be invalid because they don't have exactly two commas and three words:
foo, bar
boo,bar,baz,bang
foo, bar,
foo,bar,baz,
What I've Tried Unsuccessfully
^[\w],[\w],[\w]$
^[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*$ (Doesnt allow sapces)
Also just curious - do yall typically allow whitespaces in regex or prefer an application filters whitespace first and then applies the regex? We can do either.
The pattern ^[\w],[\w],[\w]$ that you tried, can be written as ^\w,\w,\w$ and matches 3 times a single word char with a comma in between.
The pattern ^[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*,[a-zA-Z0-9_.-]*$ matches 3 times repeating 0 or more times any of the listed chars/ranges in the character class with a comma in between.
As the quantifier * is 0 or more times, it could possibly also match ,,
If the word chars should be present at all 3 occasions, and there can not be spaces at the start and end:
^\w+(?:\s*,\s*\w+){2}$
^ Start of string
\w+ Match 1+ word chars
(?:\s*,\s*\w+){2} Repeat 2 times matching a comma between optional whitspace chars and 1+ word chars
$ End of string
Regex demo
Note that \s can also match a newline. If you want to match spaces only, and the string can also start and end with a space you could use the pattern from #anubhava
from the comments.
Try
^\w*\W?,\W?\w*\W?,\W?(\w| ){1,}
(I tested by your examples)
I am working on validating the pan card numbers. I need to check that the first character and the fifth character should be same while validating the pan card. Whatever the first character in the below string the same should be matched with the fifth character. Can anyone help me in applying the above condition?
Regex I have tried : [A-Za-z]{4}\d{4}[A-Za-z]{1}
Here is my pan card example: ABCDA9999K
If you want to match the full example string where the first A should match up with the fifth A, the pattern should match 5 occurrences of [A-Za-z]{5} instead of [A-Za-z]{4}
You could use a capturing group with a backreference ([A-Za-z])[A-Za-z]{3}\1 to account for the first 5 chars.
You might add word boundaries \b to the start and end to prevent a partial match or add anchors to assert the start ^ and the end $ of the string.
This part of the pattern {1} can be omitted.
([A-Za-z])[A-Za-z]{3}\1\d{4}[A-Za-z]
Regex demo
I have multiple lines in a text file that I need to combine together. The file is about 200 million lines long, so opening it with Excel and using their built-in tools is out of the picture.
The first set of lines looks like this:
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
Second set which I want to add at the last line of the first set is:
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
If anyone has experience with this, I'd love some help
Code
Regex
^(\d+),(.*$)(?=[\s\S]*^\1,(.*))
Formatting output
$1,$2,$3
Results
Input
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Output
1,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Explanation
^ Assert position at the start of a line
(\d+) Capture one or more digits into capture group 1
, Match the comma character , literally
(.*$) Capture any number of any character (except newline characters) until the asserted position at the end of the line (asserting end of line position dramatically reduces steps) into capture group 2
(?=[\s\S]*^\1,(.*)) Positive lookahead asserting what follows matches
[\s\S]* Match any number of any character (\s: any whitespace character; \S: any non-whitespace character)
^ Assert position at the start of a line
\1 Matches the same text as most recently matched by the 1st capturing group
, Matches the comma character , literally
(.*) Capture any number of any character into capture group 3