Regex to identify a specific pattern - regex

I am writing regex to find a specific pattern in my string. I have to identify if the string satisfy all the pattern that I am looking for. I have following criteria:
The name should start with either "P" or "Q" or "R"
Following the first character the string should match either "XYZ" or "ABCD"
If the XYZ is present then the 8th character should either be "H" or "D", if "ABCD" is present the 9th character should be either "H" or "D".
String could be:
PXYZ****H***** -> Should be true
QABCD****H***** -> Should be true
AXYG****Z***** -> Should be false
RABCD****H=D***** -> Should be true
I have tried if the string starts with ([P|Q|R])\w+, not sure how to combine others.

Use
^[PQR](XYZ|ABCD)....[HD].*
See regex proof.
EXPLANATION
^ asserts position at start of a line
Match a single character present in the list below [PQR]
PQR matches a single character in the list PQR (case sensitive)
1st Capturing Group (XYZ.|ABCD.)
1st Alternative XYZ.
XYZ matches the characters XYZ literally (case sensitive)
2nd Alternative ABCD.
ABCD matches the characters ABCD literally (case sensitive)
. matches any character (except for line terminators)
. matches any character (except for line terminators)
. matches any character (except for line terminators)
. matches any character (except for line terminators)
Match a single character present in the list below [HD]
HD matches a single character in the list HD (case sensitive)
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)

What is specific about this regex is that:
starts with PQR
continues with XYZ or ABCD
has an H or D five chars before the end
Here's my attempt:
'^[PQR](XYZ|ABCD).*[HD].{5}$'
Does it work for you?

Related

Validating emails in file with batch

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?
Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.
Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

Regular expression not worknig

I am trying to create a regular expression in javascript with the following rules:
At least 2 characters.
Should have at least 1 letter as a prefix and end with a . or have or - and then have more letters.
The following strings should be legal - aa, aaaaa, a., a-a, a a.
These should not be legal - a (too short), aa.aa. (two dots), aa- (after - should be another letter).
I don't know what I'm doing wrong here but my regex doesn't seem to work, as it is legal yet no word matches it:
(?=^.{2,}$)^(([a-z][A-Z])+([.]|[ -][a-zA-Z]+){0,1}$)
Had to re-write it completely to cover op's comment. The new regex would be:
^[a-zA-Z][a-zA-Z]*[ -][a-zA-Z]*[a-zA-Z]$|^[a-zA-Z][a-zA-Z]*([a-zA-Z]|\.)$
Explanation
1st Alternative ^[a-zA-Z][a-zA-Z]*[ -][a-zA-Z]*[a-zA-Z]$
^ asserts position at start of a line
[a-zA-Z] Match a single character present in [a-zA-Z]
[a-zA-Z]* * Quantifier — Matches between zero and unlimited
times(greedy)
[ -] Match a single character - or a space
$ asserts position at the end of a line
2nd Alternative
^[a-zA-Z][a-zA-Z]*([a-zA-Z]|\.)$
^ asserts position at start of a line
[a-zA-Z] Match a single character present in [a-zA-Z]
[a-zA-Z]* * Quantifier — Matches between zero and unlimited
times(greedy)
([a-zA-Z]|.) Match a single character present in the list below
[a-zA-Z] or dot
$ asserts position at the end of a line

regex: extract text blocks, defined beginning, undefined end

i have text like this:
Date: 01.02.2015 //<-stable format
something
something more
some random more
Date: 02.02.2015
something random
i dont know
so i have many such blocks. Starts with Date... ends with next Date... start.
The text in the lines in the block could be anything, but not Date... format
I need an array at the end, with such blocks:
array[0] = "Date: 01.02.2015
something
something more
some random more"
array[1] = "Date: 02.02.2015
something random
i dont know"
for now i add some unique splitter before Date... than split by the splitter.
Question: is it possible to get such blocks only by regex?
(i use VBA to parse the text, RegExp object)
Instead of split just match using
\bDate:\s\d{1,2}\.\d{1,2}\.\d{4}[\s\S]*?(?=\nDate:|$)
See demo.
https://regex101.com/r/uF4oY4/77
Syntax explanation (from the linked site):
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Date: matches the characters Date: literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{4} matches a digit (equal to [0-9]) exactly 4 times
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy) , what specified in previous brackets
?= Positive Lookahead - Assert that the following Regex matches
\nDate Option 1
\n matches a line-feed (newline) character (ASCII 10)
Date matches the characters Date: literally (case sensitive)
$: Option 2 - $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

RegExp: Ignore all links starting with a specific set of characters

I'm using the RegExp below to find all links in a string. How to add a condition that ignores all links that start with one of these characters: ._ -? (e.g.; .sub.example.com, -example.com)
AS3:
var str = "hello world .sub.example.com foo bar -example.com lorem http://example.com/test";
var filter:RegExp = /((https?:\/\/|www\.)?[äöüa-z0-9]+[äöüa-z0-9\-\:\/]{1,}+\.[\*\!\'\(\)\;\:\#\&\=\$\,\?\#\%\[\]\~\-\+\_äöüa-z0-9\/\.]{2,}+)/gi
var links = str.match(filter)
if (links !== null) {
trace("Links: " + links);
}
You can use the following regex:
\b((https?:\/\/|www\.)?(?<![._ -])[äöüa-z0-9]+[äöüa-z0-9\-\:\/]{1,}+\.[\*\!\'\(\)\;\:\#\&\=\$\,\?\#\%\[\]\~\-\+\_äöüa-z0-9\/\.]{2,}+)\b
Edits:
Added word boundaries \b
Added negative look behind for [._ -] i.e.. (?<![._ -])
This is the regex I use to find in full text :
/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i
Regex explanation:
\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|$!:,.;]*[A-Z0-9+&##/%=~_|$]
Assert position at a word boundary «\b»
Match the regex below and capture its match into backreference number 1 «(https?|ftp|file)»
Match this alternative «https?»
Match the character string “http” literally «http»
Match the character “s” literally «s?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Or match this alternative «ftp»
Match the character string “ftp” literally «ftp»
Or match this alternative «file»
Match the character string “file” literally «file»
Match the character string “://” literally «://»
Match a single character present in the list below «[-A-Z0-9+&##/%?=~_|$!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
The literal character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
A single character from the list “+&##/%?=~_|$!:,.;” «+&##/%?=~_|$!:,.;»
Match a single character present in the list below «[A-Z0-9+&##/%=~_|$]»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
A single character from the list “+&##/%=~_|$” «+&##/%=~_|$»

characters between two delimiters

Trying to put a regex expression together that returns the string between _ and _$ (where $ is the end of the string).
input:
abc_def_ghi_
desired regex outcoume:
def_ghi
I've tried quite a few combinations such as thsi.
((([^_]*){1})[^_]*)_$
any help appreciated.
Note: the regex above returns abc_def, and not the desired def_ghi.
So it's everything between the first _ and the final _ (both excluding)?
Then try
(?<=_).*(?=_$)
(hoping you're not using JavaScript)
Explanation:
(?<=_) # Assert that the previous character is a _
.* # Match any number of characters...
(?=_$) # ... until right before the final, string-ending _
You could try to use the greedyness of operators to your advantage:
^.*?_(.*)_$
matches everything from the start (non-greedy), up to an underscore, and from this underscore on to the end of the string, where it expects and underscore, then the end of the string, and captures it in the first match.
^ Beginning of string
.*? Any number of characters, at least 0
_ Anchor-tag, literal underscore
(.*) Any number of characters, greedy
_ Anchor-tag, literal underscore
$ End of string
I was searching for this within a larger log entry:
"threat_name":"PUP.Optional.Wajam"
The format enclosed the field name in double quotes then a colon then the value in double quotes.
Here's what I ended up with to avoid punctuation breaking the regex..
threat_name["][:]["](?P<signature>.*?)["]
(from regex101.com)
threat_name matches the characters threat_name literally (case sensitive)
["] match a single character present in the list below
" a single character in the list " literally (case sensitive)
[:] match a single character present in the list below
: the literal character :
["] match a single character present in the list below
" a single character in the list " literally (case sensitive)
(?P<signature>.*?) Named capturing group signature
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible,
expanding as needed [lazy]
["] match a single character present in the list below
" a single character in the list " literally (case sensitive)