Validating emails in file with batch - regex

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?

Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.

Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

Related

PCRE to find all urls without file extensions that include #, ?, & etc

Very much a rookie question - have the following which I am using to look for any urls without the following extensions (seems to work):
href="\S*(?i)(?<!\.html)(?<!\.pdf)(?<!\.doc)(?<!\.docx)(?<!\.ppt)(?<!\.pptx)(?<!\.xls)(?<!\.xlsx)(?<!\.jpg)(?<!\.jpeg)(?<!\.eps)"
Where I am struggling is trying to figure out how to also find file names with extensions and exclude such as:
test.html#help
test.html?help
test.html?help&please
Not sure how to take something like this (?<!\.html) and add a wildcard to handle anything after .html
Did some more testing via an online regex tester site and this seems to work - matches any of the file extensions including test.html#help etc :
href="\S*(?i)(((?<=\.html)\S*)|((?<=\.pdf)\S*)|((?<=\.doc)\S*)|((?<=\.ppt)\S*)|((?<=\.xls)\S*)|((?<=\.jpg)\S*)|((?<=\.jpeg)\S*)|((?<=\.eps)\S*))"
but this does not work at all:
href="\S*(?i)((?<!\.html)\S*)"
Any help greatly appreciated.
Does this work for you?
href="(?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
https://regex101.com/r/SPjvkR/1
href=" matches the characters href=" literally (case sensitive)
Named Capture Group url (?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
(?<url> is what gives the capturing group the name. It can be omitted if you don't want the capturing group to have a name or can be replaced with ?: to make the it a non-capturing group. Naming can just make it more convenient to get the group's value in later code if needed, but in your case I don't think it matters
.+? matches any character (except for line terminators) between one and unlimited times, as few times as possible, expanding as needed (lazy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Negative Lookahead (?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
2nd Capturing Group (html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
.*? matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)
Update
New regex based on comments
href="((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
Regex Demo
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
Negative Lookahead (?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Non-capturing group (?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)

Google Sheets Regex Match all URL between two strings (In a paragraph)

I have try to build my regex, to using Google Sheet to extract the domain url from any paragraph:
Website: https://www.interprism.co.jp/ => interprism.co.jp
Website: https://growupwork.com => growupwork.com
Email: contact#interprism.com website: None => interprism.com
HP: onetech.jp => onetech.jp
Web:interprism.jp/index.html => interprism.jp
I have tried with this look ok, =iferror(regexextract(A11,".+?[#|www.](.*\n?)( )")) but not match all case, any one can help me on this?
Best Regards
Nim
You can try:
=iferror(regexextract(A11,"(?:https?:\/\/)?(?:[^#]+#)?(?:www\.)?([^:\/?]+)"))
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
http matches the characters http literally (case sensitive)
s? matches the character s literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
Non-capturing group (?:[^#]+#)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
Non-capturing group (?:www\.)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
www matches the characters www literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group ([^:\/?]+)
Match a single character not present in the list below [^:\/?]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
? matches the character ? literally (case sensitive)

Regex / grep match on: "not this" and "that"

From a Linux command line, I would like to find all the instances in multiple files where I do not reference a figure reference with Fig..
So I'm looking each line for when I don't preface \ref{fig with exactly Fig. .
Fig. \ref{fig:myFigure}
A sentence with Fig. \ref{fig:myFigure} there.
\ref{fig:myFigure}
A sentence with \ref{fig:myFigure} there.
The regex should ignore cases (1) and (2), but find cases (3) and (4).
You can use Negative Lookahead like:
^((?!Fig\. {0,1}\\ref\{fig).)*$
https://regex101.com/r/wSw9iI/2
Negative Lookahead (?!Fig\.\s*\\ref\{fig)
Assert that the Regex below does not match
Fig matches the characters Fig literally (case sensitive)
\. matches the character . literally (case sensitive)
\s* matches any whitespace character (equal to [\r\n\t\f\v ])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
ref matches the characters ref literally (case sensitive)
\{ matches the character { literally (case sensitive)
fig matches the characters fig literally (case sensitive)

regex: extract text blocks, defined beginning, undefined end

i have text like this:
Date: 01.02.2015 //<-stable format
something
something more
some random more
Date: 02.02.2015
something random
i dont know
so i have many such blocks. Starts with Date... ends with next Date... start.
The text in the lines in the block could be anything, but not Date... format
I need an array at the end, with such blocks:
array[0] = "Date: 01.02.2015
something
something more
some random more"
array[1] = "Date: 02.02.2015
something random
i dont know"
for now i add some unique splitter before Date... than split by the splitter.
Question: is it possible to get such blocks only by regex?
(i use VBA to parse the text, RegExp object)
Instead of split just match using
\bDate:\s\d{1,2}\.\d{1,2}\.\d{4}[\s\S]*?(?=\nDate:|$)
See demo.
https://regex101.com/r/uF4oY4/77
Syntax explanation (from the linked site):
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Date: matches the characters Date: literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{4} matches a digit (equal to [0-9]) exactly 4 times
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy) , what specified in previous brackets
?= Positive Lookahead - Assert that the following Regex matches
\nDate Option 1
\n matches a line-feed (newline) character (ASCII 10)
Date matches the characters Date: literally (case sensitive)
$: Option 2 - $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Regex to match URL end-of-line or "/" character

I have a URL, and I'm trying to match it to a regular expression to pull out some groups. The problem I'm having is that the URL can either end or continue with a "/" and more URL text. I'd like to match URLs like this:
http://server/xyz/2008-10-08-4
http://server/xyz/2008-10-08-4/
http://server/xyz/2008-10-08-4/123/more
But not match something like this:
http://server/xyz/2008-10-08-4-1
So, I thought my best bet was something like this:
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)[/$]
where the character class at the end contained either the "/" or the end-of-line. The character class doesn't seem to be happy with the "$" in there though. How can I best discriminate between these URLs while still pulling back the correct groups?
To match either / or end of content, use (/|\z)
This only applies if you are not using multi-line matching (i.e. you're matching a single URL, not a newline-delimited list of URLs).
To put that with an updated version of what you had:
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|\z)
Note that I've changed the start to be a non-greedy match for non-whitespace ( \S+? ) rather than matching anything and everything ( .* )
You've got a couple regexes now which will do what you want, so that's adequately covered.
What hasn't been mentioned is why your attempt won't work: Inside a character class, $ (as well as ^, ., and /) has no special meaning, so [/$] matches either a literal / or a literal $ rather than terminating the regex (/) or matching end-of-line ($).
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)(/.*)?$
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (\d{4}-\d{2}-\d{2})
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
4th Capturing Group (.*)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string
In Ruby and Bash, you can use $ inside parentheses.
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|$)
(This solution is similar to Pete Boughton's, but preserves the usage of $, which means end of line, rather than using \z, which means end of string.)