Regex postive lookbehind only if after specific string - regex

I'm trying to form a regex to find the "from" address for any forwarded emails. There may be multiple formats, so it needs to match this pattern [Ffrom:] [there may be more text here including spaces,",A-Z,a-z,<,[, and multiple words] and then an email address. Here's what I have so far:
(?<=)[\w-]*#[\w-]*[.\w]*
This finds any email addresses, but I need it to only find those following this pattern. "From:" ["Possible Sender Name Here"] example-email1#test2.co.uk The line will always start with "[Ff]rom:" and this will be implemented in Powershell if that matters.

Usually headers are in the format From: "Fooy McBar" <Fooy#bar.com> which can be matched with:
-replace 'From:.*<(.+)>','$1'
Really the easiest way to match an email address at the end of a line is to just match everything after the last whitespace though:
-replace '^.*\s(.+)$','$1'
will match Fooy#bar.com in both:
From: "Fooy McBar" Fooy#bar.com
and
From: Fooy#bar.com

Related

Get an exact regex match of an email value from a list of email addresses

I have a text field which stores a list of email addresses e.g: x#demo.com; a.x#demo.com. I have another text field which stores the exact value matched from the list of emails i.e. if /x#demo.com/i is in x#demo.com;a.x#demo.com then it should return x#demo.com.
The issue I am having is that if I have /a.x#demo.com/i, I will get x#demo.com instead of a.x#demo.com
I know of the regex expression /^x#demo.com$/i, but this means I can only have one email in my list of email addresses which won't help.
I have tried a couple of other regex expressions with no luck.
Any ideas on how I can achieve this?
You can use this slightly changed regex:
/(^|;)x#demo.com($|;)/i
It will match from either beginning of string or start after a semi colon and end either at end of string or at a semi colon.
Edit:
Small change, this uses look behind and look forward, then you will only get the match, you want:
(?<=^|;)x#demo.com(?=$|;)
Edit2:
To allow Spaces around the semi colon and at start and end, use this (#-quoted):
#"(?<=^\s*|;\s*)x#demo.com(?=\s*$|\s*;)"
or use double escaping:
"(?<=^\\s*|;\\s*)x#demo.com(?=\\s*$|\\s*;)"

Regex get domain name from email

I am learning regex and am having trouble getting google from email address
String
first.name#google.com
I just want to get google, not google.com
Regex:
[^#].+(?=\.)
Result: https://regex101.com/r/wA5eX5/1
From my understanding. It ignore # find a string after that until . (dot) using (?=\.)
What did I do wrong?
[^#] means "match one symbol that is not an # sign. That is not what you are looking for - use lookbehind (?<=#) for # and your (?=\.) lookahead for \. to extract server name in the middle:
(?<=#)[^.]+(?=\.)
The middle portion [^.]+ means "one or more non-dot characters".
Demo.
Updated answer:Use a capturing group and keep it simple :)
#(\w+)
Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w
Regex explanation and demo on Regex101
I used the solution's regex for my task, but realized that some of the emails weren't that easy: foo#us.industries.com, foobar#tm.valves.net, andfoo#ge.test.com
To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:
(?<=#)[^.]*.[^.]*(?=\.)
This should be the regex:
(?<=#)[^.]+
(?<=#) - places the search right after the #
[^.]+ - take all the characters that are not dot (stops on dot)
So it extracts google from the email address.
As I was working to get the domain name of email addresses and none corresponded to what I needed:
To not catch subdomains
To match countries top domains (like .com.ar or co.jp)
For example, in test#ext.domain.com.mx I need to match domain.com.mx
So I made this one:
[^.#]*?\.\w{2,}$|[^.#]*?\.com?\.\w{2}$
Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59
You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in test#test.com.mx):
[^.#]*?(?=\.\w{2,}$)|[^.#]*?(?=\.com?\.\w{2}$)
Maybe not strictly a "full regex answer" but more flexible ( in case the part before the # is not "first.last") would be using cut:
cut -d # -f 2 | cut -d . -f 1
The first cut will isolate the part after # and the second one will get what you want.
This will work also for another kinds of email patterns : xxxx#server.com / xxx.yyy.zzz# server.com and so on...
Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.
Caveat : Regex.Speed = Slow
Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes.
But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.
https://regex101.com/r/ZnU3OC/1
One Regex to rule them all...
Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
Now with more broth, batteries not included
^(?<Email>.*#)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$
not overly complicated at all... why would you even say that?
Substitution / Outputs
EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
Protocol: "https://"
SubDomain: "www"
DomainWithTLD: "stackoverflow.co.uk"
Domain: "stackoverflow"
TopLevelDomain: "co"
CountryCode: "uk"
Path: "/path/2"
QString: "?q=mysearch&and=more#stuff"
}
Allowed/Compliant Domains : Should ALL MATCH
www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value
* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com
Emails : Should ALL MATCH
first.name#google1.co.com
foo#us.industries.com,
foobar#tm.valves.net,
andfoo#ge.test.com
jane.doe#my-bank.no
john.doe#spam.com
jane.ann.doe#sandnes.district.gov
Non-Compliant Domains : Should NOT MATCH
either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
symbols-not-allowed#.com
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com
TBD Not handled:
* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com
* special case localhost?
.localhost
also see:
Domain Name Rules :: Super handy ASCII Diagram of a URL
see: https://stackoverflow.com/a/66660651/738895 *
Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)
Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.
This is a relatively simple regex, and it grabs everything between the # and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.
>>> regex = re.compile(r"^.+#(.+)\.[\w]+$")
>>> regex.findall('jane.doe#my-bank.no')
['my-bank']
>>> regex.findall('john.doe#spam.com')
['spam']
>>> regex.findall('jane.ann.doe#sandnes.district.gov')
['sandnes.district']
I used this regular expression to get the complete domain name '.*#+(.*)' where .* will ignore all the character before # (by #+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

addHttpRequestHandler regex not matching

I am attempting to match a regex pattern in the middle of an incoming request.
I have verified my regex pattern works with regex101.com.
The documentation states that addHttpRequestHandler regex will match anywhere given the correct pattern.
addHttpRequestHandler('/clinServicesScheduled',
'Scripts/clinServicesScheduled.js', 'clinServicesScheduled');
The above only matches at the beginning of the request.
What is the correct pattern to match in the middle of the request?
http://127.0.0.1/data/dashboard/community/communityCode/clinServicesScheduled/month/...
http://livedoc.wakanda.org/Global-Application/Application/addHttpRequestHandler.301-636268.en.html
/myPattern: intercepts any requests containing /myPattern regardless of its position in the string. This pattern will accept indifferently requests such as /files/myPattern, /myPattern.html, and /myPattern/bar.js.
try the following regex. it looks like wakanda always matches the urlPath from the beginning even though it is documented differently in the docs: http://doc.wakanda.org/home2.en.html#/Global-Application/Application/addHttpRequestHandler.301-636268.en.html
addHttpRequestHandler('.*/clinServicesScheduled', 'Scripts/clinServicesScheduled.js', 'clinServicesScheduled');
i have used this before to catch patterns in the middle of a request string

How to build this regex?

I want to match the emails in following texts,
uma#cs.stanford.edu - match
uma at cs.Stanford.edu - match
http://infolab.stanford.edu/~widom/yearoff.h
we
genale.stanford.edu
n <A href="mailto:cheriton#cs.stanford.edu - match
hola # kirti.edu - match
Now I want to capture 2 parts of email address only like (uma) and (cs.stanford) in the email uma#cs.stanford.edu.
My current pattern is :
(\w+)[(\s+at\s+)|(\s*#\s*)]+(\w+|\w+\.\w+).edu
But it matches the string - infolab.stanford.edu - which I don't want.
Can anybody suggest any modification on this?
As long as you understand that this regex doesn't verify the correctness of your email address, but merely acts as a quick first line of defense against malformed addresses, an easy fix to your regex is as follows:
([\w.]+)(?:\s+at\s+|\s*#\s*)(\w+|\w+\.\w+).edu
In particular your regex was missing addresses with usernames containing . (which for example my main email address uses), as well as had a messed up middle part (pretending it's a character class and something weird about letting it repeat??). You can see the results here: http://refiddle.com/2js1

Regular Expression for some email rules

I was using a regular expression for email formats which I thought was ok but the customer is complaining that the expression is too strict. So they have come back with the following requirement:
The email must contain an "#" symbol and end with either .xx or .xxx ie.(.nl or .com). They are happy with this to pass validation. I have started the expression to see if the string contains an "#" symbol as below
^(?=.*[#])
this seems to work but how do I add the last requirement (must end with .xx or .xxx)?
A regex simply enforcing your two requirements is:
^.+#.+\.[a-zA-Z]{2,3}$
However, there are email validation libraries for most languages that will generally work better than a regex.
I always use this for emails
^([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}" +
#"\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\" +
#".)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
Try http://www.ultrapico.com/Expresso.htm as well!
It is not possible to validate every E-Mail Adress with RegEx but for your requirements this simple regex works. It is neither complete nor does it in any way check for errors but it exactly meets the specs:
[^#]+#.+\.\w{2,3}$
Explanation:
[^#]+: Match one or more characters that are not #
#: Match the #
.+: Match one or more of any character
\.: Match a .
\w{2,3}: Match 2 or 3 word-characters (a-zA-Z)
$: End of string
Try this :
([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})\be(\w*)s\b
A good tool to test our regular expression :
http://gskinner.com/RegExr/
You could use
[#].+\.[a-z0-9]{2,3}$
This should work:
^[^#\r\n\s]+[^.#]#[^.#][^#\r\n\s]+\.(\w){2,}$
I tested it against these invalid emails:
#exampleexample#domaincom.com
example#domaincom
exampledomain.com
exampledomain#.com
exampledomain.#com
example.domain#.#com
e.x+a.1m.5e#em.a.i.l.c.o
some-user#internal-email.company.c
some-user#internal-ema#il.company.co
some-user##internal-email.company.co
#test.com
test#asdaf
test#.com
test.#com.co
And these valid emails:
example#domain.com
e.x+a.1m.5e#em.a.i.l.c.om
some-user#internal-email.company.co
edit
This one appears to validate all of the addresses from that wikipedia page, though it probably allows some invalid emails as well. The parenthesis will split it into everything before and after the #:
^([^\r\n]+)#([^\r\n]+\.?\w{2,})$
niceandsimple#example.com
very.common#example.com
a.little.lengthy.but.fine#dept.example.com
disposable.style.email.with+symbol#example.com
other.email-with-dash#example.com
user#[IPv6:2001:db8:1ff::a0b:dbd0]
"much.more unusual"#example.com
"very.unusual.#.unusual.com"#example.com
"very.(),:;<>[]\".VERY.\"very#\\ \"very\".unusual"#strange.example.com
postbox#com
admin#mailserver1
!#$%&'*+-/=?^_`{}|~#example.org
"()<>[]:,;#\\\"!#$%&'*+-/=?^_`{}| ~.a"#example.org
" "#example.org
üñîçøðé#example.com
üñîçøðé#üñîçøðé.com