Regex email(several types) extraction - regex

I'm tring to extract email adressess from a content. I've a problem about false positives.
My regex for: example#site.com
[^\.^\w+](\w+) *?# *?(\w+) *?(?:\.|dot) *?(\w+)
Regex for: example#sub.site.com
[^\.^\w+](\w+) *?# *?(\w+) *?(?:\.|dot) *?(\w+) *?(?:\.|dot) *?(\w+)
I want the first regex not to match with:
example#sub.site
How can I fix it?

The only way to distinguish example#site.com and example#sub.site is to maintain a list of valid top level domains (yes, I'm sorry).
i.e, replacing your last (\w+) by (com|org|info|ly|... and so on.
There is no universal way.
Also, you could do only one regex.
Also, my address could be example#sub1.sub2.site.com, be careful...

Related

Regex, how to match all urls but one?

My Regex skills a minimum, I have been trying for a while now to get this to work:
I need to match all urls in one domain, but one (the login one).
Example:
Match: domain.com/ANYTHING-GOES-HERE
but
Not Match: domain.com/login
I don't actually need to match the domain.com part because that's always the same, what comes after it.
I have tried:
(?!\/login)\/.*
\/.*[^login]
Neither one seems to work as desired.
Update:
I should have explained that this is done in PHP. I don't have control over the actual code that runs the regex, but I do have control over how many regex I can have. So I could have one regex that matches everything, and then have one regex that matches or not matches "/login"
You're almost there:
// javascript
r = /domain\.com\/(?!login).+/
r.test("domain.com/ANYTHING-GOES-HERE") // true
r.test("domain.com/login") // false
This also rejects "domain.com/login/foobar", if you want it to be accepted, modify the regex to be
r = /domain\.com\/(?!login$).+/

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

Exclude part of the string with regex

I'm quite bad with regex, and I'm looking to match a criteria.
This is a regex expression that should go emmbed into the url for a firewall, so It will block any url that is not like the list at the end.
This is what Im currently using but its not working:
http://www.youtube.com/(*.*)list=UUFwtOm4N5djdcuTAlNIWJaQ
This is the example url (to be blocked):
http://www.youtube.com/watch?NR=1&feature=fvwp&v=P1b5VY_Bp_o&list=UUFwtOm4N5djdcuTAlNIWJaQ
I'm trying to make a regex that will Success fully match when NR=1 or feature=fvwp
are NOT present, I asume I can do it like this: (?!^feature=fvwp$) but the v= and list=UUFwtOm4N5djdcuTAlNIWJaQ are allowed.
Also the v= should be limited to any character (uppercase and lowercase) and 11 length, I assume its: /^[a-z0-9]{11}$/
How can I build all that together and make it work so it would allow and match only on this urls excluding from allowing the previous criterias that I explained:
http://www.youtube.com/watch?v=4eK_RWpTgcc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=TLRl85TJwZM&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=QEV9yqrpxkc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
Can you block based on matching by regex? If so, just use
(.*)www\.youtube\.com/watch\?NR=1&feature=fvwp and block whatever matches that.

Why is it selecting this file?

I have the following statement:
Directory.GetFiles(filePath, "A*.pdf")
.Where(file => Regex.IsMatch(Path.GetFileName(file), "[Aa][i-lI-L].*"))
.Skip((pageNum - 1) * pageSize)
.Take(pageSize)
.Select(path => new FileInfo(path))
.ToArray()
My problems is that the above statement also finds the file "Adali.pdf" which it should not - but i cannot figure out why.
The above statement should only select files starting with a, and where the second letter is in the range i-l.
Because it matches Adali taking 3rd and 4th characters (al):
Adali
--
Try using ^ in your regex which allows looking for start of the string (regex cheatsheet):
Regex.IsMatch(..., "^[Aa][i-lI-L].*")
Also I doubt you need asterisk at all.
PS: As a sidenote let me notice that this question doesn't seem to be written that good. You should try debugging this code yourself and particularly you should try checking your regex against your cases without LINQ. I'm sure there is nothing to do here with LINQ (the tag you have in your question), but the issue is about regular expressions (which you didn't mention in tags at all).
You are not anchoring the string. This makes the regex match the al in Adali.pdf.
Change the regex to ^[Aa][i-lI-L].* You can do just ^[Aa][i-lI-L] if you don't need anything besides matching.
You should to do this
var f = Directory.GetFiles(tb_Path.Text, "A*.pdf").Where(file => Regex.IsMatch(Path.GetFileName(file), "[Aa][i-lI-L].pdf")).ToArray();
When you call ".*" Adali accept in Regex

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])
I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).
Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com