Facebook profile URL regular expression - regex

Given the following Facebook profile and page URLs, my intent is to extract profile ids or usernames into the first match position.
http://www.facebook.com/profile.php?id=123456789
http://www.facebook.com/someusername
www.facebook.com/pages/Regular-Expressions/207279373093
The regex I have so far looks like this:
(?:http:\/\/)?(?:www.)?facebook.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[?\w\-]*\/)?(?:profile.php\?id=(\d.*))?([\w\-]*)?
Which produces the following results:
Result 1:
123456789
Result 2:
someusername
Result 3:
207279373093
The ideal outcome would look like:
Result 1:
123456789
Result 2:
someusername
Result 3:
207279373093
That is to say, I'd like to have the profile identifier to always be returned in the first position.
It would also be ideal of www.facebook.com/ and facebook.com/ didn't match either.

I'd recommend Rad Software Regular Expression Designer.
Also this online tool is great https://regex101.com/ ( though most people prefer http://regexr.com/ )
(?:(?:http|https):\/\/)?(?:www.)?facebook.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[?\w\-]*\/)?(?:profile.php\?id=(?=\d.*))?([\w\-]*)?

I made a gist a while back that works fine against the given examples:
# Matches patterns such as:
# http://www.facebook.com/my_page_id => my_page_id
# http://www.facebook.com/#!/my_page_id => my_page_id
# http://www.facebook.com/pages/Paris-France/Vanity-Url/123456?v=app_555 => 45678
# http://www.facebook.com/pages/Vanity-Url/45678 => 45678
# http://www.facebook.com/#!/page_with_1_number => page_with_1_number
# http://www.facebook.com/bounce_page#!/pages/Vanity-Url/45678 => 45678
# http://www.facebook.com/bounce_page#!/my_page_id?v=app_166292090072334 => my_page_id
/(?:http:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-]*)/
To get the latest version: https://gist.github.com/733592

Only this regular expression works correctly for all FB URLs:
/(?:https?:\/\/)?(?:www\.)?(?:facebook|fb|m\.facebook)\.(?:com|me)\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]+)(?:\/)?/i

I've tried every single answer above and each one doesn't work for at least one reason. This most likely won't be helpful to OP, but if anybody like me finds this in a web search, I believe this is the correct answer:
^(?:.*)\/(?:pages\/[A-Za-z0-9-]+\/)?(?:profile\.php\?id=)?([A-Za-z0-9.]+)
Supports basically everything I can think of, except verifying that the domain contains facebook.com. If you need to check if the URL is valid, this should be done outside of a regular expression to make sure the page or profile actually exists. Why check it twice, especially when one of the checks is incomplete?
Doesn't cut off the first character
Grabs URLs with periods
Ignores superfluous GET parameters
Supports /usernames as provided by the Facebook app
Supports both profile URL structures
Doesn't match facebook.com/ or facebook.com (by ignoring them)
Works with and without www. (by ignoring it)
Supports both http and https (by ignoring them)
Supports both facebook.com and fb.com (by ignoring them)
Supports pages with special characters in the name (by ignoring them)
Supports #! (by ignoring it)
Supports bounce_page#! (by ignoring it)

The most completed pattern for Facebook profile url:
/(?:https?:\/\/)?(?:www\.)?facebook\.com\/.(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]*)/
It detects all the cases + one important difference. Other regex patterns recognize http://www.facebook.com/ as a valid Facebook Profile URL while it is not a valid Profile url. It is just the original Facebook URL and not a user or page address. But this regex can distinguish a normal url from a profile and page url and only accepts the valid one.

Matches facebook.com, m.facebook.com, mbasic.facebook.com and fb.me (short link)
/(?:https?:\/\/)?(?:www\.)?(mbasic.facebook|m\.facebook|facebook|fb)\.(com|me)\/(?:(?:\w\.)*#!\/)?(?:pages\/)?(?:[\w\-\.]*\/)*([\w\-\.]*)/ig
Facebook URL regex DEMO

Regex that will correctly identify profile pages with a . in the name like www.facebook.com/my.name and it will also exclude www.facebook.com/ or home.php as it is not a valid facebook page.
https://regex101.com/r/koN8C2/2
(?:(?:http|https):\/\/)?(?:www.|m.)?facebook.com\/(?!home.php)(?:(?:\w)*#!\/)?(?:pages\/)?(?:[?\w\-]*\/)?(?:profile.php\?id=(?=\d.*))?([\w\.-]+)
Let me know if you found any that are not matched.

This works well for me. It can detect personal profile url, and exclude all the fan pages, and groups.
.+www.facebook.com\/[^\/]+$

Related

Exclude Character in Google Analytics via Regex

I'm trying to exclude (in a Goal) a character in a regex in Google Analytics.
Basically, I have two pages with the following URL:
/signup/done/b
/signup/done/bp
Note that both might have UTM parameters after in some results as well
I am trying to measure only /done/b
The Regex I had was the following, but it includes both strings:
(/signup/done/plan/b)
When I changed it (and verified it in an external regex tester) I got 0 results, so the /b/ was also not included.
(/signup/done/plan/b[^p])
This regex would handle the case where the URL ends with /b or if there are query parameters:
/signup/done/b($|\?.*)
So examples of converting URLs would be:
/signup/done/b
/signup/done/b?utm_campaign=test&utm_medium=display
/signup/done/b?query=value
Examples of non-converting URLs would be:
/signup/done/bd
/signup/done/b/something

Search & Replace Request URI Filter in Google Analytics

I have 2 landing pages:
/aa/index.php/aa/index/[sessionID]/alpha
/bb/index.php/bb/index/[sessionID]/bravo
Because the sessionID is unique, each of the landing page will be tracked as different pages. Therefore, I need a filter to remove the sessionID. These are what i want to track:
/aa/index.php/aa/index/alpha
/bb/index.php/bb/index/bravo
I created the Search and Replace Custom Filter on the Request URI:
Search String: /(aa|bb)/index\.php/(aa|bb)/index/(.*)
Replace String: /$1/index.php/$2/index/$3
But i get the /$1/index.php/$2/index/$3 being reported on the dashboard the next day. So i tried /\1/index.php/\2/index/\3 but i got very strange results, //aa/index.php/aa/index/alpha/index.php/aa/index/aa.
Does anyone know how to reference the grouped patterns in the replace string?
My Solution:
i managed to solve it using Advanced Filter. My solution:
Field A => Request URI: /(aa|bb)/index\.php/(aa|bb)/index/(.*)/(.*)
Field B => -
Output to => Request URI: /$A1/index.php/$A2/index/$A4
I haven't used the Google Analytics regex engine, but it appears to me that \1 is referencing the entire match (which in other regex implementations is called \0), while \2 is the first group, \3 is the second group, and so on.
Your initial regex, however, looks incomplete--I think it should look as follows:
Search String: /(aa|bb)/index\.php/(aa|bb)/index(/.*)/(alpha|bravo)
Replace String: /\2/index.php/\3/index/\5
(Note that I'm not sure whether ? is supported in this regex implementation as the non-greedy modifier, but if it is, the above search string pattern might run a little faster if you change /.* to /.*?.)

REGEX - How to ignore some query strings in URLS, but not in others

I need to redirect an old blog URL to a new blog URL. The ID field is the key query string, and everything else in the query string should be ignored. The logic at a high level:
If old case insensitive URL matches: /Blog/Post.aspx? + ID=33 anywhere in the query string of the URL then I will redirect to: /newblog/newurl/
Current REGEX Code: (?i:/Blog/Post.aspx)|(\?)|(?i:id=33)
Success: /Blog/Post.aspx?id=33
Fails: /Blog/Post.aspx?ignore=me&id=33
Fails: /Blog/Post.aspx?ignore=me&id=33&ignoreme=too
How would I have it ignore the potential unknown query string ignore=me and ignoreme=too, but still come up with a REGEX match to redirect when the ID=33 is in the query string?
Thank you for the answer m.buettner!
Right now you would even redirect, if you have only ID=33 in your URL, or even if you have only a question mark in there. I suppose that is not what you want. You are probably looking for something like this:
(?i:/Blog/Post.aspx\?.*id=33(?!\w)).*
That will require /Blog/Post.aspx? and then allow arbitrary characters until the id=33 is encountered.
Depending on which language you are using this in, you could also use a lookahead, which makes it easier to check for different parameters, whose order you might not know:
(?i:/Blog/Post.aspx\?(?=.*id=33(?!\w))).*
This could then be easily extended to
(?i:/Blog/Post.aspx\?(?=.*id=33(?!\w))(?=.*another=requirement(?!\w))).*
With the first approach you would have to add two alternatives for both possible orders.
EDIT: A caveat for all three solutions: after the number they require a non-word character (that is anything but letters, digits or underscores). This means that they would give false positives in cases like ...id=33+34... and ...id=33%2F.... But these should not be generated by Wordpress in the first place.
Ops, I was going to bring a general answer to match general attributes in an url! Well I'm gonna leave it here in case that you need it
DEMO
(?:(id|noignoreme|dontignoreme)=([^&\n]+)(?:\n|&|$))
With this you can add the parameters you want to accept and it will return it as group1 (the option) and group2 (the text of that option).
After that you could see if ID = 33 then do that; else do thot;

How do I match the question mark character in a Django URL?

In my Django application, I have a URL I would like to match which looks a little like this:
/mydjangoapp/?parameter1=hello&parameter2=world
The problem here is the '?' character being a reserved regex character.
I have tried a number of ways to match this... This was my first attempt:
(r'^pbanalytics/log/\?parameter1=(?P<parameter1>[\w0-9-]+)&parameter2=(?P<parameter2>[\w0-9-]+), 'mydjangoapp.myFunction')
This was my second attempt:
(r'^pbanalytics/log/\\?parameter1=(?P<parameter1>[\w0-9-]+)&parameter2=(?P<parameter2>[\w0-9-]+), 'mydjangoapp.myFunction')
but still no luck!
Does anyone know how I might match a '?' exactly in a Django URL?
Don't. You shouldn't match query string with URL Dispatcher.
You can access all values using request.GET dictionary.
urls
(r'^pbanalytics/log/$', 'mydjangoapp.myFunction')
function
def myFunction(request)
param1 = request.GET.get('param1')
Django's URL patterns only match the path component of a URL. You're trying to match on the querystring as well, this is why you're having trouble. Your first regex does what you wanted, except that you should only ever be matching the path component.
In your view you can access the querystring via request.GET
The ? character is a reserved symbol in regex, yes. Your first attempt looks like proper escaping of it.
However, ? in a URL is also the end of the path and the beginning of the query part (like this: protocol://host/path/?query#hash.
Django's URL dispatcher doesn't let you dispatch URLs based on the query part, AFAIK.
My suggestion would be writing a django view that does the dispatching based on the request.GET parameter to your view function.
The way to do what the original question was i.e. catch-all in URL dispatch var...
url(r'^mens/(?P<pl_slug>.+)/$', 'main.views.mens',),
or
url(r'^mens/(?P<pl_slug>\?+)/$', 'main.views.mens',),
As far as why this is needed, GET URL's don't exactly provide good "permalinks" or good presentation in general for customers and to clients.
Clients often times request the url be formatted i.e.
www.example-clothing-site.com/mens/tops/shirts/t-shirts/Big_Brown_Shirt3XL
this is a far more readable interface for the end-user and provides a better overall presentation for the client.

Regex: Match URLs for specific domain EXCEPT when a certain querystring parameter has a certain value

In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)
I have the following regex to match all URLs for the domain.
\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
I just can't get the last part to work
(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?
So the following SHOULD match
example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo
But these should NOT
https://www.anotherexample.com#link
www.example.com?refer=twitter
EDIT:
And if you can get it to match the
http://example.com?foo=foo.bar
out of a sentence like
For examples go to http://example.com?foo=foo.bar.
without picking up the period, that would be great!
EDIT2:
Fixed the trailing period issue with this
\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT3:
This seems to work, or at least 99% of the tests I've thrown at it
(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?
EDIT4:
Settled on
\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+
(?!\b.*[&?]refer=twitter)
Is what you're looking for.
To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice
>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}
You can do anything from here.