need a regular expression that copes with this URL: - regex

I have a URL from google circles that doesn't get validated by normal regular expressions. for instance, asp.net provides a standard regular expression to cope with URLS, which is:
"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?"
But when you get a google circles URL:
https://plus.google.com/photos/114197249914471021468/albums/5845982797151575009/5845982803176407170?authkey=CKfNzLrhmenraA#photos/114197249914471021468/albums/5845982797151575009/5845982803176407170?authkey=CKfNzLrhmenraA
it can't cope.
I thought of appending to the end the following expression: (\?.+)?
which basically means the URL can have a question mark after it and then any number of characters of any type, but that doesn't work.
The whole expression would be:
"[Hh][Tt][Tt][Pp]([Ss])?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*(\?.+)?)?"
For some reason, that doesn't work with complicated URLs either.
Help is appreciated.

I added the anchors ^ and $ for the purposes of this test, escaped the / because the following is a javascript regex literal, changed the &, which had no business being there, to &;, removed the space and added # to the third character set, and it seems to work okay:
/^http(s)?:\/\/([\w-]+\.)+[\w-]+(\/[\w.\/?%&;#=-]*)?$/.test(
'https://plus.google.com/photos/114197249914471021468/albums/5845982797151575009/5845982803176407170?authkey=CKfNzLrhmenraA#photos/114197249914471021468/albums/5845982797151575009/5845982803176407170?authkey=CKfNzLrhmenraA' )
// true
I also moved the - to the end in the third character set, as it should be at the start or end of the set if not specifying a range.
Disclaimer: I do not propose this as good way of validating urls in general, it is just an edited version of the original regex which now works in this specific case.

Related

Regular expression to check path of url as well as specific parameters

I have url's like the following:
/home/lead/statusupdate.php?callback=jQuery211010657244874164462_1455536082020&ref=e13ec8e3-99a8-411c-be50-7e57991d7acb&status=5&_=1455536082021
I would like a regular expression to use in my Google analytic goal that checks to see that the request uri is /home/lead/statusupdate.php and has ref and status parameter present regardless of what order these parameters are passed and regardless of if there are extra parameters because I really just care about the 2. I have looked at these examples
How to say in RegExp "contain this too"? and Regular Expressions: Is there an AND operator? but I can't seem to adapt the examples given there to work.
Im using this online tool to test http://www.regexr.com/ (perhaps the tool is the buggy one? I'l try in javascript in the mean time)
You can try:
\/home\/lead\/statusupdate\.php\?(ref=|.*(&ref=)).*(&status=)
if the order does not matter, then add the oppostite
\/home\/lead\/statusupdate\.php\?(status=|.*(&status=)).*(&ref=)
all put together
\/home\/lead\/statusupdate\.php\?(((ref=|.*(&ref=)).*(&status=))|((status=|.*(&status=)).*(&ref=)))
try:
(/home/lead/statusupdate.php?A)|(/home/lead/statusupdate.php?B)|(/home/lead/statusupdate.php?C)|(/home/lead/statusupdate.php?D)|(/home/lead/statusupdate.php?E)|(/home/lead/statusupdate.php?F)
Note that here A,B,C,D,E,F are notations for six different permutations for 'callback' string, 'ref' string, 'status' string and '_' string.
Not really elegant but this works:
\/home\/lead\/statusupdate\.php(.*(ref|status)){2}
Looks for /home/lad/statusupdate.php followed by 2x any character followed by ref or status. Admittedly this would be a match for an url with 2x ref or status though.
Demo

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

hl.regex.pattern not working in solr

I am using solr to fetch data.
I was using below parameters to fetch data:
http://testURL/solr/core0/select?start=10&rows=10&hl.fl=CC&hl.requireFieldMatch=true&hl=on&hl.maxAnalyzedChars=1&hl.fragsize=145&hl.snippets=99&sort=COlumn1+desc&q=CC%3a%28%22test%22~2%29&fl=title120%2ccolumn2%2ccolumn3%2cRL_DateTime%2cSid%2ccolumn4%2cguid%2chour&hl.regex.pattern=^\d+%20%3E%3E%20
Above query is not working with hl.regex.pattern parameter.
If I remove "hl.regex.pattern" than it is providing results in highlight section.
If I provide that regex pattern than it will not.
Regex is working in my c# code.
So am I missing anything here?
It's almost certainly the ^\. Those aren't valid in a URI, so you'll have to escape them.
From RFC 1738:
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
This is a little dated, since non-Roman alphanumerics like λάμδα are allowed now, but the gist is the same.
Try hl.regex.pattern=%5E%5Cd+%20%3E%3E%20 instead.

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

Regular expression to add base domain to directory

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.example.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following:
1. the string 'src="' (and capture it in variable $1)
2. one or more non-double-quote (") character followed by " (and capture it in variable $2)
3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+