Regex exclude .html from string - regex

I am stuck with creation of regex matching my needs.
Regex should include beginning of the string but not the ending (.html)
Example:
company.html should convert into company
My attempt:
url(r'^(?P<page>.+\.html)$', some_view)
Is there any chance someone could advice me on this one? I need to prepare my django urls to expect over 50 companies names and it seems to be the easiest way to keep my code DRY.

Simply exclude it out of the capture group:
url(r'^(?P<page>.+)\.html$', some_view)
(capture group in boldface).
The part between the brackets that starts with (?P<var>...) is the capture group, the content that is matched with that pattern, will be injected into var.
But you can add extra parts outside the capture group, that thus are required by the pattern, but not captured in the variable.
That being said, typically in Django apps, one does not add noise like extensions, etc. Why would you add weird characters to a URL that a non-technical person does not understand at all?

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

How can I check if a string has multiple matching groups that are the same?

Currently, I am filtering out URL paths using Regex (Python). A couple of the URL paths I have come across are irrelevant and I want to detect URLs that are like this.
For example:
/ugrad/honors/index.php/policies/sao/policies/overview/step-1-course-requirements.html
/ugrad/honors/index.php/overview/sao/overview/sao/policies/noodle.html
In the examples above, you can see that policies and overview are repeated both times.
How can I design a Regex function to detect if there are 2+ matching texts anywhere in a URL path?
I have attempted something like this but I am unsure if it is possible to detect if there is 2+ matching texts anywhere in the string
My attempt: \S+(\/.+)\1\S+
Capture a slash, followed by non-slashes, followed by a slash again. Then repeat anything and backreference the capture group:
(\/[^\/]+\/).*\1
https://regex101.com/r/ygqRZc/1

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

RegEx to find all possible relative links to a specific file - also capture link text

Yes, there's hundreds of [regex] [html] topics on SO, but the first 30 I've checked don't help me with my problem.
I've got 745 total links (all relative, and they have to stay relative) to a file in my site. I need to find all these links and append data before and after them. I also need to capture and use the link text.
I've tried several expressions and the regex below is the closest I can get, but it's not good enough - it keeps finding a few instances of some other href to a different file and captures the content all the way to the </a> of the file I actually care about.
<a href="((.)*?)?myFile.html((.)*?)?>((.)*?)?</a>
In the above, I need to capture the relative path to the file and any anchors that might be present, as well as the actual link text.
What regex should I be using?
It shouldn't matter, but I'm using Adobe Dreamweaver to perform the search.
The following regex should work for what you need:
<a href="([^"]*?a\.fparameters\.html)(#[^"]+?)?".*?>(.*?)<
It will work even if you have URLs like:
JOBMAXNODECOUNT
that do not have #xxxx.
A few examples:
For JOBMAXNODECOUNTyou will get:
Group 1: a.fparameters.html
Group 2: #jobmaxnodecount
Group 3: JOBMAXNODECOUNT
For mjobctl -m to modify the job after it has been submitted. See the RSVSEARCHALGO you will get only one match:
Group 1: a.fparameters.html
Group 2: #rsvsearchalgo
Group 3: RSVSEARCHALGO
Try this regex: (updated)
href="([^"]*?)myFile\.html#?([^"]*).*?>(.*?)<\/a>
Explained demo here: http://regex101.com/r/lA6vB7
First, never do this: (.)* ...or this: (?:.)*
The first one consumes one character at a time and captures it in a group, each time overwriting previous captured character. The second one avoids most of that overhead by using a non-capturing group, but it's still only matching one character at a time inside that group; why bother? All it's doing is cluttering up the regex.
Adding the ? to make it non-greedy -- e.g. (.)*?-- doesn't make it worse, but it doesn't help, either. And sticking that inside another group and making the group optional -- i.e. ((.)*?)? -- is a recipe for catastrophic backtracking.. But performance considerations aside, when I see a capturing group with a quantifier attached, it almost always turns out mistake on the author's part. (ref)
As for your question, my solution turns out to be almost identical to Oscar's:
([^<>]*)

Finding a URL within two strings regex

I have a long HTML file that contains the names of organizations and their URL's. Each organization's "section" in the code is demarcated by the word "organization" followed by a lot of code, with their URL located inside that code, and ends with the word "organization".
For example:
organization -- a lot of code (with the URL located somewhere inside) -- organization
I have tried to use regex to search and extract the URL, but to no avail.
organization(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*organization
I suspect my problem lies somewhere in my trying to demarcate the search for URL's by just using the word "organization", but I am not sure.
Try group 1 from this:
organization.*\b(\w+://[\w.?%&=#/$,-]+).*?organization
Your current regex is searching for something sandwiched immediately between two instances of "organization". If there's any chance of characters existing between "organization" and your URL, you'll need to introduce a non-greedy match for any instances of anything (.*?), and if there are newlines in the mix you'll need to use (?:.|\n)*?.
So your regex becomes:
organization(?:.|\n)*?(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*(?:.|\n)*?organization
(Because of the bold insertions, this mistakenly appears to have spaces, but it does not. If you select it and copy/paste, it will paste correctly without spaces)