regular expression to parse short urls - regex

I've a list of possible urls on my site like
1 http://dev.site.com/People/
2 http://dev.site.com/People
3 http://dev.site.com/Groups/
4 http://dev.site.com/Groups
5 http://dev.site.com/
6 http://dev.site.com/[extraword]
I want to be able to match all the urls like 6 and redirect them to
http://dev.site.com/?Shorturl=extraword
but I don't want to redirect the first 5 urls
I tried something like
((.*)(?!People|Groups))\r
but something is wrong.
any help?
thanks

You should put the check that it isn't People or Groups at the start:
(?!People|Groups)(.*)
At the moment you're checking that the regular expression isn't followed by People or Groups.
Depending on which language/framework you're using, you might also need to use ^ and $ to make sure you're matching the whole string:
^(?!People|Groups)(.*)$
You should also think about whether you want to match urls that begin with People, eg. http://dev.site.com/People2/. So this might be better:
^(?!(?:People|Groups)(?:/|$))(.*)$
It checks that a negative match for People or Groups is followed by the end of the url or a slash.
You might want to make sure you don't match an empty string, so use .+ instead of .*:
^(?!(?:People|Groups)(?:/|$))(.+)$
And if you want a word without any slashes:
^(?!(?:People|Groups)(?:/|$))([^/]+)$

In your regex, the (.*) subpattern consumes the entire string, which then causes the negative lookahead to succeed.
You need a negative lookahead to exclude People|Groups, and then you need to capture the extra word (and the word needs to have some stuff in it, otherwise we want the match to fail). The crucial piece here is that the negative lookahead does not consume any of the string, so you are able to capture the extra word for subsequent use in the redirect URL you are trying to build.
Here's a solution in Perl, but the approach should work for you in C#:
use warnings;
use strict;
while (<DATA>){
print "URL=$1 EXTRA_WORD=$2\n"
if /^(.*)\/(?!People|Groups)(\w+)\/?$/;
}
__DATA__
http://dev.site.com/People/
http://dev.site.com/People
http://dev.site.com/Groups/
http://dev.site.com/Groups
http://dev.site.com/
http://dev.site.com/extraword1
http://dev.site.com/extraword2/
Output:
URL=http://dev.site.com EXTRA_WORD=extraword1
URL=http://dev.site.com EXTRA_WORD=extraword2

Related

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

What's the right regular expression to match the exact word at the end of a string and excluding all other urls with more chars at the end?

I have to match an exact string at the end of a url, but not match all other urls that have more characters after that string
I can better explain with example.
I need to match the url having the string 'white' at its end: http//mysite.com/white
But I also need to not match urls having one or more characters postponed to it, like http//mysite.com/white__blue or http//mysite.com/white/yellow or http//mysite.com/white/
How to do that?
Thanks
Regex to match any url*
^(https?:\/\/)?([\da-z\.-]+\.[a-z\.]{2,6}|[\d\.]+)([\/:?=&#]{1}[\da-z\.-]+)*[\/\?]?$
Regex to match a url containing white in the end
^(https?:\/\/)?([\da-z\.-]+\.[a-z\.]{2,6}|[\d\.]+)([\/:?=&#]{1}[\da-z\.-]+)*[\/\?]?white$
You can check the regex here
From regexr.com
It does not match urls(which are not valid anyway) like
httpabrakadabra.co//
http:google.com
http://no-tld-here-folks.a
http://potato.54.211.192.240/
Based on your limited sample inputs, I'd say you could get away with this very minimal pattern:
^http[^\s]+white$
However, depending on what you are truly trying to achieve, what language/function you are implementing this pattern with, and what the full input string looks like, this pattern may need to be refined.
It would be best if you would improve your question to include all of the above relevant information.

can a regex match cn.cn. or ti.ti. but not vv.pp. or aa.bb.?

is it possible with regex to match a particular sequence repeating it self rather than number of letters? I would like to be able to match cn.cn. or ti.ti. or xft.xft. but not vv.pp. or aa.bb. and I do not seam to be able to do that with (\w\w.)+ opposed to \w+.\w+. in the first case I want in fact to use only one occurrence, like cn. or ti. in the second I want to keep v.p. or a.b.
thanks for any help.
Depending on your flavor of regex, you can use backreferences in your regex to match an earlier group. Your question title and question body disagree, however, on what exactly is supposed to be matched. I'll answer in Python as that's the flavor I'm most familiar with.
# match vv.pp., no match cn.cn.
re.match(r"(\w)\1\.(\w)\2\.", some_text)
# match cn.cn., no match vv.pp.
re.match(r"(\w{2})\.\1\.", some_text)

Clear Regex for "URL Contains"

I'm always stymied by regular expressions. My tool has a filtering option for "Current URL Matches Regex (case insensitive)" but I'm not sure how to write the regular expression for my needs. I'd love to figure out how to write a regex that would ONLY trigger for URLs that contain ANY of these 5 strings anywhere in URL:
Product=Neo-Supreme
Product=Cordura
Product=Hawaiian
Product=Animal%20Deluxe
Product=Camo
Basically the regex you need is something along the lines of
'Product\=[^&]+'
unless you know that the product can be something other than one of those 5 options.
If so, you'll need to use
'Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo)'
EDIT for comments:
To match anything you can always use .*, which matches on any number of any character (except a newline, unless otherwise specified).
'.*seat-option.*Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo).*'
Here's a demo

regex : how to eliminiate urls ending with .dtd

This is JavaScript regex.
regex = /(http:\/\/[^\s]*)/g;
text = "I have http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd and I like http://google.com a lot";
matches = text.match(regex);
console.log(matches);
I get both the urls in the result. However I want to eliminate all the urls ending with .dtd . How do I do that?
Note that I am saying ending with .dtd should be removed. It means a url like http://a.dtd.google.com should pass .
The nicest way to do it is to use a negative lookbehind (in languages that support them):
/(?>http:\/\/[^\s]*)(?<!\.dtd)/g
The ?> in the first bracket makes it an atomic grouping which stops the regex engine backtracking - so it'll match the full URL as it does now, and if/when the next part fails it won't try going back and matching less.
The (<!\.dtd) is a negative lookbehind, which only matches if \.dtd doesn't match ending at that position (i.e., the URL doesn't end in .dtd).
For languages that don't (such as JavaScript), you can do a negative lookahead instead, which is a bit more ugly and is generally less efficient:
/(http:\/\/(?![^\s]*\.dtd\b)[^\s]*)/g
Will match http://, then scan ahead to make sure it doesn't end in .dtd, then backtrack and scan forward again to get the actual match.
As always, http://www.regular-expressions.info/ is a good reference for more information