Using regex to determine conversions for various dynamic URLs - regex

this stuff is over my head and I hope someone can help me. Thank you for your time in advance!!
I want to determine different types of conversions based on the success URL. I have two types:
a. A job was posted on the site
b. A new tradesman has signed up
These are the two success URLs:
a. A job was posted
Unique URL: http://www.redfish.co.za/job/Some-Variable-Job-Name/success
b. A new tradesman registered
Unique URL: http://www.redfish.co.za/tradesman/Some-Variable-Trademan-Name/success
So the first regex function should looks at the URL and determine whether the URL contains the word /job/ AND /success
The second regex function should allow me to tell whether the URL contains the word /tradesman/ AND /success
Please, can you help me? I have tried to work from other examples given here but don't understand this stuff.

URL contains the word /job/ AND /success
/\/job\/.*\/success/
jsFiddle
URL contains the word /tradesman/ AND /success
/\/tradesman\/.*\/success/
jsFiddle
Explanation:
/ Indicates that what follows is a regex
'/' / has a special meaning in regular expressions (see above), so it needs to be escaped with \
job These exact characters
\/ A second /, to complete the word /job/
.* Zero or more characters (anything except a newline character)
\/success The string /success, with the \ character escaped
/ Indicates the end of the regex

Related

Fluentvalidation 6.4.1.0 support me with Incorrect regex

In my case, i want to validate for url image, some url is valid but result is wrong.
Eg: link image is "https://fuvitech.online/wpcontent/uploads/2021/02/bta16600brg.jpg" or "https://fuvitech.online/wp-content/uploads/2021/02/bta16-600brg.jpg" reponse "The image link is not in the correct format".
My code here:
RuleFor(product => product.Images)
.Length(1, 3000).WithMessage(Labels.importProduct_ExceedDescription, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))))
.Matches(#"^(http:\/\/|https:\/\/){1}?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$").WithMessage(Labels.importProduct_UrlNotCorrect, p => ImportHelpers.GetColumnName(typeof(ProductEntity).GetProperty(nameof(p.Images))));
Please help me where the above regex is wrong. Thank you.
Try this:
NOTE the following regex pattern may trigger false positives and also may ignore valid image URLs, because it is very difficult to validate whether a given URL is valid.
^https?:\/\/(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+[A-Za-z]{2,}(?::\d+)?\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|)[\w-]+\.(?:jpg|jpeg|png)$
Explanation
^ the start of a line/string.
https?:\/\/ match http with an optional letter s, followed by ://.
(?:(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)+|[A-Za-z0-9]{2,})\.)+ This will match things like foo-foo.bar-bar., foo.bar-bar. and foo.
[A-Za-z]{2,} this will match the TLD part, e.g., com, org, this part with the previous part will match things like foo-foo.bar-bar.com, foo.bar-bar.com or foo.com.
(?::\d+)? optional group of (a colon : followed by one or more digits) for port part.
\/(?:(?:[A-Za-z0-9]+(?:(?:-[A-Za-z0-9]+)+)?\/)+|) this check for two things, the first one is /uploads/public-images/, /uploads/images/, the second one is a single /.
[\w-]+ this part for the file name, e.g., bta16-600brg.
\.(?:jpg|jpeg|png) you can add here multiple extensions, you can allow uppercase letters by using for example, [Jj][Pp][Gg] for jpg.
$ the end of the line/string.
See regex demo
Thanks #SaSkY answer my question.
I found my mistake.
This source [.[a-z]{2,5}] only allows domain extensions from 2-5 characters. Example [.com] is valid. But in my case [.online] was not valid.
I changed to [.[a-z]{1,10}].

How to match a regex with a fixed URL + variable slug

I am trying to write the following regexes for google analytics usage and so far I was unable to.
Case 1. to match with all the URLs containing /cms/en/product/{variable slug}/ which only contains one slug after the /product/. I mean something like the following:
/cms/en/product/firstslug/
Case 2. to match with all the URLs containing /cms/en/product/{variable slug1}/{variable slug2}/ which only contains two slugs after the /product/. I mean something like the following:
/cms/en/product/firstslug/secondslug/
Really appreciate anyone's help in advance.
I have already tried basics like the following and it doesn't work:
`/cms/en/product/.*/$
^/cms/en/product/.*/$
^/cms/en/product/.*/$
/cms/en/product/([^/]+)/?$
^/cms/en/product/([^/]+)/?$`
^/cms/en/product/[^/]+/$ matches "/cms/en/product/firstslug/"
^/cms/en/product/[^/]+/[^/]+/$ matches "/cms/en/product/firstslug/secondslug/"
^/cms/en/product/[^/]+/([^/]+/)?$ matches both "/cms/en/product/firstslug/" and "/cms/en/product/firstslug/secondslug/"
where
[^/]+ matches a single slug, i.e. one or more character(s) (+) which are not "/" ([^/])
([^/]+/)? matches an optional slug, i.e. an optional (?) group (()) of one or more character(s) (+) which are not "/" ([^/]) followed by a single "/"
Anyway: I would suggest using Content Grouping on collection.

Django Url pattern regex for tokens

I need to pass tokens like b'//x0eaa#abc.com//x00//xf0//x7f//xff//xff//xfd//x00' in my Django Url pattern. I am not able to find matching regex for that resulting Page not found error.
My url will be like /api/users/0/"b'//x0eaa#abc.com//x00//xf0//x7f//xff//xff//xfd//x00'"/
I have tried with following regex
url(r'^api/users/(?P<username>[\w\-]+)/(?P<paging_state>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4})/$', views.getUserPagination),
Please pass the token in request header or body and then use accordingly in your view.
Considering there are some static predictable elements in your url like -
api/users/
/" before b
"/ at the end after '
So I can see the url in either of the 2 ways below. Regex's mentioned accordingly:
api/users/(set of words, digits or hyphens)/"(any character except newline)"/
REGEX: ^api\/users\/([\w\d\-]+)\/"(.*)"\/$
URL: url(r'^api\/users\/([\w\d\-]+)\/"(.*)"\/$', views.getUserPagination),
api/users/(set of words, digits or hyphens)/"(one character-b)'//(any no. of words or digits)#(any no. of words or digits).(any no. of words or digits) (any no. of words, digits, front slashes)'"/
REGEX: ^api\/users\/([\w\d\-]+)\/"([a-g]'\/\/[\w\d]*#[\w\d]*.[\w\d]*[\/\w\d]*')"\/$
URL: url(r'^api\/users\/([\w\d\-]+)\/"([a-g]'\/\/[\w\d]*#[\w\d]*.[\w\d]*[\/\w\d]*')"\/$', views.getUserPagination),
You should be able to use either of the above two. There can be multiple ways to match the token part in your url. So unless it is a big security concern, you can do with the simplest approach as mentioned in point 1.

Regex: matching unknown groups that repeat?

I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.

How to rewrite url containing plus and special chracters?

We've got some incoming URLs that needs to be redirected, but we are having trouble with URLs that contains pluses (+).
For example any incoming URL must be redirected to the Homepage of the new site:
/eng/news/2005+01+01.htm
Should be redirected to to the home page of the new site
/en/
Using UrlRewriter.net we've set up a rule which works with 'normal' URLs but does not work for the above
<redirect url="~/eng/(.+)" to="/en/index.aspx" />
However it works fine if i change the incoming URL to
/eng/news/2005-01-01.htm
What's the problem and can anyone help?
I don't know about UrlRewriter.net, and I'm not sure which regex syntax it uses. I give some hint based on Perl regex.
what is the ~ at the beginning? Perhaps you mean ^, i.e. beginning of the string.
(.+) matches any character repeated one or more time; it does not match the + sign as you want
This is one way to write a (Perl) regex matching URLs starting with the string /eng/ and containg a + sign:
^\/eng\/.*\+.*
I hope this helps.