New at this so thanks in advance for the help.
I'm looking to write a Regex that will match the end of the string but not the beginning and there are some cases where the string is only one character.
Here are the sample strings and I'm trying to match only the items shown, otherwise there is no match.
/en-ca/brand/atf-type-f/ # should match /brand/atf-type-f/
/ # no match
/en-ca # no match
/en-ca/ # no match
/es-xl # no match
/en-gb # no match
/ru-kz/ # no match
/knowledge-centre/sds # should match /knowledge-centre/sds
/en-us/brand/purity-fg # should match /brand/purity-fg
The Regex engine I'm using to Google Analytics and I'm looking to output the Page Path without the country ID and the language ID.
Figured this out.
Using the Advanced Filter within GA I:
1) Used regex with ^(/..-..)?(/)?(.*)
2) used the Output To -> Constructor to put up the groups I wanted. Each () within GA Output Constructor is numbered. Therefore $A1 pickups first part and so on. Therefore just returning $A3 gave me the path. Had to added / back in at the beginning so the output statement became /$A3
Hope this help someone else.
Related
I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.
I'm trying to write wordpress pretty permalinks regex.
I have following urls. I need 2 matches,
1st : last word between / and / before get/
2nd : string which is start with get/
Url's may be like these
http://localhost/akasia/yacht-technical-services/yacht-crew/get/gulets/for/sale/
Here I need "yacht-crew" and "get/gulets/for/sale/"
http://localhost/akasia/testimonials/get/motoryachts/for/sale/
here I need "testimonials" and get/motoryachts/for/sale/
http://localhost/akasia/may/be/lots/of/seperator/but/ineed/last/get/ships/for/rent/
here I need "last" and get/ships/for/rent/
I catch 2nd part with
(.(get/(.)?))
but for first part there is no luck.
I will be appreciated if someone helps.
Regards
Deniz
I suggest the following:
([^\/]+?)\/(get\/.+)
https://regex101.com/r/uN6yH3/1
The concept is that you match non-slash characters up to the first slash (non-greedy) that is followed by the word "get" grouping it, and then just grab the rest as the second group.
I am assuming PHP.
$path = parse_url($url,PHP_URL_PATH);
$s = strrpos($path,'/');
$matches[] = substr($path,$s+1);
I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+
Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?
you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.
I've just recently started learning Regex so i'm not sure yet about a couple of aspects of the hole thing.
Right now my web page reads in the URL breaks it up into parts and only uses certain parts for processing:
E.g. 1) http://mycontoso.com/products/luggage/selloBag
E.g. 2) http://mycontoso.com/products/luggage/selloBag.sf404.aspx
For some reason Sitefinity is giving us both possibilities, which is fine, but what I need from this is only the actual product details as in "luggage/selloBag"
My current Regex expression is: "(.*)(map-search)(\/)(.*)(\.sf404\.aspx)", I combine this with a replace statement and extract the contents of group 4 (or $4), which is fine, but it doesn't work for example 2.
So the question is: Is it possible to match 2 possibilities with regular expressions where a part of a string might or might not be there and then still reference a group whose value you actually want to use?
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced (and commented) regex (in Python syntax) which utilizes named capture groups:
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation
Depends on your regex implementation, but most support a syntax like
(\.sf404\.aspx|)
Assuming that's your group 4 (i.e. zero-indexed groups). The | lists two alternatives, one of which is the empty string.
You don't say if you're doing this in javascript, but if you are, the parseUri lib written by Steven Levithan does a pretty damn good job at parsing urls. You can get it from various places, including here (click on the "Source Code" tab) and here.