I'm trying to add something along the lines of this regex logic.
For Input:
reading/
reading/123
reading/456
reading/789
I want the regex to match only
reading/123
reading/456
reading/789
Excluding reading/.
I've tried reading\/* but that doesn't work because it includes reading/
You must escape your backslashes in Hugo, \\/\\d+.
Related
I am trying to figure out how to pull the following string out of a folder path... I want to pull COMPANY_NAME from the below folder path. Is there a way to use REGEX to pull string between 2nd and 3rd backslash?
Example:
\10.20.3.23\S$\COMPANY_NAME\Main_5e08a942f39a430db0b081736a3f1881\C_VOL-b002.spf
Try this (?(DEFINE)(?<urlPart>[^\\\s]+))\\\\(?&urlPart)\\(?&urlPart)\\\K(?&urlPart) demo
It will match the desired part of the URL you are after. Things to note:
The url does not need to start at the beginning of the string (if you require this add ^ after the define group)
It will match many urls in the same string
It will match even if there is no file name
White space will invalidate the match
See the demo for details
If you were wondering it uses subroutine definitions to reuse parts of the regex.
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
I have the following sample urls
/alfa/wp-includes/js/jquery/jquery.js
/beta/wp-content/plugins/app/js/media.js?parameter=value
/beta/wp-admin/network
/beta/wp-content/themes/journal/data.php
I'm using the following regex to match all paths, excluding paramethers
^/(alfa|beta)((/wp-(content|admin|includes))([^?\s]*)).*
This works well, but how to change the regex to exclude any paths which include a .php ? So it needs to return first 3 paths but not the last.
You can use the PCRE verbs skip and fail to skip over matches of expressions. You can read more about them here, http://www.rexegg.com/backtracking-control-verbs.html#skipfail. For your current example you can use:
.*\.php$(*SKIP)(*FAIL)|^/(alfa|beta)((/wp-(content|admin|includes))([^?\s]*)).*
which would skip files that end in .php.
Demo: https://regex101.com/r/YH3n0x/1/
The .*\.php$ looks for anything until a .php at the end of the string/line.
The solution i looked for is the following, thanks #chris85
.*\.php(*SKIP)(*FAIL)|^/(alfa|beta)((/wp-(content|admin|includes))([^?\s]*)).*
I need to use Regex to check for URLs that contain 'folder', in the following URL:
subdomain.domain.co.uk/section/folder/page
I'm using:
subdomain.domain.co.uk\/.*\/(?!folder\/).*
but it's still finding 'folder'. Any ideas?
Try this regex:
^subdomain.domain.co.uk\/((?!folder).)*$
Demo here:
Regex101
First off, you need slashes around "folder", otherwise you'll also exclude "/anotherfolder/" and "/folder.jpg" etc.
Put the negative look ahead before the "." and add "." before "folder":
subdomain.domain.co.uk\/(?!.*\/folder\/).*
This won't match a URL with "/folder/" anywhere in it.
url(r'^([a-zA-Z0-9/_-]+):p:(?P<sku>[a-zA-Z0-9_-]+)/$', 'product_display', name='product_display'),
url(r'^(?P<path>[a-zA-Z0-9/_-]+)$', 'collection_display', name='collection_display'),
That's my current regex:
My problem is this: I want to be able to match the product_display's regex without using :p: in the regex. I can do this by putting .html at the end to set it apart from the collection_display's regex, but that doesn't fix the problem that is; without the ":p:" in the regex as is above the URI "some-collection/other/other/sku.html" would match the regex all the way up to the ".html" disregarding the sku. How can I do this without using the ":p:" to end the collection regex. Anything will help.
Thanks
It looks like your sku can't contain slashes, so I would recommend using "/" as your delimiter. Then the ".html" trick can be used; it turns out that your collection_display regex doesn't match the dot, but to make absolutely sure, you can use a negative look-behind:
url(r'^([a-zA-Z0-9/_-]+)/(?P<sku>[a-zA-Z0-9_-]+)\.html$', 'product_display', name='product_display'),
url(r'^(?P<path>[a-zA-Z0-9/_-]+)(?<!\.html)$', 'collection_display', name='collection_display'),
Alternatively, always end your collection_display urls with a slash and product_display with ".html" (or vice versa).