I have to filter paths they can look like:
some_path//rest
some_path/rest
some_path\\\\rest
some_path\rest
I need to replace some_path//rest with FILTER
some_path/rest// I want FILTER/
some_path/rest\\ I want FILTER\
some_path/rest I want FILTER
some_path/rest/ I want FILTER/
some_path/rest\ I want FILTER\
I am using some_path[\\\\\\\/]+rest to match the middle, if I use it at the end it consumes all the path separators.
I do not know in advance whether the separators will be / or \\ it can mix in a single path.
some_path/rest\some_more//and/more\\\\more
Consider using back references. Keep in mind that with Python, you will be seeing the \ escaped with a second \ in the output. This example seems to do what you are looking for:
>>> for test in ('some_path/rest//','some_path/rest\\','some_path/rest','some_path/rest/','some_path/rest\\'):
... re.sub(r"some_path[\/]+rest([\/]?)\1*", r"FILTER\1", test)
...
'FILTER/'
'FILTER\\'
'FILTER'
'FILTER/'
'FILTER\\'
>>>
The \1 is a back reference to the previous () group. In the search, it is searching for any number of matches of that item. In the replace, it is just adding in the one item.
You can do it with a simple (without back reference) replace term by using a look ahead.
Use this regex to search:
some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?
and replace the match with just 'FILTER':
re.sub(r"some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?", 'FILTER', path)
This works by matching (ie consuming) the trailing slash only when it is doubled.
To allow for when there's no trailing slashes, the match for trailing slashes is made optional by wrapping in (?:...)? (which is non-capturing, so the back reference is \1, not \2 which is harder to read).
Note that you don't need quite so many backslashes in your regex.
Here's some test code:
for path in ('some_path/rest//','some_path/rest\\','some_path/rest','some_path/rest/','some_path/rest\\'):
print path + ' -> ' + re.sub(r"some_path[\\\\/]+rest(?:([\\\\/])(?=\1))?", 'FILTER', path)
Output:
some_path/rest// -> FILTER/
some_path/rest\ -> FILTER\
some_path/rest -> FILTER
some_path/rest/ -> FILTER/
some_path/rest\ -> FILTER\
Related
I am using NextAuth for Next.js for session management. In addition, I am using the middleware.js to protect my routes from unauthenticated users.
According to https://nextjs.org/docs/advanced-features/middleware#matcher,
if we want to exclude a path, we do something like
export const config = {
matcher: [
/*
* Match all request paths except for the ones starting with:
* - api (API routes)
* - static (static files)
* - favicon.ico (favicon file)
*/
'/((?!api|static|favicon.ico).*)',
],
}
In this example, we exclude /api, /static,/favicon.icon. However, I want to exclude all path except the home page, "/". What is the regular expression for that? I am tried '/(*)'. It doesn't seem to work.
The regular expression which matches everything but a specific one-character string / is constructed as follows:
we need to match the empty string: empty regex.
we need to match all strings two characters long or longer: ..+
we need to match one-character strings which are not that character: [^/].
Combining these three together with the | branching operator: "|..+|[^/]".
If we are using a regular expression tool that performs substring searching rather than a full match, we need to use its anchoring features; perhaps it supports the ^ and $ notation for that: "^(|..+|[^/])$".
I'm guessing that you might not want to match empty strings; in which case, revise your requirement and drop that branch from the expression.
Suppose we wanted to match all strings, except for a specific fixed word like abc. Without negation support in the regex language, we can use a generalization of the above trick.
Match the empty string, like before, if desired.
Match all one-character strings: .
Match all two-character strings: ..
Match all strings longer than three characters: ....+
Those simple cases taken care of, we focus on matching just those three-symbol strings that are not abc. How can we do that?
Match all three-character strings that don't start with a: [^a]...
Match all three-character strings that don't have a b in the middle: .[^b].
Match all three-character strings that don't end in c: ..[^c].
Combine it all together: "|.|..|....+|[^a]..|.[^b].|..[^c]".
For longer words, we might want to take advantage of the {m,n} notation, if available, to express "match from zero to nine characters" and "match eleven or more characters".
I will need to exclude the signin page and register page as well. Because, it will cause an infinite loop and an error, if you don't exclude signin page. For register page, you won't be able to register if you are redirected to the signin page.
So the "/", "/auth/signin", and "/auth/register" will be excluded. Here is what I needed:
export const config = {
matcher: [
'/((?!auth).*)(.+)'
]
}
I have a file with many different file path locations. Some of them have multiple directory depth and some don't. What I need to do is prepend a directory /WEB_ROOT/ to all file path locations in the file.
For example
index.jsp -> /WEB_ROOT/index.jsp
/instructor/assigned_appts.jsp -> /WEB_ROOT/instructor/assigned_appts.jsp
I have tried this one ([\/_]?[A-Za-z]*).jsp to try and capture the optional _ and / values but this doesn't match properly.
/instructor/assigned_appts.jsp only matches _appts.jsp
I have tried this as well ([\/_]?[A-Za-z])*.jsp which properly matches all expected file paths but when I replace I only get the last letter instead of the full group
So a replace with /WEB_ROOT/$1.jsp gives the following
index.jsp -> /WEB_ROOT/x.jsp
/instructor/assigned_appts.jsp -> /WEB_ROOT/s.jsp
Help please!
You can match the whole line, and as [\/_]? is optional, make sure that you match at least a single char A-Za-z before the .jsp
If you want to replace with group 1 like /WEB_ROOT/$1 you can also capture the .jsp
(.*[A-Za-z]\.jsp)
Note sure if supported in eclipse, but you might also just get the whole match and use $0 instead of group 1
.*[A-Za-z]\.jsp
If .jsp is at the end of the string, you can append an anchor .*[A-Za-z]\.jsp$
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
i have string, which has the value like
"urls":[
{
"url":"https:\/\/t.co\/OjiDUThEvK",
"expanded_url":"http:(escape sequence slash)/(escape sequence slash)/fb.me\/7Wnh0hMLL",
"display_url":"fb.me(escape sequence slash)/7Wnh0hMLL",
"indices":[48,71]}],
"user_mentions":[],
"symbols":[]
}
]
i need to capture only "expanded url" i tried the following regex:
"expanded_url"\:\"http\:\\\/\\\/(.*?)\"
this gave a result :
"fb.me(escape sequence slash)/7Wnh0hMLL"
but i want to exclude the escape sequence slash in the URL, is it possible to achieve the same, kindly let me know the changes to me made to the regex
I'm not 100% sure if this is what you're after. Can you post the raw input without the "(escape sequence slash)" part I'm assuming that this is actually / in the text you're matching against.
match:
\"expanded_url\":\"http:\\\/\\\/([^\\]*)\\\/([^\\"]*)\"
replace with:
$1/$2
I'm still learning regular expressions and I seem to be stuck.
I wanted to write a reg exp that matches URL paths like these that contain "bulk":
/bulk-category_one/product
/another-category/bulk-product
to only get the product pages, but not the category pages like:
/bulk-category_one/
/another-category/
So I came up with:
[/].*(bulk).*[/].+|[/].*[/].*(bulk).*
But there's pagination, so when I put the reg exp in Google Analytics, I'm finding stuff like:
/bulk-category/_/showAll/1/
All of them have
/_/
and I don't want any URL paths that contain
/_/
and I can't figure out how to exclude them.
I would go about it this way:
/[^/\s]*bulk[^/]*/[^/\s]+(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*(?!/)
first part:
/ - match the slash
[^/\s]* - match everything that's not a slash and not a whitespace
bulk - match bulk literally
[^/]* - match everything that's not a slash
/ - match the slash
[^/\s]+ - match everything that's not a slash and not a whitespace
(?!/) - ensure there is not a slash afterwards (i.e. url has two parts)
The second part is more of the same, but this time 'bulk' is expected in the second part of the url not the first one.
If you need the word 'product' specifically in the second part of the url one more alternative would be required:
/[^/\s]*bulk[^/]*/[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*bulk[^/\s]*product[^/\s]*(?!/)|/[^/\s]+/[^/]*product[^/\s]*bulk[^/\s]*(?!/)
If I apply that simple regex to a file FILE
egrep ".*bulk.*product" FILE
which contains your examples above, it only matches the 2 lines with bulk and product. We can, additionally, exclude '/_/':
egrep ".*bulk.*product" FILE | egrep -v "/_/"
Two invocations are often much more easy to define and to understand, than a big one-fits-all.