AEM - How to restrict a template from showing in a certain path - regex

I wonder if someone has achieved what I'll post here. In order to allow a template to be created under a certain path, there is a flag allowedPaths that receives a regex.
So, if I want my template "test" to appear only under /content/www/xx/xx/test-templates and child elements, I can do this:
/content/www/.*/.*/test-templates(/.*)?
But what if I want to make the opposite? I want the template "test" to appear in every /content/www/xx/xx/ node and beyond, EXCEPT /content/www/xx/xx/test-templates and children?
I have tried several ways but no luck so far. Do you have some hint regarding this?
Thanks!

You can always restrict a more generic pattern with a lookahead. Here is an expression that should work for you:
^(?!/content/www/[^/]*/[^/]*/test-templates(?:/|$))/content/www/[^/]*/[^/]*(/.*)
See demo.
^ - matches the start of string
(?!/content/www/[^/]*/[^/]*/test-templates(?:/|$)) - makes sure the next substring is not /content/www/<some_node>/<some_node>/test-templates, followed by the end of string ($) or /
/content/www/[^/]*/[^/]*(/.*) - matches /content/www/<some_node>/<some_node> followed with optional / and zero or more characters other than a newline

Related

How to combine groups in Regex with non capturing groups to have all optional

What I'm trying to achieve: I want to match user entered sentence with my templates and to see which template matches better (as many groups out of all in template as possible).
Regex which I'm building to solve example:
^(\bMyCompany1\b)?(?:.+)?\s(\bestablishes\b)?(?:.+)?\s(\bAnotherCompany\b)?(?:.+)?$
Example sentences:
'MyCompany1 establishes AnotherCompany' - matches all 3 groups. is OK
'MyCompany1 establ AnotherCompany' - matches first and last group. ignres the middle typo. is also Ok
'MyCompany1 establishes AnotherCompany ' - space in the end. cannot identify 2 and 3 groups. I don't understand why
'MyCompany1 establishes AnotherCompany' - additional spaces after word 'establishes'. For some reason is not detecting 2nd group anymore
This regex is just an example of one template. I will have 1 regex (build dynamically) per each template. Like 'User1 sent a request to User2', 'Company1 borrowed to Company2 $111' My idea is to define each part of the template and to see how many parts I matched. E.g. in my example: - I expect some company name from the list (MyCompany or MyCompany1) or non capturing group to ignore the rest (maybe user did a typo or is just typing and hasn't finished) - I expect same order of groups to be there
Can you please explain what I'm doing wrong in my Regex? Is it correct to achieve that by using Regex at all?
This is covering all your test cases, it is based on 3 lookaheads, each one contain an optional non-capture that includes a group for every keywords you're looking for.
^(?=(?:.*(\bMyCompany1\b))?)(?=(?:.*?(\bestablishes\b))?)(?=(?:.*(\bAnotherCompany\b))?).*$
You'll get regex explanation at the link below:
Demo
Or, if the order matter:
^(?:.*(\bMyCompany1\b))?(?:.*?(\bestablishes\b))?(?:.*(\bAnotherCompany\b))?.*$
Demo
could you please try below regex
^(\bMyCompany1\b)?\s+(\bestablishes\b)?\s+(\bAnotherCompany\b)?(?:.+)?$
hope it helps

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!
Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/
If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.

RegExp replace after

I have some link templates and I need to replace substrings inside of that links.
Link templates:
"/all_news"
"/all_news/"
"/all_news/page1"
"/all_news/page1/"
All of these templates mean the same thing - first page of news page without filtering.
So I need to:
1st template - insert "/pageX"
2nd template - insert "pageX"
3rd and 4th templates - replace page number
Is it possible with only one regexp?
If yes, then please help me.
If no, then I have 2nd question:
maybe its possible to replace everything after "/all_news" on "/pageX"?
I mean next logic:
string started
ok, I see substring "/all_news"
I replace everything after "/all_news" even if nothing exist(if string ends by "/all_news")
I return "/all_news/pageX".
This'll do it.
'/all_news/page1'.replace(/(.*\/all_news).*/,'$1' + '/pageX');
Just one for all.
Java has lookbehind. It negates the need for the $1. The solution looks like:
String result = "/all_news/page1";
String pattern = "(?<=\\/all_news).*";
System.out.println(result.replaceAll(pattern,"/PageX"));
Cheers.

Exclude part of the string with regex

I'm quite bad with regex, and I'm looking to match a criteria.
This is a regex expression that should go emmbed into the url for a firewall, so It will block any url that is not like the list at the end.
This is what Im currently using but its not working:
http://www.youtube.com/(*.*)list=UUFwtOm4N5djdcuTAlNIWJaQ
This is the example url (to be blocked):
http://www.youtube.com/watch?NR=1&feature=fvwp&v=P1b5VY_Bp_o&list=UUFwtOm4N5djdcuTAlNIWJaQ
I'm trying to make a regex that will Success fully match when NR=1 or feature=fvwp
are NOT present, I asume I can do it like this: (?!^feature=fvwp$) but the v= and list=UUFwtOm4N5djdcuTAlNIWJaQ are allowed.
Also the v= should be limited to any character (uppercase and lowercase) and 11 length, I assume its: /^[a-z0-9]{11}$/
How can I build all that together and make it work so it would allow and match only on this urls excluding from allowing the previous criterias that I explained:
http://www.youtube.com/watch?v=4eK_RWpTgcc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=TLRl85TJwZM&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
http://www.youtube.com/watch?v=QEV9yqrpxkc&feature=BFa&list=UUFwtOm4N5djdcuTAlNIWJaQ
Can you block based on matching by regex? If so, just use
(.*)www\.youtube\.com/watch\?NR=1&feature=fvwp and block whatever matches that.