Regex expression for matching folder content - regex

May be an obvious Regex, but I need to create an expression for filtering the following folders and files
So the regex should match to get the content of the folder
SUNSOLAR_Demo_0/My Documents/CUSTOMER/sale_docs
and exclude the rest.
Eg.:
Include
SUNSOLAR_Demo_0/My Documents/CUSTOMER/sale_docs/invoices/emails/sample.pdf
Exclude:
SUNSOLAR_Demo_0/My Documents/CUSTOMER/unseal/mytest.doc
SUNSOLAR_Demo_0/My Documents/PROVIDERS/orders/invoices.xls
I was trying to do something like, but not luck
"^SUNSOLAR_Demo_0/My Documents/(?!CUSTOMER/sale_docs).*$"
Thanks

Your pattern ^SUNSOLAR_Demo_0/My Documents/(?!CUSTOMER/sale_docs).*$ matches until Documents/ and then asserts that what is directly to the right is not CUSTOMER/sale_docs
But CUSTOMER/sale_docs is actually part of the string that you want to match.
There is no need for a lookaround here.
You can match the whole part of the folder that you want to match, followed by an optional part that starts with / and the rest of the line.
^SUNSOLAR_Demo_0/My Documents/CUSTOMER/sale_docs(?:/.*)?$
Regex demo

Related

Regular Expression - Starting and ending with, and contains specific string in the middle

I would like to generate a regex with the following condition:
The string "EVENT" is contained within a xml tag called "SHEM-HAKOVETZ".
For example, the following string should be a match:
<SHEM-HAKOVETZ>104000514813450EVENTS0001dfd0.DAT</SHEM-HAKOVETZ>
I think you want something like this ^<SHEM-HAKOVETZ>.*EVENT.*<\/SHEM-HAKOVETZ>$
Regular expression
^<SHEM-HAKOVETZ>.*EVENTS.*<\/SHEM-HAKOVETZ>$
Parts of the regular expression
^ From the beginning of the line
<SHEM-HAKOVETZ> Starting tag
.* Any character - zero or more
EVENT Middle part
<\/SHEM-HAKOVETZ>$ Ending part of the match
Here is the working regex.
If you want to match this line, you could use this regex:
<SHEM-HAKOVETZ>*EVENTS.*(?=<\/SHEM-HAKOVETZ>)
However, I would not recommend using regex XML-based data, because there may be problems with whitespace handling in XML (see this article for more information). I would suggest using an actual XML parser (and then applying the reg to be sure about your results.
Here is a solution to only match the "value" part ignoring the XML tags:
(?<=<SHEM-HAKOVETZ>)(?:.*EVENTS.*)(?=<\/SHEM-HAKOVETZ>)
You can check it out in action at: https://regex101.com/r/4XiRch/1
It works with Lookbehind and Lookahead to make sure it will only match if the tags are correct, but for further coding will only match the content.

Regex expression to exclude both prefix and suffix

I'm trying to build an expression which will match all text EXCLUDING text with prefix 'abc' AND suffix 'def' (text which only has the prefix OR the suffix is ok).
I've tried the following:
^(?!([a][b][c]])).*(?!([d][e][f])$), but it doesn't match text which only has one of the criterias (i.e. abc.xxx fails, as well as xxx.pdf, though they should pass)
I understand the answer is related to 'look behind' but i'm still not quite sure how to achieve this behavior
I've also tried the following:
^(?<!([a][b][c])).*(?!([d][e][f])$), but again, with no luck
^((abc.*\.(?!def))|((?!abc).*\.def))$
I think there can be a simpler solution, but this one will work as you wanted it.
[a][b][c] can be simplified to abc, the same goes for def.
The first part of the pattern matches abc.*\. without def at the end.
The second part matches .*\.def without the prefix abc.
Here is a visual representation of the pattern:
Debuggex Demo
Keep it simple and combine it into a single lookahead to check both conditions:
^(?!abc.*def$).*

Regex to match all urls, excluding .css, .js recources

I'm looking for a regular expression to exclude the URLs from an extension I don't like.
For example resources ending with: .css, .js, .font, .png, .jpg etc. should be excluded.
However, I can put all resources to the same folder and try to exclude URLs to this folder, like:
.*\/(?!content\/media)\/.*
But that doesn't work! How can I improve this regex to match my criteria?
e.g.
Match:
http://www.myapp.com/xyzOranotherContextRoot/rest/user/get/123?some=par#/other
No match:
http://www.myapp.com/xyzOranotherContextRoot/content/media/css/main.css?7892843
The correct solution is:
^((?!\/content\/media\/).)*$
see: https://regex101.com/r/bD0iD9/4
Inspirit by Regular expression to match a line that doesn't contain a word?
Two things:
First, the ?! negative lookahead doesn't remove any characters from the input. Add [^\/]+ before the trailing slash. Right now it is trying to match two consecutive slashes. For example:
.*\/(?!content\/media)[^\/]+\/.*
(edit) Second, the .*s at the beginning and end match too much. Try tightening those up, or adding more detail to content\/media. As it stands, content/media can be swallowed by one of the .*s and never be checked against the lookahead.
Suggestions:
Use your original idea - test against the extensions: ^.*\.(?!css|js|font|png|jpeg)[a-z0-9]+$ (with case insensitive).
Instead of using the regular expression to do this, use a regex that will pull any URL (e.g., https?:\/\/\S\+, perhaps?) and then test each one you find with String.indexOf: if(candidateURL.indexOf('content/media')==-1) { /*do something with the OK URL */ }

Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag:
<BASE href="http://whatreallyhappened.com/">
In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:
BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;
This works, but only if there is only ONE space character in the subject between BASE and href.
I tried to add a quantifier to the space part in the regex (\s), but it did not work.
So how can I make this regex match the URL even if there are several spaces between BASE and href?
You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.
To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.
With lookaround
You can use different ways to try using quantifiers like these:
(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")
Working demo
Without lookaround
By the way, if you want just to get the content within href there is no need of lookaround you just can use:
<BASE\s+href="(.*?)"
Working demo
EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:
((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
^---notice \s ^---notice \s\s ^---notice \s\s\s
I know that this is horrible, but if none of above work you can try with that.

How match data inside a tag, but don't other similar tags

Regexr link for the lazy: http://regexr.com?33udv
Test string:
<li><strong>Start</strong></li><li>End</li>
I want to match when I search for "Start"
<li><strong>Start</strong></li>
My pattern is this:
<li>(?!<li>)*Start.*?</li>
My issue is that it's matching both list children, when I only want to match the one that contains "Start".
Note: This is a very predictable html string that will always look the same. I know Regex shouldn't parse html, but the question is more about understanding of Negative Lookaheads.
Solution:
<li>((?!<li>).)*Start.*?</li>
The expression you posted is different than the one from the link. I will focus on the one from the link.
.* is greedy, it will try to find the longest match. You want it to be lazy:
<li>.*?Start.*?</li>