Hierarchical path RegExp - regex

I have to remove a known "level" from a hierarchical path using a regular expression.
In other terms, I want to go from 'a/b/X/c/d' to 'a/b/c/d', where X can be at any level of the path.
Using Javascript as an example, I have crafted the following:
str = str.replace(/^(?:(.+\/)|)X(?:$|\/(.+$))/, "$1$2")
which works fine when X is either the root or is in the middle of the path, but leaves a trailing slash when X comes last in the path. I could make a subsequent replace to handle those instances, but would it be possible to create a better RegEx that matches all the cases?
Thanks.
Edit: To clarify, all levels of the path might contain any number of characters and I'm only interested in removing a level only if it matches X exactly.

Search: \bX/|/X(?=$)
Replace: Empty String
In the Regex Demo, see the substitutions at the bottom.
Input
a/b/X/c/d
X/a/b/c/d
a/b/c/d/X
Output
a/b/c/d
a/b/c/d
a/b/c/d
Explanation
\b assert word boundary
X/ match X/
OR |
Match /X, if the lookahead (?=$) can assert that what follows is the end of the string

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

Regex for specific file structure

I need to parse file with next simple structure:
some string 1
some string 2
some string 3
some string x
some string y
some string z
...
File consist of 2 parts separated by "\n\n" or "\r\n\r\n". This separator present in my example after "some string 3". Each part is optional, that is if first part omitted than there will be 1(but with my regex I need 2 empty lines) empty line(\n|\r\n) before second part. And if second part is omitted than there will be any number of empty lines after first part(include no empty lines at all).
I'm trying to achieve desired result with regex like this:
(?isx: \h* (.+)? \h* (?:(?:\n|\r\n){2,} \h* (.+))? \s*)
But with no success because first "(.+)?" very greedy and if I making 2nd part non-optional it violates my requirements that both part must be optional. I know that I can use split /(?:\n|\r\n)/, $str in this case but this file in future could have more complex structure so I can't use split.
Can someone help me with this?
You actually might want to use a non-greedy group, since you don't want to match your seperator.
(?ìsx: (?:
(.*?) # Non greedy
(?:\r?\n){2,} # also matches \r\n\n but that might not be of concern
|\r?\n) # one empty line.
(.*) # second group
)
I don't know what you wanted to achieve with the \hs. If you want to ensure that there is something in the lines (right now, the . also could all match \n or spaces) you could try something like (?:[^\n]+\n)*? for the groups.
Also, for brevities sake, I avoided the explicit ? you used. There might be a difference in results. If you match nothing under a star, you'll get the empty string, if you don't match at all, the value of the group-variable is undefined. Here is a short example to show the difference:
"aa" =~ /(c)?(d*)aa/
Here $1 is undefined, while $2 is the empty string. This minor difference might yield some annoying warnings or unexpected results if someone tested with defined for the contents of a group.

How to exclude a certain word in regex?

I'm using this expression and it's perfect for what I need:
.*(cq|conquest).*
It returns any word/phrase/sentence/etc. with the letters 'cq' or the word 'conquest' in it. However, from those matches I want to exclude all that contain the term 'conquest power'.
Examples:
some conquest here (should match)
another cq with some conquest here (should match)
too much cq or conquest power is bad (should not match)
How can I do that to the regex above? It has to be only one regex otherwise the program that I'm using (Advanced Combat Tracker) will create two different tabs.
If you want to match any string which contains "conquest" or "cq", but not if the string contains "conquest power", then the regex is
^(?!.*conquest power).*?(?:cq|conquest).*
The above will attempt to match from the start of the string to the end of the line, if you want to match from the start of each line, switch on multiline mode if available - adding (?m) to the start of the regex may do that.
If you want to match across newlines change . to [\s\S], or switch on singleline mode if available.
You have confused people by stating "I want to match 'cq' or 'conquest'" but also "I want the regex to extract that line".
I assume you don't really want to match just "cq" or "conquest", you want to match strings/lines (?) containing "cq" or "conquest".
From your original question I got that you want to match all strings which contain "cq" or "conquest" but do not contain "power". For this case the following regexp works:
^([^p]|p(?!ower))*(cq|conquest)([^p]|p(?!ower))*$
(regexpal)

Regular expression get filename without extention from full filepath

How can I extract the filename without extention from the following file path:
D:\Projects\Extract\downtown - second.pdf
The following regular expression gives me the filename with extention: [^\\]*$
e.g. downtown - second.pdf
The following regular expression gives me the filename without extention: (.+)(?=(\.))
e.g. D:\Projects\Extract\downtown - second
I'm struggling to combine the two into one regular expression to give me the results I want: downtown - second
I suspect that your 2nd regex would not give you the output you have shown. It will give you the complete string till the first period (.).
To get just the file name without extension, you can use this regex: -
[^\\]*(?=[.][a-zA-Z]+$)
I have just replaced (.+) in your 2nd regex with the [^\\]* from your first regex, and added pattern to match pdf till the end.
Now this pattern will match 0 or more repetition of any character but backslash(\), followed by a . and then 1 or more repetition of alphabets making up extension.
I made up this one, which allows to capture most of the possibilities:
/[^\\\/]+(?=\.[\w]+$)|[^\\\/]+$/
/path/to/file
/path/to/file.txt
/path.with/dots.to/file.txt
/path/to/file.with.dots.txt
file.txt
C:\path\to\file.txt
and so on...
I captured file from /path/to/file.pdf by using following regex:
[^/]*(?=\.[^.]+($|\?))
Hope this helps you
I had to use an extra backslash before the first ']' to make this work
[^\\\]*(?=[.][a-zA-Z]+$)
I use this pattern
[^\/]+[.+\.].*$ for / path separator
[^\\]+[.+\.].*$ for \ path separator
hich matches the filename at the end of the string without worrying about characters. There is one exception that if the path for some reason has a folder with a period in it this will get upset. Linux hidden directories that are preceded with a . like .rvm are unaffected.
Hope this helps.
http://rubular.com/r/LNrI4inMU1