Combining Regex 'in between' AND 'do not contain' - regex

I've tried to combine two Regex with AND but failed at the attempt.
Pick up anything between '[[' and (']]' or '|') in direct succession :
(?<=(\[\[))(.*?)(?=(\||(\]\])))
Doesn't contain 'http' :
^(?:(?!http).)*$
My best try was
(?=(?<=(\[\[))(.*?)(?=(\||(\]\]))))(?=^(?:(?!http).)*$).*$
Following https://stackoverflow.com/a/870506 but it is not working.
My goal is to get all the intenal links in a dokuwiki page typically : 'my_page', 'my_other_page', but not 'http://your_page' in :
[[my_page]]
[[my_other_page|this is my other page]]
[[http://your_page|this is your page]]

As an alternative, you could make use of a SKIP FAIL approach:
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+
The pattern matches:
\[\[https?:\/\/ Match [[https:// with optional s
(*SKIP)(*FAIL) Consument the characters that you want to avoid
| Or
\[\[\K Match [[ and forget what is matched so far
[^][|]+ Match 1+ times any char except ] [ or |
Regex demo
$strings = [
"[[my_page]]",
"[[my_other_page|this is my other page]]",
"[[http://your_page|this is your page]]",
];
$re = '/\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+/';
foreach ($strings as $s){
if (preg_match($re, $s, $matches)) {
var_dump($matches[0]);
}
}
Output
string(7) "my_page"
string(13) "my_other_page"
To verify the optional part with | and the closing ]] you can use a negative lookahead
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|[^][]*)?]])
Regex demo
Or if the last part can also contain ] or [
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|.*?)?]])
Regex demo

Then use
(?<=\[\[)(?!https?:\/\/)[^][|]+
(?<=\[\[)(?!https?:\/\/)[^][|]+(?=(?:\|[^][]*)?]])
See the regex demo
Details:
(?<=\[\[) - a positive lookbehind that matches a location immediately preceded with [[
(?!https?:\/\/) - a negative lookahead that cancels the match if there is http:// or https:// immediately to the right of the current location
[^][|]+ - one or more chars other than ], [ and |
(?=(?:\|[^][]*)?]]) - a positve lookahead that requires the following sequence of patterns immediately to the right of the current location:
(?:\|[^][]*)? - an optional occrurrence of a | and then any zero or more chars other than [ and ]
]] - a ]] string.
NOTE: Depending on the regex flavor, you may need to escape ] or/and [ chars in the character class, i.e. [^][] => [^\][] (JavaScript RegExp) or [^\]\[] (Java, Ruby).

Here is the Regex Pattern with some adjustments to Wiktor Stribiżew's answer in the comment section. By using lookahead and lookbehind, you can deselect the brackets.
(?<=\[\[)(?!https?:\/\/)[^][|]*(?:\|[^][]*)?(?=]])

Related

Renaming files with criteria

Need some advice. I'm trying to do something with regular expressions that might not be possible, and if it is possible it's over my head. I can't get anything to work. I'm trying to create a tagging system for my PDF files. So if I have this file name:
"csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
I would like all the words inside the '[ ]' to have a '#' in from of them. So the above file name would look like this:
"csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The problem is keeping the '#' inside the '[ ]'. For example I'd rather the 'csharp' at the very front of the file name not have the '#'.
Also, I'm using a bulk renamer called 'Bulk Rename Utility' to help me.
Can this be done?
If it can, any hints on how?
Thanks.
Bulk Rename Utility does not support replacing multiple matches, you can only match the whole file name and perform replacements using capturing groups/backreferences.
Since you are using Windows, I suggest using Powershell:
cd 'C:\YOUR_FOLDER\HERE'
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(?<=\[[^][]*?)\w+(?=[^][]*])','#$&' }
See this regex demo and the proof it works with .NET regex flavor.
(?<=\[[^][]*?) - right before this location, there must be a [ and then any amount of chars other than [ and ], as few as possible
\w+ - 1+ word chars
(?=[^][]*]) - right after this location, there must be any amount of chars other than [ and ], as many as possible, and then a ] char.
The replacement is # + the whole match value ($&).
Also, you may use
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(\G(?!\A)[^][\w]+|\[)(\w+)','$1#$2' }
See this regex demo and .NET regex test.
(\G(?!\A)[^][\w]+|\[) - Group 1 ($1): either the end of the previous match and 1+ chars other than ], [ and word chars, or a [ char
(\w+) - Group 2 ($2): one or more word chars.
If you only want to rename *.pdf files, replace Get-ChildItem -File with Get-ChildItem *.pdf.
I assume there is at most one bracket-delimited substring.
You can replace zero-length matches of the following regular expression with '#' when using Perl (click "Perl" then check global and case-different options), Ruby, Python's alternative regex engine, R with perl=true or languages that uses the PCRE regex engine, which includes PHP. With the exception of Ruby, the case-different (\i) and general (\g) flags need be set. Ruby only requires the case-indifferent flag.
r = /(?:^.*\[ *|\G(?<!^)|[a-z]+ +)\K(?<=\[| )(?=[a-z][^\[\]]*\])/
If using Ruby, for example, one would execute
str = "csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
str.gsub(r,'#')
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I believe all of the languages I named above allow one to run a short script from the command line. (I provide a Ruby script below.)
The regex engine performs the following operations.
(?: : begin non-capture group
^.*\[ * : match beginning of string then 0+ characters then '['
then 0+ spaces
| : or
\G : asserts the position at the end of the previous match
or at the start of the string for the first match
(?<!^) : use a negative lookbehind to assert that the current
location is not the start of the string
| : or
[a-z]+ + : match 1+ letters then 1+ spaces
) : end non-capture group
\K : reset beginning of reported match to current location
and discard all previously-matched characters from match
to be returned
(?<= : begin positive lookbehind
\[|[ ] : match '[' or a space
) : end positive lookbehind
(?= : begin positive lookahead
[a-z][^\[\]]*\] : match a letter then 0+ characters other than '[' and ']'
then ']'
) : end positive lookahead
Another possibility (illustrated with Ruby) is to break the string into three pieces, modify the middle one, then rejoin the pieces:
first, mid, last = str.split /(?<=\[)|(?=\])/
#=> ["csharp 8 in a nutshell[",
# "studying programming csharp ebooks",
# "].pdf"]
first + mid.gsub(/(?<=\A| )(?! )/,'#') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The regex used by split reads, "match a (zero-width) string that is preceded by '[' ((?<=\[) being a positive lookbehind) or is followed by ']' ((?=\]) being a positive lookahead.) By matching zero-width strings split does not remove any characters.
gsub's regex reads, "match a zero-width string that is at the start of the string or is preceded by a space and is followed by a character other than a space ((?! ) being a negative lookahead). It could alternatively be written /(?<![^ ])(?! )/ ((?<![^ ]) being a negative lookbehind).
A variant:
first + mid.split.map { |s| '#' + s }.join(' ') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I created a file named 'in' that contains the following two lines:
Little [Miss Muffet sat on her] tuffet
eating her [curds and] whey
Here is an example of a (Ruby) script that could be run from the command line to perform the necessary replacements.
ruby -e "File.open('out', 'w') do |fout|
File.foreach('in') do |str|
first, mid, last = str.split(/(?<=\[)|(?=\])/)
fout.puts(first + mid.gsub(/(?<=\A| )(?! )/,'#') + last)
end
end"
This produces a file named 'out' that contains these two lines:
Little [#Miss #Muffet #sat #on #her] tuffet
eating her [#curds #and] whey

How to delete two groups of characters with regex?

I have this type of string:
First part: [[archive 726|The Archive]] is a great start
And I want to print:
First part: The Archive is a great start
Here is what I've come to far:
input.gsub!(/\[\[(.*?)\|/,"")
print input
> "First part: The Archive]] is a great start"
How can I also match the ]]?
You may use
input.gsub!(/\[\[[^\]\[]*\|(.*?)\]\]/, '\1')
See the Rubular demo and a Ruby demo.
Details
\[\[ - a [[ substring
[^\]\[]* - any 0 or more chars other than [ and ], as many as possible (if there are multiple | chars inside [[...]], replace * with *? to match as few as possible)
\| - a | char
(.*?) - Group 1 (the group value is referred to with \1 from the replacement pattern, mind the single quotes around \1): any 0 or more chars other than line break chars, as few as possible
\]\] - a ]] substring.

Match a string that doesn't contain another string in Bash?

I want to match a string that contains some text in the beginning and end but doesn't contain a different text in the middle. For example: starts with a word (\w+) and ends with another one but doesn't contain NOT in between:
some_YES_text // ok
other_COOL_string // also ok
some_NOT_string // don't want to match this
Normally, I could do that with negative lookahead:
\w+_(?!NOT)\w+_\w+
But I'm writing a script in Bash which doesn't support it. What is the easiest way to achieve the same effect?
Edit: I wasn't precise before - I still need to use regex, not just plain text matching.
You may match abc_NOT_def or abc_anywordhere_def and capture one of them, or part of them, and upon a match, check if that capture is not empty. Then, just implement the logic you need:
s="other_NOT_string"
rx='^([[:alnum:]_]+_(NOT)_[[:alnum:]_]+|[[:alnum:]_]+_[[:alnum:]_]+_[[:alnum:]_]+)$'
if [[ "$s" =~ $rx ]]; then
if [ -z ${BASH_REMATCH[2]} ]; then
echo "MATCH: ${BASH_REMATCH[0]}"
else
echo "No match"
fi;
else
echo "No match"
fi;
Details
^ - start of string
( - Start of Group 1:
[[:alnum:]_]+_ - 1+ word chars (POSIX ERE \w equivalent) and a _
(NOT) - Group 2: NOT
_[[:alnum:]_]+ - _ and 1+ word chars
| - or
[[:alnum:]_]+_[[:alnum:]_]+_[[:alnum:]_]+ - 1+ word chars, _, 1+ word chars, _ and again 1+ word chars
) - end of Group 1.
$ - end of string
With [ -z ${BASH_REMATCH[2]} ] condition, we check if NOT was matched. If it was, there is no valid match, else, there is one.

Explode string with comma when comma is not inside any brackets

I have string "xyz(text1,(text2,text3)),asd" I want to explode it with , but only condition is that explode should happen only on , which are not inside any brackets (here it is ()).
I saw many such solutions on stackoverflow but it didn't work with my pattern. (example1) (example2)
What is correct regex for my pattern?
In my case xyz(text1,(text2,text3)),asd
result should be
xyz(text1,(text2,text3)) and asd.
You may use a matching approach using a regex with a subroutine:
preg_match_all('~\w+(\((?:[^()]++|(?1))*\))?~', $s, $m)
See the regex demo
Details
\w+ - 1+ word chars
(\((?:[^()]++|(?1))*\))? - an optional capturing group matching
\( - a (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1+ chars other than ( and )
| - or
(?1) - the whole Group 1 pattern
\) - a ).
PHP demo:
$rx = '/\w+(\((?:[^()]++|(?1))*\))?/';
$s = 'xyz(text1,(text2,text3)),asd';
if (preg_match_all($rx, $s, $m)) {
print_r($m[0]);
}
Output:
Array
(
[0] => xyz(text1,(text2,text3))
[1] => asd
)
If the requirement is to split at , but only outside nested parenthesis another idea would be to use preg_split and skip the parenthesized stuff also by use of a recursive pattern.
$res = preg_split('/(\((?>[^)(]*(?1)?)*\))(*SKIP)(*F)|,/', $str);
See this pattern demo at regex101 or a PHP demo at eval.in
The left side of the pipe character is used to match and skip what is inside the parenthesis.
On the right side it will match remaining commas that are left outside of the parenthesis.
The pattern used is a variant of different common patterns to match nested parentehsis.

Exclude an escaped character from a range

I need to extract an expression between brackets that can include everything but not an non-escaped closed bracket.
For example, the regexp from [aaa\]bbbbbb] should give as result : aaa\]bbbbbb.
I tried this : \[([^(?<!\\)\]]*)\] but that fail.
Any hints?
You may use
\[([^\]\[\\]*(?:\\.[^\]\[\\]*)*)]
Or - if there may be any non-escaped [ in-between non-escaped [ and ] (e.g. [a[\[aa\]bbbbbba\[aabbbbbb]), take out the \[:
\[([^\]\\]*(?:\\.[^\]\\]*)*)]
See the regex demo 1 and regex demo 2. It is an unrolled variant of a \[((?:[^][\\]|\\.)*)] regex.
Details:
\[ - a [
([^\]\[\\]*(?:\\.[^\]\[\\]*)*) - Group 1 capturing:
[^\]\[\\]* - zero or more chars other than [, ] and \ (in some regex flavors, you may write it without escapes - [^][\\]*)
(?:\\.[^\]\[\\]*)* - zero or more sequences of:
\\. - any escaped sequence (\ and any char other than line break chars
[^\]\[\\]* - zero or more chars other than [, ] and \
] - a closing ].
This is the simplest regex that (I think) works:
\[(.*?)(?<!\\)\]
which captures the bracketed text as group 1.
See live demo.