Match a string that doesn't contain another string in Bash? - regex

I want to match a string that contains some text in the beginning and end but doesn't contain a different text in the middle. For example: starts with a word (\w+) and ends with another one but doesn't contain NOT in between:
some_YES_text // ok
other_COOL_string // also ok
some_NOT_string // don't want to match this
Normally, I could do that with negative lookahead:
\w+_(?!NOT)\w+_\w+
But I'm writing a script in Bash which doesn't support it. What is the easiest way to achieve the same effect?
Edit: I wasn't precise before - I still need to use regex, not just plain text matching.

You may match abc_NOT_def or abc_anywordhere_def and capture one of them, or part of them, and upon a match, check if that capture is not empty. Then, just implement the logic you need:
s="other_NOT_string"
rx='^([[:alnum:]_]+_(NOT)_[[:alnum:]_]+|[[:alnum:]_]+_[[:alnum:]_]+_[[:alnum:]_]+)$'
if [[ "$s" =~ $rx ]]; then
if [ -z ${BASH_REMATCH[2]} ]; then
echo "MATCH: ${BASH_REMATCH[0]}"
else
echo "No match"
fi;
else
echo "No match"
fi;
Details
^ - start of string
( - Start of Group 1:
[[:alnum:]_]+_ - 1+ word chars (POSIX ERE \w equivalent) and a _
(NOT) - Group 2: NOT
_[[:alnum:]_]+ - _ and 1+ word chars
| - or
[[:alnum:]_]+_[[:alnum:]_]+_[[:alnum:]_]+ - 1+ word chars, _, 1+ word chars, _ and again 1+ word chars
) - end of Group 1.
$ - end of string
With [ -z ${BASH_REMATCH[2]} ] condition, we check if NOT was matched. If it was, there is no valid match, else, there is one.

Related

Combining Regex 'in between' AND 'do not contain'

I've tried to combine two Regex with AND but failed at the attempt.
Pick up anything between '[[' and (']]' or '|') in direct succession :
(?<=(\[\[))(.*?)(?=(\||(\]\])))
Doesn't contain 'http' :
^(?:(?!http).)*$
My best try was
(?=(?<=(\[\[))(.*?)(?=(\||(\]\]))))(?=^(?:(?!http).)*$).*$
Following https://stackoverflow.com/a/870506 but it is not working.
My goal is to get all the intenal links in a dokuwiki page typically : 'my_page', 'my_other_page', but not 'http://your_page' in :
[[my_page]]
[[my_other_page|this is my other page]]
[[http://your_page|this is your page]]
As an alternative, you could make use of a SKIP FAIL approach:
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+
The pattern matches:
\[\[https?:\/\/ Match [[https:// with optional s
(*SKIP)(*FAIL) Consument the characters that you want to avoid
| Or
\[\[\K Match [[ and forget what is matched so far
[^][|]+ Match 1+ times any char except ] [ or |
Regex demo
$strings = [
"[[my_page]]",
"[[my_other_page|this is my other page]]",
"[[http://your_page|this is your page]]",
];
$re = '/\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+/';
foreach ($strings as $s){
if (preg_match($re, $s, $matches)) {
var_dump($matches[0]);
}
}
Output
string(7) "my_page"
string(13) "my_other_page"
To verify the optional part with | and the closing ]] you can use a negative lookahead
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|[^][]*)?]])
Regex demo
Or if the last part can also contain ] or [
\[\[https?:\/\/(*SKIP)(*FAIL)|\[\[\K[^][|]+(?=(?:\|.*?)?]])
Regex demo
Then use
(?<=\[\[)(?!https?:\/\/)[^][|]+
(?<=\[\[)(?!https?:\/\/)[^][|]+(?=(?:\|[^][]*)?]])
See the regex demo
Details:
(?<=\[\[) - a positive lookbehind that matches a location immediately preceded with [[
(?!https?:\/\/) - a negative lookahead that cancels the match if there is http:// or https:// immediately to the right of the current location
[^][|]+ - one or more chars other than ], [ and |
(?=(?:\|[^][]*)?]]) - a positve lookahead that requires the following sequence of patterns immediately to the right of the current location:
(?:\|[^][]*)? - an optional occrurrence of a | and then any zero or more chars other than [ and ]
]] - a ]] string.
NOTE: Depending on the regex flavor, you may need to escape ] or/and [ chars in the character class, i.e. [^][] => [^\][] (JavaScript RegExp) or [^\]\[] (Java, Ruby).
Here is the Regex Pattern with some adjustments to Wiktor Stribiżew's answer in the comment section. By using lookahead and lookbehind, you can deselect the brackets.
(?<=\[\[)(?!https?:\/\/)[^][|]*(?:\|[^][]*)?(?=]])

Renaming files with criteria

Need some advice. I'm trying to do something with regular expressions that might not be possible, and if it is possible it's over my head. I can't get anything to work. I'm trying to create a tagging system for my PDF files. So if I have this file name:
"csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
I would like all the words inside the '[ ]' to have a '#' in from of them. So the above file name would look like this:
"csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The problem is keeping the '#' inside the '[ ]'. For example I'd rather the 'csharp' at the very front of the file name not have the '#'.
Also, I'm using a bulk renamer called 'Bulk Rename Utility' to help me.
Can this be done?
If it can, any hints on how?
Thanks.
Bulk Rename Utility does not support replacing multiple matches, you can only match the whole file name and perform replacements using capturing groups/backreferences.
Since you are using Windows, I suggest using Powershell:
cd 'C:\YOUR_FOLDER\HERE'
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(?<=\[[^][]*?)\w+(?=[^][]*])','#$&' }
See this regex demo and the proof it works with .NET regex flavor.
(?<=\[[^][]*?) - right before this location, there must be a [ and then any amount of chars other than [ and ], as few as possible
\w+ - 1+ word chars
(?=[^][]*]) - right after this location, there must be any amount of chars other than [ and ], as many as possible, and then a ] char.
The replacement is # + the whole match value ($&).
Also, you may use
Get-ChildItem -File | Rename-Item -NewName { $_.Name -replace '(\G(?!\A)[^][\w]+|\[)(\w+)','$1#$2' }
See this regex demo and .NET regex test.
(\G(?!\A)[^][\w]+|\[) - Group 1 ($1): either the end of the previous match and 1+ chars other than ], [ and word chars, or a [ char
(\w+) - Group 2 ($2): one or more word chars.
If you only want to rename *.pdf files, replace Get-ChildItem -File with Get-ChildItem *.pdf.
I assume there is at most one bracket-delimited substring.
You can replace zero-length matches of the following regular expression with '#' when using Perl (click "Perl" then check global and case-different options), Ruby, Python's alternative regex engine, R with perl=true or languages that uses the PCRE regex engine, which includes PHP. With the exception of Ruby, the case-different (\i) and general (\g) flags need be set. Ruby only requires the case-indifferent flag.
r = /(?:^.*\[ *|\G(?<!^)|[a-z]+ +)\K(?<=\[| )(?=[a-z][^\[\]]*\])/
If using Ruby, for example, one would execute
str = "csharp 8 in a nutshell[studying programming csharp ebooks].pdf"
str.gsub(r,'#')
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I believe all of the languages I named above allow one to run a short script from the command line. (I provide a Ruby script below.)
The regex engine performs the following operations.
(?: : begin non-capture group
^.*\[ * : match beginning of string then 0+ characters then '['
then 0+ spaces
| : or
\G : asserts the position at the end of the previous match
or at the start of the string for the first match
(?<!^) : use a negative lookbehind to assert that the current
location is not the start of the string
| : or
[a-z]+ + : match 1+ letters then 1+ spaces
) : end non-capture group
\K : reset beginning of reported match to current location
and discard all previously-matched characters from match
to be returned
(?<= : begin positive lookbehind
\[|[ ] : match '[' or a space
) : end positive lookbehind
(?= : begin positive lookahead
[a-z][^\[\]]*\] : match a letter then 0+ characters other than '[' and ']'
then ']'
) : end positive lookahead
Another possibility (illustrated with Ruby) is to break the string into three pieces, modify the middle one, then rejoin the pieces:
first, mid, last = str.split /(?<=\[)|(?=\])/
#=> ["csharp 8 in a nutshell[",
# "studying programming csharp ebooks",
# "].pdf"]
first + mid.gsub(/(?<=\A| )(?! )/,'#') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
The regex used by split reads, "match a (zero-width) string that is preceded by '[' ((?<=\[) being a positive lookbehind) or is followed by ']' ((?=\]) being a positive lookahead.) By matching zero-width strings split does not remove any characters.
gsub's regex reads, "match a zero-width string that is at the start of the string or is preceded by a space and is followed by a character other than a space ((?! ) being a negative lookahead). It could alternatively be written /(?<![^ ])(?! )/ ((?<![^ ]) being a negative lookbehind).
A variant:
first + mid.split.map { |s| '#' + s }.join(' ') + last
#=> "csharp 8 in a nutshell[#studying #programming #csharp #ebooks].pdf"
I created a file named 'in' that contains the following two lines:
Little [Miss Muffet sat on her] tuffet
eating her [curds and] whey
Here is an example of a (Ruby) script that could be run from the command line to perform the necessary replacements.
ruby -e "File.open('out', 'w') do |fout|
File.foreach('in') do |str|
first, mid, last = str.split(/(?<=\[)|(?=\])/)
fout.puts(first + mid.gsub(/(?<=\A| )(?! )/,'#') + last)
end
end"
This produces a file named 'out' that contains these two lines:
Little [#Miss #Muffet #sat #on #her] tuffet
eating her [#curds #and] whey

How to delete two groups of characters with regex?

I have this type of string:
First part: [[archive 726|The Archive]] is a great start
And I want to print:
First part: The Archive is a great start
Here is what I've come to far:
input.gsub!(/\[\[(.*?)\|/,"")
print input
> "First part: The Archive]] is a great start"
How can I also match the ]]?
You may use
input.gsub!(/\[\[[^\]\[]*\|(.*?)\]\]/, '\1')
See the Rubular demo and a Ruby demo.
Details
\[\[ - a [[ substring
[^\]\[]* - any 0 or more chars other than [ and ], as many as possible (if there are multiple | chars inside [[...]], replace * with *? to match as few as possible)
\| - a | char
(.*?) - Group 1 (the group value is referred to with \1 from the replacement pattern, mind the single quotes around \1): any 0 or more chars other than line break chars, as few as possible
\]\] - a ]] substring.

Looking for regex to match before and after a number

Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.
There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.
I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238

PowerShell -replace to get string between two different characters

I am current using split to get what I need, but I am hoping I can use a better way in powershell.
Here is the string:
server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000
I want to get the server and database with out the database= or the server=
here is the method I am currently using and this is what I am currently doing:
$databaseserver = (($details.value).split(';')[0]).split('=')[1]
$database = (($details.value).split(';')[1]).split('=')[1]
This outputs to:
ss8.server.com
CSSDatabase
I would like it to be as simple as possible.
Thank you in advance
Replacing approach
You may use the following regex replace:
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$dbserver = $s -replace '^server=([^;]+).*', '$1'
$db = $s -replace '^[^;]*;database=([^;]+).*', '$1'
The technique is to match and capture (with (...)) what we need and just match what we need to remove.
Pattern details:
^ - start of the line
server= - a literal substring
([^;]+) - Group 1 (what $1 refers to) matching 1+ chars other than ;
.* - any 0+ chars other than a newline, as many as possible
Pattern 2 is almost the same, the capturing group is shifted a bit to capture another detail, and some more literal values are added to match the right context.
Note: if the values you need to extract may appear anywhere in the string, replace ^ in the first one and ^[^;]*; pattern in the second one with .*?\b (any 0+ chars other than a newline, as few as possible followed with a word boundary).
Matching approach
With a -match, you may do it the following way:
$s -match '^server=(.+?);database=([^;]+)'
The $Matches[1] will contain the server details and $Matches[2] will hold the DB info:
Name Value
---- -----
2 CSSDatabase
1 ss8.server.com
0 server=ss8.server.com;database=CSSDatabase
Pattern details
^ - start of string
server= - literal substring
(.+?) - Group 1: any 1+ non-linebreak chars as few as possible
;database= - literal substring
([^;]+) - 1+ chars other than ;
Another solution with a RegEx and named capture groups, similar to Wiktor's Matching Approach.
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$RegEx = '^server=(?<databaseserver>[^;]+);database=(?<database>[^;]+)'
if ($s -match $RegEx){
$Matches.databaseserver
$Matches.database
}