bash 2.0 string matching - regex

I'm on GNU bash, version 2.05b.0(1)-release (2002). I'd like to determine whether the value of $1 is a path in one of those /path/*.log rules in, say, /etc/logrotate.conf. It's not my box so I can't upgrade it.
Edit: my real goal is given /path/actual.log answer whether it is already governed by logrotate or if all the current rules miss it. I wonder then if my script should just run logrotate -d /etc/logrotate.conf and see if /path/actual.log is in the output. This seems simpler and covers all the cases as opposed to this other approach.
But I still want to know how to approach string matching in Bash 2.0 in general...
the line itself can start with some white space or none
it's not a match if it is in a commented line (comments are lines where the first non white space char is #)
there can be one or more paths on the same line to the left of $1
like if $1 is /my/path/*.log and the line in question is
/other/path*.log /yet/another.log /my/path/*.log {
there can be one or more paths to the right as well
the line itself can end with { and even more white space or not
paths can be contained in double-quotes or not
it can be assumed that the file is a valid logrotate conf file.
I have something that seems to work in Bash 4 but not in Bash 2.05. Where can I go to read what Bash 2.0 supports? How would this matching be checked in Bash 2.0?

You can find a terse bash changelog here.
You'll see that =~, the regex-matching operator, didn't get introduced until version 3.0.
Thus, your best bet is to use a utility to perform the regex matching for you; e.g.:
if grep -Eq '<your-extended-regex>' <<<"$1"; then ...
grep -Eq '<your-extended-regex>' <<<"$1":
IS like [[ $1 =~ <your-extended-regex> ]] in Bash 3.0+ in that its exit code indicates whether the literal value of $1 matches the extended regex <your-extended-regex>
Note that Bash 3.1 changed the interpretation of the RHS to treat quoted (sub)strings as literals.
Also note that grep -E may support a slightly different regular-expression dialect.
is NOT like it in that the grep solution cannot return capture groups; by contrast, Bash 3.0+ provides the overall match and capture groups via special array variable ${BASH_REMATCH[#]}.

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).
If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

Where is this Regex expression not closed in sed (apostrophe parenthesis)?

I'm trying to update some setting for wordpress and I need to use sed. When I run the below command, it seems to think the line is not finished. What am I doing wrong?
$ sed -i 's/define\( \'DB_NAME\', \'database_name_here\' \);/define\( \'DB_NAME\', \'wordpress\' \);/g' /usr/share/nginx/wordpress/wp-settings.php
> ^C
Thanks.
Single quotes in most shells don't support any escaping. If you want to include a single quote, you need to close the single quotes and add the single quote - either in double quotes, or backslashed:
sed 's/define\( '\''DB_NAME'\'', '\''database_name_here'\'' \);/define\( '\''DB_NAME'\'', '\''wordpress'\'' \);/g'
I fear it still wouldn't work for you, as \( is special in sed. You probably want just a simple ( instead.
sed 's/define( '\''DB_NAME'\'', '\''database_name_here'\'' );/define( '\''DB_NAME'\'', '\''wordpress'\'' );/g'
or
sed 's/define( '"'"'DB_NAME'"'"', '"'"'database_name_here'"'"' );/define( '"'"'DB_NAME'"'"', '"'"'wordpress'"'"' );/g'
Normally, using single quotes around the script of a sed script is sensible. This is a case where double quotes would be a better choice — there are no shell metacharacters other than single quotes in the sed script:
sed -e "s/define( 'DB_NAME', 'database_name_here' );/define( 'DB_NAME', 'wordpress' );/g" /usr/share/nginx/wordpress/wp-settings.php
or:
sed -e "s/\(define( 'DB_NAME', '\)database_name_here' );/\1wordpress' );/g" /usr/share/nginx/wordpress/wp-settings.php
or even:
sed -e "/define( 'DB_NAME', 'database_name_here' );/s/database_name_here/wordpress/g" /usr/share/nginx/wordpress/wp-settings.php
One other option to consider is using sed's -f option to provide the script as a file. That saves you from having to escape the script contents from the shell. The downside may be that you have to create the file, run sed using it, and then remove the file. It is likely that's too painful for the current task, but it can be sensible — it can certainly make life easier when you don't have to worry about shell escapes.
I'm not convinced the g (global replace) option is relevant; how many single lines are you going to find in the settings file containing two independent define DB_NAME operations with the default value?
You can add the -i option when you've got the basic code working. Do note that if you might ever work on macOS or a BSD-based system, you'll need to provide a suffix as an extra argument to the -i option (e.g. -i '' for a null suffix or no backup; or -i.bak to be able to work reliably on both Linux (or, more accurately, with GNU sed) and macOS and BSD (or, more accurately, with BSD sed). Appealing to POSIX is no help; it doesn't support an overwrite option.
Test case (first example):
$ echo "define( 'DB_NAME', 'database_name_here' );" |
> sed -e "s/\(define( 'DB_NAME', '\)database_name_here' );/\1wordpress' );/g"
define( 'DB_NAME', 'wordpress' );
$
If the spacing around 'DB_NAME' is not consistent, then you'd end up with more verbose regular expressions, using [[:space:]]* in lieu of blanks, and you'd find the third alternative better than the others, but the second could capture both the leading and trailing contexts and use both captures in the replacement.
Parting words: this technique works this time because the patterns don't involve shell metacharacters like $ or  ` . Very often, the script does need to match those, and then using mainly single quotes around the script argument is sensible. Tackling a different task — replace $DB_NAME in the input with the value of the shell variable $DB_NAME (leaving $DB_NAMEORHOST unchanged):
sed -e 's/$DB_NAME\([^[:alnum:]]\)/'"$DB_NAME"'\1/'
There are three separate shell strings, all concatenated with no spaces. The first is single-quoted and contains the s/…/ part of a s/…/…/ command; the second is "$DB_NAME", the value of the shell variable, double-quoted so that if the value of $DB_NAME is 'autonomous vehicle recording', you still have a single argument to sed; the third is the '\1/' part, which puts back whatever character followed $DB_NAME in the input text (with the observation that if $DB_NAME could appear at the end of an input line, this would not match it).
Most regexes do fuzzy matching; you have to consider variations on what might be in the input to determine how hard your regular expressions have to work to identify the material accurately.

Need to match similarly titled filenames present in a variable using regex

I need to find similarly named strings that are passed as bash variables to a regex pattern in an interpolated string as a function argument. I'm new to Regex so am unsure what the best approach is.
Here's what I currently have:
bash_script.sh
findKeys(`grep --ignore-case ^${apiServiceName}$`)
However, some APIs have similar names, eg:
apiServiceNames = ['api-name', 'api-name-one', 'api-name-two']
The confusing bit is where to put \ (which characters to escape) as I need ${} for the variable but $^ opens and closes a string.
You don't need a regex match with grep or any third party tools. The native bash shell provides strong enough features for pattern matching. For e.g. the below construct when written as
if [[ $apiServiceName == api-name?(?(-)+(one|two)) ]]; then
printf '%s - is allowed\n' "$apiServiceName"
fi
The construct api-name?(?(-)+(one|two)) is an extended glob match syntax provided by the shell, that is enabled by default when [[..]] is used for pattern matching with the == operator. See more on extglob

BASH_REMATCH empty

I'm trying capture the some input regex in Bash but BASH_REMATCH comes EMPTY
#!/usr/bin/env /bin/bash
INPUT=$(cat input.txt)
TASK_NAME="MailAccountFetch"
MATCH_PATTERN="(${TASK_NAME})\s+([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})"
while read -r line; do
if [[ $line =~ $MATCH_PATTERN ]]; then
TASK_RESULT=${BASH_REMATCH[3]}
TASK_LAST_RUN=${BASH_REMATCH[2]}
TASK_EXECUTION_DURATION=${BASH_REMATCH[4]}
fi
done <<< "$INPUT"
My input is:
MailAccountFetch 2017-03-29 19:00:00 Success 5.0 Second(s) 2017-03-29 19:03:00
By debugging the script (VS Code+Bash ext) I can see the INPUT string matches as the code goes inside the IF but BASH_REMATCH is not populated with my two capture groups.
I'm on:
GNU bash, version 4.4.0(1)-release (x86_64-pc-linux-gnu)
What could be the issue?
LATER EDIT
Accepted Answer
Accepting most explanatory answer.
What finally resolved the issue:
bashdb/VS Code environment are causing the empty BASH_REMATCH. The code works OK when ran alone.
As Cyrus shows in his answer, a simplified version of your code - with the same input - does work on Linux in principle.
That said, your code references capture groups 3 and 4, whereas your regex only defines 2.
In other words: ${BASH_REMATCH[3]} and ${BASH_REMATCH[4]} are empty by definition.
Note, however, that if =~ signals success, BASH_REMATCH is never fully empty: at the very least - in the absence of any capture groups - ${BASH_REMATCH[0]} will be defined.
There are some general points worth making:
Your shebang line reads #!/usr/bin/env /bin/bash, which is effectively the same as #!/bin/bash.
/usr/bin/env is typically used if you want a version other than /bin/bash to execute, one you've installed later and put in the PATH (too):
#!/usr/bin/env bash
ghoti points out that another reason for using #!/usr/bin/env bash is to also support less common platforms such as FreeBSD, where bash, if installed, is located in /usr/local/bin rather than the usual /bin.
In either scenario it is less predictable which bash binary will be executed, because it depends on the effective $PATH value at the time of invocation.
=~ is one of the few Bash features that are platform-dependent: it uses the particular regex dialect implemented by the platform's regex libraries.
\s is a character class shortcut that is not available on all platforms, notably not on macOS; the POSIX-compliant equivalent is [[:space:]].
(In your particular case, \s should work, however, because your Bash --version output suggests that you are on a Linux distro.)
It's better not to use all-uppercase shell variable names such as INPUT, so as to avoid conflicts with environment variables and special shell variables.
Bash uses system libraries to parse regular expressions, and different parsers implement different features. You've come across a place where regex class shorthand strings do not work. Note the following:
$ s="one12345 two"
$ [[ $s =~ ^([a-z]+[0-9]{4})\S*\s+(.*) ]] && echo yep; declare -p BASH_REMATCH
declare -ar BASH_REMATCH=()
$ [[ $s =~ ^([a-z]+[0-9]{4})[^[:space:]]*[[:space:]]+(.*) ]] && echo yep; declare -p BASH_REMATCH
yep
declare -ar BASH_REMATCH=([0]="one12345 two" [1]="one1234" [2]="two")
I'm doing this on macOS as well, but I get the same behaviour on FreeBSD.
Simply replace \s with [[:space:]], \d with [[:digit:]], etc, and you should be good to go. If you avoid using RE shortcuts, your expressions will be more widely understood.

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt