Need to match similarly titled filenames present in a variable using regex - regex

I need to find similarly named strings that are passed as bash variables to a regex pattern in an interpolated string as a function argument. I'm new to Regex so am unsure what the best approach is.
Here's what I currently have:
bash_script.sh
findKeys(`grep --ignore-case ^${apiServiceName}$`)
However, some APIs have similar names, eg:
apiServiceNames = ['api-name', 'api-name-one', 'api-name-two']
The confusing bit is where to put \ (which characters to escape) as I need ${} for the variable but $^ opens and closes a string.

You don't need a regex match with grep or any third party tools. The native bash shell provides strong enough features for pattern matching. For e.g. the below construct when written as
if [[ $apiServiceName == api-name?(?(-)+(one|two)) ]]; then
printf '%s - is allowed\n' "$apiServiceName"
fi
The construct api-name?(?(-)+(one|two)) is an extended glob match syntax provided by the shell, that is enabled by default when [[..]] is used for pattern matching with the == operator. See more on extglob

Related

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

reg exp: "if" and single "="

I need a regular expression (grep -e "__"), which matching all lines containing if and just one = (ignoring lines containing ==)
I tried this:
grep -e "if.*=[^=]"
but = is not a character class, so it doesn't work.
The problem is .* may contain an =.
I'd suggest
grep -e "if[^=]*=[^=]"
If your goal is to find lines of code with an if containing an erroneous assignment instead of a comparison, I'd suggest to use a linter (which would be based on a robust parser instead of just regexes). The linter to use depends on the language of the code, of course (for example I use this one in Javascript).

Can OR expressions be used in ${var//OLD/NEW} replacements?

I was testing some string manipulation stuff in a bash script and I've quickly realized it doesn't understand regular expressions (at least not with the syntax I'm using for string operations), then I've tried some glob expressions and it seems to understand some of them, some not. To be specific:
FINAL_STRING=${FINAL_STRING//<title>/$(get_title)}
is the main operation I'm trying to use and the above line works, replacing all occurrences of <title> with $(get_title) on $FINAL_STRING... and
local field=${1/#*:::/}
works, assigning $1 with everything from the beginning to the first occurrence of ::: replaced by nothing (removed). However # do what I'd expect ^ to do. Plus when I've tried to use the {,,} glob expression here:
FINAL_STRING=${FINAL_STRING//{<suffix>,<extension>}/${SUFFIX}}
to replace any occurrence of <suffix> OR <extension> by ${SUFFIX} , it works not.
So I see it doesn't take regex and it also doesn't take glob patterns... so what Does it take? Are there any exhaustive listing of what symbols/expressions are understood by plain bash string operations (particularly substring replacement)? Or are *, ?, #, ##, % and %% the only valid stuff?
(I'm trying to rely only on plain bash, without calling sed or grep to do what I want)
The gory details can be found in the bash manual, Shell Expansions section. The complete picture is surprisingly complex.
What you're doing is described in the Shell Parameter Expansion section. You'll see that the pattern in
${parameter/pattern/string}
uses the Filename Expansion rules, and those don't include Brace Expansion - that is done earlier when processing the command line arguments. Filename expansion "only" does ?, * and [...] matching (unless extglob is set).
But parameter expansion does a bit more than just filename expansion, notably the anchoring you noticed with # or %.
bash does in fact handle regex; specifically, the [[ =~ ]] operator, which you can then assign to a variable using the magic variable $BASH_REMATCH. It's funky, but it works.
See: http://www.linuxjournal.com/content/bash-regular-expressions
Note this is a bash-only hack feature.
For code that works in shells besides bash as well, the old school way of doing something like this is indeed to use #/##/%/%% along with a loop around a case statement (which supports basic * glob matching).

bash 2.0 string matching

I'm on GNU bash, version 2.05b.0(1)-release (2002). I'd like to determine whether the value of $1 is a path in one of those /path/*.log rules in, say, /etc/logrotate.conf. It's not my box so I can't upgrade it.
Edit: my real goal is given /path/actual.log answer whether it is already governed by logrotate or if all the current rules miss it. I wonder then if my script should just run logrotate -d /etc/logrotate.conf and see if /path/actual.log is in the output. This seems simpler and covers all the cases as opposed to this other approach.
But I still want to know how to approach string matching in Bash 2.0 in general...
the line itself can start with some white space or none
it's not a match if it is in a commented line (comments are lines where the first non white space char is #)
there can be one or more paths on the same line to the left of $1
like if $1 is /my/path/*.log and the line in question is
/other/path*.log /yet/another.log /my/path/*.log {
there can be one or more paths to the right as well
the line itself can end with { and even more white space or not
paths can be contained in double-quotes or not
it can be assumed that the file is a valid logrotate conf file.
I have something that seems to work in Bash 4 but not in Bash 2.05. Where can I go to read what Bash 2.0 supports? How would this matching be checked in Bash 2.0?
You can find a terse bash changelog here.
You'll see that =~, the regex-matching operator, didn't get introduced until version 3.0.
Thus, your best bet is to use a utility to perform the regex matching for you; e.g.:
if grep -Eq '<your-extended-regex>' <<<"$1"; then ...
grep -Eq '<your-extended-regex>' <<<"$1":
IS like [[ $1 =~ <your-extended-regex> ]] in Bash 3.0+ in that its exit code indicates whether the literal value of $1 matches the extended regex <your-extended-regex>
Note that Bash 3.1 changed the interpretation of the RHS to treat quoted (sub)strings as literals.
Also note that grep -E may support a slightly different regular-expression dialect.
is NOT like it in that the grep solution cannot return capture groups; by contrast, Bash 3.0+ provides the overall match and capture groups via special array variable ${BASH_REMATCH[#]}.

How to reference captures in bash regex replacement

How can I include the regex match in the replacement expression in BASH?
Non-working example:
#!/bin/bash
name=joshua
echo ${name//[oa]/X\1}
I expect to output jXoshuXa with \1 being replaced by the matched character.
This doesn't actually work though and outputs jX1shuX1 instead.
Perhaps not as intuitive as sed and arguably quite obscure but in the spirit of completeness, while BASH will probably never support capture variables in replace (at least not in the usual fashion as parenthesis are used for extended pattern matching), but it is still possible to capture a pattern when testing with the binary operator =~ to produce an array of matches called BASH_REMATCH.
Making the following example possible:
#!/bin/bash
name='joshua'
[[ $name =~ ([ao].*)([oa]) ]] && \
echo ${name/$BASH_REMATCH/X${BASH_REMATCH[1]}X${BASH_REMATCH[2]}}
The conditional match of the regular expression ([ao].*)([oa]) captures the following values to $BASH_REMATCH:
$ echo ${BASH_REMATCH[*]}
oshua oshu a
If found we use the ${parameter/pattern/string} expansion to search for the pattern oshua in parameter with value joshua and replace it with the combined string Xoshu and Xa. However this only works for our example string because we know what to expect.
For something that functions more like the match all or global regex counterparts the following example will greedy match for any unchanged o or a inserting X from back to front.
#/bin/bash
name='joshua'
while [[ $name =~ .*[^X]([oa]) ]]; do
name=${name/$BASH_REMATCH/${BASH_REMATCH:0:-1}X${BASH_REMATCH[1]}}
done
echo $name
The first iteration changes $name to joshuXa and finally to jXoshuXa before the condition fails and the loop terminates. This example works similar to the look behind expression /(?<!X)([oa])/X\1/ which assumes to only care about the o or a characters which don't have a X prefixed.
The output for both examples:
jXoshuXa
nJoy!
bash> name=joshua
bash> echo $name | sed 's/\([oa]\)/X\1/g'
jXoshuXa
The question bash string substitution: reference matched subexpressions was marked a duplicate of this one, in spite of the requirement that
The code runs in a long loop, it should be a one-liner that does not
launch sub-processes.
So the answer is:
If you really cannot afford launching sed in a subprocess, do not use bash ! Use perl instead, its read-update-output loop will be several times faster, and the difference in syntax is small. (Well, you must not forget semicolons.)
I switched to perl, and there was only one gotcha: Unicode support was not available on one of the computers, I had to reinstall packages.