Bash regex - different number of backslashes for escaping? - regex

Recently I wrote a script with such regex test:
# Works fine on Sabayon/Mac, doesn't work on CentOS
[[ $line =~ (.+)\{(.+)\} ]] || continue
It runs smoothly on Sabayon Linux and also on Mac, but then I needed to run it on other environment and script failed. It was CentOS Linux. I found out after a while that I need to make double backlash escape to make it work.
# Works fine on CentOS, does not on Sabayon/Mac
[[ $line =~ (.+)\\{(.+)\\} ]] || continue
Environtment list:
CentOS release 5.5 (Final), 2.6.18-194.el5, running GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Sabayon latest release, 3.12.0-sabayon, running GNU bash, version 4.2.45(1)-release (x86_64-pc-linux-gnu)
OS X 10.9.3, running GNU bash, version 3.2.51(1)-release (x86_64-apple-darwin13)
Why is that happening? How to make it run on both environments?

You can store the expression in a variable and use that in your test:
re="(.+)\{(.+)\}"
[[ $line =~ $re ]] || continue
I don't think you really need the parentheses here. ".+\{.+\}" would work as well.

Related

Cross platform regex substring match for git

Sorry for yet another pattern matching question, but I'm struggling to to find a tool that will do a regex in a git hook. It needs to work on Windows, Mac and Linux.
This gnu grep works for Windows and Linux, but not Mac (because bsd)
echo "feature/EOPP-234-foo" | grep -Po -e '[A-Z]{4}-\d{1,5}'
This works for Mac and Linux, but not windows (because <git>\usr\bin\egrep don't seem to work)
echo "feature/EOPP-234-foo" | egrep -o '[A-Z]{4}-\d{1,5}'
sed might be the most common tool, but stuffed if I can get it to match:
echo "feature/EOPP-234-foo" | sed -n 's/^.*\([A-Z]{4}\-\d{1,5}\).*$/\1/p'
I've even tried bash matching with no luck
[[ "feature/EOPP-234-foo" =~ ([A-Z]{4}-\d{1,5}) ]] && echo ${BASH_REMATCH[1]}
Any ideas?
When you need to make POSIX tools run on Windows, you need to remember to use double quotation marks around your commands, not single quotes.
Also, you can use a common POSIX ERE compliant regex across all these environments. This means \d must be replaced with [0-9] or [[:digit:]] as \d is a PCRE only compliant construct.
You can use
grep -Eo "[A-Z]{4}-[0-9]{1,5}"
grep -Eo "[A-Z]{4}-[[:digit:]]{1,5}"

Capture group from regex in bash script

When building an R package the command outputs the process steps to std out. From that output I would like to capture the final name of the package.
In the simulated script below I show the output of the build command. The part that needs to be captured is the last line starting with building.
How do I get the regex to match with these quotes, and then capture the package name into a variable?
#!/usr/bin/env bash
var=$(cat <<"EOF"
Warning message:
* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘analysis’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘analysis/.idea/inspectionProfiles’
Removed empty directory ‘analysis/.idea/snapshots’
* creating default NAMESPACE file
* building ‘analysis_0.1.tar.gz’
EOF
)
regex="building [\u2018](.*?)?[\u2019]"
if [[ "${var}" =~ $regex ]]; then
pkgname="${BASH_REMATCH[1]}"
echo "${pkgname}"
else
echo "sad face"
fi
This should work on both macOS and CentOS.
Support for the \u and \U unicode escapes was introduced in Bash 4.2. CentOS 7 has Bash 4.2, so this should work on that platform:
regex=$'.*building[[:space:]]+\u2018(.*)\u2019'
Unfortunately, earlier versions of CentOS had older versions of Bash, and I believe the default version of Bash on MacOS is still 3.2. For those, assuming that the quotes are encoded as UTF-8, this should work:
regex=$'.*building[[:space:]]+\xe2\x80\x98(.*)\xe2\x80\x99'
If the quotes are encoded in different ways on different platforms, then you could use alternation (e.g. (\xe2\x80\x98|...) instead of xe2\x80\x98) to match all of the possibilities (and adjusting the index used for BASH_REMATCH).
See How do you echo a 4-digit Unicode character in Bash? for more information about Unicode in Bash.
I've used $'...' to set the regular expression because it supports \x and (from Bash 4.2) \u escapes for characters, and Bash regular expressions don't.
With regard to the regular expression:
The leading .* is to ensure that the match occurs at the end of the text.
I've dropped the ?s because they aren't compatible with Bash's built-in regular expressions. See mkelement0's excellent answer to How do I use a regex in a shell script? for information about Bash regular expressions.
There are many ways to do it, this is one:
file=`echo "$var" | grep '^\* building' | grep -o '‘.*’' | head -c -4 | tail -c +4`
echo $file
Find the line starting with * building (first grep)
Find the text between ‘’ (second grep)
Discard the quotes (first 4 bytes and last 4 bytes) (head and tail)

Bash Regex matches on Ubuntu but not on macOS

I have the following bash script:
#!/bin/bash
git_status="$(git status 2>/dev/null)"
branch_pattern="^(# |)On branch ([^${IFS}]*)"
echo $git_status
echo $branch_pattern
if [[ ${git_status} =~ ${branch_pattern} ]]; then
echo 'hello'
echo $BASH_REMATCH
fi
Here is the output when I run the script on Ubuntu with bash version 4:
On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) test.sh nothing added to commit but untracked files present (use "git add" to track)
^(# |)On branch ([^ ]*)
hello
On branch master
However, when I run the same script on macOS with bash version 3, the regex does not match, and nothing inside the if block is executed. The rest of the output is identical. What am I missing? Does my regex need to be formatted differently on macOS/in this version of bash? Is there a flag I am missing?
I have seen similar posts about differences in regex behavior across platforms for e.g., the find command, but I have not yet found a post relevant to my issue.
It looks to me like there's a bug in the RE engine in the version of bash that macOS comes with (it's rather old -- 3.2.57). It's something to do with the ^(# |) part -- it doesn't seem to match an empty string at the beginning of the string, as it should. But I found a workaround. Apparently the bug doesn't happen if the ^ is inside the parentheses, like this:
branch_pattern="(^# |^)On branch ([^${IFS}]*)"
BTW, you shouldn't use echo $varname to print the contents of a variable. For one thing, it'll do word splitting (converting all runs of whitespace into single spaces) and wildcard expansion on the value, which can be very confusing/misleading. Try something like printf '<<%q>>\n' "$varname" instead. Its output can be a little cryptic if the variable contains weird characters, but at least it'll make it clear that there are weird things in there.
If you are only trying to get the current git branch name, there is no need for regex. Git already has this built in.
git rev-parse --abbrev-ref HEAD 2>/dev/null
This will print out the current branch name (if any)
If you are in a git repository without any commits, it will only return "HEAD"

How can I capture group from string using regexp in shell in msysgit on Windows?

[Editor's note: The OP has later clarified that he's running bash as part of msysgit, the Git version for Windows.]
I'm trying to get last digits from the string. I have a little script but it doesn't work and i don't know why:
#!bin/bash
TAGS="MASTER_46"
re="_(\d+)"
if [[ ${TAGS}=~$re ]]; then
echo "Find"
echo ${BASH_REMATCH}
echo ${BASH_REMATCH[1]}
fi
The output:
Find
{empty}
{empty}
I am using bash
$ bash --version
GNU bash, version 3.1.20(4)-release
There are several problems:
bash doesn't support \d. Use [0-9].
Whitespace is needed around the operator: $TAGS =~ $re, otherwise bash parses it as [[ -n "$TAGS=~$re" ]].
Path to the shell is /bin/bash, not bin/bash.
Update, based on the OP's clarification re environment and his own findings:
tl;dr:
msysgit (as of version 1.9.5) comes with a bash executable that is compiled without support for =~, the regex-matching operator
A limited workaround is to use utilities such as grep, sed, and awk instead.
To solve this problem fundamentally, install a separate Unix emulation environment such as MSYS or Cygwin, and use git.exe (the core of msysgit) from there.
choroba's answer has great pointers, but let me add that, since you get the following error message:
conditional binary operator expected syntax error near =~
the implication is either that
your bash version is too old to support =~, the regex-matching operator.
your bash version was compiled without regex support
Given that =~ was introduced in bash 3.0 (see http://tiswww.case.edu/php/chet/bash/NEWS) and you're running 3.1.x, it must be the latter, which indeed turned out to be true:
The OP runs msysgit, the official Git version for Windows that comes with bash and a set of Unix utilities.
As it turns out, as of version 1.9.5, the bash executable that comes with msysgit is built without regex support, presumably due to technical difficulties - see https://groups.google.com/forum/#!topic/msysgit/yPh85MPDyfE.
Incredibly, the "Known Issues" section of the release notes does not mention this limitation.
Your best bet is to:
Install MSYS to use as your Unix emulation environment - its bash does come with =~ support (note that msysgit was originally forked from MSYS).
Alternatively, to get better Unix emulation and more tools at the expense of a larger installation footprint and possibly performance, install Cygwin instead.
In MSYS, use only git.exe from msysgit, via the Windows %PATH%.
To that end, be sure to install msysgit with the Run Git from the Windows Command Prompt option - see https://stackoverflow.com/a/5890222/45375
Alternatively, add C:\Program Files\Git\cmd (assumes the default path on 32-bit Windows, on 64-bit Windows it's C:\Program Files (x86)\Git\cmd) manually to your Windows %PATH%, OR extend $PATH in your MSYS ~/.profile file (e.g., PATH="$PATH:/c/program files/git/cmd").
You could hack your msysgit installation, but it hardly seems worth it and I don't know what the side effects are;
If you really wanted to try, do the following: Copy the following files from an MSYS installation's bin directory to msysgit's bin directory: bash.exe, sh.exe, msys-termcap-0.dll - in other words: replace msysgit's bash altogether.

Syntax error using =~ operator in bash

I'm very new to bash and I'm trying to extract a portion of a string based on a pattern, but when I execute my code I'm seeing errors.
Sample code:
#!/bin/sh
STRING="LAX-8912_Words_Are_Here";
if [[ $STRING =~ LAX-(\d)+ ]]; then
echo "${BASH_REMATCH[1]}"
fi
So from the code above, I'm wanting to extract the "LAX-8912" portion of the string. Basically the string will be LAX- and then a series of numbers, could be any length. When the code is exectued however, I'm getting this message:
Syntax error: "(" unexpected (expecting "then")
I've also tried storing the regex in a variable like this:
#!/bin/sh
STRING="LAX-8912_Words_Are_Here";
REX="LAX-(\d)+";
if [[ $STRING =~ $REX ]]; then
echo "${BASH_REMATCH[1]}"
fi
But then I get this error:
[[: not found
My bash version is 4.2.25 so I'm guessing it's not a version issue, but I'm at a bit of a loss as to what's going.
Since your script starts with:
#!/bin/sh
it will run the system shell, which may be a completely different shell like dash, or at best bash in compatibility mode. You should use:
#!/bin/bash
to use bash with all its features.
Similarly, if run with sh file you override the shebang and force the script to run with the system shell. Use ./file so that the script can run with its declared shebang.