Capture group from regex in bash script - regex

When building an R package the command outputs the process steps to std out. From that output I would like to capture the final name of the package.
In the simulated script below I show the output of the build command. The part that needs to be captured is the last line starting with building.
How do I get the regex to match with these quotes, and then capture the package name into a variable?
#!/usr/bin/env bash
var=$(cat <<"EOF"
Warning message:
* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘analysis’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
Removed empty directory ‘analysis/.idea/inspectionProfiles’
Removed empty directory ‘analysis/.idea/snapshots’
* creating default NAMESPACE file
* building ‘analysis_0.1.tar.gz’
EOF
)
regex="building [\u2018](.*?)?[\u2019]"
if [[ "${var}" =~ $regex ]]; then
pkgname="${BASH_REMATCH[1]}"
echo "${pkgname}"
else
echo "sad face"
fi
This should work on both macOS and CentOS.

Support for the \u and \U unicode escapes was introduced in Bash 4.2. CentOS 7 has Bash 4.2, so this should work on that platform:
regex=$'.*building[[:space:]]+\u2018(.*)\u2019'
Unfortunately, earlier versions of CentOS had older versions of Bash, and I believe the default version of Bash on MacOS is still 3.2. For those, assuming that the quotes are encoded as UTF-8, this should work:
regex=$'.*building[[:space:]]+\xe2\x80\x98(.*)\xe2\x80\x99'
If the quotes are encoded in different ways on different platforms, then you could use alternation (e.g. (\xe2\x80\x98|...) instead of xe2\x80\x98) to match all of the possibilities (and adjusting the index used for BASH_REMATCH).
See How do you echo a 4-digit Unicode character in Bash? for more information about Unicode in Bash.
I've used $'...' to set the regular expression because it supports \x and (from Bash 4.2) \u escapes for characters, and Bash regular expressions don't.
With regard to the regular expression:
The leading .* is to ensure that the match occurs at the end of the text.
I've dropped the ?s because they aren't compatible with Bash's built-in regular expressions. See mkelement0's excellent answer to How do I use a regex in a shell script? for information about Bash regular expressions.

There are many ways to do it, this is one:
file=`echo "$var" | grep '^\* building' | grep -o '‘.*’' | head -c -4 | tail -c +4`
echo $file
Find the line starting with * building (first grep)
Find the text between ‘’ (second grep)
Discard the quotes (first 4 bytes and last 4 bytes) (head and tail)

Related

Case insensitive sed for pattern [duplicate]

I'm trying to use SED to extract text from a log file. I can do a search-and-replace without too much trouble:
sed 's/foo/bar/' mylog.txt
However, I want to make the search case-insensitive. From what I've googled, it looks like appending i to the end of the command should work:
sed 's/foo/bar/i' mylog.txt
However, this gives me an error message:
sed: 1: "s/foo/bar/i": bad flag in substitute command: 'i'
What's going wrong here, and how do I fix it?
Update: Starting with macOS Big Sur (11.0), sed now does support the I flag for case-insensitive matching, so the command in the question should now work (BSD sed doesn't reporting its version, but you can go by the date at the bottom of the man page, which should be March 27, 2017 or more recent); a simple example:
# BSD sed on macOS Big Sur and above (and GNU sed, the default on Linux)
$ sed 's/ö/#/I' <<<'FÖO'
F#O # `I` matched the uppercase Ö correctly against its lowercase counterpart
Note: I (uppercase) is the documented form of the flag, but i works as well.
Similarly, starting with macOS Big Sur (11.0) awk now is locale-aware (awk --version should report 20200816 or more recent):
# BSD awk on macOS Big Sur and above (and GNU awk, the default on Linux)
$ awk 'tolower($0)' <<<'FÖO'
föo # non-ASCII character Ö was properly lowercased
The following applies to macOS up to Catalina (10.15):
To be clear: On macOS, sed - which is the BSD implementation - does NOT support case-insensitive matching - hard to believe, but true. The formerly accepted answer, which itself shows a GNU sed command, gained that status because of the perl-based solution mentioned in the comments.
To make that Perl solution work with foreign characters as well, via UTF-8, use something like:
perl -C -Mutf8 -pe 's/öœ/oo/i' <<< "FÖŒ" # -> "Foo"
-C turns on UTF-8 support for streams and files, assuming the current locale is UTF-8-based.
-Mutf8 tells Perl to interpret the source code as UTF-8 (in this case, the string passed to -pe) - this is the shorter equivalent of the more verbose -e 'use utf8;'.Thanks, Mark Reed
(Note that using awk is not an option either, as awk on macOS (i.e., BWK awk and BSD awk) appears to be completely unaware of locales altogether - its tolower() and toupper() functions ignore foreign characters (and sub() / gsub() don't have case-insensitivity flags to begin with).)
A note on the relationship of sed and awk to the POSIX standard:
BSD sed and awk limit their functionality mostly to what the POSIX sed and
POSIX awk specs mandate, whereas their GNU counterparts implement many more extensions.
Editor's note: This solution doesn't work on macOS (out of the box), because it only applies to GNU sed, whereas macOS comes with BSD sed.
Capitalize the 'I'.
sed 's/foo/bar/I' file
Another work-around for sed on Mac OS X is to install gsedfrom MacPorts or HomeBrew and then create the alias sed='gsed'.
If you are doing pattern matching first, e.g.,
/pattern/s/xx/yy/g
then you want to put the I after the pattern:
/pattern/Is/xx/yy/g
Example:
echo Fred | sed '/fred/Is//willma/g'
returns willma; without the I, it returns the string untouched (Fred).
The sed FAQ addresses the closely related case-insensitive search. It points out that a) many versions of sed support a flag for it and b) it's awkward to do in sed, you should rather use awk or Perl.
But to do it in POSIX sed, they suggest three options (adapted for substitution here):
Convert to uppercase and store original line in hold space; this won't work for substitutions, though, as the original content will be restored before printing, so it's only good for insert or adding lines based on a case-insensitive match.
Maybe the possibilities are limited to FOO, Foo and foo. These can be covered by
s/FOO/bar/;s/[Ff]oo/bar/
To search for all possible matches, one can use bracket expressions for each character:
s/[Ff][Oo][Oo]/bar/
The Mac version of sed seems a bit limited. One way to work around this is to use a linux container (via Docker) which has a useable version of sed:
cat your_file.txt | docker run -i busybox /bin/sed -r 's/[0-9]{4}/****/Ig'
Use following to replace all occurrences:
sed 's/foo/bar/gI' mylog.txt
I had a similar need, and came up with this:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
Following should be fine:
sed -i 's/foo/bar/gi' mylog.txt

sed doesn't work in bash script even though individual RegEx expressions seem to be correct [duplicate]

This question already has an answer here:
Sed doesn't work in command line however regular expression in online test regex101 works
(1 answer)
Closed 4 years ago.
What I'm trying to do
Bash script to replace the uncommented php_version: "7.2" with the a user-entered version:
#!/bin/bash
# Ask desired PHP version
read -p "What version of PHP should be used? (5.6, 7.0, 7.1, or 7.2): " phpVersion
# Replace default php version with desired php version
sed -i "s/^php_version: \"[0-9]+\.[0-9]+\"/php_version: \"${phpVersion}\"/g" fileToSearchReplace.txt
Contents of fileToSearchReplace.txt which vim reports is a unix file:
APIVersion: v1.0.0
name: alpha-local
type: typo3
docroot: public
# php_version: "7.1" # PHP version to use, "5.6", "7.0", "7.1", "7.2"
php_version: "7.2"
router_http_port: "80"
router_https_port: "443"
xdebug_enabled: false
additional_hostnames: []
additional_fqdns: []
provider: default
Diagnostics
This worked: I checked the RegEx here using ^php_version: "[0-9]+\.[0-9]+" for search and php_version: "7.0" for replace (you can find replace and paste in the text a tab below. Notice I could only get the search to work using the /gm at the end. I have no idea how to get multiline to work in sed???
Environment
Windows 10 Pro host
Using MINGW64 git bash terminal
These files are part of a Docker container
I tried running this sed both in my MINGW64 git bash terminal and also inside the Ubuntu container
You cannot use + with sed by default. Either do your own + like that:
sed -i "s/^php_version: \"[0-9][0-9]*\.[0-9][0-9]*\"/php_version: \"${phpVersion}\"/g" fileToSearchReplace.txt
Or use the extended regex option -r:
sed -r -i "s/^php_version: \"[0-9]+\.[0-9]+\"/php_version: \"${phpVersion}\"/g" fileToSearchReplace.txt
Example: https://repl.it/repls/PowderblueFoolishEquipment
I don't think its a good idea to try out a regex from a site which doesn't say which class of regex it is supporting and re-using the same in your command.
You have a classic case of mix-up in which the regex pattern from the site, is of the Extended Regular Expressions and the one in your sed uses Basic Regular expression construct which does not support your [0-9]+ expression. Simply enable the ERE support with the -E switch in your sed
sed -Ei "s/^php_version: \"[0-9]+\.[0-9]+\"/php_version: \"${phpVersion}\"/g" fileToSearchReplace.txt
See the [ POSIX | GNU ] variants of the same solution, with the POSIX version using the BRE and the GNU version using ERE. Do read through this useful article on POSIX Regular Expressions on how the two variants of the regular expressions work.
You may use this script with select that lets user select a version from a menu list. You can then pass selected option to a simplified sed command:
select ver in "5.6" "7.0" "7.1" "7.2"; do
sed -i "s/^\(php_version: \).\+/\1\"$ver\"/" fileToSearchReplace.txt
break
done

Bash Regex matches on Ubuntu but not on macOS

I have the following bash script:
#!/bin/bash
git_status="$(git status 2>/dev/null)"
branch_pattern="^(# |)On branch ([^${IFS}]*)"
echo $git_status
echo $branch_pattern
if [[ ${git_status} =~ ${branch_pattern} ]]; then
echo 'hello'
echo $BASH_REMATCH
fi
Here is the output when I run the script on Ubuntu with bash version 4:
On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) test.sh nothing added to commit but untracked files present (use "git add" to track)
^(# |)On branch ([^ ]*)
hello
On branch master
However, when I run the same script on macOS with bash version 3, the regex does not match, and nothing inside the if block is executed. The rest of the output is identical. What am I missing? Does my regex need to be formatted differently on macOS/in this version of bash? Is there a flag I am missing?
I have seen similar posts about differences in regex behavior across platforms for e.g., the find command, but I have not yet found a post relevant to my issue.
It looks to me like there's a bug in the RE engine in the version of bash that macOS comes with (it's rather old -- 3.2.57). It's something to do with the ^(# |) part -- it doesn't seem to match an empty string at the beginning of the string, as it should. But I found a workaround. Apparently the bug doesn't happen if the ^ is inside the parentheses, like this:
branch_pattern="(^# |^)On branch ([^${IFS}]*)"
BTW, you shouldn't use echo $varname to print the contents of a variable. For one thing, it'll do word splitting (converting all runs of whitespace into single spaces) and wildcard expansion on the value, which can be very confusing/misleading. Try something like printf '<<%q>>\n' "$varname" instead. Its output can be a little cryptic if the variable contains weird characters, but at least it'll make it clear that there are weird things in there.
If you are only trying to get the current git branch name, there is no need for regex. Git already has this built in.
git rev-parse --abbrev-ref HEAD 2>/dev/null
This will print out the current branch name (if any)
If you are in a git repository without any commits, it will only return "HEAD"

use sed replace ""string"\1\"787" to "string"\1\"787" in cygwin

I am trying to search string and replace string in a file. I used the below code:
sed -e 's/{"AP_SESSION_ID"\1\"787"}/{"AP_SESSION_ID"\1\"800"}/g' FILE|tee FILE
but it is not working and the output is like this:
sed: number in \[0-9] invalid
My environment is CYGWIN.
sample file is:
DP_SESSION_ID is a sting for values
DP_SESSION_ID is aplicat
"DP_S42SETTACC_TYPE"\1\"02"
"DP_SAP_CLIENT"\1\"460"
"DP_SAP_COMM_CONNECTION"\1\"JAVA_COMM_TOOL_ANALYZER"
"DP_SAP_CONNECTION"\1\"JAVA_TOOL_ANALYZER"
"DP_SAP_TOOLBI_CONNECTION"\1\"JAVA_TOOLBI_ANALYZER"
"DP_SESSION_ID"\1\"808"
I want search this "DP_SESSION_ID"\1\" sting and replace corresponding number like 808 in file prenatally(windows env), and i wand sing line command in windows bat command or perl command i don't want scrip or program
even i have installed cygwin tool in my server so unix also ok but single line command
server: windows 2008,cygwin x
using tool : datastage server jobs
perl -pi -e 's{" "DP_SESSION_ID"\1\"808 '"}{' "DP_SESSION_ID"\1\"900 '"'"}g' " file name
this code is not working
Please give good solution
You need to "escape" the backslashes by using two in a row:
sed -e 's/{"AP_SESSION_ID"\\1\\"787"}/{"AP_SESSION_ID"\\1\\"800"}/g' FILE|tee FILE
Otherwise the \1 is treated as a backreference, and you have no subgroups (parenthesized expressions) to reference.
Apart from back-slashes, IMO you also need to escape the quotes.
sed -e 's/{\"AP_SESSION_ID\"\\1\\\"787\"}/{\"AP_SESSION_ID\"\\1\\\"800\"}/g' FILE|tee FILE

grep with regexp: whitespace doesn't match unless I add an assertion

GNU grep 2.5.4 on bash 4.1.5(1) on Ubuntu 10.04
This matches
$ echo "this is a line" | grep 'a[[:space:]]\+line'
this is a line
But this doesn't
$ echo "this is a line" | grep 'a\s\+line'
But this matches too
$ echo "this is a line" | grep 'a\s\+\bline'
this is a line
I don't understand why #2 does not match (whereas # 1 does) and #3 also shows a match. Whats the difference here?
Take a look at your grep manpage. Perl added a lot of regular expression extensions that weren't in the original specification. However, because they proved so useful, many programs adopted them.
Unfortunately, grep is sometimes stuck in the past because you want to make sure your grep command remains compatible with older versions of grep.
Some systems have egrep with some extensions. Others allow you to use grep -E to get them. Still others have a grep -P that allows you to use Perl extensions. I believe Linux systems' grep command can use the -P extension which is not available in most Unix systems unless someone has replaced the grep with the GNU version. Newer versions of Mac OS X also support the -P switch, but not older versions.
grep doesn't support the complete set of regular expressions, so try using -P to enable perl regular expressions. You don't need to escape the + i.e.
echo "this is a line" | grep -P 'a\s+line'