Using regular expressions in shell script - regex

What is the correct way to parse a string using regular expressions in a linux shell script? I wrote the following script to print my SO rep on the console using curl and sed (not solely because I'm rep-crazy - I'm trying to learn some shell scripting and regex before switching to linux).
json=$(curl -s http://stackoverflow.com/users/flair/165297.json)
echo $json | sed 's/.*"reputation":"\([0-9,]\{1,\}\)".*/\1/' | sed s/,//
But somehow I feel that sed is not the proper tool to use here. I heard that grep is all about regex and explored it a bit. But apparently it prints the whole line whenever a match is found - I am trying to extract a number from a single line of text. Here is a downsized version of the string that I'm working on (returned by curl).
{"displayName":"Amarghosh","reputation":"2,737","badgeHtml":"\u003cspan title=\"1 silver badge\"\u003e\u003cspan class=\"badge2\"\u003e●\u003c/span\u003e\u003cspan class=\"badgecount\"\u003e1\u003c/span\u003e\u003c/span\u003e"}
I guess my questions are:
What is the correct way to parse a string using regular expressions in a linux shell script?
Is sed the right thing to use here?
Could this be done using grep?
Is there any other command that's more easier/appropriate?

The grep command will select the desired line(s) from many but it will not directly manipulate the line. For that, you use sed in a pipeline:
someCommand | grep 'Amarghosh' | sed -e 's/foo/bar/g'
Alternatively, awk (or perl if available) can be used. It's a far more powerful text processing tool than sed in my opinion.
someCommand | awk '/Amarghosh/ { do something }'
For simple text manipulations, just stick with the grep/sed combo. When you need more complicated processing, move on up to awk or perl.
My first thought is to just use:
echo '{"displayName":"Amarghosh","reputation":"2,737","badgeHtml"'
| sed -e 's/.*tion":"//' -e 's/".*//' -e 's/,//g'
which keeps the number of sed processes to one (you can give multiple commands with -e).

You may be interested in using Perl for such tasks. As a demonstration, here is a Perl script which prints the number you want:
#!/usr/local/bin/perl
use warnings;
use strict;
use LWP::Simple;
use JSON;
my $url = "http://stackoverflow.com/users/flair/165297.json";
my $flair = get ($url);
my $parsed = from_json ($flair);
print "$parsed->{reputation}\n";
This script requires you to install the JSON module, which you can do with just the command cpan JSON.

For working with JSON in shell script, use jsawk which like awk, but for JSON.
json=$(curl -s http://stackoverflow.com/users/flair/165297.json)
echo $json | jsawk 'return this.reputation' # 2,747

My proposition:
$ echo $json | sed 's/,//g;s/^.*reputation...\([0-9]*\).*$/\1/'
I put two commands in sed argument:
s/,//g is used to remove all commas, in particular the ones that are present in the reputation value.
s/^.*reputation...\([0-9]*\).*$/\1/ locates the reputation value in the line and replaces the whole line by that value.
In this particular case, I find that sed provides the most compact command without loss of readability.
Other tools for manipulating strings (not only regex) include:
grep, awk, perl mentioned in most of other answers
tr for replacing characters
cut, paste for handling multicolumn inputs
bash itself with its rich $(...) syntax for accessing variables
tail, head for keeping last or first lines of a file

sed is appropriate, but you'll spawn a new process for every sed you use (which may be too heavyweight in more complex scenarios). grep is not really appropriate. It's a search tool that uses regexps to find lines of interest.
Perl is one appropriate solution here, being a shell scripting language with powerful regexp features. It'll do most everything you need without spawning out to separate processes (unlike normal Unix shell scripting) and has a huge library of additional functions.

You can do it with grep. There is -o switch in grep witch extract only matching string not whole line.
$ echo $json | grep -o '"reputation":"[0-9,]\+"' | grep -o '[0-9,]\+'
2,747

1) What is the correct way to parse a string using regular expressions in a linux shell script?
Tools that include regular expression capabilities include sed, grep, awk, Perl, Python, to mention a few. Even newer version of Bash have regex capabilities. All you need to do is look up the docs on how to use them.
2) Is sed the right thing to use here?
It can be, but not necessary.
3) Could this be done using grep?
Yes it can. you will just construct similar regex as you would if you use sed, or others. Note that grep just does what it does, and if you want to modify any files, it will not do it for you.
4) Is there any other command that's easier/more appropriate?
Of course. regex can be powerful, but its not necessarily the best tool to use everytime. It also depends on what you mean by "easier/appropriate".
The other method to use with minimal fuss on regex is using the fields/delimiter approach. you look for patterns that can be "splitted". for eg, in your case(i have downloaded the 165297.json file instead of using curl..(but its the same)
awk 'BEGIN{
FS="reputation" # split on the word "reputation"
}
{
m=split($2,a,"\",\"") # field 2 will contain the value you want plus the rest
# Then split on ":" and save to array "a"
gsub(/[:\",]/,"",a[1]) # now, get rid of the redundant characters
print a[1]
}' 165297.json
output:
$ ./shell.sh
2747

sed is a perfectly valid command for your task, but it may not be the only one.
grep may be useful too, but as you say it prints the whole line. It's most useful for filtering the lines of a multi-line file, and discarding the lines you don't want.
Efficient shell scripts can use a combination of commands (not just the two you mentioned), exploiting the talents of each.

Blindly:
echo $json | awk -F\" '{print $8}'
Similar (the field separator can be a regex):
awk -F'{"|":"|","|"}' '{print $5}'
Smarter (look for the key and print its value):
awk -F'{"|":"|","|"}' '{for(i=2; i<=NF; i+=2) if ($i == "reputation") print $(i+1)}'

You can use a proper library (as others noted):
E:\Home> perl -MLWP::Simple -MJSON -e "print from_json(get 'http://stackoverflow.com/users/flair/165297.json')->{reputation}"
or
$ perl -MLWP::Simple -MJSON -e 'print from_json(get "http://stackoverflow.com/users/flair/165297.json")->{reputation}, "\n"'
depending on OS/shell combination.

Simple RegEx via Shell
Disregarding the specific code in question, there may be times when you want to do a quick regex replace-all from stdin to stdout using shell, in a simple way, using a string syntax similar to JavaScript.
Below are some examples for anyone looking for a way to do this. Perl is a better bet on Mac since it lacks some sed options. If you want to get stdin as a variable you can use MY_VAR=$(cat);.
echo 'text' | perl -pe 's/search/replace/g'; # using perl
echo 'text' | sed -e 's/search/replace/g'; # using sed
And here's an example of a custom, reusable regex function. Arguments are source string (or -- for stdin), search, replace, and options.
regex() {
case "$#" in
( '0' ) exit 1 ;; ( '1' ) echo "$1"; exit 0 ;;
( '2' ) REP='' ;; ( '3' ) REP="$3"; OPT='' ;;
( * ) REP="$3"; OPT="$4" ;;
esac
TXT="$1"; SRCH="$2";
if [ "$1" = "--" ]; then [ ! -t 0 ] && read -r TXT; fi
echo "$TXT" | perl -pe 's/'"$SRCH"'/'"$REP"'/'"$OPT";
}
echo 'text' | regex -- search replace g;

Related

Grep on Linux - How do I replace text with blankspace and newlines

I'm not use to using grep on linux via the terminal. I'm use to using dnGREP on windows but there is no comparable gui tool on ubuntu from what I've found.
How do I match the regular expressions "^(.*?)[" with all files in a folder and replace it with a blankspace?
I assume this one would follow the same methodology "](?=[^.]*$)"
Also, how do I replace the text below to add new lines
{"dev_is_looking_week"
with the same text and 4 blank lines underneath. Ignore the "." at the end. StackOverflow won't show blank newlines without a character at the end.
{"dev_is_looking_week"
.
You are using the wrong tool. grep is for selecting data. You may want to use awk, perl or sed instead.
Some examples:
awk '/example/ {print; print "\n\n\n\n"; }'
awk '{print;} /example/ {print "\n\n\n\n"; }'
perl -ne 'print $_; /example/ && print "\n\n\n\n"'
Note that perl also has the neat -i option, for inplace modification of files, which comes in handy when you have to do this change on a lot of files.
Or you might opt for regexxer, redet, or kregexpeditor from KDE.
You can use sed like this:
sed 's/{"dev_is_looking_week"/&\n\n\n\n/' file
OR using awk:
awk '/{"dev_is_looking_week"/{$0=sprintf("%s\n\n\n\n", $0)} 1'

Using bash regexp to insert the contents of a file into another

I have a javascript file with a jquery function call:
$.getScript('/scripts/files/file.js');
I want to replace that line with the contents of the file at that path. This is the bash script I have so far:
cat public/scripts/old.js | sed -e "s/$\.getScript\('(.)+'\);/$(cat \1)/g" > public/scripts/new.js
However, my regular expression and remembering the path does not seem to be working correctly. I get cat: 1: No such file or directory
as it seems as if cat is being called on the number 1 (which should be the remembered portion of the regexp). How can I fix this?
Because you are using $() inside double quotes, the shell is parsing the cat \1, stripping the backslash and trying to run cat 1 to pass its output as part of the argument to sed. Sed has a command (r) for reading a file, but the filename must be literal, and cannot be the result of previous sed commands (at least in standard sed, perhaps some implementations provide that ability). sed is really the wrong tool for this. You could do an awk solution, but it will be fragile.
Here's a possible perl solution (warning: fragile):
perl -ne 'if( $_ =~ /\$\.getScript\('"'(.*)'"'\)/ )
{ system( "cat $1" ) } else {print}' public/scripts/old.js

put regular expression in variable

output=`grep -R -l "${images}" *`
new_output=`regex "slide[0-9]" $output`
Basically $output is a string like this:
slides/_rels/slide9.xml.rels
The number in $output will change. I want to grab "slide9" and put that in a variable. I was hoping new_output would do that but I get a command not found for using regex. Any other options? I'm using a bash shell script.
Well, regex is not a program like grep. ;)
But you can use
grep -Eo "(slide[0-9]+)"
as a simple approach. -o means: show only the matching part, -E means: extended regex (allows more sophisticated patterns).
Reading I want to grab "slide9" and put that in a variable. I assume you want what matches your regexp to be the only thing put in $new_output? If so, then you can change that to:
new_output=`egrep -R -l "${images}" * | sed 's/.*\(slide[0-9]+\).*/\1/'`
Note no setting of output= is required (unless you use that for something else)
If you need $output to use elsewhere then instead use:
output=`grep -R -l "${images}" *`
new_output=`echo ${ouput} | sed 's/.*\(slide[0-9]+\).*/\1/'`
sed's s/// command is similar to perls s// command and has an equivalent in most languages.
Here I'm matching zero or more characters .* before and after your slide[0-9]+ and then remembering (backrefrencing) the result \( ... \) in sed (the brackets may or may not need to be escaped depending on the version of sed). We then replace that whole match (i.e the whole line) with \1 which expands to the first captured result in this case your slide[0-9]+ match.
In these situations using awk is better :
output="`grep -R -l "main" codes`"
echo $output
tout=`echo $output | awk -F. '{for(i=1;i<=NF;i++){if(index($i,"/")>0){n=split($i,ar,"/");print ar[n];}}}'`
echo $tout
This prints the filename without the extension. If you want to grab only slide9 than use the solutions provided by others.
Sample output :
A#A-laptop ~ $ bash try.sh
codes/quicksort_iterative.cpp codes/graham_scan.cpp codes/a.out
quicksort_iterative graham_scan a

Return only matches from substitution in Perl 5.8.8 (was: Perl "p" regex modifier equivalent)

I've got a script (source) to parse svn info to create a suitable string for Bash's $PS1. Unfortunately this doesn't work on one system I'm using which is running Perl 5.8.8 - It outputs all lines instead of only the matches. What would be the Perl 5.8.8 equivalent to the following?
__svn_ps1()
{
local result=$(
svn info 2>/dev/null | \
perl -pe 's;^URL: .*?/((trunk)|(branches|tags)/([^/]*)).*;\2\4 ;p')
if [ -n "$result" ]
then
printf "${1:- (%s)}" $result
fi
}
The output from Perl 5.10 contains only a space, parenthesis, one of branch name, tag name or trunk, and the end parenthesis. The output from Perl 5.8.8 (without the final p) contains this plus a parenthesized version of each space-separated part of the svn info output.
A possible workaround involves a simple grep '^URL: ' between the svn and perl commands, but I was hoping to avoid that since this will be executed for each Bash prompt.
If you only want output from a line that matches, don't use the -p command-line switch. It prints the value of $_ at the end of each loop. You might want something with the -n command-line switch:
perl -ne 'print if s/.../.../'
I'd do it in the same way for Perl v5.8 and v5.10. I'm not sure what you think the /p modifier is doing since you don't use the $`, $&, or $' variables or their per-match equivalents.
You can read about the command-line switches in perlrun.
As of perl 5.10, the /p switch tells perl to put matched content into ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH}.
And the one-liner you've posted never uses those vars, so omit the /p.
UPDATE: Trying to keep up with the initial question...
perl -ne 's/search/replace/ and print'
Will only print lines for which the replacement was made. Note -n versus -p switch. Also, I've tried the -p /p combo on my 5.10 and it happily prints unaltered non-matching lines too. Maybe I missed something...

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO