Shell - Refactoring a string regex that join numbers

Shell - Refactoring a string regex that join numbers - regex

I am trying to refacto my script to make it readable and still usable on a single line.
My script do :
a regex on a string (GXXRXXCXX) that get all numbers matched into an array
a string to number for all string in the array (0X -> X)
a join on all numbers with a '.' delimiter
finally, it add a 'v' at the start of the string
The part i am strugguling the most to refacto is the array number (3 2 1) into a join (3.2.1) without using any tmp variable.
code :
GOROCO=G03R02C01
version=v$(tmp=( $(grep -Eo '[[:digit:]]+' <<< $GOROCO | bc) ); echo "${tmp[#]}" | sed 's/ /./g')
process :
G03R02C01
03 02 01
3 2 1
3.2.1
v3.2.1

Using a single sed you can do this:
GOROCO='G03R02C01'
version=$(sed -E 's/[^0-9]+0*/./g; s/^\./v/' <<< "$GOROCO")
# version=v3.2.1
Details:
-E: Enables extended regex mode in sed
s/[^0-9]+0*/./g: Replace all non-digits followed by 0 or more zero by a single dot
s/^\./v/: Replace first dot by a letter v
As an academic exercise here is a pure bash equivalent of doing same:
shopt -s extglob
version="${GOROCO//+([!0-9])*(0)/.}"
version="v${version#.}"

You're looking for paste
$ grep -Eo '[[:digit:]]+' <<< $GOROCO | bc | paste -s -d"."
3.2.1

Related

How to extract text between first 2 dashes in the string using sed or grep in shell

I have the string like this feature/test-111-test-test.
I need to extract string till the second dash and change forward slash to dash as well.
I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case
Finally I have to get smth like this:
input - feature/test-111-test-test
output - feature-test-111- or at least feature-test-111
feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g')
But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.

Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used):
$ s='feature/test-111-test-test'
$ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}"
feature-test-111-
Where:
-E - enable extended regex support
s/\//-/ - replace / with -
s/^....*$/ - match start and end of input line
(([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by -
\1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group)
To store the result in a variable:
$ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}")
$ echo $url
feature-test-111-

You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled:
url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}')
echo "$url"
# => feature-test-111-
See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2.
Or, with a regex as a field delimiter:
url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}')
echo "$url"
# => feature-test-111-
The -F'[-/]' option sets the field separator to - and /.
The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen.
See the online demo.

To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times":
grep -Eo '([^-]*-){2}' | tr / -

With sed and cut
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/'
Output
feature-test-111
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/'
Output
feature-test-111-

You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -.
This works:
echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p'
feature-test-111-

Another idea using parameter expansions/substitutions:
s='feature/test-111-test-test'
tail="${s//\//-}" # replace '/' with '-'
# split first field from rest of fields ('-' delimited); do this 3x times
head="${tail%%-*}" # pull first field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}" # pull first field; append to previous field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-'
$ echo "${head}"
feature-test-111-

A short sed solution, without extended regular expressions:
sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'

How to match a regex 1 to 3 times in a sed command?

Problem
I want to get any text that consists of 1 to three digits followed by a % but without the % using sed.
What I tried
So i guess the following regex should match the right pattern : [0-9]{1,3}%.
Then i can use this sed command to catch the three digits and only print them :
sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
Example
However when i run it, it shows :
$ echo "100%" | sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
0
instead of
100
Obviously, there's something wrong with my sed command and i think the problem comes from here :
[0-9]{1,3}
which apparently doesn't do what i want it to do.
edit:
Solution
The .* at the start of sed -nE 's/.*([0-9]{1,3})%.*/\1/p' "ate" the two first digits.
The right way to write it, according to Wicktor's answer, is :
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'

The .* grabs all digits leaving just the last of the three digits in 100%.
Use
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'
Details
(.*[^0-9])? - (Group 1) an optional sequence of any 0 or more chars up to the non-digit char including it
([0-9]{1,3}) - (Group 2) one to three digits
% - a % char
.* - the rest of the string.
The match is replaced with Group 2 contents, and that is the only value printed since n suppresses the default line output.

It will be easier to use a cut + grep option:
echo "abc 100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
echo "100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
Or else you may use this awk:
echo "100%" | awk 'match($0, /[0-9]{1,3}%/){print substr($0, RSTART, RLENGTH-1)}'
100
Or else if you have gnu grep then use -P (PCRE) option:
echo "abc 100%" | ggrep -oP '[0-9]{1,3}(?=%)'
100

This might work for you (GNU sed):
sed -En 's/.*\<([0-9]{1,3})%.*/\1/p' file
This is a filtering exercise, so use the -n option.
Use a back reference to capture 1 to 3 digits, followed by % and print the result if successful.
N.B. The \< ensures the digits start on a word boundary, \b could also be used. The -E option is employed to reduce the number of back slashes which would normally be necessary to quote (,),{ and } metacharacters.

How to extract a number out of a string preceded by zeroes

I got a string that looks like this SOMETHING00000076XYZ
How can I extract the number 76 out of the string using a shell script? Note that 76 is preceded by zeroes and followed by letters.

1st solution: If you are ok with awk could you please try following.
echo "SOMETHING00000076XYZ" | awk 'match($0,/0+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/0+/,"",val);print val;val=""}'
In case you want to save this into a variable use following.
variable="$(echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1')"
2nd solution: Adding 1 more awk solution here(keeping your sample in mind).
echo "SOMETHING00000076XYZ" | awk '{sub(/.*[^1-9]0+/,"");sub(/[a-zA-Z]+/,"")} 1'

Here is a sed option:
echo "SOMETHING00000076XYZ" | sed -r 's/[^0-9]*0*([0-9]+).*/\1/g';
76
Here is an explanation of the regex pattern used:
[^0-9]* match zero or more non digits
0* match zero or more 0's
([0-9]+) match AND capture any quantity of non zero digits
.* match the remainder of the string
Then, we just replace with \1, which is the first (and only) capture group.

echo 'SOMETHING00000076XYZ' | grep -o '[1-9][0-9]*'

Using gnu grep:
grep -oP '0+\K\d+' <<< 'SOMETHING00000076XYZ'
76
\K resets any matched information.
Here is another variant of awk:
awk -F '0+' 'match($2, /^[0-9]+/){print substr($2, 1, RLENGTH)}' <<< 'SOMETHING00000076XYZ'
76

You can try Perl as well
$ echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 '
76
$ a=$(echo "SOMETHING00000076XYZ" | perl -ne ' /\D+0+(\d+)/ and print $1 ')
$ echo $a
76
$

$ echo 'SOMETHING00000076XYZ' | awk '{sub(/^[^0-9]+/,""); print $0+0}'
76

You can use sed as
echo "SOMETHING00000076XYZ" | sed "s/[a-zA-Z]//g" | sed "s/^0*//"
The first step is for removing all letters
The second step is for removing leading zeroes

non matching groups in grep regex not working

I would like to extract 1, 10, and 100 from:
1 one -args 123
10 ten -args 123
100 one hundred -args 123
However this regex returns 100:
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^(?=[ ]*)\d+(?=.*)'
100
Not ignoring the preceding spaces returns the numbers (but of course with undesired spaces):
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^[ ]*\d+(?=.*)'
1
10
100
Have I misunderstood non capturing regex groups in grep / Perl (grep version 2.2, Perl as the -P flag should use its regex) or is this a bug? I notice the release notes for 2.6 says "This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep")".
If someone with 2.6 could try these examples that would be valuable to determine if this is a bug (in 2.2) or intended behaviour.

The issue is what is considered a 'match' by grep. In the absence of telling grep part of the total match is not what you want, it prints everything up to the end of the match regardless of matching groups.
Given:
$ echo "$txt"
1 one -args 123
10 ten -args 123
100 one hundred -args 123
You can get just the first column of digits without leading spaces several ways.
With GNU grep:
$ echo "$txt" | grep -Po '^[ ]*\K\d+'
1
10
100
Here \K is equivalent to a look behind assertion that resets the match text of the match to be what comes after. The left hand, before the \K, is required to match, but is not included in match text printed by grep.
Demo
awk:
$ echo "$txt" | awk '/^[ ]*[0-9]+/{print $1}'
sed:
$ echo "$txt" | sed 's/^[ ]*\([0-9]*\).*/\1/'
Perl:
$ echo "$txt" | perl -lne 'print $1 if /^[ ]*\K(\d+)/'
And then if you want the matches on a single line, run through xargs:
$ echo "$txt" | grep -Po '^[ ]*\K(\d+)' | xargs
1 10 100
Or, if you are using awk or Perl, just change the way it is printed to not include a carriage return.

You can delete the unwanted spaces this way :
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^[ ]*(\d+)' | tr -d ' '
As for your question of why it is not working, it is not a bug, it is working as intended, you just misinterpreted how it should work.
If we focus on this ^(?=[ ]*)\d+:
The (?=[ ]*) part is a lookahead assertion. So it means that the regex engine tries to check if the ^ is followed by zero or more spaces. But the assertion itself is not part of the match, so in reality this code means :
- Match a ^ that is followed by 0 or more spaces
- After this ^, match one or more digits
So your code will only match when a digit is the first character of the line. The lookahead won't help you on your use case.

I think the anchor messes with the lookahead, which could be a lookbehind, but they can't be ambiguous (I always run into that one). So the following would work:
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '(?=[ ]*)\d+(?=.*)'
As for a better tool, I would use awk as it is suited to any column driven data. So if you were running it off of ps you could do something like:
ps | awk '/stuff you want to look for here/{print $1}'
awk will take care of all the white space by default

How can I output only captured groups with sed?

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?

The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.

Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail

you can use grep
grep -Eow "[0-9]+" file

run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.

Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478

I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)

Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$

You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'

It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987

I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d

You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Shell - Refactoring a string regex that join numbers - regex

You're looking for paste $ grep -Eo '[[:digit:]]+' <<< $GOROCO | bc | paste -s -d"." 3.2.1

Related

How to extract text between first 2 dashes in the string using sed or grep in shell

How to match a regex 1 to 3 times in a sed command?

How to extract a number out of a string preceded by zeroes

non matching groups in grep regex not working

How can I output only captured groups with sed?

Categories

Resources