How to extract version number from an xml attribute with bash - regex

From this xml string from my config.xml file I need to extract the first three digits of the version number:
<widget id="com.test.enterprise.test" version="3.0.0.0" xmlns="http://www.w3.org/ns/widgets" xmlns:cdv="http://cordova.apache.org/ns/1.0">
I've tried:
cat config.xml | grep "<widget" | sed 's/[^0-9.]*\([0-9.]*\).*/\1/'
but this only yields a . How would the correct regex look like?

Don't use regular expressions to parse XML.
xmllint -xpath 'string(//*[local-name()="widget"]/#version)' 1.xml \
| cut -f1-3 -d.
If you need to specify the namespace, too, use the namespace-uri function:
//*[local-name()="widget"][namespace-uri()="http://www.w3.org/ns/widgets"]

GNU grep with PCRE support \K don't include left of '\K' in the result
grep -Po '<widget.*?version="\K[^"]*' <<< '<widget id="com.test.enterprise.test" version="3.0.0.0" xmlns="http://www.w3.org/ns/widgets" xmlns:cdv="http://cordova.apache.org/ns/1.0">'
To have only first 3 digits
grep -Po '<widget.*?version="\K\d*(\.\d*){2}' <<< '<widget id="com.test.enterprise.test" version="3.0.0.0" xmlns="http://www.w3.org/ns/widgets" xmlns:cdv="http://cordova.apache.org/ns/1.0">'

You may grab the digits and dots only after version=" substring:
cat config.xml | grep "<widget" | sed 's/.*version="\([0-9.]*\).*/\1/'
See the online demo
Pattern details:
.* - any 0+ chars
version=" - a version=" substring
\([0-9.]*\) - capturing group #1 matching zero or more digits or .
.* - any 0+ chars.
The \1 backreference will keep Group 1 value in the result.

For first three digits of version:
grep -oP 'widget.*version="\K\d+\.\d+\.\d+' xmlFile
3.0.0

try following awks too, hope this may help you too.
solution 1st: Using match function of awk.
awk '{match($0,/version=\"[^"]*/);print substr($0,RSTART+9,RLENGTH-9)}' Input_file
solution 2nd: Going through one by one all the fields and then checking for version in them.
awk '{for(i=1;i<=NF;i++){if($i ~ /version/){gsub(/version=|\"/,"",$i);print $i;next}}}' Input_file
solution 3rd: making record separator as space and field separator as (").
awk -v RS=" " -v FS="\"" '/^version/{print $2}' Input_file
solution 4th: simply substituting all the text from starting to till string version=" then again substituting from " to till end, which will keep only version number in output.
awk '{sub(/.*version=\"/,"");sub(/\".*/,"");print}' Input_file
I hope this helps.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

How to extract jira ticket number with sed?

I want to extract Jira ticket number from the branch name with sed.
This is what I have
echo "PTW-123-branch-name" | sed 's/.*\([A-Z]+-[0-9]+[^-]\).*/\1/'
expected result: PTW-123
What is wrong with the regexp?
You may use this sed:
echo "PTW-123-branch-name" | sed 's/\([0-9]\)-.*$/\1/'
PTW-123
Details:
\([0-9]\)-: Matches a digit and captures it in group #1 followed by hyphen
.*$: Match remaining string until end
\1: Is replacement that puts captured digit back in output
Alternatively you can use cut also:
echo "PTW-123-branch-name" | cut -d- -f1,2
PTW-123
In case you are ok with GNU grep please try following then. Simple explanation would be passing echo command's output as a standard input to grep command. Then in grep command using -oP option to print only matched portion and enabling PCRE regex capabilities here. In match section of grep then using non-greedy match to match till digits which should be followed by -, then if a match is found it will print it.
echo "PTW-123-branch-name" | grep -oP '^.*?\d+(?=-)'

grep a pattern until a specific character (:)

Consider the following file.txt:
#A00940:70:HTCYYDRXX:2:2101:1561:1063 1:N:0:ATCACG
TAGCACTGGGCTGTGAGACTGTCGTGTGTGCTTTGGATCAAGCAAGATCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFF::FFFFFFF:FFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2175:1063 1:N:0:ATCACG
CGCCCCCTCCTCCGGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGA
+
FFFF:FFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
#A00940:70:HTCYYDRXX:2:2101:2772:1063 1:N:0:ATCACG
TGGTGGCAGGCACCTGTAATCCCAGCTACTCGGGAGCCTGAGGCAGGAGAA
I am trying to grep all the characters of the lines that start with # up to : but not including the colon, in this example the result would be A00940.
I have tried this:
cat file.txt | grep '[^:]*'
and this:
cat file.txt | grep '^(.*?):'
but both commands do not work, why is that?
With your shown samples, could you please try following.
awk 'match($0,/#[^:]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
With sed solution: Simply stop printing for all lines and print only those lines where match is found by regex.
sed -n 's/^#\([^:]*\):.*/\1/p' Input_file
This pattern [^:]* matches 0+ times any character except : which does not take an # char into account. As the quantifier is * it can also match an empty string.
This pattern ^(.*?): matches from the start of the string, as least as possible characters till the first occurrence of : and also does not take the # char into account.
One option is to use -P for a Perl compatible regex with a positive lookbehind to assert an # to the left.
grep -oP '(?<=#)[^#:]+' file.txt
The pattern matches:
(?<=#) Positive lookbehind, assert from the current position an # directly to the left
[^#:]+ Negated character class, match 1+ times any character except # and :
Output
A00940
A00940
A00940
Another option using gawk with a capture group:
gawk 'match($0, /#([^#:]+):/, a) {print a[1]}' file.txt
In addition to the other answers, another way to get the output is with the following
awk -F '#|:' '$2=="A00940" {print $2}' file.txt
That sets the delimiter as either # or : and then prints the second column where the value is A00940:
Output:
A00940
A00940
A00940

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

regex -- grepping for alphabetic characters only

I have a quick regex question.
Let's say I have a list of packages:
packageA-0:8.39-6.fc24.x86_64
packageB-0:6.4-1.fc24.x86_64
packageB-utils-0:3.63-2.fc24.x86_64
What I want returned is:
packageA
packageB
packageB-utils
I've tried
grep -oP '^[a-z]*' myfile.txt
and
awk -F"[_-]" '{print $1}' myfile.txt
Any ideas? I think I'm sort of close, but I just can't get packageB-utils
.*?(?=-\d)
.*? => everything non greedy
(?=-\d) => until "-" followed by a digit
Try this. Selects everything upto the last alphabet:
grep -o "^[a-zA-Z-]*[a-zA-Z]" file.txt
Or, if your package name also contains digits, you can use sed to trim out everything after -0:...:
sed 's|-[0-9]*:.*||' file.txt
With sed using grouping:
sed -rn 's/([A-Za-z\-]+)\-(.*)/\1/p' packages.txt
Should yield:
#packageA
#packageB
#packageB-utils
packages.txt contains:
packageA-0:8.39-6.fc24.x86_64
packageB-0:6.4-1.fc24.x86_64
packageB-utils-0:3.63-2.fc24.x86_64