Retrieving Value from XML using grep and regular expressions - regex

I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.
<Artifacts>
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
</Artifacts>
I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'
Is there a way to just extract the following ?
http://build.system.org/path/to/artifact/folder/something.exe

The righteous way would be to use XML tools in this case, such as xmlstarlet
But that, of course, requires a valid XML structure. A valid XML structure would look like:
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
The command:
xmlstarlet sel -t -v "//artifact/file[contains(#url,'something.exe')]/#url" -n xmlfile
The output:
http://build.system.org/path/to/artifact/folder/something.exe
-v option (or --value-of ) - print value of XPATH expression
The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false.

As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:
xml sel -t -m 'Artifacts/artifact/file [contains(#url, "something.exe")]' -v #url
Output:
http://build.system.org/path/to/artifact/folder/something.exe

This regex should work: ([\w\d\s]*.exe)"\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").
$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"
Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.

Related

Regex isn't recognized in a curl/grep combination

I try to get image urls from a list of html urls with following curl/grep/seed combination (with wget I fail with 403, but cUrl get the source code correctly):
curl -K "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -o '(http[^\s]+(jpg|png|webp)\b)' | sed 's/\?.*//' > imglinks.txt
But I get an error The command "png" is either misspelled or could not be found.
Regex should be correct: https://regex101.com/r/Qk6A0Z/1/
How could this code be improved?
Edit: the source code of a single url from my list one can see running curl https://watchbase.com/sellita
The snippet, from where I want to get image urls looks like
<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>
Expected output is a file with all image urls, even those from data-src and data-srcset.
You may try this xargs+curl+grep pipeline:
xargs -n 1 curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^[:blank:]?'\"]+(jpe?g|png|gif|bmp|ico|tiff|webp)\b" > imglinks.txt
You can use
curl "https://watchbase.com/sellita" | "C:\GnuWin32\bin\grep.exe" -oE "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt
The 'png' is not recognized as an internal or external command, operable program or batch file issue is due to the use of single quotation marks. You should use double quotation marks in Windows grep.
To read all URLs from a file and process them, you may use
FOR /F %i in (C:\urls.txt) DO curl %i | "C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" >> imglinks.txt
It's really bad practice trying to parse HTML with RegEX! And to see senior members even encouraging this really makes me want to cry. This way the constant flood of these questions will never end.
Please have a look at:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
RegEx match open tags except XHTML self-contained tags
To parse HTML please use an HTML parser like xmllint, xmlstarlet, or xidel!
<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>
https://assets.watchbase.com/img/FFFFFF-0.png is just 1 white pixel and returns in every single <picture>-node. So I'm going to assume you just want the attributes data-srcset and data-src.
xidel -s "https://watchbase.com/sellita" -e "//picture/(source/#data-srcset,img/#data-src)"
You can also use xidel (with just 1 invocation) to process the urls you have in "C:\urls.txt" (assuming they all have the same <picture>-nodes as https://watchbase.com/sellita).
xidel -s "C:\urls.txt" -e "for $url in x:lines($raw) return doc($url)//picture/(source/#data-srcset,img/#data-src)" > imglinks.txt
or
xidel -se "for $url in file:read-text-lines('C:\urls.txt') return doc($url)//picture/(source/#data-srcset,img/#data-src)" > imglinks.txt
If you're goal is to download all images from 'imglinks.txt', then xidel can do this too.
xidel -s "C:\urls.txt" -f "for $url in x:lines($raw) return doc($url)//picture/(source/#data-srcset,img/#data-src)" --download "."
or
xidel -s --xquery "for $url in file:read-text-lines('C:\urls.txt') for $img in doc($url)//picture/(source/#data-srcset,img/#data-src) return file:write-binary(tokenize($img,'/')[last()],string-to-base64Binary(x:request($img)/raw))"
xidel -s --xquery ^"^
for $url in file:read-text-lines('C:\urls.txt')^
for $img in doc($url)//picture/(source/#data-srcset,img/#data-src)^
return^
file:write-binary(^
tokenize($img,'/')[last()],^
string-to-base64Binary(x:request($img)/raw)^
)^
"

Bash, grep between two lines with specified strings

example_file.txt:
a43
<un:Test1 id="U111">
abc1
cvb1
bnm1
</un:Test1>
<un:Test1 id="U222">
abc2
cvb2
bnm2
</un:Test1>
I need all lines between <un:Test1 id="U111"> and first </un:Test1> only. Number of these lines is differ from one input file to another input file. I have tried
grep -E -A100000 '<un:Test1 id=\"U111\">' example_file.txt | grep -B100000 '</un:Test1>'
but it returns all strings bellow <un:Test1 id="U222"> also. I know that it`s better to use xmlparser to parse such kind of files but it is not allowed to install additional libs to the server so I can use grep, awk, sed etc. only. Help me please.
Do you mean this?
sed -n '/<un:Test1 id="U111">/,/<\/un:Test1>/p' file
update with xmllint
If your input is xml, you can try:
xmllint --xpath "//*[local-name()='Test1'][#id='U111']" file.xml
Note: If you have different namespaces for same localname ("Test1"), you need add the namespace-uri()

updating a line in a xml file using a shell function

I'm trying to update a line in an xml file using a custom shell function and sed
In cmd line, I run it as follow:
updateloc db_name
However it does not update anything. Below sample of the code
updateloc(){
db_name=$1
file="file.xml"
olddb="<dbname><![CDATA[olddb]]></dbname>"
newddb="<dbname><![CDATA[$db_name]]></dbname>"
sed -i '' 's/$olddb/$newdb/g' $file
}
The right tool for this job is XMLStarlet. To modify any element named dbname with the new value:
updateloc() {
local db_name=$1 file=file.xml
xmlstarlet ed --inplace -u '//dbname' -v "$db_name" "$file"
}
To replace only elements with the old value olddb:
updateloc() {
local db_name=$1 file=file.xml
xmlstarlet ed --inplace \
-u "//dbname[. = 'olddb']" -v "$db_name" "$file"
}
Note that while the serialization generated by XMLStarlet won't necessarily use CDATA, it is guaranteed to be semantically equivalent, and to behave the precise same way in any XML-compliant parser.
1) $olddb and $newdb are not getting expanded because they're in single quotes
2) Your sed command is getting tripped up by all those xml characters - '[' and '/' are both meaningful to sed. You'd have to escape all those, and perhaps use a different regex delimiter (e.g. 's#$olddb#$newdb#g'). It's probably a bad idea to use sed for this, unless the format of file.xml is very consistent (the closing tag could be on a separate line for example).
That said, this would work for your example:
olddb='<dbname><!\[CDATA\[olddb\]\]></dbname>'
newdb='<dbname><!\[CDATA\[newdb\]\]></dbname>'
sed -i '' "s#$olddb#$newdb#g" $file
Grep and Sed Equivalent for XML Command Line Processing has some good approaches for better ways to mangle xml from the command line.

Replacing tags in xml in bash

I have an xml file that is of the following format:
<list>
<version>1.5</version>
<version>1.4</version>
<version>1.3</version>
<version>1.2</version>
</list>
The idea is that I always update the first version tag with a new version. And when I do so, I replace the subsequent tags.
For example, when I update the 1.6 version as the first tag (which I know how to do), the following tags would be:
<list>
<version>1.6</version>
<version>1.5</version>
<version>1.4</version>
<version>1.3</version>
</list>
I've tried to get two options going.
First Option:
My preferred option would be to search the xml file and replace the version tag i+1 with version tag i. Something like:
sed -E '2,/<version>.*<\/version>/s#<version>(.*)</c>#<version>\1</version>#' file.xml
Where I search for the second instance of version and replace it with the first instance of version (currently not working).
Second Option:
My second option would be to store the version tags in variables like:
version=$(grep -oPm1 "(?<=version>)[^<]+" file.xml)
version2=$(grep -oPm2 "(?<=version>)[^<]+" file.xml)
Then replace version 2 by version 1 and do the replacement:
sed -i "s/${version2}/${version}/g" file.xml
However, this options gives:
sed: -e expression #1, char 9: unterminated 's' command.
And when I try:
sed -i "/$version2/s/${version2}/${version}/g" file.xml
I get:
unterminated address regex
Obviously, the idea for either option would be to put the code in a loop so that I can run it i times. However, I am stuck and both options I've tried don't work.
Don't use text-manipulation tools such as awk or sed to work with XML if you can at all avoid it. While this specific subset may be so simple as to make the approach feasible, having the right tools at hand will avoid headaches later (if the file format gets extended; if someone adds comments to the front; etc).
new_version=1.6
xmlstarlet ed \
-d '/list/version[last()]' \
-i '/list/version[1]' -t elem -n version -v "$new_version" \
<old.xml >new.xml
-d '/list/version[last()]' deletes the last version entry in the list.
-i '/list/version[1]' -t elem -n version -v 1.6 introduces a new element named version, with the value 1.6, in the position currently held by the very first version.
Use ! or # as separator in sed instead of /.
It breaks because your match and replace variables contain /

regular expression shell script save specific part of file to variable

I need help for some command for handle the following situation.
I search on net for help but i can't succeed to find solution.
Problem is:
I have a xml file named as "temp.xml"
The text in the xml file is for example
<?xml version="1.0" encoding="utf-8"?>
<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />
I want to save only Note_Nbr to my variable x (312)
I try a something but it doesn't work.
X=$(sed -r 's/[^|]*(Note_Nbr=")(\d\d\d)([^|]*)/2 /' temp.xml )
Thank you for your helps.
The right way to do this is with a real XML parser:
x=$(xmllint --xpath 'string(/NoteData/#Note_Nbr)' test.xml)
...or, if you have XMLStarlet rather than a new enough xmllint:
x=$(xmlstarlet sel -t -m '/NoteData' -v #Note_Nbr -n <test.xml)
See also this answer: https://stackoverflow.com/a/1732454/14122
Now, if you only wanted to work with literal strings, you could build something fragile that looked like this, using parameter expansion:
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
s=${s#* Note_Nbr=\"}; s=${s%%\"*}; echo "$s"
Alternately, you could use native regular expression support within bash (note that this functionality is a bash extension not present in POSIX sh):
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
re='Note_Nbr="([^"]+)"'
if [[ $s =~ $re ]]; then
match="${BASH_REMATCH[1]}"
else
echo "ERROR: No match found" >&2
fi