I have large log files (around 50mb each), which contain java debug information plus all kinds of XML responses
Here's an example of something I'm trying to extract from the log
<envelope>
<response>
<ATTR name="uniqueid" value="XYZ_00000-00-00_12345_1"/>
<ATTR name="status" value="Activated"/>
<ATTR name="datecreated" value="2018/10/04 09:39:05"/>
</response>
</envelope>
I need only the XMLs which the uniqueid attribute contains "12345" and the status attribute is set to "Activated"
By using "sed" I'm able to extract all the envelopes, and currently I'm using regex to check if the above conditions exist inside of it (by running all of them in a loop).
sed -n '/<envelope>/,/<\/envelope>/p' logfile
What would be a proper solution to extract what I need from the file?
Thanks!
assuming your xml is formatted as shown, this should work...
$ awk '/<envelope>/ {line=$0; p=0; next}
line {line=line ORS $0}
/uniqueid/ && $3~/12345/ {p=1}
/<\/envelope>/ && p {print line}' file
with the opening tag, start accumulating the lines, if the desired line found set the flag, with the end tag if the flag is set print the record.
with gawk you can do this instead
$ awk -F'\n' -v RS='</envelope>\n' \
'$3~/uniqueid.*12345/ && $4~/status.*Activated/{print $0, RT}' file
there will be an extra newline though.
Related
I am currently trying to merge several xml files via the following code:
rex_xh="-e '/^ *<\?xml[^>]*>$/d' -e s/^ *<\?xml[^>]+>//'"
rex_el="-e '/^[[:space:]]*$/d'"
rex_ts="-e "'s/^[ \t]*//'
while read xmldat ; do
cat $xmldat | sed $rex_xh $rex_el $rex_ts >> "$OUTDIR/$OUTFILE" ;
done << "$files"
which should essentially be executed (for each file) as:
cat $xmldat | sed -e '/^ *<\?xml[^>]*>$/d' -e s/^ *<\?xml[^>]+>//' -e '/^[[:space:]]*$/d' -e "'s/^[ \t]*// >> "$OUTDIR/$OUTFILE"
However when trying to execute this, i get this error message:
sed: -e expression #1, char 1: unknown command: `'
If I execute the command without the variables and instead enter the sed commands directly, it works fine. What am i missing? Am I doing something wrong with the variable expansion?
Based on (later given) user input, all 3, only 2 or only 1 of the given regular expressions should be used on the file(s).
The current setup should
-remove xml headers
-remove empty lines
-remove tabs and spaces in the beginning of new lines.
Input Example
<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?>
<RootNode xmlns="http://stub/example">
<ExampleBase someattr="val">
<InnerNode>Example</InnerNode>
<ExampleBase someattr="val">
</RootNode>
Expected result (when header removal, space removal and empty line removal is wanted)
<RootNode xmlns="http://stub/example">
<ExampleBase someattr="val">
<InnerNode>Example</InnerNode>
<ExampleBase someattr="val">
</RootNode>
Expected result (when only space removal and empty line removal is wanted)
<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?>
<RootNode xmlns="http://stub/example">
<ExampleBase someattr="val">
<InnerNode>Example</InnerNode>
<ExampleBase someattr="val">
</RootNode>
Input Example 2
<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?><RootNode xmlns="http://stub/example"><ExampleBase someattr="val"><InnerNode>Example
</InnerNode>
<ExampleBase someattr="val">
</RootNode>
(And yes, we get that kind of weird formatted xml)
Expected result (when header removal, space removal and empty line removal is wanted)
<RootNode xmlns="http://stub/example"><ExampleBase someattr="val"><InnerNode>Example
</InnerNode>
<ExampleBase someattr="val">
</RootNode>
Notes:
The files are not always valid xml files, hence I cant use xmllint or other xml tools
e.g. no closing tag
The header dows not always is alone on the first line, sometimes it is not succeeded by a line break.
The different regexes (e.g rex_xh) will later be optional and controlled by user input, hence the "necessity" to wrap them in variables
In the future it should be easy, to add new "options", hence another reason for using the "options" in variables.
Can anyone help me out here?
Please try following awk code to deal with few edge cases added by OP into the question now. Written and tested with shown samples only in GNU awk.
awk -v RS="^$" '
match($0,/^<\?xml version="[^"]*" encoding="[^"]*" standalone="[^"]*"\?>/){
val=substr($0,RSTART+RLENGTH)
gsub(/\n/,"",val)
gsub(/>[[:space:]]*</,">\n<",val)
gsub(/[[:space:]]+</,"<",val)
gsub(/>[[:space:]]*</,">\n<",val)
print val
}
' Input_file
Explanation: Simple explanation would be, using 2 conditions in awk program. 1st: If a line is NOT having value(matching by regex ^<\?xml version="[^"]*" encoding="[^"]*" standalone="[^"]*"\?>$) AND its NOT NULL, then using gsub function(s) to get output as per need and print that line's value present in val variable..
EDIT by OP - Implemented Solution
After fiddling around and due to the help,comments and answer from #RavinderSingh13 The following code is the final solution(snippet for the important part):
rm_xmlhead=1; # Option given via user input (later)
rm_tabspac=1; # Option given via user input (later)
rm_emptyln=1; # Option given via user input (later)
while read xmldat ; do
cat $xmldat | awk -v rem_xh=$rm_xmlhead -v rem_ts=$rm_tabspac -v rem_el=$rm_emptyln ' {
if(rem_xh) { sub(/^ *<\?xml[^>]+>/,"") }
if(rem_ts) { sub(/^[[:space:]]+/,"") }
if(rem_el && $0 =="" ) {next}
print
}' >> "$OUTPUT" ;
done << "$files"
This removes empty lines, leading spaces and tabs, xml headers and is easily expandable if any "new" requirements arise...also it gives me the possibility to later make each of the removals optional.
Following up on an answer by #dawg to my question how to delete multiple sections in a file based on known patterns, I want to use a regular expression in awk to identify the start of the section(s) I want to delete.
The file I am working with is an xml file. It is in fact the file containing the recently used filenames list (RUFL) in Linux Mint (~/.local/share/recently-used.xbel).
This is how the RUFL is structured:
<?xml version="1.0" encoding="UTF-8"?>
<xbel version="1.0"
xmlns:bookmark="http://www.freedesktop.org/standards/desktop-bookmarks"
xmlns:mime="http://www.freedesktop.org/standards/shared-mime-info"
>
<bookmark href="file:///home/ocor61/Documents/Linux/Linux%20Mint%20Cinnamon%20Keyboard%20Shortcuts.pdf" added="2021-07-18T01:57:02Z" modified="2021-07-18T01:57:02Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/pdf"/>
<bookmark:applications>
<bookmark:application name="Document Viewer" exec="'xreader %u'" modified="2021-07-18T01:57:02Z" count="1"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
<bookmark href="file:///home/ocor61/Documents/Linux/Linux%20Command%20Line%20Cheat%20Sheet.pdf" added="2021-07-18T01:57:09Z" modified="2021-07-18T01:57:09Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/pdf"/>
<bookmark:applications>
<bookmark:application name="Document Viewer" exec="'xreader %u'" modified="2021-07-18T01:57:09Z" count="1"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
<bookmark href="file:///home/ocor61/Documents/work.bfproject" added="2021-07-20T10:52:59Z" modified="2021-07-22T08:41:57Z" visited="1969-12-31T23:59:59Z">
<info>
<metadata owner="http://freedesktop.org">
<mime:mime-type type="application/x-bluefish-project"/>
<bookmark:applications>
<bookmark:application name="bluefish" exec="'bluefish %u'" modified="2021-07-22T08:41:57Z" count="2"/>
</bookmark:applications>
</metadata>
</info>
</bookmark>
</xbel>
I am working on a script to remove filenames from the list. It works fine, but I am also working with an array that contains patterns that should not be used. For example: if the pattern [bookmark] would be used to identify a section that must be removed, the entire file would become unusable. That goes for parts of [bookmark], but also for href, added, info... You get my drift.
So, I want to work with a regexp to counter the problems of entering patterns that cannot be used.
Currently, this is the awk code I am using now (thanks to #dawg):
ENDLINE='</bookmark>'
awk -v f=1 -v st="$1" -v end="$ENDLINE" '
match($0, st) {f=0}
f
match($0, end){f=1}' ~/.local/share/recently-used.xbel
$1 would be the pattern a user enters at the command line, which is part of the file name that must be removed from the RUFL.
The following is the code I would like to use, including the regexp, which doesn't work:
STARTLINE='/(<bookmark href)(.*)($1)(.*)(>)/'
ENDLINE='</bookmark>'
awk -v f=1 -v st="$STARTLINE" -v end="$ENDLINE" '
match($0, st) {f=0}
f
match($0, end){f=1}' ~/.local/share/recently-used.xbel
I have tested the regular expression at https://regexr.com/, so I know it is correct. However, when I use it in my script, this is the error message I am getting:
./ruffle.sh: line 99: syntax error near unexpected token `$0,'
./ruffle.sh: line 99: ` match($0, st) {f=0}'
I have also tried to enter the regexp itself in the awk command line instead of the variable, but that has the same result.
I don't know how to proceed, so any help is appreciated.
The answer to my question lies in how regular expressions can differ when used in different environments. The website I used to check my regexp does so for languages like JS, but not for Bash or likely other shell implementations.
With shellcheck.net as well as by putting the command 'set -vx' in my script right before the awk command, I managed to work things out.
Another mistake I made was to attempt to catch the complete line in the regexp, while I need only the part in that line that can hold the pattern that is entered (which is the part between 'file:' and 'added' in the file ~/.local/share/recently-used.xbel).
The regexp that ultimately works for me now with the variable STARTLINE is:
STARTLINE='file:.*'$1'.*added='
I will have to look into using an xml parser, thanks for the suggestion! For now, however, my script works. Thanks #Sundeep and #EdMorton!
I have an xml file with many lines like:
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?
I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.
I'm using this expression with egrep.
Don't parse HTML with regex, use a proper XML/HTML parser.
Check: Using regular expressions with HTML tags
You can use one of the following :
xmllint
xmlstarlet
saxon-lint
File:
<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>
Example with xmllint :
xmllint --xpath '//*[#vip="true"]/#href' file.xml 2>/dev/null
Output:
href="http://store.vcenter.com/stores/en/product/tigers-midi/100"
If you need a quick & dirty one time command, you can do:
egrep -o 'https?://[^"]+' file
I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.
<Artifacts>
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
</Artifacts>
I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'
Is there a way to just extract the following ?
http://build.system.org/path/to/artifact/folder/something.exe
The righteous way would be to use XML tools in this case, such as xmlstarlet
But that, of course, requires a valid XML structure. A valid XML structure would look like:
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
The command:
xmlstarlet sel -t -v "//artifact/file[contains(#url,'something.exe')]/#url" -n xmlfile
The output:
http://build.system.org/path/to/artifact/folder/something.exe
-v option (or --value-of ) - print value of XPATH expression
The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false.
As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:
xml sel -t -m 'Artifacts/artifact/file [contains(#url, "something.exe")]' -v #url
Output:
http://build.system.org/path/to/artifact/folder/something.exe
This regex should work: ([\w\d\s]*.exe)"\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").
$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"
Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.
I am using sed to parse an xml file from yahoo.finance. the file contains a bunch of uninteresting information and all global stock symbols which i want to extract. It's a 1 liner xml file with a big amount of stock symbols which are represented like that:
symbol="VALUE"
i am using sed like this:
sed "s/.* symbol=\"\(.*\)\".*/\1/" list_stocksymbols.xml >> ./tmpfile.txt
my output looks like that:
<?xml version="1.0" encoding="UTF-8"?>
WRG.AX
<!-- engine8.yql.bf1.yahoo.com -->
problem
as you can see only 1 symbol is extracted (WRG.AX).
question
how would i go about getting sed to write out all symbols?
i tried
sed "s/.* symbol=\"\(.*\)\".*/\1/g" list_stocksymbols.xml >> ./tmpfile.txt
global flag, but it didnt work :/
**xml file extract **
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="215" yahoo:created="2014-08-22T09:05:59Z" yahoo:lang="en-US">
<results><industry id="112" name="Agricultural Chemicals">
<company name="Adarsh Plant Protect Ltd" symbol="ADARSHPL.BO"/>
<company name="Agrium Inc" symbol="AGU.DE"/><company name="Agrium Inc" symbol="AGU.TO"/>
<company name="Agrium Inc." symbol="AGU"/>
<company name="Aimco Pesticides Ltd" symbol="AIMCO.BO"/>
<company name="American Vanguard Corp." symbol="AVD"/>
... and so on. The file is in 1 line only and not formatted like above.
** perl regex try **
perl -nle'print $& if m{(?<=symbol=")[^"]+}' list_stocksymbols
did also just bring out the first occurence
grep -Eo 'symbol="[^"]+' yahoo.txt | cut -c 9-
This works for all the grep versions without Perl support (as in Mac OS X in your case).
Also using only sed you could:
sed 's/.*symbol=\"//;s/\".*//' yahoo.txt