I try to get image urls from a list of html urls with following curl/grep/seed combination (with wget I fail with 403, but cUrl get the source code correctly):
curl -K "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -o '(http[^\s]+(jpg|png|webp)\b)' | sed 's/\?.*//' > imglinks.txt
But I get an error The command "png" is either misspelled or could not be found.
Regex should be correct: https://regex101.com/r/Qk6A0Z/1/
How could this code be improved?
Edit: the source code of a single url from my list one can see running curl https://watchbase.com/sellita
The snippet, from where I want to get image urls looks like
<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>
Expected output is a file with all image urls, even those from data-src and data-srcset.
You may try this xargs+curl+grep pipeline:
xargs -n 1 curl < "C:\urls.txt" | "C:\GnuWin32\bin\grep.exe" -Eo "http[^[:blank:]?'\"]+(jpe?g|png|gif|bmp|ico|tiff|webp)\b" > imglinks.txt
You can use
curl "https://watchbase.com/sellita" | "C:\GnuWin32\bin\grep.exe" -oE "http[^?[:space:]]+(jpg|png|webp)\b" > imglinks.txt
The 'png' is not recognized as an internal or external command, operable program or batch file issue is due to the use of single quotation marks. You should use double quotation marks in Windows grep.
To read all URLs from a file and process them, you may use
FOR /F %i in (C:\urls.txt) DO curl %i | "C:\GnuWin32\bin\grep.exe" -oP "http[^?\s]+(jpg|png|webp)\b" >> imglinks.txt
It's really bad practice trying to parse HTML with RegEX! And to see senior members even encouraging this really makes me want to cry. This way the constant flood of these questions will never end.
Please have a look at:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
RegEx match open tags except XHTML self-contained tags
To parse HTML please use an HTML parser like xmllint, xmlstarlet, or xidel!
<picture>
<source type="image/webp" data-srcset="https://cdn.watchbase.com/caliber/md/origin:png/sellita/sw200-1-bd.webp" srcset="https://assets.watchbase.com/img/FFFFFF-0.png" />
<img class="lazyload" data-src="https://cdn.watchbase.com/caliber/md/sellita/sw200-1-bd.png" src="https://assets.watchbase.com/img/FFFFFF-0.png" alt="Sellita caliber SW200-1"/>
</picture>
https://assets.watchbase.com/img/FFFFFF-0.png is just 1 white pixel and returns in every single <picture>-node. So I'm going to assume you just want the attributes data-srcset and data-src.
xidel -s "https://watchbase.com/sellita" -e "//picture/(source/#data-srcset,img/#data-src)"
You can also use xidel (with just 1 invocation) to process the urls you have in "C:\urls.txt" (assuming they all have the same <picture>-nodes as https://watchbase.com/sellita).
xidel -s "C:\urls.txt" -e "for $url in x:lines($raw) return doc($url)//picture/(source/#data-srcset,img/#data-src)" > imglinks.txt
or
xidel -se "for $url in file:read-text-lines('C:\urls.txt') return doc($url)//picture/(source/#data-srcset,img/#data-src)" > imglinks.txt
If you're goal is to download all images from 'imglinks.txt', then xidel can do this too.
xidel -s "C:\urls.txt" -f "for $url in x:lines($raw) return doc($url)//picture/(source/#data-srcset,img/#data-src)" --download "."
or
xidel -s --xquery "for $url in file:read-text-lines('C:\urls.txt') for $img in doc($url)//picture/(source/#data-srcset,img/#data-src) return file:write-binary(tokenize($img,'/')[last()],string-to-base64Binary(x:request($img)/raw))"
xidel -s --xquery ^"^
for $url in file:read-text-lines('C:\urls.txt')^
for $img in doc($url)//picture/(source/#data-srcset,img/#data-src)^
return^
file:write-binary(^
tokenize($img,'/')[last()],^
string-to-base64Binary(x:request($img)/raw)^
)^
"
Related
I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.
I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.
<Artifacts>
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
</Artifacts>
I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'
Is there a way to just extract the following ?
http://build.system.org/path/to/artifact/folder/something.exe
The righteous way would be to use XML tools in this case, such as xmlstarlet
But that, of course, requires a valid XML structure. A valid XML structure would look like:
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
The command:
xmlstarlet sel -t -v "//artifact/file[contains(#url,'something.exe')]/#url" -n xmlfile
The output:
http://build.system.org/path/to/artifact/folder/something.exe
-v option (or --value-of ) - print value of XPATH expression
The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false.
As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:
xml sel -t -m 'Artifacts/artifact/file [contains(#url, "something.exe")]' -v #url
Output:
http://build.system.org/path/to/artifact/folder/something.exe
This regex should work: ([\w\d\s]*.exe)"\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").
$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"
Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.
I have a web HTML page and im trying to parse it.
Source ::
<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac> </td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac> </td><td>0</td><td>0s</td><td></td></tr></table><p>
What I want is ::
172.29.219.17 L7TOUT in 1001ms
So what Im trying right now is ::
grep redis index.html | grep 'a name=\"redis\/[0-9]*.*\"'
to extract the IP address.
But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.
Ive doublecheck the regex im using but it doesnt seem to work.
Any ideas ?
Using xpath expressions in xmllint with its built-in HTML parser would produce an output as
ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17
and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html
produces an output as
L7TOUT in 1001ms
Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')
removing the whitespaces and extracting out only the needed parts with Awk as
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms
put in a variable as
timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
Now you can print both the values together as
echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms
version details,
xmllint --version
xmllint: using libxml version 20902
Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as
htmlfile:147: HTML parser error : Unexpected end tag : table
remove the line before further testing.
Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.
JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
I have tested all the other tools, comment on this one
If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.
naive - Old Answer
Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:
PHP
RUBY
PYTHON
GOLANG
because these languages are cross platform and have parsers for all the above listed formats.
If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:
<a
name="redis/172.29.219.17">
Some text
</a>
Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:
sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'
Explanation:
The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
The first expressions removes the leading <a name="redis/ text
The last expression removes everything that comes after the closing "
I'm trying to take googles html, and parse out the links. I use curl obtain the html then pass it to gawk. From gawk I used the match() function, and it works but it only returns a small amount of links. Maybe 10 at most. If I test my regex on regex101.com it returns 51 links using the g global modifier. How can I use this in gawk to obtain all the links (relative and absolute)?
#!/bin/bash
html=$(curl -L "http://google.com")
echo "${html}" | gawk '
BEGIN {
RS=" "
IGNORECASE=1
}
{
match($0, /href=\"([^\"]*)/, array);
if (length(array[1]) > 0) {
print array[1];
}
}'
Instead of awk you can also use grep -oP:
curl -sL "http://google.com" | grep -iPo 'href="\K[^"]+'
However this is also fetching 31 links for me. This may vary with your browser because google.com serves a different page for different locations/signed in users.
Match only matches the leftmost match, you need to update the line each time.
Try
curl -sL "http://google.com" | gawk '{while(match($0, /href=\"([^\"]+)/, array)){
$0=substr($0,RSTART+RLENGTH);print array[1]}}'
I need help for some command for handle the following situation.
I search on net for help but i can't succeed to find solution.
Problem is:
I have a xml file named as "temp.xml"
The text in the xml file is for example
<?xml version="1.0" encoding="utf-8"?>
<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />
I want to save only Note_Nbr to my variable x (312)
I try a something but it doesn't work.
X=$(sed -r 's/[^|]*(Note_Nbr=")(\d\d\d)([^|]*)/2 /' temp.xml )
Thank you for your helps.
The right way to do this is with a real XML parser:
x=$(xmllint --xpath 'string(/NoteData/#Note_Nbr)' test.xml)
...or, if you have XMLStarlet rather than a new enough xmllint:
x=$(xmlstarlet sel -t -m '/NoteData' -v #Note_Nbr -n <test.xml)
See also this answer: https://stackoverflow.com/a/1732454/14122
Now, if you only wanted to work with literal strings, you could build something fragile that looked like this, using parameter expansion:
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
s=${s#* Note_Nbr=\"}; s=${s%%\"*}; echo "$s"
Alternately, you could use native regular expression support within bash (note that this functionality is a bash extension not present in POSIX sh):
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
re='Note_Nbr="([^"]+)"'
if [[ $s =~ $re ]]; then
match="${BASH_REMATCH[1]}"
else
echo "ERROR: No match found" >&2
fi