Regex not working in bash - regex

My regex:
(<property\sname=\"uri\"\s[value=\"htttp:\/\/]+(\d+\.)+\d+)+
Text sample:
<bean id="journeyWSClient" parent="abstractClient" class=" lib.JourneyWSClient">
<property name="uri" value="http://192.24.342.432:20010/some/path/1_0_0"/>
<!-- Fortuna -->
<!-- property name="uri" value="http://164.7.19.11:20010/some/path/1_0_0"/ -->
It works on http://regexr.com/, however when I put the regex in bash script, it doesn't work. Are there some characters I need to escape? Ideas?
Bonus cookie for extracting the IP with only one regex.

Replace the \d with [0-9] and \s with [[:space:]], and adjust the IP matching part as ([0-9]+(\.[0-9]+)+) (or simplify it to ([0-9.]+)) so as to be able to get its value with ${BASH_REMATCH[1]}:
text='<property name="uri" value="http://192.49.200.142:20010/some/path/1_0_0"/>'
regex='<property[[:space:]]name="uri"[[:space:]]value="http://([0-9]+(\.[0-9]+)+)'
if [[ $text =~ $regex ]]; then
echo ${BASH_REMATCH[1]};
fi
See the IDEONE demo

See if this works for you..
value="http:\/\/(([0-9].?)+):
This worked for me if you have IP only with http pattern
grep http TEXT_SAMPLE_FILE |sed -E 's;.*value="http://(([0-9].?)+):.*;\1;g'

Related

How to select an image URL using Regex for Grep in bash script?

I have a text file, where I need to select URLs for images using bash script.
An example of a line from the text file:
<icon height="36" width="36" density="ldpi" src="res/icon/android/ldpi.png"/>
I wrote the following script using Regex:
echo $line | grep -E -o "[^\"\'=\s]+\.(jpe?g|png|gif)"
An output shows: /icon/android/ldpi.png
However, I need: res/icon/android/ldpi.png
Can anyone help to fix the problem and make the right output like res/icon/android/ldpi.png ?
Thank you in advance! 😊
Here, \s is not recognized by grep as a whitespace matching pattern, it matches \ and s and as res contains s it is not matched.
You may try
grep -Eo "[^\"'=[:space:]]+\.(jpe?g|png|gif)" <<< "$line"
Or, use Bash regex matching:
rx="src=([\"'])([^\"']+\.(jpe?g|png|gif))\1"
if [[ "$line" =~ $rx ]]; then
echo "${BASH_REMATCH[2]}";
fi;
See the online demo
The src=([\"'])([^\"']+\.(jpe?g|png|gif))\1 pattern matches
src= - a literal substring
([\"']) - Group 1: a " or '
([^\"']+\.(jpe?g|png|gif)) - Group 2 (this value is accessible via "${BASH_REMATCH[2]}" after a match is found): any 1+ chars other than " and ' followed with . and jpeg, jpg, png, gif
\1 - Group 1 value.
If you can use awk, you can do:
awk -F'src=' 'NF>1 {split($2,a,"\"");print a[2]}' file
res/icon/android/ldpi.png
It will print all data between "" if it comes after src=
If it needs to only be jpg/gif/png extension?
awk -F'src=' 'NF>1 && $2~/\.(jpe?g|png|gif)/ {split($2,a,"\"");print a[2]}' file
res/icon/android/ldpi.png

How to use sed to fix an xml issue

I have an xml with the following (invalid) structure
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1>
I want to use sed to change it into
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>
i.e. I want to remove </tag1>...<tag1> (and move everything in between under the enclosing tag1), if I encounter an invalid xml substring as <tag1></*
I have tried using sed without success (one such attempt is below)
sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
It does work with the example above, but if I have two occurrence of the same condition it just removes the first </tag1> and the last <tag1> instead of performing the replacement twice
echo '<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3><tag1></tag4>text8</tag1>' | sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
outputs
<tag1>text1<tag2>text2<tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3></tag4>text8</tag1>
I think sed just expands the RE to cover the largest selection, but what should I do if I do not want it to do such thing ?
You want non-greedy matching, but to the best of my knowledge, sed doesn't support it. Can you use perl or do you have to use sed?
Try: perl -p -e 's/<\/tag1>(.*?)<tag1>(\<\/.+?<\/tag1>)/\1\2/g'
I think the issue is that the regex has to match through to the end of the actual closing or else that closing tag becomes the beginning of the next match.
sed 's|</tag1><tag3>|<tag3>|;s|</tag3><tag1>|</tag3>|' file.xml
Output:
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>
This might work for you (GNU sed):
sed -r 's/<tag1>/\n/g;s/<\/tag1>(<tag3>[^\n]*)\n/\1/g;s/\n/<tag1>/g' file
Reduce <tag1> to a unique character i.e \n then use the negated character class [^\n] to obtain non-greedy matching. Following the changes reverse the initial substitution.
GNU sed
sed '\,<tag1></,{ s,</tag1>,,; s,<tag1>,,2; }' <<END
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
END
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
If the string <tag1></ is seen, then remove the first </tag1> and the second <tag1>

How to use regex with grep

I just used the following grep command:
grep -ri '^(<topicref |<mapref).*( )(dest=")'
to match the following:
<topicref version="1" dest="susu"/>
<mapref id="" dest="summat"/>
all topicref and mapref that have a dest attribute.
However, it didnt work although regexpal accepts the regex. How do I have to change this to work with grep?
If you would like to use parentheses and alternation without using extended regular expressions, you can escape them with the backslash to enable this functionality.
grep -ir '^\(<topicref \|<mapref\).*\( \)\(dest="\)' .
Or, you can use -E option, and then you do not have to escape brackets:
grep -iEr '^(<topicref |<mapref).*( )(dest=")' .
Mind the . at the end stands for the current directory, and together with r recursive option, this will fetch you all the matches in the directory files.

Extract url from a string with regex in shell script

I need to extract a URL that is wrapped with <strong> tags. It's a simple regular expression, but I don't know how to do that in shell script. Here is example:
line="<strong>http://www.example.com/index.php</strong>"
url=$(echo $line | sed -n '/strong>(http:\/\/.+)<\/strong/p')
I need "http://www.example.com/index.php" in the $url variable.
Using busybox.
This might work:
url=$(echo $line | sed -r 's/<strong>([^<]+)<\/strong>/\1/')
url=$(echo $line | sed -n 's!<strong>\(http://[^<]*\)</strong>!\1!p')
You don't have to escape forward slashes with backslashes. Only backslashes need to be escaped in regular expressions. You should also use non-greedy matching with the ?-operator to avoid getting more than you want when there are multiple strong tags in the HTML sourcecode.
strong>(http://.+?)</strong
Update: as busybox uses ash, the solution assuming bash features likely won't work. Something only a little longer but still POSIX-compliant will work:
url=${line#<strong>} # $line minus the initial "<strong>"
url=${url%</strong>} # Remove the trailing "</strong>"
If you are using bash (or another shell with similar features), you can combine extended pattern matching with parameter substitution. (I don't know what features busybox supports.)
# Turn on extended pattern support
shopt -s extglob
# ?(\/) matches an optional forward slash; like /? in a regex
# Expand $line, but remove all occurrances of <strong> or </strong>
# from the expansion
url=${line//<?(\/)strong>}

What's wrong with this shell/sed script?

I have about 150 HTML files in a given directory that I'd like to make some changes to. Some of the anchor tags have an href along the following lines: index.php?page=something. I'd like all of those to be changed to something.html. Simple regex, simple script. I can't seem to get it correct, though. Can somebody weigh in on what I'm doing wrong?
Sample html, before and after output:
<!-- Before -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
<!-- After -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
Script file:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done
It's your regex, and the fact that the shell is trying to interpret bits of your regex.
First - the [.]* matches any number of literal dots .. Change it to .*.
Secondly, enclose the entire regex in single quotes ' to prevent the bash shell from interpreting any of it.
sed 's/"index\.php?page=\(.*\)"/\1\.html/g'
Also, instead of < $f >! $f you can just feed in the '-i' switch to sed to have it operate in-place:
sed -i 's/"index\.php?page=\(.*\)"/"\1\.html"/g' "$f"
(Also, as another point I think in your replacement you want double quotes around the \1.html so that the new URL is quoted within the HTML. I also quoted your $f to "$f", because if the file name contains spaces bash will complain).
EDIT: as #TimPote notes, the standard way to match something within quotes is either ".*?" (so that the .* is non-greedy) or "[^"]+". Sed doesn't support the former, so try:
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' "$f"
This is to prevent (for example) "asdf" from being turned into "asdf.html" (where the (.*) captured asdf">"asdf, being greedy).
Your .* was too greedy. Use [^"]\+ instead. Plus your quotes were all messed up. Surround the whole thing with single quotes instead, then you can use " without escaping them.
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g'
You can do this whole operation with a single statement using find:
find . -maxdepth 1 -type f -name '*.html' \
-exec sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' {} \+
The following works:
sed "s/\"index\.php?page=\(.*\)\"/\"\1.html\"/g" < 1.html
I think it was mostly the square brackets. Not sure why you had them.
Oh, and the entire sed command needs to be in quotes.