Extract url from a string with regex in shell script - regex

I need to extract a URL that is wrapped with <strong> tags. It's a simple regular expression, but I don't know how to do that in shell script. Here is example:
line="<strong>http://www.example.com/index.php</strong>"
url=$(echo $line | sed -n '/strong>(http:\/\/.+)<\/strong/p')
I need "http://www.example.com/index.php" in the $url variable.
Using busybox.

This might work:
url=$(echo $line | sed -r 's/<strong>([^<]+)<\/strong>/\1/')

url=$(echo $line | sed -n 's!<strong>\(http://[^<]*\)</strong>!\1!p')

You don't have to escape forward slashes with backslashes. Only backslashes need to be escaped in regular expressions. You should also use non-greedy matching with the ?-operator to avoid getting more than you want when there are multiple strong tags in the HTML sourcecode.
strong>(http://.+?)</strong

Update: as busybox uses ash, the solution assuming bash features likely won't work. Something only a little longer but still POSIX-compliant will work:
url=${line#<strong>} # $line minus the initial "<strong>"
url=${url%</strong>} # Remove the trailing "</strong>"
If you are using bash (or another shell with similar features), you can combine extended pattern matching with parameter substitution. (I don't know what features busybox supports.)
# Turn on extended pattern support
shopt -s extglob
# ?(\/) matches an optional forward slash; like /? in a regex
# Expand $line, but remove all occurrances of <strong> or </strong>
# from the expansion
url=${line//<?(\/)strong>}

Related

use regular expressions to identify html form action tags

I am trying to sed -i to update all my html forms for url shortening. Basically I need to delete the .php from all the action="..." tags in my html forms.
But I am stuck at just identifying these instances. I am trying this testfile:
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
And I am using this expression:
grep -R "action\s?=\s?(.*)php(\"|\')" testfile
And grep returns nothing at all.
I've tried a bunch of variations, and I can see that even the \s? isn't working because just this grep command also returns nothing:
grep -R "action\s?=\s?" testfile
grep -R "action\\s?=\\s?" testfile
(the latter I tried thinking maybe I had to escape the \ in \s).
Can someone tell me what's wrong with these commands?
Edit:
Fix 1 - apparently I need to escape the question make in \s? to make it be perceived as optional character rather than a literal question mark.
The way you're using it, grep accepts basic posix regex syntax. The single quote does not need to be escaped in it1, but some of the metacharacters you use do -- in particular, ?, (), and |. You can use
grep -R "action\s\?=\s\?\(.*\)php\(\"\|'\)" testfile
I recommend, however, that you use extended posix regex syntax by giving grep the -E flag:
grep -E -R "action\s?=\s?(.*)php(\"|')" testfile
As you can see, that makes the whole thing much more readable.
Addendum: To remove the .php extension from all action attributes in a file, you could use
sed -i 's/\(action\s*=\s*["'\''][^"'\'']*\)\.php\(["'\'']\)/\1\2/g' testfile
Shell strings make this look scarier than it is; the sed code is simply
s/\(action\s*=\s*["'][^"']*\)\.php\(["']\)/\1\2/g
I amended the regex slightly so that in a line action='foo.php' somethingelse='bar.php' the right .php would be removed. I tried to make this as safe as I can, but be aware that handling HTML with sed is always hacky.
Combine this with find and its -exec filter to handle a whole directory.
1 And that the double quote needs to be escaped is because you use a doubly-quoted shell string, not because the regex requires it.
You need to use the -P option to use Perl regexs:
$ grep -P "action\s?=\s?(.*)php(\"|\')" test
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
try this unescaped plain regex, which only selects text within quotes:
action\s?=\s?["'](.*)\.php["']
you can fiddle around here:
https://regex101.com/r/lN8iG0/1
so on command line this would be:
grep -P "action\s?=\s?[\"'](.*)\.php[\"']" test

Sed regex not matching 'either or' inner group

I would like to match multiple file extensions passed through a pipe using sed and regex.
The following works:
sed '/.\(rb\)\$/!d'
But if I want to allow multiple file extensions, the following does not work.
sed '/.\(rb\|js\)\$/!d'
sed '/.\(rb|js\)\$/!d'
sed '/.(rb|js)\$/!d'
Any ideas on how to do either/or inner groups?
Here is the whole block of code:
#!/bin/sh
files=`git diff-index --check --cached $against | # Find all changed files
sed '/.\(rb\|js\)\$/!d' | # Only process .rb and .js files
uniq` # Remove duplicate files
I am using a Mac OSX 10.8.3 and the previous answer does not work for me, but this does:
sed -E '/\.(rb|js)$/!d'
Note: use -E to
Interpret regular expressions as extended (modern) regular expressions
rather than basic regular expressions (BRE's).
and this enables the OR function |; other versions seem to want the -r flag to enable extended regular expressions.
Note that the initial . must be escaped and the trailing $ must not be.
Try something like this:
sed '/\.\(rb\|js\)$/!d'
or if you have then use -r option to use extended regular expression for avoiding escaping special character.

What's wrong with this shell/sed script?

I have about 150 HTML files in a given directory that I'd like to make some changes to. Some of the anchor tags have an href along the following lines: index.php?page=something. I'd like all of those to be changed to something.html. Simple regex, simple script. I can't seem to get it correct, though. Can somebody weigh in on what I'm doing wrong?
Sample html, before and after output:
<!-- Before -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
<!-- After -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
Script file:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done
It's your regex, and the fact that the shell is trying to interpret bits of your regex.
First - the [.]* matches any number of literal dots .. Change it to .*.
Secondly, enclose the entire regex in single quotes ' to prevent the bash shell from interpreting any of it.
sed 's/"index\.php?page=\(.*\)"/\1\.html/g'
Also, instead of < $f >! $f you can just feed in the '-i' switch to sed to have it operate in-place:
sed -i 's/"index\.php?page=\(.*\)"/"\1\.html"/g' "$f"
(Also, as another point I think in your replacement you want double quotes around the \1.html so that the new URL is quoted within the HTML. I also quoted your $f to "$f", because if the file name contains spaces bash will complain).
EDIT: as #TimPote notes, the standard way to match something within quotes is either ".*?" (so that the .* is non-greedy) or "[^"]+". Sed doesn't support the former, so try:
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' "$f"
This is to prevent (for example) "asdf" from being turned into "asdf.html" (where the (.*) captured asdf">"asdf, being greedy).
Your .* was too greedy. Use [^"]\+ instead. Plus your quotes were all messed up. Surround the whole thing with single quotes instead, then you can use " without escaping them.
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g'
You can do this whole operation with a single statement using find:
find . -maxdepth 1 -type f -name '*.html' \
-exec sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' {} \+
The following works:
sed "s/\"index\.php?page=\(.*\)\"/\"\1.html\"/g" < 1.html
I think it was mostly the square brackets. Not sure why you had them.
Oh, and the entire sed command needs to be in quotes.

Add a prefix to all media links in a html file

I'm trying to insert an absolute path before all images in an HTML file, like this:
<img src="/media/some_path/some_image.png"> to <img src="{ABS_PATH}/some_path/some_image.png">
I tried the following regex to identify the lines :
egrep '(src|href)="/media([^"]*)"'
I want to use sed to make these changes, but the above regexp doesn't work, any hints?
sed 's#(src|href)="/media([^"]*)"##g'
sed: -e expression #1, char 32: unknown option to `s'
EDIT:
ok, now i have:
echo 'src="/media/some_image.png"' | "egrep -o '(src|href)="/media([^"]*)"' | sed 's/(src|href)=\"\/media([^"]*)\"//g'
Sed should match the string, but it doesn't
sed doesn't understand ERE (extended regular expressions), only BRE (basic regular expressions). GNU sed has "-r" option which turn on ERE.
You should change delimiters for regular expressions, because you have slash in the regex, like this:
sed -r 's#(src|href)="/media([^"]*)"##g'
You can use almost any punctuation for delimiters.
You must escape / in sed if using it as a delimiter for the pattern.
So:
sed 's/(src|href)="/media([^"]*)"//g'
becomes:
sed 's/(src|href)="\/media([^"]*)"//g'
Perhaps what is confusing is that egrep (which uses extended regular expressions) has different rules to sed, and vanilla grep (which use basic regular expressions) when it comes to what must be escaped.

put regular expression in variable

output=`grep -R -l "${images}" *`
new_output=`regex "slide[0-9]" $output`
Basically $output is a string like this:
slides/_rels/slide9.xml.rels
The number in $output will change. I want to grab "slide9" and put that in a variable. I was hoping new_output would do that but I get a command not found for using regex. Any other options? I'm using a bash shell script.
Well, regex is not a program like grep. ;)
But you can use
grep -Eo "(slide[0-9]+)"
as a simple approach. -o means: show only the matching part, -E means: extended regex (allows more sophisticated patterns).
Reading I want to grab "slide9" and put that in a variable. I assume you want what matches your regexp to be the only thing put in $new_output? If so, then you can change that to:
new_output=`egrep -R -l "${images}" * | sed 's/.*\(slide[0-9]+\).*/\1/'`
Note no setting of output= is required (unless you use that for something else)
If you need $output to use elsewhere then instead use:
output=`grep -R -l "${images}" *`
new_output=`echo ${ouput} | sed 's/.*\(slide[0-9]+\).*/\1/'`
sed's s/// command is similar to perls s// command and has an equivalent in most languages.
Here I'm matching zero or more characters .* before and after your slide[0-9]+ and then remembering (backrefrencing) the result \( ... \) in sed (the brackets may or may not need to be escaped depending on the version of sed). We then replace that whole match (i.e the whole line) with \1 which expands to the first captured result in this case your slide[0-9]+ match.
In these situations using awk is better :
output="`grep -R -l "main" codes`"
echo $output
tout=`echo $output | awk -F. '{for(i=1;i<=NF;i++){if(index($i,"/")>0){n=split($i,ar,"/");print ar[n];}}}'`
echo $tout
This prints the filename without the extension. If you want to grab only slide9 than use the solutions provided by others.
Sample output :
A#A-laptop ~ $ bash try.sh
codes/quicksort_iterative.cpp codes/graham_scan.cpp codes/a.out
quicksort_iterative graham_scan a