How to use regex with grep - regex

I just used the following grep command:
grep -ri '^(<topicref |<mapref).*( )(dest=")'
to match the following:
<topicref version="1" dest="susu"/>
<mapref id="" dest="summat"/>
all topicref and mapref that have a dest attribute.
However, it didnt work although regexpal accepts the regex. How do I have to change this to work with grep?

If you would like to use parentheses and alternation without using extended regular expressions, you can escape them with the backslash to enable this functionality.
grep -ir '^\(<topicref \|<mapref\).*\( \)\(dest="\)' .
Or, you can use -E option, and then you do not have to escape brackets:
grep -iEr '^(<topicref |<mapref).*( )(dest=")' .
Mind the . at the end stands for the current directory, and together with r recursive option, this will fetch you all the matches in the directory files.

Related

Sed regex not matching 'either or' inner group

I would like to match multiple file extensions passed through a pipe using sed and regex.
The following works:
sed '/.\(rb\)\$/!d'
But if I want to allow multiple file extensions, the following does not work.
sed '/.\(rb\|js\)\$/!d'
sed '/.\(rb|js\)\$/!d'
sed '/.(rb|js)\$/!d'
Any ideas on how to do either/or inner groups?
Here is the whole block of code:
#!/bin/sh
files=`git diff-index --check --cached $against | # Find all changed files
sed '/.\(rb\|js\)\$/!d' | # Only process .rb and .js files
uniq` # Remove duplicate files
I am using a Mac OSX 10.8.3 and the previous answer does not work for me, but this does:
sed -E '/\.(rb|js)$/!d'
Note: use -E to
Interpret regular expressions as extended (modern) regular expressions
rather than basic regular expressions (BRE's).
and this enables the OR function |; other versions seem to want the -r flag to enable extended regular expressions.
Note that the initial . must be escaped and the trailing $ must not be.
Try something like this:
sed '/\.\(rb\|js\)$/!d'
or if you have then use -r option to use extended regular expression for avoiding escaping special character.

Extract url from a string with regex in shell script

I need to extract a URL that is wrapped with <strong> tags. It's a simple regular expression, but I don't know how to do that in shell script. Here is example:
line="<strong>http://www.example.com/index.php</strong>"
url=$(echo $line | sed -n '/strong>(http:\/\/.+)<\/strong/p')
I need "http://www.example.com/index.php" in the $url variable.
Using busybox.
This might work:
url=$(echo $line | sed -r 's/<strong>([^<]+)<\/strong>/\1/')
url=$(echo $line | sed -n 's!<strong>\(http://[^<]*\)</strong>!\1!p')
You don't have to escape forward slashes with backslashes. Only backslashes need to be escaped in regular expressions. You should also use non-greedy matching with the ?-operator to avoid getting more than you want when there are multiple strong tags in the HTML sourcecode.
strong>(http://.+?)</strong
Update: as busybox uses ash, the solution assuming bash features likely won't work. Something only a little longer but still POSIX-compliant will work:
url=${line#<strong>} # $line minus the initial "<strong>"
url=${url%</strong>} # Remove the trailing "</strong>"
If you are using bash (or another shell with similar features), you can combine extended pattern matching with parameter substitution. (I don't know what features busybox supports.)
# Turn on extended pattern support
shopt -s extglob
# ?(\/) matches an optional forward slash; like /? in a regex
# Expand $line, but remove all occurrances of <strong> or </strong>
# from the expansion
url=${line//<?(\/)strong>}

Add a prefix to all media links in a html file

I'm trying to insert an absolute path before all images in an HTML file, like this:
<img src="/media/some_path/some_image.png"> to <img src="{ABS_PATH}/some_path/some_image.png">
I tried the following regex to identify the lines :
egrep '(src|href)="/media([^"]*)"'
I want to use sed to make these changes, but the above regexp doesn't work, any hints?
sed 's#(src|href)="/media([^"]*)"##g'
sed: -e expression #1, char 32: unknown option to `s'
EDIT:
ok, now i have:
echo 'src="/media/some_image.png"' | "egrep -o '(src|href)="/media([^"]*)"' | sed 's/(src|href)=\"\/media([^"]*)\"//g'
Sed should match the string, but it doesn't
sed doesn't understand ERE (extended regular expressions), only BRE (basic regular expressions). GNU sed has "-r" option which turn on ERE.
You should change delimiters for regular expressions, because you have slash in the regex, like this:
sed -r 's#(src|href)="/media([^"]*)"##g'
You can use almost any punctuation for delimiters.
You must escape / in sed if using it as a delimiter for the pattern.
So:
sed 's/(src|href)="/media([^"]*)"//g'
becomes:
sed 's/(src|href)="\/media([^"]*)"//g'
Perhaps what is confusing is that egrep (which uses extended regular expressions) has different rules to sed, and vanilla grep (which use basic regular expressions) when it comes to what must be escaped.

Regex in sed to replace parts of an url given a specific format

I'm having some issues in doing a simple regex using sed.
I've to do some replacement in a sql file and I'm trying to use sed.
I should replace the url of some links. The links are in the following format:
www.site1.com/blog/2012/12/12
I would like to replace site1 with site2 in all links.
To find these links I've written the following regex:
(site1.com)\/blog\/\d{4}\/\d{2}\/\d{2}
And seems to wokr properly.
Using sed to do the replacement things I've written the following code
cat back.sql | sed 's:(site1.com)\/blog\/\d{4}\/\d{2}\/\d{2}:site2.com:' > fixed.sql
But it seems is not working..
sed does not support \d (not to my knowing at least), and supports {4} only with extended regular expressions.
sed -r 's:site1.com(/blog/[0-9]{4}/[0-9]{2}/[0-9]{2}):site2.com/\1:'
as a basic regular expression (requires lots of escaping):
sed 's:site1.com\(/blog/[0-9]\{4\}/[0-9]\{2\}/[0-9]\{2\}\):site2.com/\1:'
ps. you don't need to escape slashes if you use different delemiters (:)
Looks to be a straight substitution to me:
$ sed -i s/\.site1\./\.site2\./g afile.txt
... where afile.txt contains your list of sites.
If you want to output to another file, remove the -i and redirect the output using > .

Grep does not show results, online regex tester does

I am fairly unexperienced with the behavior of grep. I have a bunch of XML files that contain lines like these:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
I wanted to get the identifier part after the slash and constructed a regex using RegexPal:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
It highlights everything that I wanted. Perfect. Now when I run grep on the very same file, I don't get any results. And as I said, I really don't know much about grep, so I tried all different combinations.
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
What am I doing wrong?
Your regex doesn't match the input. Let's break it down:
[a-z] matches g
\d{4} matches 1234
[a-z]* doesn't match .
Also, I believe grep and family don't like the \d syntax. Try either [0-9] or [:digit:]
Finally, when using regular expressions, prefer egrep to grep. I don't remember the exact details, but egrep supports more regex operators. Also, in many shells (including bash on OS X as you mentioned, use single quotes instead of double quotes, otherwise * will be expanded by the shell to a list of files in the current directory before grep sees it (and other shell meta-characters will get expanded too). Bash won't touch anything in single quotes.
grep doesn't support \d by defaul. To match a digit, use [0-9], or allow Perl compatible regular expressions:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
or:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
grep uses "basic" regular expressions : (excerpt from man pages )
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
Also depending on which shell you are executing in the '*' character might get expanded.
You can make use of the following command:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
Try this:
[a-z]\d{5}[.][a-z]{2}\d{6}
Try this expression in grep:
[a-z]\d{4}[a-z]*\.[a-z]*\d*