Grep and sed returning only first match - regex

I am trying to extract the title and description of a rss Feed , I have written following script to return all the title in the Feed , But its returning only the first Title from the xml:
curl "http://www.dailystar.com.lb/RSS.aspx?id=113" 2>/dev/null | grep -E -o "<title>(.*)</title>" |sed -e 's,.*<title>\(.*\)</title>.*,\1,g' | less
How can I also find the description ?

You can use grep -P:
curl "http://www.dailystar.com.lb/RSS.aspx?id=113" 2>/dev/null |\
grep -oP "<title>\K[\s\S]*?(?=</title>)"

First put each title and description on its own line. Here is an example:
curl "http://www.dailystar.com.lb/RSS.aspx?id=113" 2>/dev/null | \
grep -E -o "<title>(.*)</title>" | \
sed -e 's,<\(title\|description\)>,\n<\1>,g' |
sed -n 's,.*<title>\(.*\)</title>.*,\1,gp'
For the description:
curl "http://www.dailystar.com.lb/RSS.aspx?id=113" 2>/dev/null | \
grep -E -o "<title>(.*)</title>" | \
sed -e 's,<\(title\|description\)>,\n<\1>,g' | \
sed 's,<title>\([^<]*\)</title>,T:\1,' | \
sed 's,<description>\([^<]*\)</description>,D:\1,' | \
sed -n 's/[DT]://p'

You should use non-greedy match (.*?) instead of greedy matching (.*) to get all the titles:
curl "http://www.dailystar.com.lb/RSS.aspx?id=113" 2>/dev/null | grep -E -o "<title>(.*?)</title>" |sed -e 's,.*<title>\(.*?\)</title>.*,\1,g' | less

Related

Need to get substring from string in bash

I'm trying to get Atom version in bash. Thid regex is working, but I need a substring from string, which giving grep. How can I get version from this string?
<span class="version">1.34.0</span>
curl https://atom.io/ | grep 'class="version"' | grep '[0-9]\+.[0-9]\+.[0-9]\+'
with awk
$ curl ... | awk -F'[<>]' '/class="version"/{print $3; exit}'
You can achieve this by using the cut command and adding your respective delimiters; in your case this would be the > and < tags encapsulating the version.
Input:
curl -s https://atom.io/ \
| grep 'class="version"' \
| grep '[0-9]\+.[0-9]\+.[0-9]\+' \
| cut -d '>' -f2 \
| cut -d '<' -f1
Output:
1.34.0
*added the curl -s flag to make output silent, personal choice

Linux delete egrepped lines

I pass file to my egrep expression (tcpdump log), then I want to delete all matched lines
Code example:
cat file | tr -d '\000' |egrep -i 'user: | usr: ' --color=auto --line-buffered -B20
How can I delete all matched lines now?
Use -v flag
-v, --invert-match
Selected lines are those not matching any of the specified patterns.
cat file | tr -d '\000' |egrep -iv 'user: | usr: ' --color=auto --line-buffered -B20 > newfile
You can do all that using sed:
sed -iE '/use?r: /d; s/\x0//g' file

Grep in bash with regex

I am getting the following output from a bash script:
INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist
and I would like to get only the path(MajorDomo/MajorDomo-Info.plist) using grep. In other words, everything after the equals sign. Any ideas of how to do this?
This job suites more to awk:
s='INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist'
awk -F' *= *' '{print $2}' <<< "$s"
MajorDomo/MajorDomo-Info.plist
If you really want grep then use grep -P:
grep -oP ' = \K.+' <<< "$s"
MajorDomo/MajorDomo-Info.plist
Not exactly what you were asking, but
echo "INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist" | sed 's/.*= \(.*\)$/\1/'
will do what you want.
You could use cut as well:
your_script | cut -d = -f 2-
(where your_script does something equivalent to echo INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist)
If you need to trim the space at the beginning:
your_script | cut -d = -f 2- | cut -d ' ' -f 2-
If you have multiple spaces at the beginning and you want to trim them all, you'll have to fall back to sed: your_script | cut -d = -f 2- | sed 's/^ *//' (or, simpler, your_script | sed 's/^[^=]*= *//')
Assuming your script outputs a single line, there is a shell only solution:
line="$(your_script)"
echo "${line#*= }"
Bash
IFS=' =' read -r _ x <<<"INFOPLIST_FILE = MajorDomo/MajorDomo-Info.plist"
printf "%s\n" "$x"
MajorDomo/MajorDomo-Info.plist

Regular Expression for "D210" for Linux?

The tile says it all. Right now I'm using:
grep "^D[\d][\d][\d]" file.txt
to no avail.
\d is not recognized unless -P or --perl-regexp option is specified. (assuming GNU grep).
$ echo D210 | grep '^D\d\d\d'
$ echo D210 | grep -P '^D\d\d\d'
D210
$ echo D210 | grep -P '^D\d{3}'
D210
If your grep does not accept -P, use [0-9] or [[:digit:]]:
$ echo D210 | grep '^D[0-9][0-9][0-9]'
D210
$ echo D210 | grep '^D[[:digit:]][[:digit:]][[:digit:]]'
D210

Can not extract the capture group with either sed or grep

I want to extract the value pair from a key-value pair syntax but I can not.
Example I tried:
echo employee_id=1234 | sed 's/employee_id=\([0-9]+\)/\1/g'
But this gives employee_id=1234 and not 1234 which is actually the capture group.
What am I doing wrong here? I also tried:
echo employee_id=1234| egrep -o employee_id=([0-9]+)
but no success.
1. Use grep -Eo: (as egrep is deprecated)
echo 'employee_id=1234' | grep -Eo '[0-9]+'
1234
2. using grep -oP (PCRE):
echo 'employee_id=1234' | grep -oP 'employee_id=\K([0-9]+)'
1234
3. Using sed:
echo 'employee_id=1234' | sed 's/^.*employee_id=\([0-9][0-9]*\).*$/\1/'
1234
To expand on anubhava's answer number 2, the general pattern to have grep return only the capture group is:
$ regex="$precedes_regex\K($capture_regex)(?=$follows_regex)"
$ echo $some_string | grep -oP "$regex"
so
# matches and returns b
$ echo "abc" | grep -oP "a\K(b)(?=c)"
b
# no match
$ echo "abc" | grep -oP "z\K(b)(?=c)"
# no match
$ echo "abc" | grep -oP "a\K(b)(?=d)"
Using awk
echo 'employee_id=1234' | awk -F= '{print $2}'
1234
use sed -E for extended regex
echo employee_id=1234 | sed -E 's/employee_id=([0-9]+)/\1/g'
You are specifically asking for sed, but in case you may use something else - any POSIX-compliant shell can do parameter expansion which doesn't require a fork/subshell:
foo='employee_id=1234'
var=${foo%%=*}
value=${foo#*=}
 
$ echo "var=${var} value=${value}"
var=employee_id value=1234