Why does grep return nothing? - regex

I'm somewhat at ease with regex, but not with grep particularly, and can't figure out why the following regex returns nothing:
wget -qO- 'http://www.acme.com/index.html' | grep -iPo '(?s)(^<div class="titlebar">.+?<div class="colleft">)'
I prepended (?s) because the catch-all ".+?" includes carriage-returns (either CRLF, CR, or LF, depending on how the text was saved).
Any idea why it doesn't work as expected?
Thank you.

grep is line-oriented, so if there are newlines between the tags, grep can't find it. You'll want:
wget -qO- 'http://website.invalid/index.html' |
perl -0777 -nE 'say for /(^<div class="titlebar">.+?<div class="colleft">)/msg'

Related

Why is sed not extracting value?

When I run my regex with sed
echo "abc-def-stg" | sed -e '/(\w*$)/g'
on regexr.com it works with no problems, but when I try to extract the value stg using said it does not work.
Can anyone explain why?
sed is used to replace strings. You are trying to extract.
Use (as John1024 said)
echo "abc-def-stg" | sed '/.*-//'
It will remove all up to and including the last hyphen. Or
echo "abc-def-stg" | grep -oE '[^-]+$'
It will extract all characters other than a hyphen at the end of the string.

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'
It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.
For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

Getting defined substring with help of sed or egrep

Everyone!!
I want to get specific substring from stdout of command.
stdout:
{"response":
{"id":"110200dev1","success":"true","token":"09ad7cc7da1db13334281b84f2a8fa54"},"success":"true"}
I need to get a hex string after token without quotation marks, the length of hex string is 32 letters.I suppose it can be done by sed or egrep. I don't want to use awk here. Because the stdout is being changed very often.
This is an alternate gnu-awk solution when grep -P isn't available:
awk -F: '{gsub(/"/, "")} NF==2&&$1=="token"{print $2}' RS='[{},]' <<< "$string"
09ad7cc7da1db13334281b84f2a8fa54
grep's nature is extracting things:
grep -Po '"token":"\K[^"]+'
-P option interprets the pattern as a Perl regular expression.
-o option shows only the matching part that matches the pattern.
\K throws away everything that it has matched up to that point.
Or an option using sed...
sed 's/.*"token":"\([^"]*\)".*/\1/'
With sed:
your-command | sed 's/.*"token":"\([^"]*\)".*/\1/'
YourStreamOrFile | sed -n 's/.*"token":"\([a-f0-9]\{32\}\)".*/\1/p'
doesn not return a full string if not corresponding

Grep - using get "Match" to the end of text file with .* not working

I have a long text file in ascii. Every so often it will have Page: ##### in a line of it. I would like to match starting at "Page: 25141" and every single line after that to the end of the document.
Some of the combos I have tried are:
grep -E ^["Page: 25141".*] document.txt
grep "Page: 25141.*" document.txt
grep "Page: 25141\.\*" document.txt
grep -E "Page: 25141"[.*] document.txt
grep -E "Page: 25141"{.*} document.txt
grep -E {"Page: 25141".*} document.txt
Can't get this to work.
If sed solution is ok for you:
sed -n '/Page: 25141/,$p' file
The above sed will match all lines between the range starting from the line containing the pattern 'Page 25141, till the end of the file($).
Since grep is line-oriented matching multiple lines with a regular expression is not easy. However, the -A option will print the context following any matches:
grep -A 1000000000 'Page: 25141'
You could also do it with ed or sed:
echo '/Page: 25141/,$p' | ed -s filename
I almost got it to work with .*, based on nenopera's answer to a similar question. The only problem is that it prints an extra newline at the end:
grep -Pzo '(?s)Page: 25141.*'
You can give a thought to awk, which is built for such jobs. awk 'BEGIN{i=0;} /pattern/ {i=1;} {if(i==1) print}' file will do you job nicely, and if you want to do more.

Bash (grep) regex performing unexpectedly

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).
The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive
You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.
grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].
To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)