grep text before string - regex - regex

I have to extract few fields from below input html text using bash (only).
HTML input
SOMETEXT
I have extract id value and SOMETEXT from above input.
I am hoping that grep using some regex should workout.
For id_value I am using following regex
"id=[0-9]*"
which is giving me correct results.
grep -o 'id=[0-9]*' index.html | head -n 5
But I am not sure what sort of regex I should use to grab text till next </a>.
Thanks in advance.

(?<=>).*?(?=<)
You can use this with grep -P,since this uses lookarounds supported by perl.See demo.
https://regex101.com/r/fM9lY3/21

The regex you have in your OP ("id=[0-9]*") looks like it worked in your case, but a better approach is to hone down on the anchor tags themselves.
Here is a regex to extract out the id value:
<a.*?id=(\d.*?)">
And here is a regex to extract out the contents inside the <a> tag:
<a.*?">(.*?)<\/a>

Related

Find everything between tags which contains specific content using regex

Input (Non-valid xml):
blabla<Val>Test2312x<End><Val>Nonazx<End><Val>Test<End><Val>Testazxcz<End><Val>asdsad<End>
Goal:
Extracting all tags content which contains "Test":
1231Test2312x
Test
Testazxcz
I have tried this regex:
<Val>.?Test.*?<End>
but it only captures the first occurrence without any letters before "Test".
Any ideas ?
Since you haven't mentioned which language you want to use so I am using awk:
awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /Test/){print $i}}}' Input_file
Output will be as follows.
Test2312x
Test
Testazxcz

Can grep show only result i want

I have data as this
tatusx2.atc?beginnum=0;8pctgRB Mwdf fgEio"text1"text4"text
tatqsx3.atc?beginnum=1;8pctgRBwsaNezxio"text2
tatssx4.atc?beginnum=2;8pctgsvMALNejkio"data2
tatksx4.atc?beginnum=1;8pctgxdfALNebfio"text3
tatzsx5.atc?beginnum=3;8pwerRBMALNetior"datac
How to get only data between ; and "
I have tried grep -oP ';.*?"' file and got output :
;8pctgRBMwdffgEio"
;8pctgRBwsaNezxio"
;8pctgsvMALNejkio"
;8pctgxdfALNebfio"
;8pwerRBMALNetior"
But my desired output is:
8pctgRB Mwdf fgEio
8pctgRBwsaNezxio
8pctgsvMALNejkio
8pctgxdfALNebfio
8pwerRBMALNetior
You need to use lookahead and lookbehind regex expressions
grep -oP '(?<=;)\w*(?=")'
I consider you play around regexr to learn more about regular expressions. Checkout their cheatsheet.
A much more readable way to write the expression you need is:
grep -oP '(?<=;).*(?=")' file
and will get you the desired result. PERL regexes are apparently experimental but certain patterns work without issues.
The following options are being used:
-o --only-matching to the print only the matched parts of a matching line
-P --perl-regexp
Using ?=; will get you the string beginning with ; but using the > you are able to start at the index after. Similarly the end string tag is specified.
Here is suggested additional reading.

sed/grep - get text between two strings (html)

I am trying to extract "pagename" from the following:
<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>
I tried to get it to work using "sed" but it only says invalid command code.
What line of code would you guys suggest to get the pagename? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right?
Thanks in advance for helping me out!
I would use awk for this:
awk -F"[/?]" '/timetable work/ {print $4}'file
pagename
It search for a line containing timetable work, then print fourth field using \ or ? as separator.
As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex:
<a class="timetable.*?<\/a>
Working demo
If you want to grab the content just surround the regex with capturing groups:
(<a class="timetable.*?<\/a>)
The match is:
MATCH 1
1. [9-80] `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`
I think this is what you want:
sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'
Giving you exactly what you asked for using exactly the delimiters you told us to use:
$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename
I know it may be tempting to handle this using a regular expression but here's an alternative.
You are trying to parse some HTML, so use an HTML parser. Here's an example in Perl:
use strict;
use warnings;
use feature qw(say);
use HTML::TokeParser::Simple;
use URI::URL;
my $filename = 'file.html';
my $parser = HTML::TokeParser::Simple->new($filename);
while (my $anchor = $parser->get_tag('a')) {
next unless defined(my $class = $anchor->get_attr('class'));
next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;
my $url = url $anchor->get_attr('href');
say substr($url->path, 1);
}
Parse the HTML using HTML::TokeParser::Simple. loop through the <a> tags, skipping any that don't have the correct classes defined. For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). As you didn't want the leading slash, I used substr to remove the first character.
Output:
pagename
I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. HTML parsers exist for a reason :)

replace one or more occurrences of a string pattern on matching lines

Within an HTML document, I need to replace the '&amp' with "&" within the following using sed:
33140
<a class="coding_reference" href="/cgi-bin/_subs/efgu?c=mre_icd9cm&u=icdv58&p=">V58.6</a>
There are other occurrences of "&amp" that need to be preserved, so I only want occurrences replaced if they are within an href attribute.
This solved it for me. You can restrict sed to the targeted pattern by using a regex before the search and replace
cat file | sed '/href="\(\S*\)"/s/amp;//g'
sed ': again
s/\(<[hH][rR][eE][fF]="[^&"]*\&\)amp/\1/
t again' YourFile
change first &amp after
This assume,
ps: i'm not sur about the \& instead of simple & but i have no system to test here

Bash/PHP extract URL from HTML via regex

Is there any easy way to extract this URL in bash/or PHP?
http://shop.image-site.com/images/2/format2013/fullies/kju_product.png
From this HTML code?
<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5")',550,366);">
With perl you could do a match and a capture
perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'
This captures everything between image= and the next & and prints it $1.
For more on regular expressions, see perlre or http://www.regular-expressions.info/
In bash, you can try the following:
sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'
Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.
Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.
However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.
Try doing this :
xmllint --html --xpath '//a/#href' file://file.html |
grep -oP 'image=\Khttp://.*?\.png'
You can use an URL instead of a local file :
http://domain.tld/path
Or if you had already extracted the line to parse in the $string variable :
grep -oP 'image=\Khttp://.*?\.png' <<< "$string"