Find after pattern plus next line - regex

In a HTML file with some PHP included
<table>
<tbody>
<tr>
<td>td1</td>
<td>td2</td>
</tr>
<tr>
<td colspan="3">
<?php echo 'something' ?>
</td>
<td>
<?php echo echo 'something' ?>
</td>
<td>
<?php echo 'something' ?>
</td>
</tr>
</tbody>
</table>
I'd like to find all <?php that comes after <td ...> + next line.
(Here not td1 and not td2)
Approach:
(?s)(<td.*?>)+(\n)... also matches <td>td1</td> and <td>td2</td>
What comes after (?s)(<td.*?>)?

To remove the <?php...?> you can replace (<td[^>]*>)\s*<\?php.*?\?> by $1:
Explaining:
(<td[^>]*>) # store in $1 what you want to retrieve
\s* # also match what you want to remove
<\?php.*?\?> # the php content
Then when replacing by $1 it will remove only \s*<\?php.*?\?> which depended on <td>
Hope it helps.

Related

html - hyperlink and link text extraction

Hi I'm trying to extract the hyperlink and link text
HTML
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
Simple Symmetry </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
Faces and a Vase </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
Blind Contour Drawing </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
Seeing Values </td>
</tr>
Code
sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Desired
http://www.drawspace.com/lessons/b03/simple-symmetry Simple Symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase Faces and a Vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing Blind Contour Drawing
http://www.drawspace.com/lessons/b06/seeing-values Seeing Values
1st solution: With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be setting RS to <a href="[^"]*">[^<]* regex and in main program, checking if RT is NOT NULL and using split to split its value into array arr with delmiter of > OR ", if all conditions are met then simply printing 2nd and 4th values of array arr as per requirement to get needed output.
awk -v RS='<a href="[^"]*">[^<]*' '
RT && split(RT,arr,"[>\"]"){
print arr[2],arr[4]
}
' Input_file
2nd solution: Using sed with its -E option(to enable ERE, extended regular expressions) please try following code. Using -n option to stop default writing by sed for lines. Then in main program using s option to perform substitution operation. Where I am mentioning [[:space:]]+<a href="([^"]*)">([^<]*).* regex which will create 2 capturing groups from which we are substituting whole matched text and then using p option to print this matched part as per requirement.
sed -E -n 's/[[:space:]]+<a href="([^"]*)">([^<]*).*/\1 \2/p' Input_file
3rd solution: Using GNU awk's match function where mentioning regex and creating 2 capturing group to fetch the required value(s).
awk '
match($0,/^[[:space:]]+<a href="([^"]*)">([^<]*)/,arr){
print arr[1],arr[2]
}
' Input_file

regular expression: what's wrong with my expression?

I have a difficulty building a regex.
Suppose there is a html clip as below.
I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">)
I construct the following expression :
/<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/g
But the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!!
(I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( )
<tbody id="text_0">
<td class="by">
...lots of other tags
cat
...lots of other tags
</td>
</tbody>
<tbody id="text_1">
...lots of other tags
<td class="by">
apple
</td>
...lots of other tags
</tbody>
<tbody id="text_2">
...lots of other tags
<td class="by">
cat
</td>
...lots of other tags
</tbody>
<tbody id="text_3">
...lots of other tags
<td class="by">
...lots of other tags
tiger
</td>
...lots of other tags
</tbody>
<tbody id="text_4">
<td class="by">
banana
</td>
</tbody>
<tbody id="text_5">
<td class="by">
peach
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_7">
<td class="by">
banana
</td>
</tbody>
And this is what i expect to get
<tbody id="text_1">
<td class="by">
apple
</td>
</tbody>
<tbody id="text_6">
<td class="by">
apple
</td>
</tbody>
This is not an answer to the regex part of the question, but shouldn't the td elements be embedded in tr elements? tr stands for "table row", while tbody stands for "table body". tbody usually groups the table rows. It is not prohibited to have more than one tbody in the same table, but it is usually not necessary. (tbody is actually optional; you can have tr directly inside the table element.)
First, Regex is not a good solution for parsing anything like HTML or XML.
I can fix your pattern to work with this specific example but I can't guarantee that it will work in all cases. Regex just is not the right tool for the job.
But anyway, replace the first 2 instances of [\s\S] in your pattern with [^<].
<tbody.*?text[^<]*?<td class="by"[^<]*?<a.*?>apple<\/a>[\s\S]*?</tbody>
Start with this working regexp and go from there:
/<a href="(.*?)">apple<\/a>/g
If that is too broad and you want to make it more specific, add the next surrounding tag:
/<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Then continue:
/<tbody.*?>\s*<td.*?>\s*<a href="(.*?)">apple<\/a>/g
Also, consider an alternate solution such as XPATH. Regular expressions can't really parse all variations of HTML.

Tree-like matches in regex with a fixed chain

i have a very specific task to achieve with a single regex.
Here's the pattern of the text i have to extract the data from (note i'm parsing HTML-like code, stored in an immutable file) :
<tr>
<td > <a ><img /></a>
</td>
<td > <a ><span >RootData</span></a>
</td>
<td > Data1.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data1.2</span></a>
</td>
<td >  
</td></tr>
<tr>
<td > Data2.1
</td>
<td > <a ><img /></a>
</td>
<td > <a ><span >Data2.2</span></a>
</td>
<td >  
</td></tr>
...
First there's a root contained inside the first "tr". Still inside this one, there's some datq (Data1.1 and Data1.2) to extract.
Then comes a finite number of "tr" block each containing data to extract.
I'd like the matches to be like this :
match 1 : 'RootData' 'Data1.1' 'Data1.2'
match 2 : 'RootData' 'Data2.1' 'Data2.2'
etc
So far i see what to do with 2 regex and 2 loops (like 1 searching for the Root, and the other to find all datas from this root) but i'd like it to be in a single regex.
If some of you already encountered that and could help, that'd be nice :)
Thanks in advance.
If I understand you correctly, you'd like to have a single regular expression provide more than one match for the same input. Regular expressions do not work that way, and are probably just not the right tool for the problem you're trying to solve.

Regular expression with multiple results

What's wrong with my regex ?
"/Blabla\(2\) :.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis"
....
<tr>
<td class="aaa">Blabla(1) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr>
<tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="aaa">Blabla(2) :</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1b</td><td class="generic">word2b </td><td class="generic">word3b</td></tr>
<tr><td class="generic">word4b</td><td class="generic">word5b </td><td class="generic">word6b</td></tr>
</tbody></table>
</td>
</tr
What I want to do is to get the content of the FIRST TD of each TR from the block beginning with Blabla(2).
So the expected answer is word1b AND word4b
But only the first is returned...
Thank you for your help. Please don't answer me to use a DOM navigator, it's not possible in my case.
That's an interesting regex, in which I learned about the ungreedy flag, nice!
And for your problem, you might make use of \G to match immediately after the previous match and the flag g, assuming PCRE engine:
/(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uisg
regex101 demo
Or a little shorter with different delimiters:
'~(?:Blabla\(2\) :|(?<!^)\G).*<tr><td class="generic">(.*)</td>.+</tr>~Uisg'
Thanks to #Jerry, I learn today new tricks:
(Blabla\(2\) :.*?|\G)<tr><td class=\"generic\">\K([^<]+).+?<\/tr>\r\n

Displaying product viewd report in dashboard in opencart

I want to display Product Viewed Report in Dashboard itself. Now the report id under Report->Products->Viewed
How to display it? I tried copying the code from admin->controller->report->product_viewed.php TO admin->controller->common->home.php
and copied the code from admin->view->report->product_viewed.tpl to admin->common->home.tpl
i have added code like this in home.tpl
<div class="content">
<table class="list">
<thead>
<tr>
<td class="left"><?php echo $column_name; ?></td>
<td class="left"><?php echo $column_model; ?></td>
<td class="right"><?php echo $column_viewed; ?></td>
<td class="right"><?php echo $column_percent; ?></td>
</tr>
</thead>
<tbody>
<?php if ($products) { ?>
<?php foreach ($products as $product) { ?>
<tr>
<td class="left"><?php echo $product['name']; ?></td>
<td class="left"><?php echo $product['model']; ?></td>
<td class="right"><?php echo $product['viewed']; ?></td>
<td class="right"><?php echo $product['percent']; ?></td>
</tr>
<?php } ?>
<?php } else { ?>
<tr>
<td class="center" colspan="4"><?php echo $text_no_results; ?></td>
</tr>
<?php } ?>
</tbody>
</table>
</div>
in my admin panel-> dashboard i am getting error like this
Notice: Undefined variable: products in /Applications/MAMP/htdocs/opencart/admin/view/template/common/home.tpl on line 95Notice: Undefined variable: column_name in /Applications/MAMP/htdocs/opencart/admin/view/template/common/home.tpl on line 88 Notice: Undefined variable: column_model in /Applications/MAMP/htdocs/opencart/admin/view/template/common/home.tpl on line 89 Notice: Undefined variable: column_viewed in /Applications/MAMP/htdocs/opencart/admin/view/template/common/home.tpl on line 90 Notice: Undefined variable: column_percent in /Applications/MAMP/htdocs/opencart/admin/view/template/common/home.tpl on line 91
please help me in solving this? where i should declare this 'products' ?
What it looks like to me is that there are no results from the products viewed and you have done this transition almost perfectly.
You can either update the language files with the appropriate fields in admin>language>english>common>home.php or you can change the tpl a little like this
<div class="content">
<table class="list">
<thead>
<tr>
<td class="left">Product Name:</td>
<td class="left">Model:</td>
<td class="right">Viewed:</td>
<td class="right">Percent:</td>