How to parse specific conents from table with Scrapy

How to parse specific conents from table with Scrapy - regex

I'm trying to parse certain contents from table looking like below:
<table class="dataTbl col-4">
<tr>
<th scope="row">Rent</th>
<td>5.5</td>
<th scope="row">Management</th>
<td>3.3</td>
</tr>
<tr>
<th scope="row">Deposit</th>
<td>No</td>
<th scope="row">Other</th>
<td>No</td>
</tr>
<tr>
<th scope="row">Other2</th>
<td>No</td>
<th scope="row">Insurance</th>
<td>Yes</td>
</tr>
</table>
My goal is to find specific row (for example, Rent) and if there is a match, extract the content in the next <td> tag(For example, 5.5).
But how can I do it in Python?
I'm using Python3/Scrapy 1.3.0.
Thanks

In [9]: Selector(text=html).xpath('//th[text()="Rent"]/following-sibling::td[1]').extract()
Out[9]: ['<td>5.5</td>']
Use text()="Rent" to id the th tag
Use following-sibling:: get it's sibling and use [1] to get first

Using a python's regular expression.
r'\>text\<.+\n +\<td\>(\d+\.\d+)'
In your case, change text by Rent. Also, this is a useful web page to debug regular expressions.

Related

Remove string between HTML tags with TRegEx

I am designing by code a report sent by email with Outlook using HTML format.
To do that, I'm loading first a HTML template where I can insert all dynamic parts using predefined tags like [CustomerName].
<p>You will find below reports for customer [CustomerName] dated [ReportdDate]</p>
<tag-1>
<h3>TableTitleA</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='400'>Date</th>
<th align='left' width='200'>Info</th>
<th align='center' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableA]
</tbody>
</table>
</tag-1>
<tag-2>
<h3>TableTitleB</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='100'>Date</th>
<th align='left' width='400'>Info</th>
<th align='left' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableB]
</tbody>
</table>
</tag-2>
<p>Best regards</p>
This template is ready to insert two HTML tables: [TableA] and [TableB]
But sometimes a table has no data. So, I want to remove that complete HTML section. To achieve this, I have inserted fake tags:
<tag-1></tag-1> and <tag-2></tag-2>
And then removing the complete section including the two fake tags using TRegEx. This is working just fine here:
https://regex101.com/r/5OFlyC/1
But with this code in Delphi, it doesn't work as expected:
TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '');
Could you tell me what's wrong here?
My problem is fixed. Thanks to all of you

Just use the roSingleLine option to deal with line feeds:
MessageBody.Text := TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '', [roSingleLine]);

first you have to remove all the CR LF from your string and then use the expression with escape before < and >
S:=StringReplace(messagebody.Text,#13#10,'<br>',[rfReplaceAll]);
S:=TRegEx.Replace(S,'(\<tag-1\>.*?\<\/tag-1\>)','');
messagebody.text:=StringReplace(S,'<br>',#13#10,[rfReplaceAll]);

How to get date input from table created using for loop in django?

So I have passed a context from views.py to my html template.
I have created a html table using 'For Loop' in the following way and also added a column with input date field.
<table class="table">
<thead style="background-color:DodgerBlue;color:White;">
<tr>
<th scope="col">Barcode</th>
<th scope="col">Owner</th>
<th scope="col">Mobile</th>
<th scope="col">Address</th>
<th scope="col">Asset Type</th>
<th scope="col">Schhedule Date</th>
<th scope="col">Approve Asset Request</th>
</tr>
</thead>
<tbody>
{% for i in deliverylist %}
<tr>
<td class="barcode">{{i.barcode}}</td>
<td class="owner">{{i.owner}}</td>
<td class="mobile">{{i.mobile}}</td>
<td class="address">{{i.address}}</td>
<td class="atype">{{i.atype}}</td>
<td class="deliverydate"><input type="date"></td>
<td><button id="schedulebutton" onclick="schedule({{forloop.counter0}})" style="background-color:#288233; color:white;" class="btn btn-indigo btn-sm m-0">Schedule Date</button></td>
</tr>
{% endfor %}
</tbody>
Now I would like to get that date element value in javascript, but its proving difficult since I am assigning a class instead of id(as multiple elements cant have same id).
I tried in the following way but its not working. The console log shows no value in that variable.
<script> //i is the iteration number passed in function call using forloop.counter0
function schedule(i){
var deldate = document.getElementsByClassName("deliverydate");
deldate2 = deldate[i].innerText;
console.log(deldate2); //log shows no value/empty
console.log(i); //log shows iteration number
</script>

How to stop encoding url in template file on Beego?

I'm in trouble dealing with template and url encoding on Beego.
(Beego is one of the template engines of Go lang)
How to stop encoding url in HTML TAG in template file on Beego?
Please let me know.
--
logcontroller.go
package controllers
import (
"mycode/models"
)
type FiletranslogController struct {
baseController
}
func (this *FiletranslogController) Get() {
// Already encoded url
this.Data["querystring"] = "/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
this.TplName = "log/filetrans.html"
}
filetrans.html
<!-- Not in TABLE TAG -->
{{str2html .querystring}}
<!-- In TABLE TAG -->
<table id="table-log"
data-url="{{str2html .querystring}}"
data-toggle="table"
data-toolbar="#toolbar-log"
data-search="true"
data-show-refresh="true"
data-pagination="true"
data-side-pagination="server"
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
<script>
view source on Web browser
<!-- Not in TABLE TAG -->
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
<!-- In TABLE TAG -->
<table id="table-log"
data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-toggle="table"
data-toolbar="#toolbar-log"
data-search="true"
data-show-refresh="true"
data-pagination="true"
data-side-pagination="server"
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
<script>
OMG
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
---> changed to
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
* ex) PHP Smarty template engine supports {literal} bla..bla..never encoded {/literal} tag. *

str2html
Parse string to HTML, no escaping. {{str2html .Strhtml}}
https://beego.me/docs/mvc/view/template.md

Second test result.
template_file.html
{{str2html .querystring}}
<table data-url="{{.querystring}}"
data-url='{{.querystring}}'
data-url="{{str2html .querystring}}"
data-url='{{str2html .querystring}}'
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
view source on Web Browser
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
<table data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-url='/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2'
data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-url='/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2'
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
Why is literal string encoded? I use "beego.ParseForm" function for form parsing, however, double-encoded url is not parsed by "beego.ParseForm" properly.

Beautifulsoup replace set of html code with different code

I have a set of html code in my beautifulsoup object which is to be replaced with some other code
This is what I am getting in my Beautifulsoup object
<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>
Required code:
<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>
I have tried but that's not working
soup.replace('<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>', '<th class="tho" scope="col"><b>O</b></th>')

In your own solution you're already hinting at string replacements, rather than
actual HTML tree insertions. That's because the HTML you're starting from is terrible.
One solution is to add tags to the original tree that was generated by BeautifulSoup:
from bs4 import BeautifulSoup
import re
start_str = """<html><body><table class="bt" width="100%"><tr class="heading"><th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>"""
soup = BeautifulSoup(start_str) # remark: this'll split right after the first '</html>'
substr = re.findall('<th class="thm".*', start_str, re.DOTALL)
subsoup = BeautifulSoup(substr[0])
for tag in subsoup.findAll('th'):
soup.tr.append(tag)
While using regular expressions to parse HTML isn't recommended, this is a
borderline case, and it's not even really parsing, merely selecting a substring.
In that sense, it can even be replaced completely with pure python builtins:
substr = start_str.split('</html></html>')[1]
Another solution is simply to remove those undesired tags, but that will only work if that substring is fixed:
to_remove = '</tr></table></div></div></div></div></div></div></body></html></html>'
soup = BeautifulSoup(''.join(start_str.split(to_remove)))
You could also use the re module in this solution, if there is whitespace between those tags for example.

Extract attribute value from html element

Been struggling with this for a couple of hours now...
I have the following regex:
(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)
The following input:
<div class="cameras">
<table class="results">
<colgroup>
<col class="col0">
<col class="col1">
</colgroup>
<thead>
<tr>
<th title="Name">
Name
</th>
<th title="Date">
Date
</th>
</tr>
</thead>
<tbody>
<tr data-video-id="1">
<td title="149 - Cam123">
149 - Cam123
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
<tr data-video-id="2">
<td title="150 - Cam456">
150 - Cam456
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
</tbody>
</table>
</div>
The regex outputs this:
<td title="149 - Cam123">
<td title="150 - Cam456">
But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:
149 - Cam123
150 - Cam456
The number of rows may obviously vary but the number of columns is fixed.
Please help me fine tune the above regex.
Thanks
NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.

Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:
(?<=\bdata-video-id=""."">).*?title="(.*?)">
Otherwise, the general solution is to not use a regex:
Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser
Here is a SO about this topic
Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to parse specific conents from table with Scrapy - regex

In [9]: Selector(text=html).xpath('//th[text()="Rent"]/following-sibling::td[1]').extract() Out[9]: ['<td>5.5</td>'] Use text()="Rent" to id the th tag Use following-sibling:: get it's sibling and use [1] to get first

Using a python's regular expression. r'\>text\<.+\n +\<td\>(\d+\.\d+)' In your case, change text by Rent. Also, this is a useful web page to debug regular expressions.

Related

Remove string between HTML tags with TRegEx

How to get date input from table created using for loop in django?

How to stop encoding url in template file on Beego?

Beautifulsoup replace set of html code with different code

Extract attribute value from html element

Categories

Resources