Beautifulsoup replace set of html code with different code

Beautifulsoup replace set of html code with different code - python-2.7

I have a set of html code in my beautifulsoup object which is to be replaced with some other code
This is what I am getting in my Beautifulsoup object
<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>
Required code:
<html>
<body>
<table class="bt" width="100%">
<tr class="heading">
<th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>
I have tried but that's not working
soup.replace('<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>', '<th class="tho" scope="col"><b>O</b></th>')

In your own solution you're already hinting at string replacements, rather than
actual HTML tree insertions. That's because the HTML you're starting from is terrible.
One solution is to add tags to the original tree that was generated by BeautifulSoup:
from bs4 import BeautifulSoup
import re
start_str = """<html><body><table class="bt" width="100%"><tr class="heading"><th scope="col">Â </th>
<th class="th-heading" scope="col">B</th>
<th class="tho" scope="col"><b>O</b></th></tr></table></div></div></div></div></div></div></body></html></html>
<th class="thm" scope="col"><b>M</b></th>
<th class="thr" scope="col"><b>R</b></th>
<th class="thw" scope="col"><b>W</b></th>
<th class="thecon" scope="col"><b>E</b></th>
<th class="thw" scope="col"><b>0s</b></th>
<th class="thw" scope="col"><b>F</b></th>
<th class="thw" scope="col"><b>S</b></th>
<th scope="col">Â </th>.............</body></html>"""
soup = BeautifulSoup(start_str) # remark: this'll split right after the first '</html>'
substr = re.findall('<th class="thm".*', start_str, re.DOTALL)
subsoup = BeautifulSoup(substr[0])
for tag in subsoup.findAll('th'):
soup.tr.append(tag)
While using regular expressions to parse HTML isn't recommended, this is a
borderline case, and it's not even really parsing, merely selecting a substring.
In that sense, it can even be replaced completely with pure python builtins:
substr = start_str.split('</html></html>')[1]
Another solution is simply to remove those undesired tags, but that will only work if that substring is fixed:
to_remove = '</tr></table></div></div></div></div></div></div></body></html></html>'
soup = BeautifulSoup(''.join(start_str.split(to_remove)))
You could also use the re module in this solution, if there is whitespace between those tags for example.

Related

Remove string between HTML tags with TRegEx

I am designing by code a report sent by email with Outlook using HTML format.
To do that, I'm loading first a HTML template where I can insert all dynamic parts using predefined tags like [CustomerName].
<p>You will find below reports for customer [CustomerName] dated [ReportdDate]</p>
<tag-1>
<h3>TableTitleA</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='400'>Date</th>
<th align='left' width='200'>Info</th>
<th align='center' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableA]
</tbody>
</table>
</tag-1>
<tag-2>
<h3>TableTitleB</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='100'>Date</th>
<th align='left' width='400'>Info</th>
<th align='left' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableB]
</tbody>
</table>
</tag-2>
<p>Best regards</p>
This template is ready to insert two HTML tables: [TableA] and [TableB]
But sometimes a table has no data. So, I want to remove that complete HTML section. To achieve this, I have inserted fake tags:
<tag-1></tag-1> and <tag-2></tag-2>
And then removing the complete section including the two fake tags using TRegEx. This is working just fine here:
https://regex101.com/r/5OFlyC/1
But with this code in Delphi, it doesn't work as expected:
TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '');
Could you tell me what's wrong here?
My problem is fixed. Thanks to all of you

Just use the roSingleLine option to deal with line feeds:
MessageBody.Text := TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '', [roSingleLine]);

first you have to remove all the CR LF from your string and then use the expression with escape before < and >
S:=StringReplace(messagebody.Text,#13#10,'<br>',[rfReplaceAll]);
S:=TRegEx.Replace(S,'(\<tag-1\>.*?\<\/tag-1\>)','');
messagebody.text:=StringReplace(S,'<br>',#13#10,[rfReplaceAll]);

How to get date input from table created using for loop in django?

So I have passed a context from views.py to my html template.
I have created a html table using 'For Loop' in the following way and also added a column with input date field.
<table class="table">
<thead style="background-color:DodgerBlue;color:White;">
<tr>
<th scope="col">Barcode</th>
<th scope="col">Owner</th>
<th scope="col">Mobile</th>
<th scope="col">Address</th>
<th scope="col">Asset Type</th>
<th scope="col">Schhedule Date</th>
<th scope="col">Approve Asset Request</th>
</tr>
</thead>
<tbody>
{% for i in deliverylist %}
<tr>
<td class="barcode">{{i.barcode}}</td>
<td class="owner">{{i.owner}}</td>
<td class="mobile">{{i.mobile}}</td>
<td class="address">{{i.address}}</td>
<td class="atype">{{i.atype}}</td>
<td class="deliverydate"><input type="date"></td>
<td><button id="schedulebutton" onclick="schedule({{forloop.counter0}})" style="background-color:#288233; color:white;" class="btn btn-indigo btn-sm m-0">Schedule Date</button></td>
</tr>
{% endfor %}
</tbody>
Now I would like to get that date element value in javascript, but its proving difficult since I am assigning a class instead of id(as multiple elements cant have same id).
I tried in the following way but its not working. The console log shows no value in that variable.
<script> //i is the iteration number passed in function call using forloop.counter0
function schedule(i){
var deldate = document.getElementsByClassName("deliverydate");
deldate2 = deldate[i].innerText;
console.log(deldate2); //log shows no value/empty
console.log(i); //log shows iteration number
</script>

How to find the element part of the anchor tag

I am totally new to the selenium. Please accept apologies for asking daft or silly question.
I have below on the website. What I am interested is that how can I get the data-selectdate value using selenium + python . Once I have the data-selectdate value, I would like to compare this against the the date I am interested in.
You help is deeply appreciated.
Note: I am not using Beautiful soup or anything.
Cheers
<table role="grid" tabindex="0" summary="October 2018">
<thead>
<tr>
<th role="columnheader" id="dayMonday"><abbr title="Monday">Mon</abbr></th>
<th role="columnheader" id="dayTuesday"><abbr title="Tuesday">Tue</abbr></th>
<th role="columnheader" id="dayWednesday"><abbr title="Wednesday">Wed</abbr></th>
<th role="columnheader" id="dayThursday"><abbr title="Thursday">Thur</abbr></th>
<th role="columnheader" id="dayFriday"><abbr title="Friday">Fri</abbr></th>
<th role="columnheader" id="daySaturday"><abbr title="Saturday">Sat</abbr></th>
</tr>
</thead>
<tbody>
<tr>
<td role="gridcell" headers="dayMonday">
<a data-selectdate="2018-10-22T00:00:00+01:00" data-selected="false" id="day22"
class="day-appointments-available">22</a>
</td>
<td role="gridcell" headers="dayTuesday">
<a data-selectdate="2018-10-23T00:00:00+01:00" data-selected="false" id="day23"
class="day-appointments-available">23</a>
</td>
<td role="gridcell" headers="dayWednesday">
<a data-selectdate="2018-10-24T00:00:00+01:00" data-selected="false" id="day24"
class="day-appointments-available">24</a>
</td>
<td role="gridcell" headers="dayThursday">
<a data-selectdate="2018-10-25T00:00:00+01:00" data-selected="false" id="day25"
class="day-appointments-available">25</a>
</td>
<td role="gridcell" headers="dayFriday">
<a data-selectdate="2018-10-26T00:00:00+01:00" data-selected="false" id="day26"
class="day-appointments-available">26</a>
</td>
<td role="gridcell" headers="daySaturday">
<a data-selectdate="2018-10-27T00:00:00+01:00" data-selected="false" id="day27"
class="day-appointments-available">27</a>
</td>
</tr>
</tbody>
</table>

To get the values of the attribute data-selectdate you can use the following solution:
elements = driver.find_elements_by_css_selector("table[summary='October 2018'] tbody td[role='gridcell'][headers^='day']>a")
for element in elements:
print(element.get_attribute("data-selectdate"))

You can use get_attribute api of element class to read attribute value of element.
css_locator = "table tr:nth-child(1) > td[headers='dayMonday'] > a"
ele = driver.find_element_by_css_selector(css_locator)
selectdate = ele.get_attribute('data-selectdate')

How to parse specific conents from table with Scrapy

I'm trying to parse certain contents from table looking like below:
<table class="dataTbl col-4">
<tr>
<th scope="row">Rent</th>
<td>5.5</td>
<th scope="row">Management</th>
<td>3.3</td>
</tr>
<tr>
<th scope="row">Deposit</th>
<td>No</td>
<th scope="row">Other</th>
<td>No</td>
</tr>
<tr>
<th scope="row">Other2</th>
<td>No</td>
<th scope="row">Insurance</th>
<td>Yes</td>
</tr>
</table>
My goal is to find specific row (for example, Rent) and if there is a match, extract the content in the next <td> tag(For example, 5.5).
But how can I do it in Python?
I'm using Python3/Scrapy 1.3.0.
Thanks

In [9]: Selector(text=html).xpath('//th[text()="Rent"]/following-sibling::td[1]').extract()
Out[9]: ['<td>5.5</td>']
Use text()="Rent" to id the th tag
Use following-sibling:: get it's sibling and use [1] to get first

Using a python's regular expression.
r'\>text\<.+\n +\<td\>(\d+\.\d+)'
In your case, change text by Rent. Also, this is a useful web page to debug regular expressions.

How to stop encoding url in template file on Beego?

I'm in trouble dealing with template and url encoding on Beego.
(Beego is one of the template engines of Go lang)
How to stop encoding url in HTML TAG in template file on Beego?
Please let me know.
--
logcontroller.go
package controllers
import (
"mycode/models"
)
type FiletranslogController struct {
baseController
}
func (this *FiletranslogController) Get() {
// Already encoded url
this.Data["querystring"] = "/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
this.TplName = "log/filetrans.html"
}
filetrans.html
<!-- Not in TABLE TAG -->
{{str2html .querystring}}
<!-- In TABLE TAG -->
<table id="table-log"
data-url="{{str2html .querystring}}"
data-toggle="table"
data-toolbar="#toolbar-log"
data-search="true"
data-show-refresh="true"
data-pagination="true"
data-side-pagination="server"
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
<script>
view source on Web browser
<!-- Not in TABLE TAG -->
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
<!-- In TABLE TAG -->
<table id="table-log"
data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-toggle="table"
data-toolbar="#toolbar-log"
data-search="true"
data-show-refresh="true"
data-pagination="true"
data-side-pagination="server"
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
<script>
OMG
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
---> changed to
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
* ex) PHP Smarty template engine supports {literal} bla..bla..never encoded {/literal} tag. *

str2html
Parse string to HTML, no escaping. {{str2html .Strhtml}}
https://beego.me/docs/mvc/view/template.md

Second test result.
template_file.html
{{str2html .querystring}}
<table data-url="{{.querystring}}"
data-url='{{.querystring}}'
data-url="{{str2html .querystring}}"
data-url='{{str2html .querystring}}'
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
view source on Web Browser
/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2
<table data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-url='/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2'
data-url="/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2"
data-url='/filetranslog/getlogs?sdate=2016-11-13%2000%3A00&edate=2016-12-13%2023%3A59&md5=&trans_type=2'
>
<thead>
<tr>
<th data-field="rdate">Date</th>
<th data-field="mail_sender">Mail Sender</th>
<th data-field="trans_type">Trans Type</th>
<th data-field="md5">MD5</th>
</tr>
</thead>
</table>
Why is literal string encoded? I use "beego.ParseForm" function for form parsing, however, double-encoded url is not parsed by "beego.ParseForm" properly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Beautifulsoup replace set of html code with different code - python-2.7

Related

Remove string between HTML tags with TRegEx

How to get date input from table created using for loop in django?

How to find the element part of the anchor tag

How to parse specific conents from table with Scrapy

How to stop encoding url in template file on Beego?

Categories

Resources