Using BeautifulSoup to clean up markup but skip specific HTML comments - python-2.7

I am using BeautifulSoup to cleanup some HTML, semantically, and am wanting to move all style, meta, link tags into the head tag.
Heres the HTML I am working with:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
<o:PixelsPerInch>96</o:PixelsPerInch>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if !mso]><!-- -->
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">
<!--<![endif]-->
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
<body>
<p>Hello, World</p>
</body>
</html>
Here is my python method:
def cleanup_markup(html):
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all(['style', 'meta', 'link'])
conditional_search = r"<!.*\[if(.*)\](.*\n)*(.*)endif\]-->"
re_flags = re.MULTILINE | re.DOTALL
search = re.findall(conditional_search, html, flags=re_flags)
found = filter(lambda a: a not in map(str, tags), search)
head_tag = soup.head or soup.new_tag('head')
for tag in tags:
if tag.name not in found:
head_tag.append(tag.extract())
if not soup.head:
soup.html.insert(0, head_tag)
return unicode(soup)
But every time the method runs above, the markup looks like:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">\n<!--<![endif]-->
<p>Hello, World</p>
\n\n
<style type="text/css">
\nh1, h2, h3, h4, h5, h6 {\n\tmargin: 0;\n\tpadding: 0;\n\tborder: 0;\n\tfont-size: 100%;\n\tfont: inherit;\n\tvertical-align: baseline;\n}\n
</style>\n<!--[if gte mso 9]><xml>\n <o:OfficeDocumentSettings>\n <o:AllowPNG/>\n <o:PixelsPerInch>96</o:PixelsPerInch>\n </o:OfficeDocumentSettings>\n</xml><![endif]-->\n<!--[if !mso]><!== -->
</body>
</html>
I basically need to skip the conditional tags so that they stay in place, but BeautifulSoup keeps shifting things around in a weird way.

With the right parser, you will get the "head" element for free. From there, all you need to remove the comment using the extract method.
In [10]: from bs4 import Comment
In [11]: from bs4 import BeautifulSoup
In [12]: soup = BS(html, "html5lib")
In [13]: for c in soup.find_all(text=lambda t: isinstance(t, Comment)):
...: c.extract()
...:
In [14]: soup
Out[14]:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml"><head><link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet"/>
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
</head><body>
<p>Hello, World</p>
</body></html>

Related

Django: How i can delete the white space between the background image and the origin page also Page number appear (Page undefined of undefined)

Am still learning on Django and wkhtml2pdf using and I need some help please, I have searched and try much solutions but they don't work with me, I want to set a background image and page number in the footer(I Have use wkhtml2pdf library):
Here it's the code
<!doctype html>
<html lang="en">
<head>
<style type="text/css">
body{
background-image: url('data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEBLAEsAAD/...');
margin:0px;
background-size: cover;
background-position: left top;
padding-top:0px;
height: 1100px;
width: 900px;
}
#divb
{
font-size: 12px;
font-family: Arial;
}
div.footer {
display: block; text-align: center;
position: running(footer);
#page {
#bottom-center { content: element(footer) }
}
}
</style>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body style="border:0; margin: 0;">
<div id="divb">
<p><strong>Materials :</strong> {{ materiel }}</p>
.........(the rest of calling the data)
</div>
<div class='footer'>
Page <span id='page'></span> of
<span id='topage'></span>
<script>
var vars={};
var x=window.location.search.substring(1).split('&');
for (var i in x) {
var z=x[i].split('=',2);
vars[z[0]] = unescape(z[1]);
}
document.getElementById('page').innerHTML = vars.page;
document.getElementById('topage').innerHTML = vars.topage;
</script>
</div>
<script src="https://cdn.jsdelivr.net/npm/bootstrap#5.0.1/dist/js/bootstrap.bundle.min.js" integrity="sha384-gtEjrD/SeCtmISkJkNUaaKMoLD0//ElJ19smozuHV6z3Iehds+3Ulb9Bn9Plx0x4" crossorigin="anonymous"></script>
<script>
window.load = function() {
window.status = 'render-pdf';
};
</script>
</body>
</html>
and here it's the result :
the was a white space between the background image and the origin page , also the number of page appear to me (Page undefined of undefined), How i can adjust the background-image correctly to filled out all the page without white space , also what' wrong with page number in the footer.
see the screenshoot please.
enter image description here
.Thanks In Advance for everyone here.

google cloud static hosting only shows html, but no website

i want to host a single page on google cloud. only index.html. html is validated, but only the source code is shown under the link, no website.
inserted some other test html pages which work on google cloud (copied source code) but i get only html shown under Link.
what am i missing?
TLDR: Your link, in fact, displays HTML, but the content of the HTML page is a HTML source.
When I open the link you provided - https://storage.googleapis.com/fronteris-makler-gmbh/Immobilienmakler-regensburg/index.html ` I see the following:
<!DOCTYPE html>
<html>
<head>
<title>Immobilienmakler in regensburg - FRONTERIS MAKLER Gmbh</title>
<head>
But when I display the page source I see:
view-source:https://storage.googleapis.com/fronteris-makler-gmbh/Immobilienmakler-regensburg/index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<meta name="Generator" content="Cocoa HTML Writer">
<meta name="CocoaVersion" content="1671.5">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 15.0px 'Courier New'; color: #0000c2; background-color: #ffffff}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 15.0px 'Courier New'; color: #0000c2; -webkit-text-stroke: #0000c2; background-color: #ffffff}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 15.0px 'Courier New'; color: #0000c2; -webkit-text-stroke: #0000c2; background-color: #ffffff; min-height: 17.0px}
span.s1 {font-kerning: none}
span.Apple-tab-span {white-space:pre}
</style>
</head>
<body>
<p class="p1"><span class="s1"><!DOCTYPE html></span></p>
<p class="p2"><span class="s1"><html></span></p>
<p class="p2"><span class="s1"><head></span></p>
<p class="p2"><span class="s1"><title>Immobilienmakler in regensburg - FRONTERIS MAKLER Gmbh</title></span></p>
See the last line for example - it's your page title html encoded. I have no idea how you did achieve it, but you stored the html source as html file type and it was re-encoded as html. What kind of editor did you use? This would happen for example if you created your page in word and then stored it as .html type. Do you use some programming editor? At least Notepad++ or Atom, TextMate or whatever platform you do use.

Cannot decode russian symbols in pdf (xhtml2pdf)

Following this, I've created html to pdf converter and it works fine with english language, but I have some russian symbols that I cannot decode. Instead of normal russian words I get:
тут должен быть текÑ■Ñ
template:
<html lang="ru">
<head>
<meta charset="UTF-8">
<title>MC-report</title>
</head>
<body>
<div style="align:center"> тут должен быть текст {{ today }}</div>
</body>
</html>
I have same code (plus some code just to get needed data) as in this manual, instead of playing with html.encode and template:
pdf = pisa.pisaDocument(BytesIO(html.encode("UTF-8")), result) #for decoding data, not template text
None of cp1251/2/866 and UTF-8 won't work
add to your html (Alice-Regular.ttf) is Russian fonts
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style>
#font-face {
font-family: "Alice-Regular";
src: url("/fonts/Alice-Regular.ttf") format("truetype");
}
body {
font-family: "Alice-Regular";
font-size: 20px
}
</style>

Django + Anymail + Mailgun - Email HTML renders without button and image

I am having an issue while trying to send an email containing html and an image, through mailgun, using the anymail library.
This is my code:
url_formulario = CLIENT_URL + str(token.key)
email = EmailMultiAlternatives('Confirmación Vacante', to=emails)
cid = attach_inline_image_file(email, '/var/www/static/icons/ba_logo.png')
contexto = {'nombre_contacto': contacto.responsable_nombre,
'nombre_alumno': contacto.alumno_nombre,
'url_formulario': url_formulario,
'imagen':cid}
mensaje = render_to_string('email.html', context=contexto)
email.attach_alternative(mensaje, "text/html")
email.track_clicks = True
email.send()
I have also tried doing it like this:
url_formulario = CLIENT_URL + str(token.key)
contexto = {'nombre_contacto': contacto.responsable_nombre,
'nombre_alumno': contacto.alumno_nombre,
'url_formulario': url_formulario}
mensaje = render_to_string('email.html', context=contexto)
content = strip_tags(mensaje)
email = EmailMultiAlternatives('Confirmación Vacante', content,to=emails)
email.attach_alternative(mensaje, "text/html")
email.track_clicks = True
email.send()
Here are the two corresponding versions of the html file I am using:
<html>
<head>
<title>Ingresa al formulario</title>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
<!-- Bastrap CSS -->
<link rel="stylesheet" href="/static/css/bastrap.css">
<style>
.contenedor-general{
background:#e5e5e5;
padding-top:3em;
}
.contenedor-general img{
padding-bottom:3em;
}
.contenido-mensaje{
background:white;
margin-bottom:calc(43px + 6em);
}
.contenido-mensaje p{
font-family:"CHANEWEI", Helvetica, Arial, sans-serif;
margin:7%;
color:#717170;
}
.contenido-mensaje h1,
.contenido-mensaje a{
margin: 0 7% 0 7%;
}
.contenido-mensaje h1{
padding-top:7%;
color:#717170;
}
.contenido-mensaje a{
color:#333;
}
.btn-primary{
background-color:#fcda59 !important;
color:#685723 !important;
box-shadow:none !important;
}
.btn-primary:hover{
background-color:#fdd306 !important;
border-color:#fdd306 !important;
color:#685723 !important;
box-shadow:none !important;
}
</style>
</head>
<body>
<div class="container">
<div class="contenedor-general col-lg-8 col-lg-offset-2">
<img src="{{imagen}}" alt="Logo Buenos Aires" class="center-block"/>
<div class="col-lg-8 col-lg-offset-2 contenido-mensaje">
<h2>Hola {{nombre_contacto}},</h2>
<p>Tenemos una vacante escolar pendiente para {{nombre_alumno}}</p>
Confirmar vacante
<p>Si tenés problemas para ingresar comunicate al XXXX-XXXX (Número de télefono)</p>
<p>Muchas gracias</p>
</div>
</div>
</body>
</html>
Another version of the tag without passing the image:
<img src="" alt="Logo Buenos Aires" class="center-block"/>
This is the resulting email:
Is there a way to attach an html file after rendering it to a string with an specified context and an image attached?
Thanks.
It looks like your template is missing the cid: part of the <img src>. You have:
<img src="{{imagen}}" alt="Logo Buenos Aires" class="center-block"/>
But that would need to be:
<img src="cid:{{imagen}}" alt="Logo Buenos Aires" class="center-block"/>
The cid: scheme is how the email client knows to treat the value of {{imagen}} as an inline Content-ID. Without it, the client doesn't know where it's supposed to look for that image source, so you get a broken image icon instead.
There's a little more detail in the Anymail docs

Beautifulsoup fail to read page

I am trying the following:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup
The print statement above shows the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
body { text-align: center; padding: 150px; }
h1 { font-size: 50px; }
body { font: 20px Helvetica, sans-serif; color: #333; }
#article { display: block; text-align: left; width: 650px; margin: 0 auto; }
a { color: #dc8100; text-decoration: none; }
a:hover { color: #333; text-decoration: none; }
</style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
Travis Central Appraisal District Website </p>
<p><b>Click here to reload the property search to try again</b></p>
</div>
</div>
</body>
</html>
I am able to access the url however through the browser on same computer so the server is definitely not blocking my IP. I don't understand what is wrong with my code?
You need to get some cookies first, then you can visit the url.
Although this can be done with urllib2 and CookieJar , i recommend requests :
import requests
from BeautifulSoup import BeautifulSoup
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
Note that requests is not a standard lib, you'll have to insall it.
If you want to use urllib2 :
import urllib2
from cookielib import CookieJar
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()