I am trying the following:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup
The print statement above shows the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
body { text-align: center; padding: 150px; }
h1 { font-size: 50px; }
body { font: 20px Helvetica, sans-serif; color: #333; }
#article { display: block; text-align: left; width: 650px; margin: 0 auto; }
a { color: #dc8100; text-decoration: none; }
a:hover { color: #333; text-decoration: none; }
</style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
Travis Central Appraisal District Website </p>
<p><b>Click here to reload the property search to try again</b></p>
</div>
</div>
</body>
</html>
I am able to access the url however through the browser on same computer so the server is definitely not blocking my IP. I don't understand what is wrong with my code?
You need to get some cookies first, then you can visit the url.
Although this can be done with urllib2 and CookieJar , i recommend requests :
import requests
from BeautifulSoup import BeautifulSoup
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
Note that requests is not a standard lib, you'll have to insall it.
If you want to use urllib2 :
import urllib2
from cookielib import CookieJar
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()
Related
I'm trying to build a small web app. A simple price scraper. I'm trying to piece it together into a web app without success.
Here's my working piece of python code I have, that returns the price of the specific item which is :
918193-012
Basically the idea is : a user would input a stock code of a product, and in return, on the same page/or the other page, he would receive the price for that product. Any advices are appreciated.
import requests
from bs4 import BeautifulSoup
import subprocess
url = f'https://www.tennis.fr/catalogsearch/result/?q=918193-012'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
capturedPrice = soup.findAll('p', class_="special-price")[0].text
capturedPrice = capturedPrice.strip('0')
capturedPrice = capturedPrice.strip()
print(capturedPrice[35:])
Here's my app.py
from flask import Flask, jsonify, render_template, request
from bs4 import BeautifulSoup
import subprocess
app = Flask(__name__)
#app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(debug=True)
An here's: the index.html
<!DOCTYPE html>
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto+Mono&display=swap" >
<style>
body {
font-family: 'Roboto Mono', monospace;
}
* {
box-sizing: border-box;
}
form.example input[type=text] {
padding: 10px;
font-size: 17px;
border: 1px solid grey;
float: left;
width: 80%;
background: #f1f1f1;
}
form.example button {
float: left;
width: 20%;
padding: 10px;
background: #2196F3;
color: white;
font-size: 17px;
border: 1px solid grey;
border-left: none;
cursor: pointer;
}
form.example button:hover {
background: #0b7dda;
}
form.example::after {
content: "";
clear: both;
display: table;
}
</style>
</head>
<body>
<center>
<h1>Find the best prices</h1>
<p><h4>Stock code to test: 918193-012</h3></p>
<form class="example" action="" style="margin:auto;max-width:500px">
<input type="text" placeholder="Stock code" name="">
<button type="submit"><i class="fa fa-search"></i></button>
</form>
</center>
</body>
</html>
You want to extract Input-Information from your HTML Site. A possible solution would be to use a form:
app.py:
from flask import Flask
from flask_wtf import FlaskForm
from flask import render_template
from wtforms import StringField, SubmitField
from wtforms.validators import DataRequired
app = Flask(__name__)
class InputForm(FlaskForm):
input = StringField('Input',validators=[DataRequired()]
submit = SubmitField('Search for Price')
#app.route('/')
def index():
form = InputForm()
if form.validate_on_submit():
result = form.input.data
return render_template("index.html", result=result, form=form)
else:
return render_template("index.html",form=form)
if __name__=="__main__":
app.run(debug=True)
You can change the StringField to IntegerField if you want to, thats on you.
Next step would be to add sth to your #app.route:
Ofcourse you need to add some code to your HTML Site to pass the Search-Value to the Web App.
instead of your normal html button you need to pass the form:
<!DOCTYPE html>
<html>
<head>
##your css stuff##
<style>
</style>
</head>
<body>
<center>
<h1>Find the best prices</h1>
<p><h4>Stock code to test: 918193-012</h3></p>
{{ form.input(class="form-control") }}
{{ form.submit(class="btn") }}
{% if result == #yourvalue# %}
#your printed data#
{% else %}
{% endif %}
</center>
</body>
</html>
in this example bootstrap is used for design but thats changable the way you want it
I hope my text was understandable
Edit 1:
You use an if statement inside the html site if you want to use just one html file. With my example you pass the result to the html with the render_template line.
Very new to Django I am trying to follow along a tutorial by sentdex over on youtube.
Django version 1.9
Chose this version as that is being used in the tutorial.
I can't seem to figure out how to get the css file to load.
The location of the css file
/media/xxx/django tutorial/mysite/personal/static/personal/css
I assume the BASE_URL is referencing till:
/media/xxx/djangotutroial/mysite
This is the location of the manage.py. Or am I wrong?
The css file is reference in header.html:
{% load staticfiles %}
<link rel="stylesheet" href="{% static 'personal/css/bootstrap.min.css' %}" type = "text/css"/>
I read through a lot of the answers and if I understand correctly I have to change settings.py in mysite folder.
This is what I have at the moment:
STATIC_URL = '/static/'
STATICFILES_DIRS = [
os.path.join(BASE_DIR,'personal'),
]
I have tried a lot of combinations in the os.path.join I still can't get the file to load.
Thank for your help.
Project Structure:
django tutorial
--mysite
--mysite
---------__pycache_
---------__init__.py
---------settings.py
---------urls.py
---------wsgi.py
--personal
---------migrations
---------__pycache_
---------static
------personal
------css # has the bootstrap.min.css
------js
---------templates
---------admin.py
---------apps.py
---------init.py
---------models.py
---------tests.py
---------urls.py
---------views.py
If someone can tell me the right command to get the directory structure in ubuntu I would be happy to show that here.
Template:
In home.html:
{% extends "personal/header.html" %}
{% block content %}
<p>Hey, welcome to my webpage. We are testing.<p>
{% include "personal/includes/htmlsnippets.html" %}
{% endblock %}
Error message in view page source
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Page not found at /static/personal/css/bootstrap.min.css</title>
<meta name="robots" content="NONE,NOARCHIVE">
<style type="text/css">
html * { padding:0; margin:0; }
body * { padding:10px 20px; }
body * * { padding:0; }
body { font:small sans-serif; background:#eee; }
body>div { border-bottom:1px solid #ddd; }
h1 { font-weight:normal; margin-bottom:.4em; }
h1 span { font-size:60%; color:#666; font-weight:normal; }
table { border:none; border-collapse: collapse; width:100%; }
td, th { vertical-align:top; padding:2px 3px; }
th { width:12em; text-align:right; color:#666; padding-right:.5em; }
#info { background:#f6f6f6; }
#info ol { margin: 0.5em 4em; }
#info ol li { font-family: monospace; }
#summary { background: #ffc; }
#explanation { background:#eee; border-bottom: 0px none; }
</style>
</head>
<body>
<div id="summary">
<h1>Page not found <span>(404)</span></h1>
<table class="meta">
<tr>
<th>Request Method:</th>
<td>GET</td>
</tr>
<tr>
<th>Request URL:</th>
<td>http://127.0.0.1:8000/static/personal/css/bootstrap.min.css</td>
</tr>
</table>
</div>
<div id="info">
<p>'personal/css/bootstrap.min.css' could not be found</p>
</div>
<div id="explanation">
<p>
You're seeing this error because you have <code>DEBUG = True</code> in
your Django settings file. Change that to <code>False</code>, and Django
will display a standard 404 page.
</p>
</div>
</body>
</html>
In the terminal where runserver:
[24/Feb/2019 06:19:32] "GET /static/personal/css/bootstrap.min.css HTTP/1.1" 404 1703
Error when running using #bkawan code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Page not found at /static/personal/css/bootstrap.min.css</title>
<meta name="robots" content="NONE,NOARCHIVE">
<style type="text/css">
html * { padding:0; margin:0; }
body * { padding:10px 20px; }
body * * { padding:0; }
body { font:small sans-serif; background:#eee; }
body>div { border-bottom:1px solid #ddd; }
h1 { font-weight:normal; margin-bottom:.4em; }
h1 span { font-size:60%; color:#666; font-weight:normal; }
table { border:none; border-collapse: collapse; width:100%; }
td, th { vertical-align:top; padding:2px 3px; }
th { width:12em; text-align:right; color:#666; padding- right:.5em; }
#info { background:#f6f6f6; }
#info ol { margin: 0.5em 4em; }
#info ol li { font-family: monospace; }
#summary { background: #ffc; }
#explanation { background:#eee; border-bottom: 0px none; }
</style>
</head>
<body>
<div id="summary">
<h1>Page not found <span>(404)</span></h1>
<table class="meta">
<tr>
<th>Request Method:</th>
<td>GET</td>
</tr>
<tr>
<th>Request URL:</th>
<td>http://127.0.0.1:8000/static/personal/css/bootstrap.min.css</td>
</tr>
</table>
</div>
<div id="info">
<p>'personal/css/bootstrap.min.css' could not be found</p>
</div>
<div id="explanation">
<p>
You're seeing this error because you have <code>DEBUG = True</code> in
your Django settings file. Change that to <code>False</code>, and Django
will display a standard 404 page.
</p>
</div>
</body>
</html>
Please check if you have done first step as below
Official Documentation link for Django 1.9 link
This document is for an insecure version of Django that is no longer
supported. Please upgrade to a newer release!
Try to move on new release since documentation 1.9 is no longer supported.
Configuring static files
Make sure that django.contrib.staticfiles is included in your INSTALLED_APPS.
Check If your folder structure is similar as below
Project Structure image Link
You do not need to add code below unless if personal folder is same level as BASE_DIR
os.path.join(BASE_DIR,'personal')
Your code below is fine if you have personal app and then static folder inside personal app and then personal folder inside static folder ie personal/static/personal
{% load staticfiles %}
<link rel="stylesheet" href="{% static 'personal/css/bootstrap.min.css' %}" type = "text/css"/>
In Case if Page not found error.
Check if bootstrap.min.css exists in path personal/css/bootstrap.min.css
Check spelling as well.
Check your app ie personal in this case is included in INSTALLED_APPS
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'personal'
]
You should serve them during development. Check this part.
Edit your main urls.py(you'll find it in same folder as settings.py).
from django.conf import settings
from django.conf.urls.static import static
urlpatterns = [
# ... the rest of your URLconf goes here ...
] + static(settings.STATIC_URL, document_root=settings.STATIC_ROOT)
Make sure that your app is listed in INSTALLED_APPS in django settings.
Remove /personal from your style sheet import. Instead, just have:
{%load static%}
<link rel="stylesheet" href="{% static 'css/bootstrap.min.css' %}" type = "text/css"/>
I am not sure if you copied and pasted exactly the same code but the p closing tag is out of syntax in home.html.
I have been playing with Reportlab and Django all day today, and I finally have it working with the help of this SO issue earlier today...https://stackoverflow.com/questions/54565679/how-to-incorporate-reportlab-with-django-class-based-view/54566002#= The output is now being produced as a PDF as expected. However, I can't get an image to display when the output is produced as a PDF. Nothing renders.
I have tried investigating my URLS, my absolute URLs in my settings.py file and nothing. I've noticed I can't get my template to pick up any of my static settings, yet the rest of my project is working fine with the same references. I also found this very similar SO issue, but can't seem to determine the actual fix....html template to pdf with images
My utils.py file...
from io import BytesIO
from django.http import HttpResponse
from django.template.loader import get_template
from xhtml2pdf import pisa
def render_to_pdf(template_src, context_dict={}):
template = get_template(template_src)
html = template.render(context_dict)
result = BytesIO()
pdf = pisa.pisaDocument(BytesIO(html.encode("ISO-8859-1")), result)
if not pdf.err:
return HttpResponse(result.getvalue(), content_type='application/pdf')
return None
My HTML template file...
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Title</title>
<style type="text/css">
body {
font-weight: 200;
font-size: 14px;
}
.header {
font-size: 20px;
font-weight: 100;
text-align: center;
color: #007cae;
}
.title {
font-size: 22px;
font-weight: 100;
/* text-align: right;*/
padding: 10px 20px 0px 20px;
}
.title span {
color: #007cae;
}
.details {
padding: 10px 20px 0px 20px;
text-align: left !important;
/*margin-left: 40%;*/
}
.hrItem {
border: none;
height: 1px;
/* Set the hr color */
color: #333; /* old IE */
background-color: #fff; /* Modern Browsers */
}
</style>
</head>
<body>
<img class="logo3" src="/book/author.svg" >
<div class='wrapper'>
<div class='header'>
<p class='title'>Invoice # </p>
</div>
<div>
<div class='details'>
Bill to: {{ customer_name }} <br/>
Amount: {{ amount }} <br/>
Date:
<hr class='hrItem' />
</div>
</div>
</body>
</html>
And the view....
class SomeDetailView(DetailView):
model = YourModel
template_name = 'pdf/invoice.html'
def get_context_data(self, **kwargs):
context = super(SomeDetailView, self).get_context_data(**kwargs)
# add extra context if needed
return context
def render_to_response(self, context, **kwargs):
pdf = render_to_pdf(self.template_name, context)
return HttpResponse(pdf, content_type='application/pdf')
I am trying to get the author.svg image to appear on the PDF but no luck as of yet.
After a bunch of trial and error I gave up on the SVG approach for now...and instead opted to use a PNG file. I found this SO that gave me the answer...django - pisa : adding images to PDF output was as simple as referencing the attribute in my model via the django template as shown below:
I include a default location in my model. I experimented with a PNG and an SVG. The PNG worked but the SVG did not. Either way this solves my issue.
I am trying to convert my HTML to a PDF. I watched this tutorial, https://www.codingforentrepreneurs.com/blog/html-template-to-pdf-in-django/ and I was able to successfully get the tutorial to work with hardcoded values. The problem is...I can't figure out how to dynamically incorporate my model and it's attributes into this example.
Here are the steps I followed...
First I installed reportlab into my environment with the following version...
pip install --pre xhtml2pdf
This worked...
Then I added a utils.py file to my project as instructed.
Then I copied this code into my utils.py file...
from io import BytesIO
from django.http import HttpResponse
from django.template.loader import get_template
from xhtml2pdf import pisa
def render_to_pdf(template_src, context_dict={}):
template = get_template(template_src)
html = template.render(context_dict)
result = BytesIO()
pdf = pisa.pisaDocument(BytesIO(html.encode("ISO-8859-1")), result)
if not pdf.err:
return HttpResponse(result.getvalue(), content_type='application/pdf')
return None
Then I created an HTML file like the one below:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Title</title>
<style type="text/css">
body {
font-weight: 200;
font-size: 14px;
}
.header {
font-size: 20px;
font-weight: 100;
text-align: center;
color: #007cae;
}
.title {
font-size: 22px;
font-weight: 100;
/* text-align: right;*/
padding: 10px 20px 0px 20px;
}
.title span {
color: #007cae;
}
.details {
padding: 10px 20px 0px 20px;
text-align: left !important;
/*margin-left: 40%;*/
}
.hrItem {
border: none;
height: 1px;
/* Set the hr color */
color: #333; /* old IE */
background-color: #fff; /* Modern Browsers */
}
</style>
</head>
<body>
<div class='wrapper'>
<div class='header'>
<p class='title'>Invoice # </p>
</div>
<div>
<div class='details'>
Bill to: {{ customer_name }} <br/>
Amount: {{ amount }} <br/>
Date:
<hr class='hrItem' />
</div>
</div>
</body>
</html>
I also created the necessary URL....
Then I created a view similar to what is defined below:
from django.http import HttpResponse
from django.views.generic import View
from yourproject.utils import render_to_pdf #created in step 4
class GeneratePdf(View):
def get(self, request, *args, **kwargs):
data = {
'today': datetime.date.today(),
'amount': 39.99,
'customer_name': 'Cooper Mann',
'order_id': 1233434,
}
pdf = render_to_pdf('pdf/invoice.html', data)
return HttpResponse(pdf, content_type='application/pdf')
When I click on my link to this view...the output shows just like the tutorial says...
However, if I am trying to get my model to display in this format...I'm stumped as to how to do that. I tried a DetailView....and then I get the data but no PDF....I've also searched many other places and I can't seem to find an example that would allow me to get my model dynamically and pull in attributes as needed...Thanks in advance for any help or pointers.
If you want to use DetailView, I think you can do it like this:
class SomeDetailView(DetailView):
model = YourModel
template_name = 'pdf/invoice.html'
def get_context_data(self, **kwargs):
context = super(SomeDetailView, self).get_context_data(**kwargs)
# add extra context if needed
return context
def render_to_response(self, context, **kwargs):
pdf = render_to_pdf(self.template_name, context)
return HttpResponse(pdf, content_type='application/pdf')
Here I am overriding render_to_response to override the default response from DetailView to return a PDF response. Here, the context that comes in are from get_context_data. in get_context_data you can add any extra context if needed.
I am using BeautifulSoup to cleanup some HTML, semantically, and am wanting to move all style, meta, link tags into the head tag.
Heres the HTML I am working with:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
<o:PixelsPerInch>96</o:PixelsPerInch>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if !mso]><!-- -->
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">
<!--<![endif]-->
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
<body>
<p>Hello, World</p>
</body>
</html>
Here is my python method:
def cleanup_markup(html):
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all(['style', 'meta', 'link'])
conditional_search = r"<!.*\[if(.*)\](.*\n)*(.*)endif\]-->"
re_flags = re.MULTILINE | re.DOTALL
search = re.findall(conditional_search, html, flags=re_flags)
found = filter(lambda a: a not in map(str, tags), search)
head_tag = soup.head or soup.new_tag('head')
for tag in tags:
if tag.name not in found:
head_tag.append(tag.extract())
if not soup.head:
soup.html.insert(0, head_tag)
return unicode(soup)
But every time the method runs above, the markup looks like:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet">\n<!--<![endif]-->
<p>Hello, World</p>
\n\n
<style type="text/css">
\nh1, h2, h3, h4, h5, h6 {\n\tmargin: 0;\n\tpadding: 0;\n\tborder: 0;\n\tfont-size: 100%;\n\tfont: inherit;\n\tvertical-align: baseline;\n}\n
</style>\n<!--[if gte mso 9]><xml>\n <o:OfficeDocumentSettings>\n <o:AllowPNG/>\n <o:PixelsPerInch>96</o:PixelsPerInch>\n </o:OfficeDocumentSettings>\n</xml><![endif]-->\n<!--[if !mso]><!== -->
</body>
</html>
I basically need to skip the conditional tags so that they stay in place, but BeautifulSoup keeps shifting things around in a weird way.
With the right parser, you will get the "head" element for free. From there, all you need to remove the comment using the extract method.
In [10]: from bs4 import Comment
In [11]: from bs4 import BeautifulSoup
In [12]: soup = BS(html, "html5lib")
In [13]: for c in soup.find_all(text=lambda t: isinstance(t, Comment)):
...: c.extract()
...:
In [14]: soup
Out[14]:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml"><head><link href="https://fonts.googleapis.com/css?family=Didact+Gothic|Ubuntu" rel="stylesheet"/>
<style type="text/css">
h1, h2, h3, h4, h5, h6 {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
</style>
</head><body>
<p>Hello, World</p>
</body></html>