addition of not requested code into index.html - build

when executing build book, index.html is gaining "automatic contributions" that I have not ordered; example of the "contribution":
</div>
<div id="prerequisites" class="section level1">
<h1><span class="header-section-number"> 1</span> Prerequisites</h1>
<p>This is a <em>sample</em> book written in <strong>Markdown</strong>. You can use anything that
Pandoc’s Markdown supports, e.g., a math equation <span class="math inline">\(a^2 + b^2 = c^2\)
</span>
...
</p>class="uri">https://yihui.org/tinytex/</a>.</p>
This "gift" is reflected into the ePUB output, but not into the html output
Seems improper to have to edit index.html to exclude this "addition".
Anyone has been able to avoid this effect?
Thank you

Related

Can't parse Google Finance html

I'm trying to scrape some stock prices, and variations, from Google Finance using python3 but I just can't figure out if there's something wrong with the page, or my regex. I'm thinking that either the svg graphic or the many script tags throughout the page are making the regex parsers fail to properly analyze the code.
I have tested this regex on many online regex builders/testers and it looks ok. As ok as a regex designed for HTML can be, anyway.
The Google Finance page I'm testing this out on is https://www.google.com/finance?q=NYSE%3AAAPL
And my python code is the following
import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())
It would extract the stock price and its percent variation.
I have also tried using python2 + BeautifulSoup, like so
soup.find(id='price-panel')
but it returns empty even for this simple query. This is especially why I'm thinking that there's something weird with the html.
And here's the most important bit of html that I'm aiming for
<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
Disclaimer
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>
I'm wondering if any of you have encountered a similar problem with this page and/or can figure out if there's anything wrong with my code. Thanks in advance!
You might try a different URL that will be easier to parse, such as: http://www.google.com/finance/info?q=AAPL
The catch is that Google has said that using this API in an application for public consumption is against their Terms of Service. Maybe there is an alternative that Google will allow you to use?
I managed to get it working using BeautifulSoup, on the link posted originally.
Here's the bit of code I finaly used:
response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
I couldn't get it working with BeautifulSoup before because I was actually trying to parse the table in this page https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG, not the one I posted.
Neither method described on my question has worked on this page.

Mako inhereting from multiple files

I have a pyramid application with multiple views each depending on a single mako template. The views are quite complicated and bug free, so I don't want to split or merge views, and by extension, the corresponding templates.
However, I would like a single view to represent all the others. Merging all the pyramid views and templates is practically not an option.
For example, I have a login view & template and a signup view & template. Now I want my root page to contain both of them. Both login and signup inherit from base.mak, which contains common scripts and style sheet imports. The following is a pictorial representation of the mako import structure I want.
base.mak
/ \
login.mak signup.mak
\ /
root.mak
Alternatively, I tried chaining them as such:
base -> login -> signup -> root
However, I think that the views no longer talk to their respective templates.
My problem comes in when I do the 3rd chain (login.mak -> signup). I'll post analogous and extract code below, since my full code is a bit long (If more code is needed, feel free to shout).
base.mak:
<!DOCTYPE HTML>
<html lang="en">
<head>
<meta charset="utf-8">
<title>
${next.title()}
</title>
#Imports
${next.head()}
</head>
<body>
<div id = "content">
${next.body()}
</div>
</body>
</html>
login.mak:
<%inherit file="base.mak"/>
<%def name="title()">
${next.title()}
</%def>
<%def name="head()">
${next.head()}
</%def>
<div id="login">
<div id="message">
${sMessage}
</div>
<div id="form">
<form action="${url}" method="post"> <--- url returned in views.py
...
</div>
${next.body()}
signup.mak:
<%inherit file="login.mak"/>
<%def name="title()">
</%def>
<%def name="head()">
</%def>
<div id="box">
...
</div>
Now my problem here is that my url returned from my views is undefined when I try to inherit as in above.
Then of course if I get this working, adding base.mak to to inherit from signup should be trivial.
I assume that there is a simple fix for this, but I can't find an example/explanation on how to do this in pyramid, where the templates actually do stuff.
Alternatively, Is there another way to bring together multiple pyramid views and templates into a single view?
Ok, I figured it out. One has to use mako's <%include/>, and then there is no complicated inheritance structure. So, now my files look like this:
root.mak
<%inherit file="base.mak"/>
<%def name="title()">
Welcome
</%def>
<%def name="head()">
</%def>
<%include file="login.mak"/>
<%include file="signup.mak"/>
login.mak:
<%inherit file="base.mak"/>
<%def name="title()">
</%def>
<%def name="head()">
<link rel="stylesheet" type="text/css" href="${request.static_url(...
</%def>
<div id="login">
<div id=".....
</div>
and the same structure with signup.mak. base.mak still looks the same as in the question above.
Now, if you're using pyramid (I assume another framework will work the same), and you have views that receive and pass information from forms for example, then turn them into normal functions (without #view_config(renderer='path/file.mak') and place their functionality into the parent view function, in my case root. In other words:
#view_config(renderer='pyramidapp:templates/root.mak',
context=Root,
name="")
#forbidden_view_config(renderer='pyramidapp:templates/root.mak')
def root(self):
xLoginRet = login(self)
xSignupRet = signup(self)
#logic and functionality for both, return stuff to go to base.mak

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

Trouble accessing attribute after using BeautifulSoup's findAll

I'm trying to scrape sites like this one on the BBC website to grab the relevant parts of the programme listing, and I've just started using BeautifulSoup to do this.
The parts of interest start with sections like:
<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment">
<li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">
What I've done so far is opened the HTML as soup and then used soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']) to give a ResultSet of the parts I'm interested in the order in which they appear.
What I then want to do is check whether a section refers to po:MusicSegment or po:SpeechSegment in HTML that looks like:
<li about="/programmes/p01400m9#segment" class="segment track" id="segmentevent-p01400mb" typeof="po:MusicSegment"> <span class="artist-image"> <span class="depiction" rel="foaf:depiction"><img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/></span> </span> <script type="text/javascript"> window.programme_data.tracklist.push({ segment_event_pid : "p01400mb", segment_pid : "p01400m9", playlist : "http://www.bbc.co.uk/programmes/p01400m9.emp" }); </script> <h3> <span rel="mo:performer"> <span class="artist no-image" property="foaf:name" typeof="mo:MusicArtist">Mala</span> </span> <span class="title" property="dc:title">Calle F</span> </h3></li>
I want to access the typeof attribute associated with <li>, but if this chunk of HTML (as a BS4 tag) is called section and I enter section.li, it returns None.
Note that if I do section.img instead, I get something back:
<img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/>
and I could then do, e.g. section.img['height'] to get back u'63'
What I want is something analogous for the section.li part, so section.li['typeof'] to give me po:MusicSegment or po:SpeechSegment
Of course, I could simply convert each result to text and then do a simple string search, but searching by attribute seems more elegant.
I'd iterate over the list returned by findAll:
soup = BeautifulSoup('<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment"><li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">')
for elem in soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']):
print elem['typeof']
returns
po:MusicSegment
po:SpeechSegment
and then conditionally perform your other tasks:
if elem['typeof'] == 'po:MusicSegment'
do.something()
elif elem['typeof'] == 'po:SpeechSegment':
do.something_else()

CSS3 class match letter range [a-z]+?

Is there any possibility to create CSS definition for any element with the class "icon-" and then a set of letters but not numbers.
According to this article something like:
[class^='/icon\-([a-zA-Z]+)/'] {}
should works. But for some reason it doesn't.
In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32"
Is it possible at all?
CSS attribute selectors do not support regular expressions.
If you actually read that article closely:
Regex Matching Attribute Selectors
They don’t exist, but wouldn’t that be so cool? I’ve no idea how hard it would be to implement, or how to expensive to parse, but wouldn’t it just be the bomb?
Notice the first three words. They don't exist. That article is nothing more than a blog post lamenting the absence of regex support in CSS attribute selectors.
But if you're using jQuery, James Padolsey's :regex selector for jQuery may interest you. Your given CSS selector might look like this for example:
$(":regex(class, ^icon\-[a-zA-Z]+)")
I answered this one on facebook but thought I'd best share here too :)
I haven't tested this so don't shoot me if it doesn't work :) but my guess would be to excplicitly target elements that contain the word icon in the classname, but to instruct the browser not to inlcude those classes containing numbers.
Example code:
div[class|=icon]:not(.icon-16, .icon-32, icon-64, icon-96) {.....}
Reference:
attribute selectors... (http://www.w3.org/TR/CSS2/selector.html#attribute-selectors):
[att|=val]
Represents an element with the att attribute, its value either being exactly "val" or beginning with "val" immediately followed by "-" (U+002D).
:not selector...
(http://kilianvalkhof.com/2008/css-xhtml/the-css3-not-selector/)
Hope this helps,
Waseem
I tested my previous solution and can confirm that it DOES NOT work (see comment from BoltClock). This however does:
OP: "In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32""
The required CSS code would look something like this:
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that begin with ( ^= ) "icon-16", or "icon-32" */
*[class^="icon"]:not([class^="icon-16"]):not([class^="icon-32"]) {.....}
or
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that contain ( *= ) the number "16" or the number "18" */
*[class^="icon"]:not([class*="16"]):not([class*="32"]) { ...... }
Test code:
<!DOCTYPE html>
<html>
<head>
<style>
div{border:1px solid #999;margin-bottom:1em;height:100px;}
*[class|=icon]:not([class|=icon-16]):not([class|=icon-32]) {background:red;color:white;}
</style>
</head>
<body>
<div class="icon-something">
<h4>icon-something</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-anotherthing">
<h4>icon-anotherthing</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-16-install">
<h4>icon-16-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-redirect">
<h4>icon-16-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-login">
<h4>icon-16-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-install">
<h4>icon-32-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-redirect">
<h4>icon-32-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-login">
<h4>icon-32-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
</body>
</html>