email address is not being parsed in beautifulsoup - django

I am scraping a webpage using the beautifulsoup and requests in python3.5 . problem is when i tried to parse the email address in the p it gives me the [email protected]. I have tried the other links but no gain. cf_email tag is not even there. I am parsing through this
email_addresses=[]
for email_address in detail.findAll('p'):
email_addresses.append(email_address.text)
information = {}
information['email'] = email_addresses
emails are in the <p> tags.
i have this html in inspecting element.
<div class="email">
<p>test1#hotmail.com</p>
<p>test2#yahoo.com</p>
<p>test3#yahoo.com</p>
<div>
when i open the page source i have noticed this .
<p>[email protected]</p>

The page does not actually contain the email address. This is probably being done as a protection against spammers; there will be some javascript that replaces the holding text with the actual value.
In other words, the site is trying to stop people doing exactly what you are trying to do.

Related

Anchor tag not act properly instead show full string inside cfemail content

I've write a functionality about send email process. Here I've set Mail Server details admin setting. And write a below code for sending email. I can successfully send & receive email to my gmail account. But Here I've added some paragraph with anchor tag value that is click me.
<cfoutput>
<cfmail from="test#gmail.com" to="test#gmail.com" username="myemail#gmail.com" password="mypass" port="587" subject="Chaange title" >
<p> I'm from test link click Me 2! </p>
</cfmail>
</cfoutput>
The issue is in my email not received as a click me as a link. Instead it will display entire html about anchor tag. FYR please refer my email content image.
Note : I've already tried with cfsavecontent too but it's not help me.
Could you any one help on this. Why it's was happen ? Thanks in advance.
Add type="html" to your cfmail tag. That should indicate to the end user's email client that the message should be displayed as an HTML page instead of just plain text.

Is it possible to read tweet-text of a tweet URL without twitter API?

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

Using BeautifulSoup to print specific information that has a <div> tag

I'm still new to using BeautifulSoup to scrape information from a website. For this piece of code I'm specifically trying to grab this value and others like it and display it back to me the user in a more condensed easy to read display. The below is a screenshot i took with the highlighted div and class i am trying to parse:
This is the code I'm using:
import urllib2
from bs4 import BeautifulSoup
a =("http://forecast.weather.gov/MapClick.php?lat=39.32196712788175&lon=-82.10190859830237&site=all&smap=1#.VQM_kOGGP7l")
website = urllib2.urlopen(a)
html = website.read()
soup = BeautifulSoup(html)
x = soup.find_all("div",{"class": "point-forecast-icons-low"})
print x
However once it runs it returns this "[]" I get no errors but nothing happens. What I thought at first was maybe it couldn't find anything inside the <div> that I told it to search for but usually I would get back a none from the code saying nothing was found. So what i believe to be going on now with my code is maybe since it is a that its not opening the div up to pull the other content from inside it, but that is just my best guess at the moment.
You are getting [] because point-forecast-icons-low class is not an attribute of the div rather it's an attribute of the p tag. Try this instead.
x = soup.find_all("p", attrs={"class": "point-forecast-icons-low"})

How to get large amounts of href links of very large contents of website with Beautifulsoup

I am parsing a large html website that has over 1000 href links. I am using Beautifulsoup to get all the links but second time when I run the program again, beautifulsoup cannot handle it. (find specific all 'td' tags. how will I overcome this problem? Though I can load the html page with urllib, all the links cannot be printed. When I use it with find one 'td' tag, it is passed.
Tag = self.__Page.find('table', {'class':'RSLTS'}).findAll('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
Now working as
Tag = self.__Page.find('table', {'class':'RSLTS'}).find('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
You need to iterate over them:
tds = self.__Page.find('table', class_='RSLTS').find_all('td')
for td in tds:
a = td.find('a', href=True)
if a:
print "found", a['href']
Although I'd just use lxml if you have a ton of stuff:
root.xpath('table[contains(#class, "RSLTS")]/td/a/#href')

Embedding issuu

I need to embed an issuu document inside a website. The website administrator should be allowed to decide which document is displayed on the frontend.
This is an easy task, using the embed link on the issuu page. But I need to customize some options - for instance, disable sharing, set the dimensions and so on. I cannot rely on the administrators doing this process every time they need to change the document.
I can easily customize the issuu embed code to my taste, and all that I need is the document id. Unfortunately, the id is not included in the issuu page for the document. For instance, the id for this random link happens to be 110209071155-d0ed1d10ac0b40dda80dad24166a76ee, which is nowhere to be found, neither in the URL nor easily inside the page. You have to dig into the embed code to find it.
I thought the issuu API could allow me to get a document id given its URL, but I cannot find anything like this. The closest match is the search API, but if I search for the exact name of the document I get only one match for a different document!
Is there some easy way to be able to embed a document only knowing its URL? Or an easy way for a non techie person to find a document id in the page?
Unfortunate the only way for you to costomize is to pay for the service wich is 39$ for month =/.
You can force a fullscreen mode without ads by using
<body style="margin:0px;padding:0px;overflow:hidden">
<iframe src="YOUR ISSU EMBED" frameborder="0" style="overflow:hidden;height:105%;width:105%;position:absolute;" height="100%" width="100%""></iframe>
</body>
You can embed of course stacks but that isnt showed on Issuu site. This is code (its old code but it works):
<iframe src="http://static.issuu.com/widgets/shelf/index.html?folderId=FOLDERIDamp;theme=theme1&rows=1&thumbSize=large&roundedCorners=true&showTitle=true&showAuthor=false&shadow=true&effect3d=true" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="100%" height="200"></iframe>
FOLDERID is number of 36 chars that you get on address bar when you enter stacks (example: https://issuu.com/username/stacks/FOLDERID). When you replacing that in code you must paste 36 chars in this format 8-4-4-4-12 with - between chars. And voila its working.
You can change theme and other stuffs in code.
The Document ID is found in the HTML source of every document. It is in the og:video meta property.
<meta property="og:video" content="http://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=XXXXXXXX-XXXXXXXXXXXXX&pageNumber=0">
You can easily handle it by using the DomDocument and DomXPath php classes.
Here is how-to using PHP:
// Your document URL
$url = 'https://issuu.com/proyectotres/docs/proyecto_3_edicion_135';
// Turn off errors, loads the URL as an object and then turn errors on again
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTMLFile($url);
libxml_use_internal_errors(false);
// DomXPath helps find the <meta property="og:video" content="http://hereyoucanfindthedocumentid?documentId=xxxxx-xxxxxxx"/>
$xpath = new DOMXPath($dom);
$meta = $xpath->query("//html/head/meta[#property='og:video']");
// Get the content attribute of the <meta> node and parse its query
$vars = [];
parse_str(parse_url($meta[0]->getAttribute('content'))['query'], $vars);
// Ready. The document ID is here:
$docID = $vars['documentId'];
// You can print it:
echo $docID;
You can try it with the URL of your own Issu document.
You can use the Issuu URL of your document to complete this iframe :
<iframe width="100%" height="283" style="display: block; margin-left: auto; margin-right: auto;" src="https://e.issuu.com/issuu-reader3-embed-files/latest/twittercard.html?u=nantucketchamber&d=program-update1&p=1" frameborder="0" allowfullscreen="allowfullscreen" span="" id="CmCaReT"></iframe>
You just need to replace "nantucketchamber" by a user name and "program-update1" by the file name in the Issuu URL
(for this example the URL is https://issuu.com/nantucketchamber/docs/program-update1)