Python: scrape multiple pages of a message board

Python: scrape multiple pages of a message board - python-2.7

I would like to be able to scrape all the messages from message pages of Yahoo finance for a specific stock.
Here is an example page:
http://finance.yahoo.com/mb/AMD/
I like to be able to get all the messages in there.
If I click on the "Messages" button on the above link I go to this link:
http://finance.yahoo.com/mb/forumview/?&v=m&bn=d56b9fc4-b0f1-3e88-b1f5-e1c40c0067e7
which has more than 10 pages.
How can I use Python code to scrape this data by just knowing the stock symbol "AMD"?

The basics:
tickers = ['AMD', 'AAPL', 'GOOG']
for t in tickers:
url = 'http://finance.yahoo.com/mb/' + t + '/'
r = br.open(url)
html = r.read()
soup = BeautifulSoup(html)
print soup
The content you want is located within particular html tags. Use soup.find_all to get what you want. To move between pages, use Selenium.

Related

Is it possible to read tweet-text of a tweet URL without twitter API?

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.

Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

Methods in Python 2.7 that enable text extraction from multiple HTML pages with different element tags?

I primarily work in Python 2.7. I'm trying to extract the written content (body text) of hundreds of articles from their respective URLs. To simplify things, I've started by trying to extract the text from just one website in my list, and I've been able to do so successfully using BeautifulSoup4. My code looks like this:
import urllib2
from bs4 import BeautifulSoup
url = 'http://insertaddresshere'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
response = urllib2.urlopen(request)
soup = BeautifulSoup((response),"html.parser")
texts = soup.find_all("p")
for item in texts:
print item.text
This gets me the body text of a single article. I know how to iterate through a csv file and write to a new one, but the list of sites I need to iterate through are all from different domains, so the HTML code varies a lot. Is there any way to find body text from multiple articles that have different element labels (here, "p") for said body text? Is it possible to use BeautifulSoup to do this?

Need help extracting links from a TD in webpage

I am new at Python and trying my hands at building some small web crawlers. I am trying to code this program in Python 2.7 with BeautifulSoup that would extract all profile URLs from this page and the subsequent pages
http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=1
Here I am trying to scrape all the URLs that are linked to the details page, such as this
http://www.bda-findadentist.org.uk/practice_details.php?practice_id=6034&no=61881
However, I am lost as to how to make my program recognize these URLs. They are not within a DIV class or ID, rather they are encapsulated within a TD bgcolor tag
<td bgcolor="E7F3F1">View Details</td>
Please advise on how I can make my program identify these URLs and scrape them. I tried the following, but neither worked
for link in soup.select('td bgcolor=E7F3F1 a'):
for link in soup.select('td#bgcolor#E7F3F1 a'):
for link in soup.findAll('a[practice_id=*]'):
My full program is as follows:
import requests
from bs4 import BeautifulSoup
def bda_crawler(pages):
page = 1
while page <= pages:
url = 'http://www.bda-findadentist.org.uk/pagination.php?limit=50&page=' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a[practice_id=*]'):
href = "http://www.bda-findadentist.org.uk" + link.get('href')
print (href)
page += 1
bda_crawler(2)
Please help
Many thanks

regex and scrapy in web crawling

I do a web crawling use scrapy. currently, it can extract the start url but not crawl later.
start_urls = ['https://cloud.cubecontentgovernance.com/retention/document_types.aspx']
allowed_domains = ['cubecontentgovernance.com']
rules = (
Rule(LinkExtractor(allow=("document_type_retention.aspx?dtid=1054456",)),
callback='parse_item', follow=True),
)
And the link i want to extract in the develop tool is:<a id="ctl00_body_ListView1_ctrl0_hyperNameLink" href="document_type_retention.aspx?dtid=1054456"> pricing </a>
the corresponding url is https://cloud.cubecontentgovernance.com/retention/document_type_retention.aspx?dtid=1054456
so what the allow field should be? thanks a lot

When I try to open the site of your start URL I get a login window.
Did you try to print response.body in the simple parse method for your start URL? I guess your Scrapy instance gets the same login window which does not have the URL you want to extract with the LinkExtractor.

How to get large amounts of href links of very large contents of website with Beautifulsoup

I am parsing a large html website that has over 1000 href links. I am using Beautifulsoup to get all the links but second time when I run the program again, beautifulsoup cannot handle it. (find specific all 'td' tags. how will I overcome this problem? Though I can load the html page with urllib, all the links cannot be printed. When I use it with find one 'td' tag, it is passed.
Tag = self.__Page.find('table', {'class':'RSLTS'}).findAll('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']
Now working as
Tag = self.__Page.find('table', {'class':'RSLTS'}).find('td')
print Tag
for a in Tag.find('a', href= True):
print "found", a['href']

You need to iterate over them:
tds = self.__Page.find('table', class_='RSLTS').find_all('td')
for td in tds:
a = td.find('a', href=True)
if a:
print "found", a['href']
Although I'd just use lxml if you have a ton of stuff:
root.xpath('table[contains(#class, "RSLTS")]/td/a/#href')

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python: scrape multiple pages of a message board - python-2.7

Related

Is it possible to read tweet-text of a tweet URL without twitter API?

Methods in Python 2.7 that enable text extraction from multiple HTML pages with different element tags?

Need help extracting links from a TD in webpage

regex and scrapy in web crawling

How to get large amounts of href links of very large contents of website with Beautifulsoup

Categories

Resources