I am trying to scroll down an infinite scroll page and get the links of news. The problem is when I scrolled down the page for let say 100 times, and I tried to get the links, Python launched an error that says: "StaleElementReferenceException: Message: stale element reference: element is not attached to the page document". I think its because the page is get updated and scrolled page is not available any more. here is my code for scrolling the page with Selenium Webdriver:
import urllib2
from bs4 import BeautifulSoup
from __future__ import print_function
from selenium import webdriver #open webdriver for specific browser
from selenium.webdriver.common.keys import Keys # for necessary browser action
from selenium.webdriver.common.by import By # For selecting html code
import time
driver = webdriver.Chrome('C:\\Program Files (x86)\\Google\\Chrome\\chromedriver.exe')
driver.get('http://seekingalpha.com/market-news/top-news')
for i in range(0,100):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(15)
URL = driver.find_elements_by_class_name('market_current_title')
print URL
and the code for getting the URLs
for a in URL:
links = a.get_attribute('href')
print(links)
I am wondering if there is any solution to settle this problem or it is possible to get URLs for this specific page with request library, as I couldn't do that.
Related
I have the following route, which is supposed to redirect the user to an external URL (I'm using Apple's URL as an example here) -
import flask
from flask import Flask, jsonify, Response, render_template
import pymongo
from pymongo import MongoClient
from bson import ObjectId, json_util
import json
cluster = pymongo.MongoClient("mongodb+srv://USERNAME:PASSWORD#cluster0.mpjcg.mongodb.net/<dbname>?retryWrites=true&w=majority")
db = cluster["simply_recipe"]
collection = db["recipes_collection"]
app = Flask(__name__)
# This route returns the team's index page
#app.route("/")
def home():
return render_template('index.html')
# This route returns heesung's plot page of the team's website
#app.route("/heesung")
def heesung():
return redirect("http://www.apple.com")
if __name__ == "__main__":
app.run()
Issue:I keep getting a "GET /heesung/ HTTP/1.1" 404 - in my terminal, when I navigate to my localhost/heesung
Note:I am aware there are other questions of similar nature and for those, I have followed the steps, but they are old posts, so I'm wondering if Flask has changed anything. I couldn't find any definitive documentation.
This isn't related to the return redirect("http://www.apple.com") line.
I keep getting a GET /heesung/ HTTP/1.1" 404 - in my terminal, when I navigate to my localhost/heesung
That terminal output suggests you're hitting /heesung/ (with the trailing slash).
With the decorator:
#app.route("/heesung")
Requests to /heesung will succeed
Requests to /heesung/ will 404.
Instead with the decorator:
#app.route("/heesung/")
Requests to /heesung/ will succeed
Requests to /heesung will issue a '308 permanent redirect' to /heesung/
Pick whichever best suits your use-case.
As per V25's comment, I forgot to import the dependency. Sorry for everyone's trouble!
Answer:
from flask import redirect
I am trying to click a tab (Regulatory Regional) on a webpage: https://www5.fdic.gov/idasp/advSearchLanding.asp
However, it does not recognize the command. Here, I have attached the code.
import urllib2
import urllib
from bs4 import BeautifulSoup
import subprocess
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("/usr/local/bin/chromedriver")
import time
s1_url = 'https://www5.fdic.gov/idasp/advSearchLanding.asp'
browser.get(s1_url)
Problem: choose regulatory regional tab but it does not click it.
browser.find_element_by_xpath('//[#id="Banks_Regulatory_Tab"]/a').click()
Got an exception:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="Banks_Regulatory_Tab"]/a"}
Required element located inside an iframe. To be able to handle it you need to switch to that iframe:
browser.switch_to.frame("content")
browser.find_element_by_link_text("Regulatory Regional").click()
I am clicking the link "Images" on a new page (after searching 'bugs bunny') on Google. It is not retrieving images of the search, rather it is opening the link 'Images' on the old page.
My Code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys("bugs bunny")
search.send_keys(Keys.RETURN) # hit return after you enter search text
browser.current_window_handle
print(browser.current_url)
browser.find_element_by_link_text("Images").click()
Your problem is you are using send_keys, which perform the action and don't wait
search.send_keys(Keys.RETURN) # hit return after you enter search text
So after that if you use click it is doing it nearly on the current page even when the results are not loaded. So you need to add some delay for the return key to change the results and once the results are loaded, you can do the click
So what you need is a simple sleep delay
I am trying to scrape the stats off the table on this webpage: http://stats.nba.com/teams/traditional/ but I am unable to find the html for the table. This is in python 2.7.10.
from bs4 import BeautifulSoup
import json
import urllib
html = urllib.urlopen('http://stats.nba.com/teams/traditional/').read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all('tr'):
print(table)
This is the code I have now, but nothing is being outputted.
If I try this with different elements on the page it works fine.
The table is loaded dynamically, so when you grab the html, there are no tr tags in it to be found.
The table you're looking for is NOT in that specific page/URL.
The stats you're trying to scrape come from this url:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
When you browse a webpage/url in a modern browser, more requests are made "behind the scene" other than the original url you use to fully render the whole page.
I know this sounds counter-intuitive, you can check out this answer for a bit more detailed explanation.
Try this code. It is giving me the HTML code. I am using requests to obtain information.
import datetime
import BeautifulSoup
import os
import sys
import pdb
import webbrowser
import urllib2
import requests
from datetime import datetime
from requests.auth import HTTPBasicAuth
from HTMLParser import HTMLParser
from urllib import urlopen
from bs4 import BeautifulSoup
url="http://stats.nba.com/teams/traditional/"
data=requests.get(url)
if (data.status_code<400):
print("AUTHENTICATED:STATUS_CODE"+" "+str(data.status_code))
sample=data.content
soup=BeautifulSoup(sample,'html.parser')
print soup
You can use selenium and PhantomJS (or chomedriver, firefox etc.) to load the page, thereby also loading all the javascript. All you need is to download selenium and the PhantomJS webdriver, then place a sleep timer after the get(url) to ensure that the page loads (actually, using a function such as WebDriverWait would be much better than sleep, but you can look more into that if you need it). Now your soup content will look exactly like that what you see when looking at the site through your browser.
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
url = 'http://stats.nba.com/teams/traditional/'
browser = webdriver.PhantomJS('*path to PhantomJS driver')
browser.get(url)
sleep(10)
soup = BeautifulSoup(browser.page_source, "html.parser")
for table in soup.find_all('tr'):
print(table)
import requests from bs4
import BeautifulSoup
import xlrd file="C:\Users\Ashadeep\PycharmProjects\untitled1\xlwt.ashadee.xls"
workbook=xlrd.open_workbook(file)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))
r = requests.get(sheet.cell_value(0,0))
r.content soup = BeautifulSoup(r.content,"html.parser") g_data=soup.find_all("div",{"class":"admissionhelp-left"})
print(g_data)
text=soup.find_all("Tel") for item in g_data:print(item.text)
Are you trying to download an Excel file from the web and save it to your HDD? I don't see any URL, but you can try one of these 3 ideas.
import urllib
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
urllib.urlretrieve(dls, "test.xls")
import requests
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
Or, if you don't necessarily need to go through the browser, you can use the urllib module to save a file to a specified location.
import urllib
url = 'http://www.example.com/file/processing/path/excelfile.xls'
local_fname = '/home/John/excelfile.xls'
filename, headers = urllib.retrieveurl(url, local_fname)