I have a project and i need to be take some info's on www.wikizero.com because wikipedia is not work on my country so.. I tried to make API on https://www.wikizero.com/tr/Mustafa_Kemal_Atat%C3%BCrk this site and <div class="mw-parser-output"> under this and first <p> "Mustafa Kemal Atatürk[n 2] (1881[n 3] - 10 Kasım 1938), Türk mareşal ve devlet adamı. Ülkesinde monarşinin kaldırılarak cumhuriyetin kurulmasına önderlik etti ve 1923'ten 1938'e kadar cumhurbaşkanı olarak görev yaptı." this is what i want to take on web site but i can only take this.
when i write this code
soup.find("div", attrs={"class":"mw-parser-output"}).select("p:nth-of-type(1)")
its shows me this.
enter image description here
So how i can take this blocks.? enter image description here
Related
I am scraping a webpage using the beautifulsoup and requests in python3.5 . problem is when i tried to parse the email address in the p it gives me the [email protected]. I have tried the other links but no gain. cf_email tag is not even there. I am parsing through this
email_addresses=[]
for email_address in detail.findAll('p'):
email_addresses.append(email_address.text)
information = {}
information['email'] = email_addresses
emails are in the <p> tags.
i have this html in inspecting element.
<div class="email">
<p>test1#hotmail.com</p>
<p>test2#yahoo.com</p>
<p>test3#yahoo.com</p>
<div>
when i open the page source i have noticed this .
<p>[email protected]</p>
The page does not actually contain the email address. This is probably being done as a protection against spammers; there will be some javascript that replaces the holding text with the actual value.
In other words, the site is trying to stop people doing exactly what you are trying to do.
I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']
1.
<p class="followText">Follow us</p>
<p><a class="symbol ss-social-circle ss-facebook" href="http://www.facebook.com/HowStuffWorks" target="_blank">Facebook</a></p>
2.
<p>Gyroscopes can be very perplexing objects because they move in peculiar ways and even seem to defy gravity. These special properties make gyroscopes extremely important in everything from your bicycle to the advanced navigation system on the space shuttle. A typical airplane uses about a dozen gyroscopes in everything from its compass to its autopilot. The Russian Mir space station used 11 gyroscopes to keep its orientation to the sun, and the Hubble Space Telescope has a batch of navigational gyros as well. Gyroscopic effects are also central to things like yo-yos and Frisbees!</p>
This is part of source of the website http://science.howstuffworks.com/gyroscope.htm, from which I'm trying to extract contents of the <p> tag from.
This is the code I'm using to do that
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://science.howstuffworks.com/gyroscope' + str(page) + ".htm"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('p'):
paragraph = link.string
print paragraph
But I'm getting both types of data( both 1 and 2) inside the p tag.
I need to get only the data from the part 2 section and not part 1.
Please suggest me a way to leave out tags with attributes but keep the basic tags of the same html tag.
I'm currently in the process of trying to scrape a website. The problem is the information is placed on google maps in an iframe. Specifically, Latitude and Longitude.
I'm able to get all the other information I currently need expect this. Searching around, and working with import.io tech support, I found I need to use specific xPath and Regex to pull this information but the code I found on the site has me lost. Ideally I'd like to pull Latitude and Longitude separately. This is the code I have to work with.
What are my options? Thank you.
<div class="padding-listItem--sm">
<iframe width="100%" height="310" frameborder="0" allowfullscreen="" src="https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8" style="border:0"></iframe>
</div>
1) Get the src attribute of the iframe element.
string srcText = driver.findElement(By.tagName("iframe")).getAttribute("src");
2) Parse the url (found in srcText) for the latitude and longitude values.
Regex to find both numbers:
/([-]?\d+\.\d+)/g
when the url is as you specified:
https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8"
The XPath to obtain the iframe source is:
//div[#class='padding-listItem--sm']/iframe/#src
Then you can apply a regex like this one to obtain latitude and longitude
/q=(-?[\d\.]*),(-?[\d\.]*)/g
Implementation online Here
I'm trying to scrape some stock prices, and variations, from Google Finance using python3 but I just can't figure out if there's something wrong with the page, or my regex. I'm thinking that either the svg graphic or the many script tags throughout the page are making the regex parsers fail to properly analyze the code.
I have tested this regex on many online regex builders/testers and it looks ok. As ok as a regex designed for HTML can be, anyway.
The Google Finance page I'm testing this out on is https://www.google.com/finance?q=NYSE%3AAAPL
And my python code is the following
import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())
It would extract the stock price and its percent variation.
I have also tried using python2 + BeautifulSoup, like so
soup.find(id='price-panel')
but it returns empty even for this simple query. This is especially why I'm thinking that there's something weird with the html.
And here's the most important bit of html that I'm aiming for
<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
Disclaimer
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>
I'm wondering if any of you have encountered a similar problem with this page and/or can figure out if there's anything wrong with my code. Thanks in advance!
You might try a different URL that will be easier to parse, such as: http://www.google.com/finance/info?q=AAPL
The catch is that Google has said that using this API in an application for public consumption is against their Terms of Service. Maybe there is an alternative that Google will allow you to use?
I managed to get it working using BeautifulSoup, on the link posted originally.
Here's the bit of code I finaly used:
response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
I couldn't get it working with BeautifulSoup before because I was actually trying to parse the table in this page https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG, not the one I posted.
Neither method described on my question has worked on this page.