Specific xPath and Regex - Web Crawling - regex

I'm currently in the process of trying to scrape a website. The problem is the information is placed on google maps in an iframe. Specifically, Latitude and Longitude.
I'm able to get all the other information I currently need expect this. Searching around, and working with import.io tech support, I found I need to use specific xPath and Regex to pull this information but the code I found on the site has me lost. Ideally I'd like to pull Latitude and Longitude separately. This is the code I have to work with.
What are my options? Thank you.
<div class="padding-listItem--sm">
<iframe width="100%" height="310" frameborder="0" allowfullscreen="" src="https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8" style="border:0"></iframe>
</div>

1) Get the src attribute of the iframe element.
string srcText = driver.findElement(By.tagName("iframe")).getAttribute("src");
2) Parse the url (found in srcText) for the latitude and longitude values.
Regex to find both numbers:
/([-]?\d+\.\d+)/g
when the url is as you specified:
https://www.google.com/maps/embed/v1/place?q=33.3929503,-111.908652&key=AIzaSyDK08tC4NRubbIiw-xwDR1WEp-YAXX1Mx8"

The XPath to obtain the iframe source is:
//div[#class='padding-listItem--sm']/iframe/#src
Then you can apply a regex like this one to obtain latitude and longitude
/q=(-?[\d\.]*),(-?[\d\.]*)/g
Implementation online Here

Related

Is it possible to read tweet-text of a tweet URL without twitter API?

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

Embedding issuu

I need to embed an issuu document inside a website. The website administrator should be allowed to decide which document is displayed on the frontend.
This is an easy task, using the embed link on the issuu page. But I need to customize some options - for instance, disable sharing, set the dimensions and so on. I cannot rely on the administrators doing this process every time they need to change the document.
I can easily customize the issuu embed code to my taste, and all that I need is the document id. Unfortunately, the id is not included in the issuu page for the document. For instance, the id for this random link happens to be 110209071155-d0ed1d10ac0b40dda80dad24166a76ee, which is nowhere to be found, neither in the URL nor easily inside the page. You have to dig into the embed code to find it.
I thought the issuu API could allow me to get a document id given its URL, but I cannot find anything like this. The closest match is the search API, but if I search for the exact name of the document I get only one match for a different document!
Is there some easy way to be able to embed a document only knowing its URL? Or an easy way for a non techie person to find a document id in the page?
Unfortunate the only way for you to costomize is to pay for the service wich is 39$ for month =/.
You can force a fullscreen mode without ads by using
<body style="margin:0px;padding:0px;overflow:hidden">
<iframe src="YOUR ISSU EMBED" frameborder="0" style="overflow:hidden;height:105%;width:105%;position:absolute;" height="100%" width="100%""></iframe>
</body>
You can embed of course stacks but that isnt showed on Issuu site. This is code (its old code but it works):
<iframe src="http://static.issuu.com/widgets/shelf/index.html?folderId=FOLDERIDamp;theme=theme1&rows=1&thumbSize=large&roundedCorners=true&showTitle=true&showAuthor=false&shadow=true&effect3d=true" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="100%" height="200"></iframe>
FOLDERID is number of 36 chars that you get on address bar when you enter stacks (example: https://issuu.com/username/stacks/FOLDERID). When you replacing that in code you must paste 36 chars in this format 8-4-4-4-12 with - between chars. And voila its working.
You can change theme and other stuffs in code.
The Document ID is found in the HTML source of every document. It is in the og:video meta property.
<meta property="og:video" content="http://static.issuu.com/webembed/viewers/style1/v2/IssuuReader.swf?mode=mini&documentId=XXXXXXXX-XXXXXXXXXXXXX&pageNumber=0">
You can easily handle it by using the DomDocument and DomXPath php classes.
Here is how-to using PHP:
// Your document URL
$url = 'https://issuu.com/proyectotres/docs/proyecto_3_edicion_135';
// Turn off errors, loads the URL as an object and then turn errors on again
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTMLFile($url);
libxml_use_internal_errors(false);
// DomXPath helps find the <meta property="og:video" content="http://hereyoucanfindthedocumentid?documentId=xxxxx-xxxxxxx"/>
$xpath = new DOMXPath($dom);
$meta = $xpath->query("//html/head/meta[#property='og:video']");
// Get the content attribute of the <meta> node and parse its query
$vars = [];
parse_str(parse_url($meta[0]->getAttribute('content'))['query'], $vars);
// Ready. The document ID is here:
$docID = $vars['documentId'];
// You can print it:
echo $docID;
You can try it with the URL of your own Issu document.
You can use the Issuu URL of your document to complete this iframe :
<iframe width="100%" height="283" style="display: block; margin-left: auto; margin-right: auto;" src="https://e.issuu.com/issuu-reader3-embed-files/latest/twittercard.html?u=nantucketchamber&d=program-update1&p=1" frameborder="0" allowfullscreen="allowfullscreen" span="" id="CmCaReT"></iframe>
You just need to replace "nantucketchamber" by a user name and "program-update1" by the file name in the Issuu URL
(for this example the URL is https://issuu.com/nantucketchamber/docs/program-update1)

Extract all Images from HTML whose width or height higher than a specified value - Regex

I'm trying to make a small link share function with Classic ASP like LinkedIn or Facebook.
What I need to do is to get HTML of remote URL and extract all the images whose width are greater than 50px for example.
I can crawl and take the HTML and also I can find the images with this regex:
<img([^<>+]*)>
It matches; <img src="/images/icon.jpg" width="60" height="90" style="display:none"/>
Then I'm able to extract the path but sometimes it matches <img src="/track.php" style="display:none" width="1" height="1"/> which is not a real image.
Anyway, I feel like you are gonna be mad because of classic ASP but my company ....
I know there are lots of topics about this issue and mostly, they recommend not to USE regex but I couldn't find a way to this with classic asp. Is there a component or something to this?
Regards
This will get you close:
<img [^>]*width="(0?[1-9]\d{2,}|[5-9]\d)"[^>]*>
It accepts image tags with a width of 50 or greater.
Edit: tags with unspecified widths:
<img [^>]*width="(0?[1-9]\d{2,}|[5-9]\d)"[^>]*>|<img ((?!width=)[^>])*>

Regular Expression Match for Google Analytics Goal Tracking

I have a coupon request form on every page on my website, when the coupon form is submitted you are taken to the same page you are on with an additional "?coupon=sent" parameter added to the query string. I would like to be able to track any page url wiht ?coupon=sent on the end as a goal. Currently, I have this:
/[^.][.php][\?coupon\=sent]+
which does not seem to be doing the trick. Any ideas?
Use this one:
/\?coupon=sent/

The regular expression for finding the image url in <img> tag in HTML using VB .Net code

I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.