Getting the website title from a link in a string - regex

string: "Here is the badges, https://stackoverflow.com/badges bla bla bla"
If string contatins a link (see above) I want to parse the website title of that link.
It should return : Badges - Stack Overflow.
How can i do that?
Thanks.

#!/usr/bin/perl -w
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get('http://search.cpan.org/');
if ($response->is_success) {
print $response->title();
}
else {
die $response->status_line;
}
See LWP::UserAgent. Cheers :-)

I use URI::Find::Simple's list_uris method and URI::Title for this.

Depending how the link is given and how you define title, you need one or other approach.
In the exact scenario that you have presented, getting the URL with URI::Find, HTML::LinkExtractor etc, and then my $title=URI->new($link)->path() will provide the title and the link.
But if the website title is the linked text like badged, then How can I extract URL and link text from HTML in Perl? will give you the answer.
If the title is encoded in the link itself and the link is the text itself of the link, how do you define the title?
Do you want the last bit of the URI before any query? What happens with the queries set as URL paths?
Do you want the part between the host and the query?
Do you want to parse the link source and retrieve the title tag if any?
As always going from trivial first implementation to cover all corner cases is a daunting tasks ;-)

Related

Regex for extracting only the Youtube Embedment URL in Angular 5

I think it is not very convenient for an user the get this link here:
https://www.youtube.com/embed/GmvM6syadl0
Because YouTube provides an entire code snipped like so:
<iframe width="560" height="315" src="https://www.youtube.com/embed/GmvM6syadl0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
It would be a lot better if the user could take the code snippet above and my program is just going to extract the url for him.
Any ideas how to go about this? I'm usually not very good at extracting data from elaborate strings, what I would like to end up with is something like this:
let yTLink = extractYoutubeLinkfromIframe(providedInput);
extractYoutubeLinkfromIframe(iframeTag) {
// do fancy regex stuff
}
If you will have a format like that iFrame you could use split and I did it using the follwoing code:
extractYoutubeLinkfromIframe(iframeTag) {
let youtubeUrl = iframeTag.split('src');
youtubeUrl = youtubeUrl[1].split('"');
return youtubeUrl[1];
}
First we split by the src, so, we will separte the iFrame string, after that, we split by quote ", to get just the part that we need as the link is with "[link]", we get the first position that will indicate that we want to get the link.

Is it possible to read tweet-text of a tweet URL without twitter API?

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?
One such example of a tweet (shortened link) is as follows:
https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1
NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.
Scrape yourself
Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.
Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/
XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.
In your case:
//div[contains(#class, 'permalink-tweet-container')]//strong[contains(#class, 'fullname')]/text()
will get you the name of the author and
//div[contains(#class, 'permalink-tweet-container')]//p[contains(#class, 'tweet-text')]//text()
will get you the content of the Tweet.
The full working example:
from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(#class, "permalink-tweet-container")]//p[contains(#class, "tweet-text")]//text()')
results in:
['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']

Regex to look for html tags based on their classes, and extract their value

I'm looking for a Regex to look for html tags based on their class name, and extract their value, for example:
<span class="myclass" id="myid">Hello world</span>
I need to extract - Hello world
I've tried doing that by my own but it seems to be more complicated than it looks
Some help? :)
Thanks!
You can try
var str = '<span class="myclass" id="myid">Hello world</span>';
var res = str.match("<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>");
alert(res[2]);
I really prefer use a HTML parser.
But, if it is really needed, you can try this https://regex101.com/r/xP5kG7/1
.+(?<="myclass")[^>]+>([^<]+).+
It will give you the desirable output.

JavaScript Regx to remove certain string if a pattern is found

Lets say i have
input string as
<div id="infoLangIcon"></div>ARA, DAN, ENGLISHinGERMAN, FRA<div id="infoPipe"></div><div id="infoRating0"></div><div id="infoPipe"></div><div id="infoMonoIcon"></div>
so i want to check if inforating is 0 and then remove the div and previous div also. The output is
<div id="infoLangIcon"></div>ARA, DAN, ENGLISHinGERMAN, FRA</div><div id="infoPipe"></div><div id="infoMonoIcon"></div
Regex is not your best option here. It is not reliable when it comes to HTML.
I suggest you use DOM functions to do this (I gave you a Javascript example, you have not provided a language to be used). If I understood correctly, if there is an element with the ID of infoRating0, you want to remove it and its previous sibling. This little snippet should do that:
if (document.getElementById('infoRating0')) {
var rating0=document.getElementById('infoRating0'),
rParent=rating0.parentNode;
rParent.removeChild(rating0.previousSibling);
rParent.removeChild(rating0);
}
Also, your HTML is invalid. You can only use an ID once in your HTML. You have two divs with the same ID (infoPipe) which you should REALLY fix. Use classes instead.
jsFiddle Demo

how to match a URL inside a HTML comment with regular expressions?

I'm making an automated script with PHP to check if my link exists at my partner website ( link exchange) .. besides making sure my link exists in the source code , I want to make sure he is not placing it in a HTML comment like <!-- http://www.mywebsite.com --> and cheating me ..
I tried to match it with REGEXP , but have failed
Use the DOM and XPath, it ignores comments:
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[contains(#href, "mywebsite.com")]');
if (!$result->length) echo "You've been cheated\n";
And then if you still want to know if your website is being commented out
if (strpos($htmlstring, 'mywebsite.com') !== false && !$result->length)
echo "Your partner is hiding your link in a comment, sneaky bastard\n";
Sounds like a perfect use for an HTML parser like DOMDocument->loadHTML() and look for an anchor tag with your link. He could still remove it via javascript on the browser side, but that's a different issue.
If it's a cat and mouse game of "are you showing a link to my site" using a standard parser is your best bet. There are just too many ways for a regex to fail on html.