How to find attribute values using regular expression with Selenium WebDriver? - regex

The HTML is as follows:
<div class="item link-color-1
automated link-track logout-link"
data-track-category="Logout"
data-track-action="Logout from /myaccount/mymoney/cashier"
data-data-automated-function="clickTracker"> logout </div>
To find the x path of a link, tried something like: //*[contains(#data-track-category='Logout']. But its not working. Please help.

If you are just trying to find the element itself (and not the value of the attribute as your question title implies), you could always use CSS (its my selector of choice over XPath).
You have not indicated which language bindings you use, so this is how I would find it in Ruby by using only the data-track-category attribute to select the element:
#driver.find_element(:css => "[data-track-category='Logout']")
Of course, the same applies across all the bindings. Just use the value "[data-track-category='Logout']" for your CSS method.

You can always store the given element in a WebElement and then using the getAttribute() method. You can find the documentation here
WebElement myDiv = driver.findElement(By.className("logout-link"));
String attributeValue = myDiv.getAttribute("data-track-category");
This should work on Java. Not sure that's the language you are looking for or if you must use regex for it.

Related

Removing entire tags containing a specific term using regex

I am altering a database with approximately 500 html pages using phpmyadmin.
Several pages contain a Facebook Pixel or Google Tag that I would like to remove.
The easiest way I thought would be to search via regex the entire tag that contains some expression or term related to Facebook or Google, and replace it with blank.
An example would be
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'G-XXXXXXXX');
</script>
or
<script>
(window, document, 'script', 'https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '9999999999999999');
fbq('track', 'salespage_xxxxxx');
</script>
Although all are unique, some have the same code or another element that makes it possible to identify each one of them.
Before running in myphpadmin, I'm trying to formulate the expression using SublimeText3
It's the first contact I have with the regex and I found it fascinating, but even following some references I can't match the search.
The expression I came up with after some research was
<(.*)>[\s\S]face[\s\S]<\/(.*)>
Where I thought the expression would select the entire tag containing the word "face", but it doesn't find anything.
I would like some help.
If it works, it would be able to make several other necessary changes.
This regex expression will match the <script> tag that contains the face keyword
<(script)>(?:(?!<\/\1>|face)[\s\S])+face(?:(?!<\/\1>)[\s\S])+<\/\1>
See example: https://regex101.com/r/LfRlBV/1

Which regex tag to use in a Mechanize function?

I retrieved all the links from the web page containing /title/tt inside the url in a list.
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );
but the list is too long so I want to filter by adding in the function find_all_Links that the link must be also in the tags starting with <id="actor-tt..."> here is where the link (/title/tt...) is, in the code source retrieved by cmd.exe:
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>
I imagine you have to use a tag_regex but I don't know how because the command prompt doesn't seem to take tag_regex into account when I put it in.
Using HTML::TreeBuilder and HTML::Element instead of Mechanize:
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $html_string = join "", <DATA>;
my $tree = HTML::TreeBuilder->new_from_content($html_string);
my #url_links = map { $_->attr_get_i("href") }
map { $_->look_down(href => qr{/title/tt}) }
$tree->look_down(id => qr/^actor-tt/);
say for #url_links;
__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
2009
</span>
<b>Inglourious Basterds</b>
<br/>
Lt. Aldo Raine
</div>
<div id="not-the-right-id">
</div>
<div class="filmo-row odd" id="actor-tt0123456">
<b>Another movie</b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
the id will match, but no href in here
</div>
$tree->look_down(id => qr/^actor-tt/); finds all elements whose id matches actor-tt. Then $_->look_down(href => qr{/title/tt}) will find all elements within them with a field href matching /title/tt. Finally, $_->attr_get_i("href") returns the value of their href fields.
You might be interested in the method new_from_url or new_from_file from HTML::TreeBuilder rather than the new_from_content I used.
WWW::Mechanize is not sophisticated enough to do what you're trying to do. It can only search links on one criterium at a time, and it converts them to WWW::Mechanize::Link objects, which do not maintain their ancestry (as in position in the DOM tree).
Mechanize is meant to be a browser, not a scraper. It's important to pick the right tools for the job you have to do.
As Dada suggested in their answer, you can use your own parser to search for this. You can still extract the HTML out of WWW::Mechanize and then use the code they suggest. Use $mech->content or $mech->content_raw to get the HTML out.
There are several alternatives to this. While I personally like Web::Scraper for this kind of task, its interface is a bit weird and has a learning curve.
Instead, I would suggest using Mojo::UserAgent and Mojo::DOM. In fact, the handy ojo package for one-liners should be able to do this.
perl -Mojo -E 'g("https://www.imdb.com/name/nm0000093/")->dom->find("div[id^=actor-tt] a")->map(sub {say $_->attr("href")})'
Broken down, this does the following:
use Mojo::UserAgent to get that page
look at the DOM tree
find all <a>s inside <div>s that have an id that starts with actor-tt (see https://metacpan.org/pod/Mojo::DOM::CSS#SELECTORS for details)
for each of them, print out the href attribute
You can customise this as much as you want.
Please note that according to their Terms of Services, scraping IMDB is not allowed.

React change string to component (multiple)

I want to make custom grammar like Wiki and how can I do it in React and without dangerouslySetInnerHTML?
For example:
"hello this is simple string [linkToSomewhere] {this is where bold goes}"
becomes
<div>
hello this is simple string <Link where="linktoSomewhere"/> <Bold string="this is where bold goes"/>
</div>
like this
I found way to parse custom markdown to array but found no way to insert it as react component array is like
link[0] = "linkToSomewhere"
bold[0] = "this is where Bold goes"
Thank you in advance!
There are a few Markdown libraries available for React. Check out:
react-markdown: https://rexxars.github.io/react-markdown/
react-remarkable: https://github.com/acdlite/react-remarkable
And here's a whole host of other libraries that can translate your Markdown document into a nicely rendered HTML: https://react.rocks/tag/Markdown
And in order to learn the Markdown syntax, there are a few cheat sheets available, such as:
http://assemble.io/docs/Cheatsheet-Markdown.html
https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf

JavaScript Regx to remove certain string if a pattern is found

Lets say i have
input string as
<div id="infoLangIcon"></div>ARA, DAN, ENGLISHinGERMAN, FRA<div id="infoPipe"></div><div id="infoRating0"></div><div id="infoPipe"></div><div id="infoMonoIcon"></div>
so i want to check if inforating is 0 and then remove the div and previous div also. The output is
<div id="infoLangIcon"></div>ARA, DAN, ENGLISHinGERMAN, FRA</div><div id="infoPipe"></div><div id="infoMonoIcon"></div
Regex is not your best option here. It is not reliable when it comes to HTML.
I suggest you use DOM functions to do this (I gave you a Javascript example, you have not provided a language to be used). If I understood correctly, if there is an element with the ID of infoRating0, you want to remove it and its previous sibling. This little snippet should do that:
if (document.getElementById('infoRating0')) {
var rating0=document.getElementById('infoRating0'),
rParent=rating0.parentNode;
rParent.removeChild(rating0.previousSibling);
rParent.removeChild(rating0);
}
Also, your HTML is invalid. You can only use an ID once in your HTML. You have two divs with the same ID (infoPipe) which you should REALLY fix. Use classes instead.
jsFiddle Demo

Django: How do I prepend

I'm exploring Django and got this particular problem.
How do I prepend <span class="label">Note:</span> inside {{article.content_html|safe}}?
The content of {{article.content_html|safe}} are paragraph blocks, and I just wanna add <span class="label">Note:</span> in the very first paragraph.
Thanks!
Sounds like you want to write a custom tag that uses BeautifulSoup to parse the HTML and inject the fragment.
There's no easy way. You can easily prepend to all articles.
<span class="label">Note:</span>
{{article.content_html|safe}}
If that doesn't help you consider changing the structure of article.content_html so you can manipulate with blocks from django templates, so it should look something like this
{{article.content_header}}
<span class="label">Note:</span>
{{article.content_html}}
If that solution is not feasible to you and you absolutely need to parse and modify the content of article.content_html, write your own custom filter that does that. You can find documentation about writing custom filters here http://docs.djangoproject.com/en/dev/howto/custom-template-tags/#writing-custom-template-filters.
An alternate approach could be to do this with javascript. In jQuery, it would look something like:
var first_p_text = $("p:first").text()
$("p:first").html("<span class="label">Note:</span>" + first_p_text)
Note though that if there are other elements inside your first p, $("p:first").text() will grab the text from those as well - see http://api.jquery.com/text/
Of course, this relies on decent javascript support in the client.
jQuery is the simplest and easiest to implement. You only need one line with the prepend call (documentation):
$('p:first').prepend('<span class="label">Note:</span>');
Explanation: 'p:first' is a jQuery selector similar to the ':first-child' CSS selector. It will select the first paragraph and the prepend call will then insert the span into that selected paragraph.
Note: If there is a paragraph on the page before your content, you may have to surround it with a div:
<div id='ilovesmybbq'>{{article.content_html|safe}}</div>
Then the jQuery call would be:
$('#ilovesmybbq p:first').prepend('<span class="label">Note:</span>');