Selenium+Python, about find_element_by_xpath h3 - python-2.7

Is there any way identify the button "Connect" by the string "Test Engine 0728" then click it with the method find_element_by_xpath or any other method in python+selenium environment. Thanks a lot!
<html
<head
<body
<div class="page" id="main-page"
<div class="controls" id="Engines"
<div class="devices" id="Devices-List"
<h3 class="device-name">Test Engine 0728 </h3>
</div>
<button>Connect</button>
...

This xpath should work for you:
driver.find_element_by_xpath("//h3[contains(text(),'Test Engine 0728')]/../../button[contains(text(),'Connect')]").click()

There are certainly multiple ways to find the button.
One option would be to start your xpath expression with the div with id Engines, check that it contains the h3 tag with Test Engine 0728 text in the div with Devices-List id. Then, get the button by Connect text:
button = driver.find_element_by_xpath('//div[#id="Engines" and div[#id="Devices-List"]/h3[contains(., "Test Engine 0728")]]/button[. = "Connect"]')
button.click()
Or, another option would be to find the div with Devices-List id, check for the h3 tag's text inside and get the following button sibling:
//div[#id="Devices-List" and h3[contains(., "Test Engine 0728")]]/following-sibling::button

This one also should work:
connectButtonClick = driver.find_element_by_xpath("//div[#class='controls'][#id='Engines'][contains(., 'Test Engine 0728')]//button[text()='Connect']").click()

Related

How do I scrape nested data using selenium and Python

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

RegExp replace all but selected

So I'm trying to erase everything except the matched case in this 1900 line document with Notepad++ RegExp Find/Replace, so that I only have the file names, which shorten it to under about 1000 lines at minimum. I know the code that selects the text ((?<=/images/item/)(.*)(?=" a) but the problem is I don't know how to make it erase anything that doesn't match that case. Here's a portion of the document.
using notepad++, it would find and select abyssal-scepter.gif, aegis-of-the-legion.gif, etc
<img src="/images/item/abyssal-scepter.gif" alt="LoL Item: Abyssal Scepter"><br> <div id="id_77" class="tier-wrapper drag-items health magic-resist health-regen champ-box float-left ajax-tooltip {t:'Item',i:'77'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-advanced filter-bonus-aura filter-category-health filter-category-magic-resist filter-category-health-regen ui-draggable ui-draggable-handle">
<img src="/images/item/aegis-of-the-legion.gif" alt="LoL Item: Aegis of the Legion"><br> <div id="id_235" class="tier-wrapper drag-items ability-power movement champ-box float-left ajax-tooltip {t:'Item',i:'235'} filter-tier-advanced filter-bonus-unique-passive filter-category-ability-power filter-category-movement ui-draggable ui-draggable-handle">
<img src="/images/item/aether-wisp.gif" alt="LoL Item: Aether Wisp"><br>
<div class="info">
<div class="champ-name">Aether Wisp</div>
<div class="champ-sub">
<img src="/images/gold.png" alt="Item Cost" style="width:16px; vertical-align:middle;"> 850 / 415
</div>
</div>
</div>
<div id="id_21" class="tier-wrapper drag-items ability-power champ-box float-left ajax-tooltip {t:'Item',i:'21'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-basic filter-category-ability-power ui-draggable ui-draggable-handle">
<img src="/images/item/amplifying-tome.gif" alt="LoL Item: Amplifying Tome"><br>
<div class="info">
<div class="champ-name">Amplifying Tome</div>
<div class="champ-sub">
I'm not familiar with RegExp, so to summarize, I need it to look like this at the end of it.
abyssal-scepter.gif
aegis-of-thelegion.gif
aether-wisp.gif
amplifying-tome.gif
Thank you for your time
A Notepad++ solution:
Find what : .*?/images/item/(.*?)"|.*
Replace with : $1\n
Search mode : Regular expression (with ". matches newline" checked)
The result will have an extra linefeed at the end.
But that shouldn't pose a problem I suppose.
Maybe this can help. or not since you dropped the Javascript tag out of your original post
<script type="text/javascript">
var thestring = "<img src=\"/images/item/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
var thestring2 = "<img src=\"/images/otherstuff/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
function ParseIt(incomingstring) {
var pattern = /"\/images\/item\/(.*)" /;
if (pattern.test(incomingstring)) {
return pattern.exec(incomingstring)[1];
}
else {
return "";
}
//return pattern.test(incomingstring) ? pattern.exec(incomingstring)[1] : "";
}
</script>
Calling ParseIt(thestring) returns "aegis-of-the-legion.gif"
Calling ParseIt(thestring2) return ""
Since you are doing this in NP++, this works for me. In cases like this where speed and results are more important than specific technique, I'll usually run several regexes. First, I'll get each tag on its own line by doing a search for > and replacing it with >\n. This gets each tag on its own line for simpler processing. Then a replace of ^>*<.*?".*?/?([\w\d\-_]+\.\w{2,4})?".*>.*$ with $1 will will extract all the filenames from the tags, removing the unneeded text. Then, finally, to clear all the tags that didn't have a filename in them, just replace <.*> with an empty string. Finally, use Edit>Line Operations>Remove empty lines, and you'll have the result you're looking for. It's not a 100% regex solution, but this is a one time action that you just need a simple result from.

Selenium XPATH how to get text from Span tag underneath the input id tag

I have the following html snippet:
<div>
<span class="gwt-InlineLabel myinlineblock" style="display: none;" aria-hidden="true">Go to row</span>
<input id="data_configuration_view_preview_ib_row" class="gwt-IntegerBox marginleft red" type="text" size="8"/>
<span class="gwt-InlineLabel error myinlineblock marginleft" style="width: 7ex;" aria-hidden="false">Error!</span>
</div>
I am trying to locate the text Error!
I start from the input id tag as that has an ID. I am not able to go down to the span tag which has the text Error!
My xpath to start from the id is:
//input[#id="data_configuration_view_preview_ib_row"]
I have tried:
//input[#id="data_configuration_view_preview_ib_row"]/span[contains(text(), "Error!")]
What CSS or XPath can I use to locate the text Error!?
I have managed to locate the element with the following Xpath:
//input[#id="data_configuration_view_preview_ib_row"]//following-sibling::span[contains(text(), "Error!")]
Thanks, Riaz
You can use cssSelector as :
using with error class
span.error
using with id data_configuration_view_preview_ib_row
#data_configuration_view_preview_ib_row + span.error
OR you can use xpath as :
using with error class
//span[contains(#class, 'error')]
using with preceding id data_configuration_view_preview_ib_row
//span[preceding::*[#id = 'data_configuration_view_preview_ib_row']]
using with preceding-sibling id data_configuration_view_preview_ib_row
//span[preceding-sibling::*[#id = 'data_configuration_view_preview_ib_row']]
Hope it helps..:)
Use the axis following-sibling to get the next element on the same level:
//input[#id="data_configuration_view_preview_ib_row"]/following-sibling::span
You could also use a CSS selector:
#data_configuration_view_preview_ib_row + span

Select every text node in a HTML document except script nodes with XPath

I am currently writing a web crawler with Scrapy, and I would like to fetch all the text displayed on the screen of every HTML document with a single XPath query.
Here is the HTML I'm working with:
<body>
<div>
<h1>Main title</h1>
<div>
<script>var grandson;</script>
<p>Paragraph</p>
</div>
</div>
<script>var child;</script>
</body>
As you can see, there are some script tags that I want to filter when getting the text inside the body tag
Here is my first XPath query and its result:
XPath: /body/*//text()
Result: Main title / var grandson; / Paragraph / var child;
This is not good because it also fetches the text inside the script tag.
Here is my second try:
XPath: /body/*[not(self::script)]//text()
Result: Main title / var grandson; / Paragraph
Here, the last script tag (which is body's child) is filtered, but the inner script is not.
How would you filter all the script tags ? Thanks in advance.
Try
//*[not(self::script)]/text()
This xPath does what you want.
.//text()[not(parent::script)]
So we have looking what is parent of text.
More interesting sample. I can use it for each element which contains html code.
.//text()[not(ancestor::script|ancestor::style|ancestor::noscript)]

Matching text that is not html tags with regular expression

So I am trying to create a regular expression that matches text inside different kinds of html tags. It should match the bold text in both of these cases:
<div class="username_container">
<div class="popupmenu memberaction">
<a rel="nofollow" class="username offline " href="http://URL/surfergal.html" title="Surfergal is offline"><strong><!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end --></strong></a>
</div>
<div class="username_container">
<span class="username guest"><b><a>**Advertisement**</a></b></span>
</div>
I have tried with the following regular expression without any result:
/<div class="username_container">.*?((?<=^|>)[^><]+?(?=<|$)).*?<\/div>/is
This is my first time posting here on stackoverflow so if I am doing something incredibly stupid I can only apologize.
Using regex to parse html is.. hard. See the links in the comments to your question.
What do you plan to do with these matches? Here's a quick jquery script that logs the results in the console:
var a = [];
$('strong, b').each(function(){
a.push($(this).html());
});
console.log(a);
results:
["<!-- google_ad_section_start(weight=ignore) -->**Surfergal**<!-- google_ad_section_end -->", "<a>**Advertisement**</a>"] ​
http://jsfiddle.net/Mk7xf/