I would know if it's possible to find all elements that match a certain string, using xpath.
For example, suppose I have this page:
<html>
<head>
<title></title>
</head>
<body>
<form id="form1">
</form>
<p class="test"></p>
<p class="test"></p>
<p class="test"></p>
<p class="test"></p>
</body>
</html>
If I search for this string<form id="form1"> I would get the first form element, instead if I search for this string <p class="test"></p> I would get all the paragraphs elements. Is it possible?
Something like //*[matches(., string)]
I'm at the beginning, so any suggestions will be appreciated.
Try this using xpath :
//form[#id="form1"]
Output :
<form id="form1">
</form>
The rest :
//p[#class="test"]
and if you want a partial match :
//p[contains(#class, "tes")]
Related
I have the below HTML content:
<html>
<body>
<div>
<p><img class="img.jpg" /></p>
</div>
</body>
</html>
and i am trying to parse the HTML using lxml parser as below:
import lxml.html as LH
root = LH.fromstring(html)
for el in root.iter('img'):
el.attrib['src'] = el.attrib['class']
content = '<html><body>' + LH.tostring(root) + '</body></html>'
I am getting the content after parsing as below:
<html>
<body>
<div>
<p><img class="img.jpg" src="img.jpg"></p>
</div>
</body>
</html>
As you can see, the <img>'s closing tag </> has been removed after parsing. Is there anyway I can retain all the HTML closing tags after HTML parsing?
there is a extended document with title indexed in format ascend, for example 8.1, 8.1.1... 8.1.1.1.1.1.1 such as:
<h1 class="topicTitle-h1">8.12.1.1.12.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1 title01</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1 title02</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.2.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.3.2.3.1 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1.1 title05</h1>
<h1 class="topicTitle-h1">8.1.4.2.5.9.3 title03</h1>
<h1 class="topicTitle-h1">8.1.1.1.1.1 title06</h1>
<h1 class="topicTitle-h1">8.1.11.12.14.3.1 title03</h1>
I tried to get only title03 with regex expression re.search(r'\">\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3} (.*)</h1>',x) but it matches all of the title without exceptions instead of only matches for d.d.d.d.d.d.d
thanks in advance
Use:
r'">\d{1,3}(?:\.\d{1,3}){6} (.*)</h1>'
Demo & explanation
Try it
re.search(r'\">\d{1,3}(\.\d{1,3})* (.*)</h1>',x)
I have not been able to find in Stackoverflow advice on finding specific URLs and appending them. I am looking to create "deep links" using a popular affiliate network within HTML content. For example here is some HTML:
<h2>This is a title</h2>
<p>this is some text</p>
<p>link to macys</p>
<p>link to google</p>
<p>something else</p>
</body>
</html>
I want to use Regex to find just the Macys link (not the Google link) in the HTML and append the URLs with the "deep link" code from the affiliate network. So it looks like this:
<html>
<body>
<h2>This is a title</h2>
<p>this is some text</p>
<p>link to macys</p>
<p>link to google</p>
<p>something else</p>
</body>
</html>
I did a find and replace for http://macys.com, and www.macys.com, and http://www.macys.com and it works.
I have the following html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN" "http://www.w3.org/MarkUp/Wilbur/HTML32.dtd">
<html xmlns="http://www.w3.org/MarkUp/Wilbur/HTML32.dtd">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
</head>
<body style="margin-left: 5%;">
<a name="pagetop"></a>
<a name="firstpage"></a>
<div>
<h3>Content to read I</h3>
<p>
Content to read II<br><br>
</p>
<img src="abc.svg" width="200" height="166" alt="">
<br><br>
<h4>Code:ABC</h4>
<!-- End Buttons -->
</div>
</body>
</html>
I want to read the content between the 2 tags < p > (without < br >) and < h3 >
Is there some standard available in lets say boost for achieving the same?
I have a simple text input in which I only want to allow floats and ints (watch out: jade)
input.form-control(type="text", ng-model='usd', ng-pattern="nums",ng-change='convert_to_btc()', placeholder="USD")
However it doesn't work, I can always insert any character in the input (do I need to do more in order to display something? e.g. a red border if it's incorrrect? or should then just those characters not even be able to be entered?)
The pattern is a regex and thus not a string, so that should be fine???
Here's the controller:
app.controller("AppCtrl", function AppCtrl($scope, $http, $interval ) {
//lots of other stuff
$scope.nums = /^\-?\d+((\.|\,)\d+)?$/; //note no string, it's a regex
}
This is the generated HTML. Could this be the problem? The generated HTML actually has a string, not a regex!?
<input type="text" ng-model="usd" ng-pattern="/^\-?\d+((\.|\,)\d+)?$/" ng-change="convert_to_btc()" placeholder="USD" class="form-control ng-dirty ng-valid-parse ng-touched ng-invalid ng-invalid-pattern">
I hope this is what you are trying to do.
Please have a look at the below link
http://plnkr.co/edit/BGzLbQHy0ZtHYmom8xA3
<!DOCTYPE html>
<html ng-app="">
<head>
<script data-require="angular.js#1.3.x" src="https://code.angularjs.org/1.3.13/angular.js" data-semver="1.3.13">
</script>
<style>
.ng-invalid-pattern {
border:1px solid #f00;
}
</style>
</head>
<body>
<p>Hello</p>
<form name='myform'>
<input type="text" name='ip' ng-model="usd" ng-pattern="/^\-?\d+((\.|\,)\d+)?$/"
ng-change="convert_to_btc()" placeholder="USD"/>
<p ng-show='myform.ip.$invalid'>Error</p>
</form>
</body>
</html>
If you are trying to block the user from being able to enter character/letters and only allowing them to enter numbers into the input, then change the <input type="text" to <input type="number"
Here's a link to the Angular Doc page on inputs that should only allow numbers: input[number]