How find XPATH with random number value in attribute? - regex

I have div blocks on website like this: <div id="banner-XXX-1"></div>
So I need to query this banner, where XXX is any digit number.
How to do that? Currently I use this way:
//div[contains(#id,'banner-') and contains(#id,'-1')]
But this way is not good if XXX starts with 1. So, is there any way to do like this: //div[contains(#id,'banner-' + <any_decimal> + '-1')]?
It seems match operator on popular Chrome plugin XPath Helper does not work, so I use v1.0
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en

XPath 1.0
This XPath 1.0 expression,
//div[ starts-with(#id,'banner-')
and translate(substring(#id, 8, 3), '0123456789', '') = ''
and substring(#id, 11) = '-1']
selects all div elements whose id attribute value
starts with banner-,
followed by 3 digits, which a translate() trick mapped to nothing,
followed by -1,
as requested.
XPath 2.0
This XPath 2.0 expression,
//div[matches(#id,'^banner-\d{3}-1$')]
selects all div elements whose id attribute value matches the shown regex and
starts (^) with banner-,
followed by 3 digits, (\d{3}),
and ends ($) with -1,
as requested.

Related

Regular Expression to extract the digits comes after 36th character in a String

In jmeter, I need to extract digits which comes after 36th character.
Example
Response: {"data":{"paymentId":"DOM1234567890111243"}}
I need to extract :11243 (Sometimes it will be only 1 or 2 or 3 or 4 digits)
Right boundary : DOM12345678901 Keeps changing too.But the right boundary length will be 36 charters always.
Any help will be higly appreciated.
Your response data seems to be JSON therefore I wouldn't rely on this "36 characters" as it's format might be different.
I would suggest extracting this paymentId value first and then apply a regular expression onto this DOMxxx bit.
Add JSR223 PostProcessor as a child of the request which returns the above data
Put the following code into "Script" area:
def dom = new groovy.json.JsonSlurper().parse(prev.getResponseData()).data.paymentId
log.info("DOM: " + dom)
def myValue = ((dom =~ ".{14}(\\d+)")[0][1]) as String
log.info("myValue: " + myValue)
vars.put("myValue", myValue)
That's it, you should be able to access the extracted data as ${myValue} where required.
More information:
Groovy: Parsing and producing JSON
Groovy: Match Operator
Apache Groovy - Why and How You Should Use It
If there isn't anything else in the string you're checking, you could use something like:
.{36}(\d+)
The first group of this regex will be the number you're looking for.
Test and explanation: https://regex101.com/r/iDOO8T/2

Using the contains function in xpath 1.0 to select numbers

I'm using scrapy and I need to scrape something like this: any number, followed by a dash, followed by any number, then a whitespace, then two letters (e.g. 1-3 mm). It seems xpath 1.0 does not allow the use of regex. Searching around, I've found some workarounds like using starts-with() and ends-with() but from what I've seen they only use it with letters. Please help.
Scrapy uses lxml internally, and lxml's XPath has support for regular expressions via EXSLT when you add the corresponding namespaces.
Scrapy does that by default so you can use re:test() within XPath expressions as a boolean for predicates.
boolean re:test(string, string, string?)
The re:test function returns true if the string given as the first argument matches the regular expression given as the second argument.
See this example Python2 session:
>>> import scrapy
>>> t = u"""<!DOCTYPE html>
... <html lang="en">
... <body>
... <p>ab-34mm</p>
... <p>102-d mm</p>
... <p>15-22 µm</p>
... <p>1-3 nm</p>
... </body>
... </html>"""
>>> selector = scrapy.Selector(text=t)
>>> selector.xpath(r'//p/text()[re:test(., "\d+-\d+\s\w{2}")]').extract()
[u'15-22 \xb5m', u'1-3 nm']
>>>
Edit: note on using EXSLT re:match
Using EXSLT re:match is a bit trickier, or at least less natural than re:test. re:match is similar to Python's re.match, which returns MatchObject
The signature is different from re:test:
object regexp:match(string, string, string?)
The regexp:match function returns a node set of match elements
So re:match will return <match> elements. To capture the string from these <match> elements, you need to use the function as the "outer" function, not inside predicates.
The following example chains XPath expressions,
selecting <p> paragraphs
then matching each paragraph string-value (normalized) with a regular expression containing parenthesized groups
finally extracting the result of these re:match calls
Python2 shell:
>>> for p in selector.xpath('//p'):
... print(p.xpath(ur're:match(normalize-space(.), "(\d+)-(\d+)\s(\w{2})")').extract())
...
[]
[]
[u'<match>15-22 \xb5m</match>', u'<match>15</match>', u'<match>22</match>', u'<match>\xb5m</match>']
[u'<match>1-3 nm</match>', u'<match>1</match>', u'<match>3</match>', u'<match>nm</match>']
>>>
To do this with xpath 1.0 you can use the translate function.
translate(#test , '1234567890', '..........') will replace any number (digit) with a dot.
If your numbers are always one digit you may try something like:
[translate(#test , '1234567890', '..........') = '.-. mm']
if the numbers could be longer than on digit you may try to replace numbers with nothing and test for - mm
[translate(#test , '1234567890', '') = '- mm']
But this can have some false trues. To avoid them you will need to check with substring-before -after length if there was at least one digit

How to Match in a Strict Order Data that comes in a Random Order?

I'm quite new to regular expressions and I have the following target string resource which can sometimes differ slightly. For example, the string might be:
<TITLE>SomeTitle</TITLE>
<ITEM1>Item 1 text</ITEM>
<ITEM2>Item 2 text</ITEM2>
<ITEM3>Item 3 text</ITEM3>
And the next time the resource is requested, it's output might be:
<ITEM1>Item 1 text</ITEM>
<ITEM2>Item 2 text</ITEM2>
<ITEM3>Item 3 text</ITEM3>
<TITLE>SomeTitle</TITLE>
I want to capture the data between the two tags in order of the first example, so that the match would always match "SomeTitle" first, followed by the items. So if the search string was the second example, I need an expression that can first match "SomeTitle" and then somehow "reset" the position of the match to start from the beginning so I can then match the items.
I can achieve this with two different pattern searches, but was wondering if there is a way to do this in a single search pattern? Perhaps using lookaheads/lookbehinds and conditionals?
Capture Groups inside Lookaheads
Use this:
(?s)(?=.*<TITLE>(.*?)</)(?=.*<ITEM1>(.*?)</)(?=.*<ITEM2>(.*?)</)(?=.*<ITEM3>(.*?)</)
Even when the tokens are in a random order, you can see them in the right order by examining Capture Groups 1, 2, 3 and 4.
For instance, in the online regex demo, see how the input is in a random order, but the capture groups in the right pane are in the right order.
PCRE: How to use in a programming language
The PCRE library is used in several programming languages: for instance PHP, R, Delphi, and often C. Regardless of the language, the idea is the same: retrieve the capture groups.
As an example, here is how to do it in PHP:
$regex = '~(?s)(?=.*<TITLE>(.*?)</)(?=.*<ITEM1>(.*?)</)(?=.*<ITEM2>(.*?)</)(?=.*<ITEM3>(.*?)</)~';
if (preg_match($regex, $yourdata, $m)) {
$title = $m[1];
$item1 = $m[2];
$item2 = $m[3];
$item3 = $m[4];
}
else { // sorry, no match...
}

strings and math operations in xslt

I am looking to fix some tables up using XSLT. I need to use the Colspan attribute but the code I am converting from uses namest and nameend.
example:
<entry namest="col1" nameend="col3">
I need to turn this into <td colspan="3">. I thought about setting variables and then using substring($var,4,1) to get the number at the end of the col3/col1 and then doing math the math- by subtracting the digit from namest from the digit from nameend and then adding one but it didn't work.
If entry is the context node, the following expression returns the difference of the "col" values plus one which should be the colspan value you're looking for:
substring(#nameend, 4) - substring(#namest, 4) + 1
substring(#attr, 4) returns the substring of #attr starting from the fourth character until the end. The substrings are implicitly converted to numbers by the minus operator.
Test of the expression with libxslt's xmllint:
$ echo '<entry namest="col1" nameend="col3"/>' >so.xml
$ xmllint --shell so.xml
/ > cd entry
entry > xpath substring(#nameend, 4) - substring(#namest, 4) + 1
Object is a number : 3

Find number of characters matching pattern in XSLT 1

I need to make an statement where the test pass if there is just one asterisk in a string from the source document.
Thus, something like
<xslt:if test="count(find('\*', #myAttribute)) = 1)>
There is one asterisk in myAttribute
</xslt:if>
I need the functionality for XSLT 1, but answers for XSLT 2 will be appreciated as well, but won't get acceptance unless its impossible in XSLT 1.
In XPath 1.0, we can do it by removing all asterisks using translate and comparing the length:
string-length(#myAttribute) - string-length(translate(#myAttribute, '*', '')) = 1
In XPath 2.0, I'd probably do this:
count(string-to-codepoints(#myAttribute)[. = string-to-codepoints('*')]) = 1
Another solution that should work in XPath 1.0:
contains(#myAttribute, '*') and not(contains(substring-after(#myAttribute, '*'), '*'))