I am trying to determine in which column the name "Phone" appears, by checking the HTML of a web page.
The string in which I am doing the search looks like this :
<tr class="C1">
<td>Name</td>
<td>Address</td>
...
... < some more columns, but their number is not fixed >
...
<td>Phone</td>
...
... <more columns>
...
</tr>
Is it possible to determine using regular expressions ?
From the viewpoint of theoretical computer science: It is not possible, since tables could be nested; and regular expressions generally cannot cope with nested structures (you need a Typ-2-Grammer (Chomsky-Hierarchy), i.e. a Parser, to analyse the structure of a html-Text, it's not Typ-3, i.e. regular).
From a practical viewpoint, however, if you assume, that the tables are not nested, you could use a RegEx to extract table rows (something like <tr (?!</tr>)*</tr>), match the entries afterwards (something like <td (?!</td>)*</td>) to produce a List of columns and search that list for an Entry containing the string "Phone"....
Tough task. I'm referring you to various posts that explain why HTML parsing using RegEx is (virtually) imposibble:
RegEx match open tags except XHTML self-contained tags
https://stackoverflow.com/a/590789/290343
https://stackoverflow.com/a/133684/290343
Related
So what I'm trying to do is add ":" character into a regular expression so that it come's to the matching value when I'm parsing all the values.
Here's the code that I'm parsing from:
<tr>
<td>165.227.124.179</td>
<td>3128</td>
</tr>
<tr>
<td>13.56.91.112</td>
<td>443</td>
</tr>
I need to get the values inside the "td" tags and add ":" between them like so:
165.227.124.179:3128
13.56.91.112:443
I get both of the values parsed, but what I can't find out is that is it possible to add the ":" character between these 2 values inside the regular expression and not after the matching value is parsed.
I've tried googling it real hard, but I just can't seem to get a right sort of match for the problem I have. Sorry if the question is confusing, I've gotten so confused along the way, feel free to ask a clearance.
Regex not recommended to parse html, but if you have to use it something like
Find <td>(\d+(?:\.\d+)+)</td>\s*<td>(\d+)</td>
Replace <td>$1:$2</td>
https://regex101.com/r/SiOgKy/1
I'm trying to assign a 6-digit sequence which lays in <pre>-node to a variable using "store" command with XPath and regex, but something is wrong with my approach.
Sample text from <pre>:
"OPERACIA, KOD PODTVERZDENIA 021477"
Command:
store(//table[#id='sms_table']/tbody/tr/td/pre[matches(text(),'[0-9]{6}')], foo)
First thing to note, you should be using storeText, not store. Store will only record what you put in the target field, it won't look for the locator on the page. Also, the way you've done your regex ([0-9]{6}) won't give you what you'd need. That would look for a digit from 0-9 followed by 6 more digits.
I've recently had to do pretty much the same thing, the way I did it is separated this out into 2 commands, rather than trying to process it all in one go. so first command, store the full thing, second command, Regex to pull out the 6 digits. Like below
<tr>
<td>storeText</td>
<td>//table[#id='sms_table']/tbody/tr/td/pre</td>
<td>Text</td>
</tr>
<tr>
<td>storeEval</td>
<td>storedVars['Text'].match(/\d{6}/)</td>
<td>digits</td>
</tr>
I need to extract a row from HTML table which contains some substring XXX:
<some html>
<tr rn="1"...AAA...</tr><tr rn="2"...XXX...</tr><tr rn="3"...ZZZ...</tr>
<some html>
... may contain attributes of tr and other elements but can't contain other <tr> tags. Surrounding HTML code contains other tables but they don't have rn attribute immediately after <tr>. I need to get the whole HTML code of the row and specifically value of rn:
Match 1: <tr rn="2" XXX </tr>
Match 2: 2
Obviously this RE works incorrectly, cause it extracts also the first row:
(<tr rn=\"(\d+)\".*XXX.*?tr>)
I tried to add negative lookahead in that ways:
(<tr rn(?!<tr rn)=\"(\d+)\".*XXX.*?tr>)
(<tr rn((?!<tr rn).)*=\"(\d+)\".*XXX.*?tr>)
But they also work incorrectly.
How do I do it right?
I don't know if this is the most efficient way to do this, but this should work:
(<tr rn=\"(\d+)\"(?:(?!tr>).)*?XXX.*?tr>)
Basically, you're adding a non-capturing group saying anything that isn't tr> (your closing tag) for as few repetitions as possible until you find the XXX.
Hope that makes sense
I want to write a regular expression to parse this webpage(view-source:http://www.imdb.com/search/title?title=spiderman&title_type=feature). Basically I want to extract all the sections between <tr class=".+"> and </tr>. This webpage is a list of movies from imdb(http://www.imdb.com/search/title?title=spiderman&title_type=feature) and each section here indicates a movie. I tried the regular expression
<tr class=".+">(.+\n)+</tr>
However, it doesn't work. Also, I'm not allowed to use DOM. Does anyone have any suggestions? Thanks!
I strongly suggest you use a proper parser. But here is the regex for your case.
<tr class="(.+)">([\s\S]+?)</tr>
I am working on a site that is using the unfortunate practice of wrapping <tr/> tags inside of <form/> tags for the purpose of being able to submit the contents of single rows as form posts to the server. The HTML is generated via XSL, and sometimes there is XSL flow control (<xsl:if/>, <xsl:choose/>, etc.) or <xsl:attribute/> tags between the <form/> and <tr/> tags.
Example:
<table>
<tbody>
<form id="row1_form">
<xsl:if test="test">
<xsl:attribute name="foo">bar</xsl:attribute>
</xsl:if>
<tr id="row1">
...
I am trying to write an regex that will find all the places that a "<tr" string occurs at some point after a "<form" string. The following works for this:
<form[^<]*?>[\s\w\<\:\>\/]*<tr
What I really need, though, is for the above regex to only match when the string "<table" does NOT occur between the "<form" and "<tr" strings. If the string "<table" does not occur between "<form" and "<tr", then I know that I have found an invalid placement of a form tag.
Thanks,
Matt
This regex will find a form containing a <tr with no preceding <table:
<form[^<]*(?:<(?!/?form|tr|table)[^<]*)*<tr\b
It does require that the tool support negative lookahead. Note that this regex implements Jeffrey Friedl's unrolling-the-loop efficiency technique and is quite fast.
If your regular expression engine supports negative lookarounds you can do:
<form[^<]*?>((?!<table)[\s\w\<\:\>\/])*<tr