Validating html-style table data with XSLT - xslt

I need to be able to check xml with html-style table data to ensure that it's "rectangular". For example this is rectangular (2x2)
<table>
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
This is not
<table>
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
<tr>
<td>Baz</td>
</tr>
</table>
This is complicated by row and column spans and the fact that I need to accept two styles of markup, either where spanned cells are included as empty td or where span cells are omitted.
<!-- good (3x2), spanned cells included -->
<table>
<tr>
<td colspan="2">Foo</td>
<td/>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- also good (3x2), spanned cells omitted -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
Here are a bunch of examples of bad tables where it's ambiguous how to deal with them
<!-- bad, looks like spanned cells are included but more cells in row 1 than 2 -->
<table>
<tr>
<td colspan="2">Foo</td>
<td/>
<td rowspan="2">Bar</td>
<td>BAD</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- bad, looks like spanned cells are omitted but more cells in row 1 than 2 -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
<td>BAD</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
</tr>
</table>
<!-- bad, can't tell if spanned cells are included or omitted -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td/>
</tr>
</table>
<!-- bad, looks like spanned cells are omitted but a non-emtpy cell is overspanned -->
<table>
<tr>
<td colspan="2">Foo</td>
<td rowspan="2">Bar</td>
</tr>
<tr>
<td>Baz</td>
<td>Qux</td>
<td>BAD</td>
</tr>
</table>
I already have a working XSLT 2.0 solution for this problem that involves normalizing the data to the "spanned cells included" style then validating, however, my solution is cumbersome and starts to perform poorly for tables with an area of greater than 1000 cells. My normalization and validation routines involve iterating sequentially over the cells and passing along a param of cells that should be created by spans and inserting them when I pass their coordinates in the table. I'm not happy with either of them.
I'm looking for suggestions about cleverer ways in which to achieve this validation that hopefully would have better performance profiles on large tables. I need to account for th and td but omitted th from the examples for sake of simplicity, they can be included or ignored in any answers. I'm not checking to see if thead, tbody, and/or tfoot have the same width, this can also be included or omitted. I'm currently using XSLT 2.0 but I'd be interested in 3.0 solutions if they were significantly better than a solution implemented in 2.0.

I don't think this kind of problem is suited for XSLT - especially if you have to process very large tables.
I'd suggest to develop a solution using a procedural languge - maybe using XSLT to pre- or post- process the XML.

Related

Openoffice Calc VLOOKUP with multiple sheets

I am working with this data in Apache OpenOffice 4.1.2 with the goal of a vlookup that allows for cross sheet lookup of data in Sheet1 being added to Sheet2 base on the column pn. Here is the equation I have now but its not lining up right now. Any suggestions/corrections welcome.
In Sheet 2 using this.
=VLOOKUP(A2; Sheet1.A2:Sheet1.C500; 2; 1)
From What I understand Im expecting the code to return the information in the name column from Sheet1 based on the match of the pn column across all 500 rows.
Sheet1<br>
<table>
<thead>
<tr>
|<th>code</th>|
|<th>name</th>|
|<th>pn</th>|
</tr>
</thead>
<br>
<tbody>
<tr>
|<td>111</td>|
|<td>one</td>|
|<td>101</td>|
</tr>
<br>
<tr>
|<td>112</td>|
|<td>two</td>|
|<td>102</td>|
</tr>
</table>
<br>
Sheet2<br>
<table>
<thead>
<tr>
|<th>pn</th>|
|<th>qty</th>|
|<th>cur</th>|
</tr>
</thead>
<br>
<tbody>
<tr>
|<td>102</td>|
|<td>200</td>|
|<td> $ </td>|
</tr>
<br>
<tr>
|<td>101</td>|
|<td>150</td>|
|<td> $ </td>|
</tr>
</table>

Remove string between HTML tags with TRegEx

I am designing by code a report sent by email with Outlook using HTML format.
To do that, I'm loading first a HTML template where I can insert all dynamic parts using predefined tags like [CustomerName].
<p>You will find below reports for customer [CustomerName] dated [ReportdDate]</p>
<tag-1>
<h3>TableTitleA</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='400'>Date</th>
<th align='left' width='200'>Info</th>
<th align='center' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableA]
</tbody>
</table>
</tag-1>
<tag-2>
<h3>TableTitleB</h3>
<table>
<thead id="t01">
<tr>
<th align='center' width='80'>Order Nr</th>
<th align='left' width='100'>Date</th>
<th align='left' width='400'>Info</th>
<th align='left' width='200'>Site Name</th>
</tr>
</thead>
<tbody>
[TableB]
</tbody>
</table>
</tag-2>
<p>Best regards</p>
This template is ready to insert two HTML tables: [TableA] and [TableB]
But sometimes a table has no data. So, I want to remove that complete HTML section. To achieve this, I have inserted fake tags:
<tag-1></tag-1> and <tag-2></tag-2>
And then removing the complete section including the two fake tags using TRegEx. This is working just fine here:
https://regex101.com/r/5OFlyC/1
But with this code in Delphi, it doesn't work as expected:
TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '');
Could you tell me what's wrong here?
My problem is fixed. Thanks to all of you
Just use the roSingleLine option to deal with line feeds:
MessageBody.Text := TRegEx.Replace(MessageBody.Text, '<tag-1>.*?</tag-1>', '', [roSingleLine]);
first you have to remove all the CR LF from your string and then use the expression with escape before < and >
S:=StringReplace(messagebody.Text,#13#10,'<br>',[rfReplaceAll]);
S:=TRegEx.Replace(S,'(\<tag-1\>.*?\<\/tag-1\>)','');
messagebody.text:=StringReplace(S,'<br>',#13#10,[rfReplaceAll]);

How to find the element part of the anchor tag

I am totally new to the selenium. Please accept apologies for asking daft or silly question.
I have below on the website. What I am interested is that how can I get the data-selectdate value using selenium + python . Once I have the data-selectdate value, I would like to compare this against the the date I am interested in.
You help is deeply appreciated.
Note: I am not using Beautiful soup or anything.
Cheers
<table role="grid" tabindex="0" summary="October 2018">
<thead>
<tr>
<th role="columnheader" id="dayMonday"><abbr title="Monday">Mon</abbr></th>
<th role="columnheader" id="dayTuesday"><abbr title="Tuesday">Tue</abbr></th>
<th role="columnheader" id="dayWednesday"><abbr title="Wednesday">Wed</abbr></th>
<th role="columnheader" id="dayThursday"><abbr title="Thursday">Thur</abbr></th>
<th role="columnheader" id="dayFriday"><abbr title="Friday">Fri</abbr></th>
<th role="columnheader" id="daySaturday"><abbr title="Saturday">Sat</abbr></th>
</tr>
</thead>
<tbody>
<tr>
<td role="gridcell" headers="dayMonday">
<a data-selectdate="2018-10-22T00:00:00+01:00" data-selected="false" id="day22"
class="day-appointments-available">22</a>
</td>
<td role="gridcell" headers="dayTuesday">
<a data-selectdate="2018-10-23T00:00:00+01:00" data-selected="false" id="day23"
class="day-appointments-available">23</a>
</td>
<td role="gridcell" headers="dayWednesday">
<a data-selectdate="2018-10-24T00:00:00+01:00" data-selected="false" id="day24"
class="day-appointments-available">24</a>
</td>
<td role="gridcell" headers="dayThursday">
<a data-selectdate="2018-10-25T00:00:00+01:00" data-selected="false" id="day25"
class="day-appointments-available">25</a>
</td>
<td role="gridcell" headers="dayFriday">
<a data-selectdate="2018-10-26T00:00:00+01:00" data-selected="false" id="day26"
class="day-appointments-available">26</a>
</td>
<td role="gridcell" headers="daySaturday">
<a data-selectdate="2018-10-27T00:00:00+01:00" data-selected="false" id="day27"
class="day-appointments-available">27</a>
</td>
</tr>
</tbody>
</table>
To get the values of the attribute data-selectdate you can use the following solution:
elements = driver.find_elements_by_css_selector("table[summary='October 2018'] tbody td[role='gridcell'][headers^='day']>a")
for element in elements:
print(element.get_attribute("data-selectdate"))
You can use get_attribute api of element class to read attribute value of element.
css_locator = "table tr:nth-child(1) > td[headers='dayMonday'] > a"
ele = driver.find_element_by_css_selector(css_locator)
selectdate = ele.get_attribute('data-selectdate')

Doctrine2 Nested Set rendering as a table with indentation

I have categories structure (using Gedmo Nested Tree extension for Doctrine2) like in the example:
https://github.com/Atlantic18/DoctrineExtensions/blob/master/doc/tree.md#tree-entity-example
The question is how to display all the tree as a table like this:
<table>
<tr>
<td>Category-1 name</td>
<td>Category-1 other data</td>
</tr>
<tr>
<td>Category-2 name</td>
<td>Category-2 other data</td>
</tr>
<tr>
<td><span class="indent">---</span>Subcategory-2-1 name</td>
<td>Subcategory-2-1 other data</td>
</tr>
<tr>
<td><span class="indent">---</span><span class="indent">---</span>Subcategory-2-1-1 name</td>
<td>Subcategory-2-1-1 other data</td>
</tr>
<tr>
<td>Category-3 name</td>
<td>Category-3 other data</td>
</tr>
</table>
Another words, I need to get the tree as a plain list with Level param in 1 query.
I found a way to get the list only as array (getNodesHierarchy), but I need to have it as a collection like if I called findAll()
Found the solution:
class CategoryRepository extends NestedTreeRepository
{
public function getTreeList()
{
return $this->getNodesHierarchyQuery()->getResult();
}
}

Extract attribute value from html element

Been struggling with this for a couple of hours now...
I have the following regex:
(?<=\bdata-video-id=""."">)(.*?)(title=.*?>)
The following input:
<div class="cameras">
<table class="results">
<colgroup>
<col class="col0">
<col class="col1">
</colgroup>
<thead>
<tr>
<th title="Name">
Name
</th>
<th title="Date">
Date
</th>
</tr>
</thead>
<tbody>
<tr data-video-id="1">
<td title="149 - Cam123">
149 - Cam123
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
<tr data-video-id="2">
<td title="150 - Cam456">
150 - Cam456
</td>
<td title="Feb 18 2013">
Feb 18 2013
</td>
</tr>
</tbody>
</table>
</div>
The regex outputs this:
<td title="149 - Cam123">
<td title="150 - Cam456">
But what I'd like to get is the contents of the title attribute of the 1st cell from every table row:
149 - Cam123
150 - Cam456
The number of rows may obviously vary but the number of columns is fixed.
Please help me fine tune the above regex.
Thanks
NOTE: The solution MUST be a regular expression. I do not have access to the code base therefore an HTML parser or any other kind of code intervention is not possible. The only way I can hook into the application is by injecting a different regex.
Based on the OP requirements that it MUST be a regex, then my suggestion would be to add a group wrapper to the inner title information:
(?<=\bdata-video-id=""."">).*?title="(.*?)">
Otherwise, the general solution is to not use a regex:
Why are you using a regex? The typical solution for this due to the complexities of the tags is to use an HTML parser
Here is a SO about this topic
Here is another even more popular response on using regex for XHTML which was pointed out by Jeff Atwood in this blogpost