RegEx Parsing for HTML attributes - one specific string - regex

With Delphi Rio, I am using an HTML/DOM parser. I am traversing the various nodes, and the parser is returning attributes/tags. Normally these are not a problem, but for some attributes/tag, the string returned includes multiple attributes. I need to parse this string into some type of container, such as a stringlist. The attribute string the parser returns already has the '<' and '> removed.
Some examples of attribute strings are:
data-partnumber="BB3312" class=""
class="cb10"
account_number = "11432" model = "pay_plan"
My end result that I want is a StringList, with one or more name=value pairs.
I have not used RegEx to any real degree, but I think that I want to use RegEx. Would this be a valid approach? For a RegEx pattern, I think the pattern I want is
\w\s?=\s?\"[^"]+"
To identify multiple matches within a string, I would use TRegex.Matches. Am I overlooking something here that will cause me issues later on?
*** ADDITIONAL INFO ***
Several people have suggested to use a decent parser. I am currently using the openSource HTML/DOM parser found here: https://github.com/sandbil/HTML-Parser
In light of that, I am posting more info... here is an HTML Snippet I am parsing. Look at the line I have added *** at the end. My parser is returning this as
Node.AttributeText= 'data-partnumber="B92024" data-model="pay_as_you_go" class="" '
Would a different HTML DOM parser return this as 3 different elements/attributes? If so, can someone recommend a parser?
<section class="cc02 cc02v0" data-trackas="cc02" data-ocomid="cc02">
<div class="cc02w1">
<div class="otable otable-scrolling">
<div class="otable-w1">
<table class="otable-w2">
<thead>
<tr>
<th>Product</th>
<th>Unit Price</th>
<th>Metric</th>
</tr>
</thead>
<tbody>
<tr>
<td class="cb152title"><div>MySQL Database for HeatWave-Standard-E3</div></td>
<td><div data-partnumber="B92024" data-model="pay_as_you_go" class="">$0.3536<span></span></div></td> *****
<td><div>Node per hour</div></td>
</tr>
<tr data-partnumber="B92426">
<td class="cb152title">MySQL Database—Storage</td>
<td><span data-model="pay_as_you_go" class="">$0.04<span></span></span></td>
<td>Gigabyte storage capacity per month</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</section>

The documentation for the parser you are using says TDomTreeNode has an AttributesText property that is a "string with all attributes", which you have shown examples of. But it also has an Attributes property that is "parsed attributes" provided as a TDictionary<string, string>. Have you tried looking into the values of that property yet? You should not need to use a RegEx at all, just enumerate the entries of the TDictionary instead, eg:
var
Attr: TPair<string, string>;
for Attr in Node.Attributes do begin
// use Attr.Key and Attr.Value as needed...
end;

(As the OP asked about using a RegEx to parse attribute=value pairs, this answers the question directly, which other users may be looking for in the future.)
RegEx based answer
Using a RegEx is extremely powerful, from the data you have provided you can extract the attribute name and value pairs using:
(\S+)\s*=\s*(\"?)([^"]*)(\2|\s|$)
This uses grouping and can be explained as follows:
The first result group is the attribute name (it matches non-whitespace characters)
The second result group is an enclosing " if present, otherwise an empty string
The third result group is the value of the attribute
As RegExes can be run recursively you can use MatchAgain to see if there's another match and so read all of the attributes recursively.
procedure ParseAttributes(AInput: String; ATarget: TStringList);
var
LMatched: Boolean;
begin
pRegEx:=TPerlRegEx.Create;
try
pRegEx.RegEx:='(\S+)\s*=\s*(\"?)([^"]*)(\2|\s|$)';
pRegEx.Subject:=AInputData;
LMatched:=pRegEx.Match;
while LMatched do
begin
ATarget.Add(pRegEx.Groups[1].'='+'"'+pRegEx.Groups[3]+'"');
LMatched:=pRegEx.MatchAgain;
end;
finally
pRegEx.Free;
end;
end;
Disclaimer: I haven't tried compiling that code, but hopefully it's enough to get you started!
Practical Point: With respect to the actual problem you posed with your DOM parser - this is a task that there are existing solutions for so a practical answer to solving the problem may well be to use a DOM parser that works! If a RegEx is something you need for whatever reason this one should do the job.

Related

XSS remediation - Improper Neutralization of Script-Related HTML Tags

I'm trying to fix some XSS errors with my code. #getEmailRecord is the line that contains the problem. How do I fix a piece of code like this? The error: Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS). Veracode cleansing solution: coldfusion.runtime.CFPage.HTMLEditFormat
tr>
<td> </td>
<td class="left"><b>To: </b></td>
<td class="left">#getEmailRecord.EMAIL_TO#</td></tr>
<tr><td colspan="4"> </td></tr>
Thanks! This is my first time doing something like this so any help is much appreciated.
Veracode cleansing solution: coldfusion.runtime.CFPage.HTMLEditFormat The recommended solution tells you what to do. Wrap any variables which contain user supplied data that you utilize in your code in #HTMLEditFormat()#.
<td class="left">#HTMLEditFormat(getEmailRecord.EMAIL_TO)#</td></tr>
HTMLEditFormat
Description
Replaces special characters in a string with their HTML-escaped equivalents.
Add if you are on ColdFusion 10 or newer you have even more options - EncodeFor Functions

How to make a plone view that inserts other smaller views of content items?

I think this should be simple. I have a folderish TTW dexterity content item (a drop box) that contains folderish TTW dexterity items (proposals). Each proposal contains TTW dexterity reviews that have fields I want to summarize.
I can easily make a view that generates a table as indicated below for any proposal with simple modifications to the folderlisting view:
[review1 link] [criterion_1 value] [criterion-2 value]...
[review2 link] [criterion_1 value] [criterion-2 value]...
.
.
I can also generate a working table view for a drop box by modifying the folderlisting view:
[proposal1 link] [column I would like to insert the above table in for this proposal]
[proposal2 link] [column I would like to insert the above table in for this proposal]
.
.
My problem is I cannot figure out how to insert the first table into the cells in the second column of the second table. I've tried two things:
Within the view template for the dropbox listing, I tried duplicating the repeat macro of the listingmacro, giving it and all its variables new names to have it iterate on each proposal. This easily accesses all of the Dublin core schemata for each review, but I cannot get access to the dexterity fields. Everything I have tried (things that work when generating the first table) yield LocationError and AttributeError warnings. Somehow when I go down one level I lose some of the information necessary for the view template to find everything. Any suggestions?
I've also tried accessing the listing macro for the proposal, with calls like <metal use-macro="item/first_table_template_name/listing"/>. Is this even partially the right approach? It gives no errors, but also does not insert anything into my page.
Thanks.
This solution is loosely based on the examples provided by kuel: https://github.com/plone/Products.CMFPlone/blob/854be6e30d1905a7bb0f20c66fbc1ba1f628eb1b/Products/CMFPlone/skins/plone_content/folder_full_view.pt and https://github.com/plone/Products.CMFPlone/blob/b94584e2b1231c44aa34dc2beb1ed9b0c9b9e5da/Products/CMFPlone/skins/plone_content/folder_full_view_item.pt. --Thank you.
The way I found easiest to create and debug this was:
Create a minimalist template from the plone standard template folder_listing.pt which makes just the table of summarized review data for a single proposal. The template is just for a table, no header info or any other slots. This is a stripped version, but there is nothing above the first statement. A key statement that allowed access to the fields were of the form:
python: item.getObject().restrictedTraverse('criterion_1')
The table template:
<table class="review_summary listing">
<tbody><tr class="column_labels"><th>Review</th><th>Scholarly Merit</th><th>Benefits to Student</th><th>Clarity</th><th>Sum</th></tr>
<metal:listingmacro define-macro="listing">
<tal:foldercontents define="contentFilter contentFilter|request/contentFilter|nothing;
contentFilter python:contentFilter and dict(contentFilter) or {};
I kept all the standard definitions from the original template.
I have just removed them for brevity.
plone_view context/##plone;">
The following tal:sum is where I did some math on my data. If you are
not manipulating the data this would not be needed. Note that I am only
looking at the first character of the choice field.
<tal:sum define="c1_list python:[int(temp.getObject().restrictedTraverse('criterion_1')[0])
for temp in batch if temp.portal_type=='ug_small_grants_review'];
c1_length python: test(len(c1_list)<1,-1,len(c1_list));
c2_list python:[int(temp.getObject().restrictedTraverse('criterion_2')[0])
for temp in batch if temp.portal_type=='ug_small_grants_review'];
c2_length python: test(len(c2_list)<1,-1,len(c2_list));
c1_avg python: round(float(sum(c1_list))/c1_length,2);
c2_avg python: round(float(sum(c2_list))/c2_length,2);
avg_sum python: c1_avg+c2_avg;
">
<tal:listing condition="batch">
<dl metal:define-slot="entries">
<tal:entry tal:repeat="item batch" metal:define-macro="entries">
<tal:block tal:define="item_url item/getURL|item/absolute_url;
item_id item/getId|item/id;
Again, this is the standard define from the folder_listing.pt
but I've left out most of it to save space here.
item_samedate python: (item_end - item_start < 1) if item_type == 'Event' else False;">
<metal:block define-slot="entry"
The following condition is key if you can have things
other than reviews within a proposal. Make sure the
item_type is proper for your review/item.
tal:condition="python: item_type=='ug_small_grants_review'">
<tr class="review_entry"><td class="entry_info">
<dt metal:define-macro="listitem"
tal:attributes="class python:test(item_type == 'Event', 'vevent', '')">
I kept all the standard stuff from folder_listing.pt here.
</dt>
<dd tal:condition="item_description">
</dd>
</td>
The following tal:comp block is used to calculate values
across the rows because we do not know the index of the
item the way the batch is iterated.
<tal:comp define = "crit_1 python: item.getObject().restrictedTraverse('criterion_1')[0];
crit_2 python: item.getObject().restrictedTraverse('criterion_2')[0];
">
<td tal:content="structure crit_1"># here</td>
<td tal:content="structure crit_2"># here</td>
<td tal:content="structure python: int(crit_1)+int(crit_2)"># here</td>
</tal:comp>
</tr>
</metal:block>
</tal:block>
</tal:entry>
</dl>
<tr>
<th>Average</th>
<td tal:content="structure c1_avg"># here</td>
<td tal:content="structure c2_avg"># here</td>
<td tal:content="structure avg_sum"># here</td>
</tr>
</tal:listing>
</tal:sum>
<metal:empty metal:define-slot="no_items_in_listing">
<p class="discreet"
tal:condition="not: folderContents"
i18n:translate="description_no_items_in_folder">
There are currently no items in this folder.
</p>
</metal:empty>
</tal:foldercontents>
</metal:listingmacro>
</tbody></table>
Create another listing template that calls this one to fill the appropriate table cell. Again, I used a modification of the folder_listing.pt. Basically within the repeat block I put the following statement in the second column of the table:
This belongs right after the </dd> tag ending the normal item listing.
</td> <td class="review_summary">
<div tal:replace="structure python:item.getObject().ug_small_grant_review_summary_table()" />
</td>
Note that "ug_small_grant_review_summary_table" is the name I gave to the template shown in more detail above.

CFdump and Bootstrap tooltips fight each other

I attach Bootstrap tooltips via
$("[title]").tooltip({ html: true });
When I use a <cfdump>, title tags are attached all over the place. The start of the <cfdump> html looks like this
<table class="cfdump_struct">
<tr><th class="struct" colspan="2" onClick="cfdump_toggleTable(this);" style="cursor:pointer;" title="click to collapse">struct</th></tr>
<tr>
<td class="struct" onClick="cfdump_toggleRow(this);" style="cursor:pointer;" title="click to collapse">Cause</td>
<td>
Is there a way to keep, the two from stepping on eachother?
You shouldn't care because cfdump shouldn't be used in production, however you could just reduce the array returned by the jQuery selector. Not sure if this is the best way to do it, but it works:
$("[title]").filter(function(){
return ($(this).closest(".cfdump_struct").length == 0);
}).tooltip({ html: true });
It runs the filter function for each item in the array returned by the selector. If it is within the CFDUMP table (signified by the .cfdump_struct class) it will not return it. You will have to extend this to other cfdump types (queries, etc) but this should get you started.
Again, it really shouldn't matter since you shouldn't be using cfdump in production code anyway.
You can see this in action here: http://jsfiddle.net/seancoyne/rc7TL/

selenium regular expression such as id=regexp:.* doesn't work

I have an .aspx page and I'm trying to locate a textbox via the selenium UI. The id is: p_lt_ctl01_pageplaceholder_p_lt_ctl00_SignUpFree_txtFirstName
I've tried: id=*._txtFirstName
and: id=glob:*_txtFirstName
Is there a better way other than CSS to locate a textbox whose id may change each time it's compiled?
You can't put wildcards in an id "selector". Either you use id=whole_id_here or don't.
Fortunately, for your case, you can use the CSS selector:
[id*=_txtFirstName]
In Selenium IDE, use it like:
css=[id*=_txtFirstName]
Example Selenium IDE source snippet:
<tr>
<td>storeText</td>
<td>css=[id*=_txtFirstName]</td>
<td>x</td>
</tr>
<tr>
<td>echo</td>
<td>${x}</td>
<td></td>
</tr>
Note: If _txtFirstName is always at the end, you can also use the CSS locator with $ instead of * (it is more restrictive, will only match if it is at the end, while * matches if it is anywhere): [id$=_txtFirstName]. (In Selenium IDE, naturally, use it like: css=[id$=_txtFirstName].)
You could try using contains
[contains(#id,'_txtFirstName')]

jSoup - How to get elements with background style (inline CSS)?

I'm building an app in Railo, which uses the jSoup .jar library. It all works really well in my CFML language.
Anyhow, I can grab every element with a "style" attribute doing:
<cfset variables.mySelection = variables.myDocument.select("*[style]") />
But this returns an array which contains elements that sometimes do not have a "background" or "background-image" style on them. As an example, the HTML might looks like so:
<p style="color: red;">I should not be selected</p>
<p style="background: green">I **should** be selected</p>
<p style="text-align: left;">I should not be selected</p>
<p style="background-image: url("/path/to/image.jpg");">I **should** be selected</p>
So I can get these elements above, but I don't want the 1st and 3rd in my array, as they don't have a background style...do you know how I can only grab and work with these?
Please note, I'm not after a COMPUTATED style, or anything that complicated, I'm just wondering if I can filter based on the properties of an inline CSS style. Perhaps some regex after the fact? I'm open to ideas!
I tried messing with :contains(background) as a key word, but I wasn't sure if that was the correct path?
Many thanks for your help.
Michael.
Try with:
variables.myDocument.select("*[style*='background']")
As *= is the standard selector to match a substring in the attribute content.
Elements els = doc.select(div[style*=dashed]);
Or
Elements elements = doc1.select("span[style*=font-weight:bold]");