HTML labels with ocamlgraph - ocaml

Is it possible to make a graph like this one with ocamlgraph? HTML labels have to be delimited with <> instead of "" and I don't see any mention of this functionality in the documentation.

They can parse this kind of dot nodes: the documentation for the Dot_ast module of OCamlgraph has a Html of string case of the id type for this. It seems like they cannot print this kind of dot files, as the `Label node of the Dot attributes only handles direct strings.
If you need this feature, you could consider implementing it yourself (just change the files graphviz.ml and graphviz.mli), I'm sure the authors would be glad to have some contribution.

Related

Regex or Xpath for extracting nodes?

I have an XML file with the following structure;
<JobList>
<Job><subnodes/></Job>
<Job><subnodes/></Job>
</JobList>
This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.
I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?
To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.
The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.
As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex.
A general regex for this case could be something like:
<Job>((?!<Job>).)*</Job>$
this will bring anything between a complete pair
Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

How to sanitize form values to allow text-only

I understand that if a user needs to supply HTML code as part of a form input (e.g. in a textarea) then I use an Anti-Samy policy to filter out the hazardous HTML that's not permitted.
However, I have some text-fields and text-areas which should be text-only. No HTML code at all should be inserted into the DB from these fields.
I am trying to therefore sanitize the inputs so that only raw text is inserted into the database. I believe I can do this two ways:
Use a Regex expression to filter out HTML code e.g. #REReplaceNoCase(FORM.InputField, "[^a-zA-Z\d\s:]", "", "ALL")#
Use a strict text-only Anti-Samy policy
Which option is the correct/good-practice way to remove any user inputted HTML code from a textfield. Or are there further options available to me?
While you could use AntiSamy to do it, I don't know how sensible that would be. Kinda defeats the purpose of it's flexibility, I think. I'd be curious about the overhead, even if minimal, to running that as a filter over just a regex.
Personally I'd probably opt for the regex route in this scenario. Your example appears to only strip the brackets. Is that acceptable in your situation? (understandable if it was just an example) Perhaps use something like this:
reReplace(string, "<[^>]*>", "", "ALL");

Whitelist tags exempt from escaping using Go's html/template

Pass a []byte into a template as the body of a message post on a forum-style web app. In the template, call a method to convert to string and along the way, switch out all newlines for line breaks:
<p>{{.BodyString}}</p>
...
func (p *Post) BodyString() string {
nl := regexp.MustCompile(`\n`)
return nl.ReplaceAllString(string(p.Body), `<br>`)
}
What you'll end up with:
paragraphs <br> <br>in <br> <br>this <br> <br>post
I don't want to pass the entire post in with HTML(p.Body), as it represents third party data from potentially untrustworthy sources. Is there a way to whitelist only some tags for formatting purposes using the vanilla Go1 template package?
I do think you want to parse the HTML. The HTML parser in exp/html was deemed incomplete and so removed from Go 1, although the exp tree is still in the Go source tree and can be accessed by weekly tag, for example. I don't know exactly what is incomplete. I used it for a simple task once and it met my needs.
Also of course, check the dashboard and see related SO post, Any smart method to get exp/html back after Go1?, mostly for the recomendation of http://code.google.com/p/go-html-transform/
I'm affraid the template package cannot help with this too much. If you want to remove specific (black-listed) tags (resp. the sub-tree enclosed by such tags) or allow to pass only specific tags (white-listed) then I think probably nothing less than parsing and rewriting the html AST can be a good solution. That said, one can see here and there some crazy REs trying to do the same, but I don't consider that a "good solution" and I doubt they can be a "correct" solution in the general case of a specs conforming HTML, including several legal irregularities, as it is probably ruled out of a regular grammar category problem.

How to not transform special characters to html entities with owasp antisamy

I use Owasp Anti samy with Ebay policy file to prevent XSS attacks on my website.
I also use Hibernate search to index my objects.
When I use this code:
String html = "special word: été";
// use the Ebay configuration file
Policy policy = Policy.getInstance(xssPolicyFile.getInputStream());
AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(html, policy);
// result is now : "special word: été"
result = cr.getCleanHTML();
As you can see all chars "é" has been transformed to their html entity equivalent "é"
My page is on UTF-8, so I don't need this transformation. Moreover, when I index this text with Hibernate Search, it indexes the word with html entities, so I can't find word "été" on my index.
How can I force antisamy to not transform special chars to their html entity equivalent ?
thanks
PS: an issue has been opened : http://code.google.com/p/owaspantisamy/issues/detail?id=99
I ran into the same problem this morning.
I have encapsulated antisamy in a class and I use apache StringEscapeUtil from apache common-lang to restore special characters.
CleanResults cleanResults = antiSamy.scan(taintedHtml);
cleanedHtml = cleanResults.getCleanHTML();
return StringEscapeUtils.unescapeHtml(cleanedHtml)
The result is a cleaned up HTML without the HTML escaping of special characters.
Hope this helps.
Like Mohamad said it in a comment, Antisamy has just released a new directive named : entityEncodeIntlChars
here is the detail : http://code.google.com/p/owaspantisamy/source/detail?r=240
It seems that this directive solves the problem.
After scouring the AntiSamy source code, I found no way of changing this behavior apart from modifying AntiSamy.
Check out this one: http://code.google.com/p/owaspantisamy/source/browse/#svn/trunk/dotNet/current/source/owaspantisamy/html/scan
Grab the source and notice that key classes (AntiSamyDOMScanner, CleanResults) use standard framework objects (like XmlDocument). Compile and run with the binary you compiled - so that you can see everything in a debugger - as in which of the major classes actually corrupts your data. With that in hand you'll be able to either change a few properties on major objects to make it stop or inject your own post-processing to revert the wrongdoing (say with a regexp). Latter you can expose that as additional top-level property, say one named NoMess :-)
Chances are that behavior in that respect is different between languages (there's 3 in that trunk) but the same tactics will work no matter which one you have to deal with.

HTML templating in C++ and translations

I'm using HTML_Template for templating in my C++-based web app (don't ask). I chose that because it was very simple and it turns out to be a good solution.
The only problem right now is that I would like to be able to include translatable strings in the HTML templates (HTML_Template does not really support that).
Ultimately, what I would like is to have a single file that contains all the strings to be translated. It can then be given to a translator and plugged back in to the app and used depending on which language the user chose in settings.
I've been going back and forth on some options and was wondering what others felt was the best choice (or if there's a better choice that isn't listed)
Extend HTML_Template to include a tag for holding the literal string to translate. So, for example, in the HTML I would put something like
<TMPL_TRANS "this is the text to translate"/>
Use a completely separate scheme for translation and preprocess the HTML files to generate the final template files (without the special translation lingo). For example, in the pre-processed file, translatable text would look like this:
{{this is the text to translate}}
and the final would look like:
this is the text to translate
Don't do anything and let the translators find the string to translate in the html and js files themselves.
You may want to consider arrays, if not already.
A popular implementation for translating strings is to use tables and indices. One index is for the language and the second index is for the string. Create a function that returns strings based on these two indices:
const std::string& Get_String(unsigned int language_index, unsigned int string_index);
Each language would have a table of strings (or const char *). There would be a table of pointers to language tables, one for each supported language.
The biggest pain is to convert existing code to use this system.
Hope this helps.