Sanitizing user supplied XSLT

Sanitizing user supplied XSLT - xslt

We have an application which uses XSLT to format XML data for display as XHTML.
The system is able to cope with arbitrary XML schemas, so Schemas and XSLTs need to be uploaded by users of the system. Clearly this is a task which is only allowed to Admin level users, however it's also a pretty large bulls-eye to aim at so I'm trying to make it more secure.
I should mention that we're using Saxon 9.0 B
Is there any standard way to sanitise user supplied XSLT? So far I have identified three possible issues although I am concious that there may be more which I simply haven't thought of:
xsl:import and document() functions can get at the server file system. This is pretty easy to lock down using a custom URI Resolver so I'm pretty confident I have this covered
output can contain javascript. I'm thinking of using something like OWASP Anti-Samy to white-list the allowed output tags.
XSLT can call java functions. This is the one which is currently causing me a headache. I don't want to turn the capability off altogether (although at the moment I can't even see how to do that) because we're using it. My preferred solution would be to be able to lock down the acceptable java namespaces so that only known safe functions can be executed. I am open to other suggestions though.
The gold standard would be a standard library which just handles all known XSLT based vulnrabilities, but failing that any suggestions on tackling the issues listed above (especially 3) would be much apprieciated.
Thanks in advance

Saxon has a configuration option to disable use of "reflexive" (dynamically loaded) extension functions. This doesn't prevent use of "integrated" extension functions which have been explicitly registered in the configuration via an API. This seems to meet your requirement of allowing the service provider to register extension functions, but not allowing the stylesheet author to do so.
You can be even more selective if you want by defining your own JavaExtensionFunctionFactory to control how extension function calls are bound. This is fairly low-level system programming and you'll probably need to study the source code to see which methods you need to override to meet your needs.
As well as document(), you need to consider collection(), unparsed-text(), xsl:result-document. In all cases there are Saxon hooks that allow you to control the behaviour.

I don't think uploading and executing anybody's XSLT on the server is something sensible to do.
There are things that one can't prevent or detect, such as a Denial Of Service attack like:
Endless recursion that eats up all available memory and crashes the server with stack overflow.
Transformation that takes many minutes or hours -- as the halting problem is undecidable, we don't know if this is an intentional perpetual loop, or accidental programmer error, or a computation that may or maynot converge.
There are certainly many other exploits, such as referencing a recursively defined entity ...

Related

API design choice - namespaces in XML interface

I'm designing a new API and I'm struggling with some decisions. I've read tons of blogs on SOAP vs REST and I used the popular APIs (Paypal, Amazon, etc.) as my guidelines.
I ended up with 2 endpoints in my API: one for SOAP and one for REST (XML). The SOAP one looks pretty good, but the XML interface looks somewhat strange. I'm calling it "strange" because I ended up with namespaces in some of my tags. For example:
[sample1]
<EnvelopeRequest xmlns:c1='http://foobar/CarrierX'>
<Weight>1.0</Weight>
<PostmarkDate>5/3/2013</PostmarkDate>
<c1:ShippingMethod>Ground</c1:ShippingMethod>
<c1:Notification>a#b.com</c1:Notification>
</EnvelopeRequest>
[sample2]
<EnvelopeRequest xmlns:cs='http://foobar/SpecialCarrier'>
<Weight>1.0</Weight>
<PostmarkDate>5/3/2013</PostmarkDate>
<cs:Shape>Flat</cs:Shape>
</EnvelopeRequest>
The reason the XML interface has namespaces is because it is auto-generated from the class definition (which has some inheritance). We are using WCF btw. That works just fine for SOAP (the WSDL is derived from the same class), because SOAP hides all the ugliness in the client proxies. However, after looking at many REST/XML services, I don't think I've seen namespaces being used too often. This also kinda scares me because I'm thinking that I would love to have a JSON interface in the near future, and JSON doesn't support namespaces.
My decision to make the API SOAP friendly came from the fact that many of our customers use Enterprise solutions which thrive on SOAP. But lately, with the growing popularity of Python and Ruby, which new clients seem to adopt more often, I'm starting to second guess my initial decision. The main thing that bothers me is the namespaces in the XML interface, but is it really an issue? Are namespaces in a REST/XML API such a big no-no that I should change my design?
If I do change my design, then my (2 previous) requests would look like so:
[sample1]
<EnvelopeRequest>
<Weight>1.0</Weight>
<PostmarkDate>5/3/2013</PostmarkDate>
<CarrierX>
<ShippingMethod>Ground</ShippingMethod>
<Notification>a#b.com</Notification>
</CarrierX>
</EnvelopeRequest>
[sample2]
<EnvelopeRequest>
<Weight>1.0</Weight>
<PostmarkDate>5/3/2013</PostmarkDate>
<SpecialCarrier>
<Shape>Flat</Shape>
</SpecialCarrier>
</EnvelopeRequest>
And yes, this would allow me to have a JSON interface in the future.

Removing namespaces would be a problem if by doing so you create the possibility of ambiguity in a given message. Is it possible for someone somewhere to create an EnvelopeRequest message with a Shape element that might be interpreted (by code or by people reading the message) in more than one way? The reason to introduce namespaces is to preclude this possibility. Tools like WCF's auto-generator are not able to answer this question in the general case so they err on the side of caution.
Only you can know the set of possible valid messages. In my experience, it's usually preferable to remove namespaces for the sake of not confusing your users/clients. There are a few reasons why I might change that preference:
I expect my message format to be used widely and intermixed with other formats. (A good example is the Atom syndication format)
I'm using someone else's widely used (and namespaced) format and planning to intermix it with my own (e.g. embedding XHTML inside my message).
I expect to embed a message of a given format inside a message of the same format (e.g. XSLT stylesheets that generate XSLT stylesheets).
In that latter case, you might find it convenient (though not absolutely necessary) to use namespaces to separate the inner message from the message that is carrying it by using different prefixes. I don't think any of these cases apply very often.

I would ponder why you have namespace in the first place, those are some strange payloads.
But, disregarding that, no, the namespaces are not a big deal. Namespaces almost inevitably run afoul with XPath and XSL (since they tend to be namespace aware), but when consuming the document wholesale, a lot of times folks just ignore the namespace component completely, so in the end there's no difference.
I would clean up the namespaces for the sake of cleaning them up semantically, but not necessarily for the sake of the consumers. From a practical stand point, it's not that big a deal.

Embedded Lua - timing out rogue scripts (e.g. infinite loop) - an example anyone?

I have embedded Lua in a C++ application. I need to be able to kill rogue (i.e. badly written scripts) from hogging resources.
I know I will not be able to cater for EVERY type of condition that causes a script to run indefinitely, so for now, I am only looking at the straightforward Lua side (i.e. scripting side problems).
I also know that this question has been asked (in various guises) here on SO. Probably the reason why it is constantly being re-asked is that as yet, no one has provided a few lines of code to show how the timeout (for the simple cases like the one I described above), may actually be implemented in working code - rather than talking in generalities, about how it may be implemented.
If anyone has actually implemented this type of functionality in a C++ with embedded Lua application, I (as well as many other people - I'm sure), will be very grateful for a little snippet that shows:
How a timeout can be set (in the C++ side) before running a Lua script
How to raise the timeout event/error (C++ /Lua?)
How to handle the error event/exception (C++ side)
Such a snippet (even pseudocode) would be VERY, VERY useful indeed

You need to address this with a combination of techniques. First, you need to establish a suitable sandbox for the untrusted scripts, with an environment that provides only those global variables and functions that are safe and needed. Second, you need to provide for limitations on memory and CPU usage. Third, you need to explicitly refuse to load pre-compiled bytecode from untrusted sources.
The first point is straightforward to address. There is a fair amount of discussion of sandboxing Lua available at the Lua users wiki, on the mailing list, and here at SO. You are almost certainly already doing this part if you are aware that some scripts are more trusted than others.
The second point is question you are asking. I'll come back to that in a moment.
The third point has been discussed at the mailing list, but may not have been made very clearly in other media. It has turned out that there are a number of vulnerabilities in the Lua core that are difficult or impossible to address, but which depend on "incorrect" bytecode to exercise. That is, they cannot be exercised from Lua source code, only from pre-compiled and carefully patched byte code. It is straightforward to write a loader that refuses to load any binary bytecode at all.
With those points out of the way, that leaves the question of a denial of service attack either through CPU consumption, memory consumption, or both. First, the bad news. There are no perfect techniques to prevent this. That said, one of the most reliable approaches is to push the Lua interpreter into a separate process and use your platform's security and quota features to limit the capabilities of that process. In the worst case, the run-away process can be killed, with no harm done to the main application. That technique is used by recent versions of Firefox to contain the side-effects of bugs in plugins, so it isn't necessarily as crazy an idea as it sounds.
One interesting complete example is the Lua Live Demo. This is a web page where you can enter Lua sample code, execute it on the server, and see the results. Since the scripts can be entered anonymously from anywhere, they are clearly untrusted. This web application appears to be as secure as can be asked for. Its source kit is available for download from one of the authors of Lua.

Snippet is not a proper use of terminology for what an implementation of this functionality would entail, and that is why you have not seen one. You could use debug hooks to provide callbacks during execution of Lua code. However, interrupting that process after a timeout is non-trivial and dependent upon your specific architecture.
You could consider using a longjmp to a jump buffer set just prior to the lua_call or lua_pcall after catching a time out in a luaHook. Then close that Lua context and handle the exception. The timeout could be implemented numerous ways and you likely already have something in mind that is used elsewhere in your project.
The best way to accomplish this task is to run the interpreter in a separate process. Then use the provided operating system facilities to control the child process. Please refer to RBerteig's excellent answer for more information on that approach.

A very naive and simple, but all-lua, method of doing it, is
-- Limit may be in the millions range depending on your needs
setfenv(code,sandbox)
pcall (function() debug.sethook(
function() error ("Timeout!") end,"", limit)
code()
debug.sethook()
end);
I expect you can achieve the same through the C API.
However, there's a good number of problems with this method. Set the limit too low, and it can't do its job. Too high, and it's not really effective. (Can the chunk get run repeatedly?) Allow the code to call a function that blocks for a significant amount of time, and the above is meaningless. Allow it to do any kind of pcall, and it can trap the error on its own. And whatever other problems I haven't thought of yet. Here I'm also plain ignoring the warnings against using the debug library for anything (besides debugging).
Thus, if you want it reliable, you should probably go with RB's solution.
I expect it will work quite well against accidental infinite loops, the kind that beginning lua programmers are so fond of :P
For memory overuse, you could do the same with a function checking for increases in collectgarbage("count") at far smaller intervals; you'd have to merge them to get both.

How to organize import and export code for versioned files?

If an application has to be able to open (and possibly save) the file format for the last N releases, how should the code be organized sanely so that updating the file format is easy and less error-prone? Assume the file format is in XML, and functions take in objects for export and produce objects for import.
Append a number to the end of each function name, and copy/paste it and increment the number for each new version? That's like maintaining multiple versions of version-controlled functions within source code. Perhaps do some magic at build time?

Firstly, supporting import of old versions is a lot easier than export. This is because usually later versions are different because they support more features. Hence saving to an old format may well mean loss of data. Consequently, my experience has only been on supporting import of multiple versions, spanning over a decade.
XML is of course the smart solution. It is designed with this problem in mind. The key point to me is that clean code structure follows from a clean data model. Provided new versions add features and these are represented by support for additional tags, you do not really have to recode handling of existing tags at all.
Now you could change the semantics of existing tags, requiring their recoding. Solution: don't do this if you can avoid it. When you add a attribute or tag, make sure you define the default value and then old and new data files are handled seamlessly.
So it seems to me that with care you should be able to avoid many cases where you really have significantly different code for handling the same fields in different file versions. Where this does occur, I am guess there are "special circumstances" (that's life with software). When you design the generic solution you'll have specific use cases in mind, and such special cases may not be handled anyway.
In summary: You'll future proof most efficiently via defining the upgrade path for the data model.

A version number is probably required.
But the best thing is to actually make a design for your XML. And make sure that the XML is structured in an intuitive and natural way. Otherwise the current organisation of your code may leak into the structure of the XML, which makes the XML harder to read for future versions of your product.
When saving enumerated values, don't write the numbers, but the name of the enumerable. If some elements could occur multiple times in principle, but not in your current application, design it as an array in XML. Make sure the numbers you write are in a unit that is logical in the problem domain, and not what your application happens to use right now.
In XML written this way, it should not be hard to support legacy versions of your XML.
Edit:
If you make drastic changes, it can be helpful to just implement a legacy data object that reads the legacy xml. Then you write a conversion method to convert from the old data model to the new one. This helps you to a fresh start esp. if the old data model was badly designed.

Using XSLT to process business rules?

A coworker of mine mentioned that one use of XSLT is processing business rules. He mentioned that there were systems that allowed users to write business rules in some kind of text format, and then the program uses XSLT to process the text and apply the rules at run-time in the application.
Can someone shed some light on this subject for me?
Thanks!

Ouch. I wouldn't recommend that.
As the first responder said, XSL-T is for transforming XML. It's not a rules engine. I think it sounds like a misuse of the technology.
XSL-T transforms are not intuitive to write. If one of your goals for business rules is allowing business folks to update and maintain the rules, I can't imagine a more obtuse and difficult technology for doing so than XSL-T.

I suppose your colleague was refering to BPEL, the Business Process Execution Language. BPEL is an XML-based executable language for describing business processes.
Being an XML format, business rules may be generated or transformed using XSLT. However, I'm not familiar with BPEL so I don't know any system doing something like that.

Yes. The somewhat-like text format is called Excel, and users tend to do all kinds of complex things with it. The programmer then spends an awful lot of time trying to process it with every shiny new technology he can find, including XSLT, and finally decides to hand-code around all the inconsistencies. It is not fully automated, as no sane user trusts the programmer to get it right first time.

XSLT stands for XSL Transform. It is used to change an XML document from one form to another.
As for systems, Microsoft BizTalk uses XSLT in mapping operations that map one XML document into another. Within the XSLT the user can make use of .net code to do more complex processing.
I'm sure someone else will have a much nicer explanation but you can easily find out more by Googling XSLT tutorials. It's a huge topic.

It should be possible: write your rules in XML, the case data should also be in XML, and then a generic XSLT could be written that compares the case data against the rules and executes the relevant rules in the correct sequence.
The business users don't need to know XSLT, they just need to know how to write the rules.

Using strings with "general purpose" XML in WS - good or bad?

We're working now on the design of a new API for our product, which will be exposed via web services. We have a dispute whether we should use strict parameters with well defined types (my opinion) or strings that will contain XML in whatever structure needed. It is quite obvious that ideally using a strict signature is safer, and it will allow our users to use tools like wsdl2java. OTOH, our product is developing rapidly, and if the parameters of a service will have to be changed, using XML (passed as a string or anyType - not complex type, which is well defined type) will not require the change of the interface.
So, what I'm asking for is basically rule of thumb recommendations - would you prefer using strict types or flexible XML? Have you had any significant problems using either way?
Thanks,
Eran

I prefer using strict types. That gives you access to client tools that make that end of the job much easier. You also state that if the messaging changes, the string approach will not require changing the interface. Personally, I see this as a disadvantage, not an advantage. If the interface changes, you will know very quickly which clients need to be updated.

Strings containing XML is an extremely bad idea and asking for trouble. Use messages that have a defined schema.I had to rewrite significant portions of an app that used a lot of XML internally instead of types. It was horribly slow and impossible to figure out what was happening.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js