Should output be encoded at the API or client level? - xss

We are moving our Web app architecture to being microservice based. We have an internal debate as to whether an REST API that provides content (in JSON, let's say) should be looking to encode content to make it safe, or whether the consumers that take that content and display it (in HTML, for example, or otherwise use it) should be responsible for that encoding. The use case is to prevent XSS attacks and similar.
The provider stance is "Well, we can't know how to encode it for everyone, or how you're going to use the content, so of course the consumers should encode the content."
The consumer stance is "There is one provider and multiple consumers, so it's more secure to do it once in the providing API than to hope that every consumer does it."
Are there any generally accepted best practices on this and why?

As a rule, data when passing through "internal" processes (whatever that might mean to use) should be stored or encoded in whatever "internal" format makes sense. The format chosen is typically designed to minimize encoding/decoding steps and to prevent data loss.
Then, upon output, data is encoded using whatever output format makes sense. Preventing data loss is important, but also proper escaping and formatting is key here.
So for example, with internal APIs, data in binary format may be sufficent. But when you output JSON or HTML or XML or PDF, you have to encode and escape your data appropriately to fit the output format.
The important point here is that different output formats have different concepts of "safe". What's "safe" for HTML may not be safe for JSON, and what's safe for JSON may not be safe for SQL. Data is encoded upon output specifically so that you can use the proper encoding for the task. You cannot assume that this step is done for you ahead of time, nor should you put your output function in the position to determine whether or not encoding must be done. If you stick with the rule: "output function ALWAYS encodes for safety", then you will never have to worry about data injection attacks.

I would say that the two important points are the following:
The encoding used by the provider MUST be specified with extreme clarity and precision in a reference document, so that all consumer implementors can know what to expect.
Whatever default encoding is used by the provider MUST keep all needed information, i.e. still be amenable to transcoding by any consumer who would wish to do it.
If you follow these two rules then you will have done 95% of the job for reliability and security.
As for your specific question, a good practice is a middle-ground: the provider follows by default a "generic" encoding, but consumers can ask (optionally) for a specific encoding which the provider may then apply -- this allows the provider to support a number of dumb, lightweight clients of possibly different kinds and can be extended later on with extra encodings without breaking the API.

I firmly believe it is both the consumer and the provider that need to do their part in being good citizens in the security space.
As the provider I want to make sure I deliver a secure product. I don't need to know the context in which my client is going to use my product, all I need to know is how I am going to deliver it. If my delivery is in JSON, then I can use that context to escape my data before sending it off, similarly for XML, plain text, etc. Further more there are transport methods that aid in security already. JSONP is one such delivery method. This ensures the payload is consumed appropriately.
As the consumer, which by the way in our environment no one is the final consumer, we are all providers to the final end client (the end users via a web browser mostly.). Because of this we have to also secure the data at this end. I would never trust a black box API to do this job for me, I would always make a point to ensure a secure payload. There are many tools out there, the ESAPI project from OWASP comes to mind, that will aid in the sanitization by context of data. Remember that you are eventually sending this data on to the end-user (browser) and if there is something awry you won't be able to pass the buck. Your service will be viewed as the vulnerable one regardless of where the flaw lies. Additionally, as the consumer, you may not always be able to rely on the black box provider to fix their flaws in a timely fashion. What if their support is lacking or they have higher priorities. Does that mean you continue to provide a known flaw to your end-users?
Security is about layers, and having safeguards at the source and end-points is always preferable.

Related

Deidentification using FPE Primitive Transform and ignoring certain characters

is there a way to have the de-identify DLP ignore certain characters? Currently, encrypting EMails with a custom alphabet that includes the "-" sign ends up encrypting as below. Ideally the encrypted text would all be of the format "XXX-XX-XXXX" I noticed there was a CharsToIgnore call that could be made but not sure where to put that... maybe metadata field in the API that's called (deidentify_content?) or at some other place.
Thanks!
325-7959452
31424943-6
Officially you should reconsider using FPE as it isn't considered the best option by security standards. Only use it if you are in legacy system that is going to be strict about formatting, otherwise deterministic crypto is much better for emails, another technique available in the API will give you better control.
But as for FPE, there is not the ability to skip characters when using FPE, that feature only exists for masking.

Language agnostic cookie encoding / decoding standards

I'm having difficulties to figure out what is the standard (or is there any?) for encoding/decoding cookie values regardless to backend platforms.
According to RFC 2109:
The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. "Opaque" implies that the content is of interest and relevance only to the origin server. The content may, in fact, be readable by anyone that examines the Set-Cookie header.
which sounds like "server is the boss" and it decides whatever the encoding will apply. This makes it quite difficult to set a cookie from, say PHP backend and read it from Python or Java or whatever, without writing any manual encode/decode handling on both sides.
Let's say we have a value needs to be encoded. Russian /"печенье (*} значения"/ means "cookie value" with some additional non alpha-numeric chars in it.
Python:
Almost every WSGI server does the same and uses Python's SimpleCookie class that encodes to / decodes from octal literals even though many says that octal literals are depreciated in ECMA-262, strict mode. Wtf?
So, our raw cookie value becomes "/\"\320\277\320\265\321\207\320\265\320\275\321\214\320\265 (*} \320\267\320\275\320\260\321\207\320\265\320\275\320\270\321\217\"/"
Node.js:
Haven't tested at all but I'm just guessing a JavaScript backend would do it with native encodeURIComponent and decodeURIComponent functions that use hexadecimal escaping / unescaping?
PHP:
PHP applies urlencode to the cookie values that is similar to encodeURIComponent but not exactly the same.
So the raw value becomes; %2F%22%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D1%8C%D0%B5+%28%2A%7D+%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%22%2F that is not even wrapped with double quotes.
However; if the JavaScript value variable has the PHP encoded value above, decodeURIComponent(value) gives /"печенье+(*}+значения"/, see "+" chars instead of spaces..
What is the situation in Java, Ruby, Perl and .NET? Which language is following (or closest) to the desired behaviour. Actually, is there any standard for this defined by W3?
I think you've got things a bit mixed up here. The server's encoding does not matter to the client, and it shouldn't. That is what RFC 2109 is trying to say here.
The concept of cookies in http is similar to this in real life: Upon paying the entrance fee to a club you get an ink stamp on your wrist. This allows you to leave and reenter the club without paying again. All you have to do is show your wrist to the bouncer. In this real life example, you don't care what it looks like, it might even be invisible in normal light - all that is important is that the bouncer recognises the thing. If you were to wash it off, you'll lose the privilege of reentering the club without paying again.
In HTTP the same thing is happening. The server sets a cookie with the browser. When the browser comes back to the server (read: the next HTTP request), it shows the cookie to the server. The server recognises the cookie, and acts accordingly. Such a cookie could be something as simple as a "WasHereBefore" marker. Again, it's not important that the browser understands what it is. If you delete your cookie, the server will just act as if it has never seen you before, just like the bouncer in that club would if you washed off that ink stamp.
Today, a lot of cookies store just one important piece of information: a session identifier. Everything else is stored server-side and associated with that session identifier. The advantage of this system is that the actual data never leaves the server and as such can be trusted. Everything that is stored client-side can be tampered with and shouldn't be trusted.
Edit: After reading your comment and reading your question yet again, I think I finally understood your situation, and why you're interested in the cookie's actual encoding rather than just leaving it to your programming language: If you have two different software environments on the same server (e.g.: Perl and PHP), you may want to decode a cookie that was set by the other language. In the above example, PHP has to decode the Perl cookie or vice versa.
There is no standard in how data is stored in a cookie. The standard only says that a browser will send the cookie back exactly as it was received. The encoding scheme used is whatever your programming language sees fit.
Going back to the real life example, you now have two bouncers one speaking English, the other speaking Russian. The two will have to agree on one type of ink stamp. More likely than not this will involve at least one of them learning the other's language.
Since the browser behaviour is standardized, you can either imitate one languages encoding scheme in all other languages used on your server, or simply create your own standardized encoding scheme in all languages being used. You may have to use lower level routines, such as PHP's header() instead of higher level routines, such as start_session() to achieve this.
BTW: In the same manner, it is the server side programming language that decides how to store server side session data. You cannot access Perl's CGI::Session by using PHP's $_SESSION array.
Regardless of the cookie being opaque to the client, it still needs to conform to the HTTP spec. rfc2616 specifies that all HTTP headers should be ASCII (ISO-8859-1). rfc5987 extends that to support other character sets, but I don't know how widely supported it is.
I prefer to encode into UTF8 and wrap with base64 encoding. It's fast, ubiquitous, and will never mangle your data at either end.
You will need to ensure an explicit conversion into UTF8 even when wrapping it. Other languages & runtimes, while supporting Unicode, may not store strings as UTF8 internally... like many Windows APIs. Python 2.x, in my experience, rarely gets Unicode strings right without explicit conversion.
ENCODE: nativeString -> utfEncode() -> base64Encode()
DECODE: base64Decode() -> utfDecode() -> nativeString
Almost every language I know of, these days, supports this. You can look for a universal single-function encode, but I err on the side of caution and choose the two-step approach... especially with foreign character sets.

How to safely store strings (i.e. password) in a C++ application?

I'm working on a wxWidgets GUI application that allows the user to upload files to an FTP server and a pair of username/password is required to access the FTP server.
As far as I know, STL strings or even char* strings are visible to end user even the program is compiled already, using hex editors or maybe string extractors like Sysinternals String Utility.
So, is there a safe/secure way to store sensitive informations inside a C++ application?
PS. I cannot use .NET for this application.
This is actually independent of the programming language used.
FTP is a protocol that transfers its password in plain text. No amount of obfuscation will change that, and an attacker can easily intercept the password as it is transmitted.
And no amount of obfuscation, no matter the protocol used, will change the fact that your application has to be able to decode that password. Any attacker with access to the application binary can reverse-engineer that decoding, yielding the password.
Once you start looking at secure protocols (like SFTP), you also get the infrastructure for secure authentication (e.g. public/private key) when looking at automated access.
Even then you are placing the responsibility of not making that key file accessable to anyone else on the file system, which - depending on the operating system and overall setup - might not be enough.
But since we're talking about an interactive application, the simplest way is to not make the authentication automatic at all, but to query the user for username and password. After all, he should know, shouldn't he?
Edit: Extending on the excellent comment by Kate Gregory, in case that users share a common "technical" (or anonymous) account accessing your server, files uploaded by your app should not be visible on the server before some kind of filtering was done by you. The common way to do this is having an "upload" directory where files can be uploaded to, but not be downloaded from. If you do not take these precautions, people will use your FTP server as turntable for all kind of illegal file sharing, and you will be the one held legally responsible for that.
I'm not sure if that is possible at all, and if, than not easy. If the password is embedded and your program can read it, everybody with enough knowledge should be able to do.
You can improve security against lowlevel attempts (like hexeditor etc.) by encrypting or obfuscating (eg two passwords which generate the real password by XOR at runtime and only at the moment you need it).
But this is no protection against serious attacks by experienced people, which might decompile you program or debug it (well, there are ways to detect that, but it's like cold-war - mutual arms race of debugging-techniques against runtime-detection of these).
edit: If one knows an good way with an acceptable amount of work to protect the data (in c++ and without gigantic and/or expensive frameworks), please correct me. I would be interested in that information.
While it's true that you cannot defend against someone who decompiles your code and figures out what you're doing, you can obscure the password a little bit so that it isn't in plain text inside the code. You don't need to do a true encryption, just anything where you know the secret. For example, reverse it, rot13 it, or interleave two literal strings such as "pswr" and "asod". Or use individual character variables (that are not initialized all together in the same place) and use numbers to set them (ascii) rather than having 'a' in your code.
In your place, I would feel that snooping the traffic to the FTP server is less work than decompiling your app and reading what the code does with the literal strings. You only need to defeat the person who opens the hex and sees the strings that are easily recognized as an ID and password. A littel obscuring will go a long way in that case.
As the others said, storing a password is never really save but if you insist you can use cryptlib for encryption and decryption.
Just a raw idea for you to consider.
Calculate the md5 or SHA-2 of your password and store it in the executable.
Then do the same for input username/password and compare with stored value.
Simple and straightforward.
http://en.wikipedia.org/wiki/MD5
http://en.wikipedia.org/wiki/SHA-2

Web Service to return complex object with optional parts

I'm trying to think of the correct design for a web service. Essentially, this service is going to perform a client search in a number of disparate systems, and return the results.
Now, a client can have various pieces of information attached - e.g. various pieces of contact information, their address(es), personal information. Some of this information may be complex to retrieve from some systems, so if the consumer isn't going to use it, I'd like them to have some way of indicating that to the web service.
One obvious approach would be to have different methods for different combinations of wanted detail - but as the combinations grow, so too do the number of methods. Another approach I've looked at is to add two string array parameters to the method call, where one array is a list of required items (e.g. I require contact information), and the other is optional items (e.g. if you're going to pull in their names anyway, you might as well return that to me).
A third approach would be to add additional methods to retrieve the detail. But that's going to explode the number of round trips if I need all the details for potentially hundreds of clients who make up the result.
To be honest, I'm not sure I like any of the above approaches. So how would you design such a generic client search service?
(Considered CW since there might not be a single "right" answer, but I'll wait and see what sort of answers arrive)
Create a "criteria" object and use that as a parameter. Such an object should have a bunch of properties to indicate the information you want. For example "IncludeAddresses" or "IncludeFullContactInformation".
The consumer is then responsible to set the right properties to true, and all combinations are possible. This will also make the code in the service easier to do. You can simply write if(criteria.IncludeAddresses){response.Addresses = GetAddresses;}
Any non-structured or semi-structured data is best handled by XML. You might pass XML data via a string or wrap it up in a class adding some functionality to it. Use XPathNavigator to go through XML. You can also use XMLDocument class although it is not too friendly to use. Anyway, you will need some kind of class to handle XML content of course.
That's why XML was invented - to handle data which structure is not clearly defined.
Regards,
Maciej

When is it best to sanitize user input?

User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?
Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.
This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.
There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.
As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.
On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.
On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.
So, from the real life usage point of view, the only proper way would be
formatting, not whatever "sanitization"
right before use
according to the certain medium rules
and even following sub-rules required for this medium's different parts.
It depends on what kind of sanitizing you are doing.
For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.
For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.
I sanitize my user data much like Radu...
First client-side using both regex's and taking control over allowable characters
input into given form fields using javascript or jQuery tied to events, such as
onChange or OnBlur, which removes any disallowed input before it can even be
submitted. Realize however, that this really only has the effect of letting those
users in the know, that the data is going to be checked server-side as well. It's
more a warning than any actual protection.
Second, and I rarely see this done these days anymore, that the first check being
done server-side is to check the location of where the form is being submitted from.
By only allowing form submission from a page that you have designated as a valid
location, you can kill the script BEFORE you have even read in any data. Granted,
that in itself is insufficient, as a good hacker with their own server can 'spoof'
both the domain and the IP address to make it appear to your script that it is coming
from a valid form location.
Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run
your scripts in taint mode. This forces you to not get lazy, and to be diligent about
step number 4.
Sanitize the user data as soon as possible using well-formed regexes appropriate to
the data that is expected from any given field on the form. Don't take shortcuts like
the infamous 'magic horn of the unicorn' to blow through your taint checks...
or you may as well just turn off taint checking in the first place for all the good
it will do for your security. That's like giving a psychopath a sharp knife, bearing
your throat, and saying 'You really won't hurt me with that will you".
And here is where I differ than most others in this fourth step, as I only sanitize
the user data that I am going to actually USE in a way that may present a security
risk, such as any system calls, assignments to other variables, or any writing to
store data. If I am only using the data input by a user to make a comparison to data
I have stored on the system myself (therefore knowing that data of my own is safe),
then I don't bother to sanitize the user data, as I am never going to us it a way
that presents itself as a security problem. For instance, take a username input as
an example. I use the username input by the user only to check it against a match in
my database, and if true, after that I use the data from the database to perform
all other functions I might call for it in the script, knowing it is safe, and never
use the users data again after that.
Last, is to filter out all the attempted auto-submits by robots these days, with a
'human authentication' system, such as Captcha. This is important enough these days
that I took the time to write my own 'human authentication' schema that uses photos
and an input for the 'human' to enter what they see in the picture. I did this because
I've found that Captcha type systems really annoy users (you can tell by their
squinted-up eyes from trying to decipher the distorted letters... usually over and
over again). This is especially important for scripts that use either SendMail or SMTP
for email, as these are favorites for your hungry spam-bots.
To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have
in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in
close proximity to the door (running taint mode and using good regexes to check the
user data).
I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.
Or in summary, as my Mum would often say... 'Better safe than sorry".
UPDATE:
One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.
I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.
Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)
Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)
The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.
For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.
For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.
Early is good, definitely before you try to parse it. Anything you're going to output later, or especially pass to other components (i.e., shell, SQL, etc) must be sanitized.
But don't go overboard - for instance, passwords are hashed before you store them (right?). Hash functions can accept arbitrary binary data. And you'll never print out a password (right?). So don't parse passwords - and don't sanitize them.
Also, make sure that you're doing the sanitizing from a trusted process - JavaScript/anything client-side is worse than useless security/integrity-wise. (It might provide a better user experience to fail early, though - just do it both places.)
My opinion is to sanitize user input as soon as posible client side and server side, i'm doing it like this
(client side), allow the user to
enter just specific keys in the field.
(client side), when user goes to the next field using onblur, test the input he entered
against a regexp, and notice the user if something is not good.
(server side), test the input again,
if field should be INTEGER check for that (in PHP you can use is_numeric() ),
if field has a well known format
check it against a regexp, all
others ( like text comments ), just
escape them. If anything is suspicious stop script execution and return a notice to the user that the data he enetered in invalid.
If something realy looks like a posible attack, the script send a mail and a SMS to me, so I can check and maibe prevent it as soon as posible, I just need to check the log where i'm loggin all user inputs, and the steps the script made before accepting the input or rejecting it.
Perl has a taint option which considers all user input "tainted" until it's been checked with a regular expression. Tainted data can be used and passed around, but it taints any data that it comes in contact with until untainted. For instance, if user input is appended to another string, the new string is also tainted. Basically, any expression that contains tainted values will output a tainted result.
Tainted data can be thrown around at will (tainting data as it goes), but as soon as it is used by a command that has effect on the outside world, the perl script fails. So if I use tainted data to create a file, construct a shell command, change working directory, etc, Perl will fail with a security error.
I'm not aware of another language that has something like "taint", but using it has been very eye opening. It's amazing how quickly tainted data gets spread around if you don't untaint it right away. Things that natural and normal for a programmer, like setting a variable based on user data or opening a file, seem dangerous and risky with tainting turned on. So the best strategy for getting things done is to untaint as soon as you get some data from the outside.
And I suspect that's the best way in other languages as well: validate user data right away so that bugs and security holes can't propagate too far. Also, it ought to be easier to audit code for security holes if the potential holes are in one place. And you can never predict which data will be used for what purpose later.
Clean the data before you store it. Generally you shouldn't be preforming ANY SQL actions without first cleaning up input. You don't want to subject yourself to a SQL injection attack.
I sort of follow these basic rules.
Only do modifying SQL actions, such as, INSERT, UPDATE, DELETE through POST. Never GET.
Escape everything.
If you are expecting user input to be something make sure you check that it is that something. For example, you are requesting an number, then make sure it is a number. Use validations.
Use filters. Clean up unwanted characters.
Users are evil!
Well perhaps not always, but my approach is to always sanatize immediately to ensure nothing risky goes anywhere near my backend.
The added benefit is that you can provide feed back to the user if you sanitize at point of input.
Assume all users are malicious.
Sanitize all input as soon as possible.
Full stop.
I sanitize my data right before I do any processing on it. I may need to take the First and Last name fields and concatenate them into a third field that gets inserted to the database. I'm going to sanitize the input before I even do the concatenation so I don't get any kind of processing or insertion errors. The sooner the better. Even using Javascript on the front end (in a web setup) is ideal because that will occur without any data going to the server to begin with.
The scary part is that you might even want to start sanitizing data coming out of your database as well. The recent surge of ASPRox SQL Injection attacks that have been going around are doubly lethal because it will infect all database tables in a given database. If your database is hosted somewhere where there are multiple accounts being hosted in the same database, your data becomes corrupted because of somebody else's mistake, but now you've joined the ranks of hosting malware to your visitors due to no initial fault of your own.
Sure this makes for a whole lot of work up front, but if the data is critical, then it is a worthy investment.
User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.
I find that cleaning it immediately has two advantages. One, you can validate against it and provide feedback to the user. Two, you do not have to worry about consuming the data in other places.