Warning: C++ noob
I've read multiple posts on StackOverflow about string encryption. By the way, they don't answer my doubts.
I must insert one or two hardcoded strings in my code but I would like to make it difficult to read in plain text when debugging/reverse engineering. That's not all: my strings are URLs, so a simple packet analyzer (Wireshark) can read it.
I've said difficult because I know that, when the code runs, the string is somewhere (in RAM?) decrypted as plain text and somebody can read it. So, assuming that is not possible to completely secure my string, what is the best way of encrypting/decrypting it in C++?
I was thinking of something like this:
//I've omitted all the #include and main stuff of course...
string encryptedUrl = "Ajdu67gGHhbh34590Hb6vfu6gu" //Encrypted url with some known algorithm
URLDownloadToFile(NULL, encryptedUrl.decrypt(), C:\temp.txt, 0, NULL);
What about packet analyzing? I'm sure there's no way to hide the URL but maybe I'm missing something? Thank you and sorry for my worst english!
Edit 1: What my application does?
It's a simple login script. My application downloads a text file from an URL. This file contains an encrypted string that is read using fstream library. The string is then decrypted and used to login on another site. It is very weak, because there's no database, no salt, no hashing. My achievement is to ensure that neither the url nor the login string are "easy" to read from a static analisys of the binary, and possibly as hard as possible with a dynamic analysis (debugging, revers engineering, etc).
If you want to stymie packet inspectors, the bare minimum requirement is to use https with a hard-coded server certificate baked into your app.
There is no panacea for encrypting in-app content. A determined hacker with the right skills will get at the plain url, no matter what you do. The best you can hope for is to make it difficult enough that most people will just give up. The way to achieve this is to implement multiple diverse obfuscation and tripwire techniques. Including, but not limited to:
Store parts of the encrypted url and the password (preferably a one-time key) in different locations and bring them together in code.
Hide the encrypted parts in large strings of randomness that looks indistinguishable from the parts.
Bring the parts together piecemeal. E.g., Concatenate the first and second third of the encrypted url into a single buffer from one initialisation function, concatenate this buffer with the last third in a different unrelated init function, and use the final concatenation in yet another function, all called from different random places in your code.
Detect when the app is running under a debugger and have different functions trash the encrypted contents at different times.
Detection should be done at various call sites using different techniques, not by calling a single "DetectDebug" function or testing a global bool, both of which create a single point of attack.
Don't use obvious names, like, "DecryptUrl" for the relevant functions.
Harvest parts of the key from seemingly unrelated, but consistent sources. E.g., read the clock and only use a few of the high bits (high enough that that they won't change for the foreseeable future, but low enough that they're not all zero), or use a random sampling of non-volatile results from initialisation code.
This is just the tip of the iceberg and will only throw novices off the scent. None of it is going to stop, or even significantly slow down, a skillful attacker, who will simply intercept calls to the SSL library using a stealth debugger. You therefore have to ask yourself:
How much is it worth to me to protect this url, and from what kind of attacker?
Can I somehow change the system design so that I don't need to secure the url?
Try XorSTR [1, 2]. It's what I used to use when trying to hamper static analysis. Most results will come from game cheat forums, there is an html generator too.
However as others have mentioned, getting the strings is still easy for anyone who puts a breakpoint on URLDownloadToFile. However, you will have made their life a bit harder if they are trying to do static analysis.
I am not sure what your URL's do, and what your goal is in all this, but XorStr + anti-debug + packing the binary will stop most amateurs from reverse engineering your application.
Related
I want to create portable c++ application for myself [CLI] which will store my secret project information.
But i am not sure, how can i store information in my program, as whatever i will update in program when i am using it will be stored in buffer and when i will close it, it will get deleted and same informations will not be available at any place.
I want to store information persistently, what is the best way to do it. [Considering my application will be portable, i.e, i can carry it in my pen drive in any place and i can fetch my information from program].
Option i found was Datbase , but i have certain problem with database :-
1). sqlite => If any one gets my sqlite.db file, he will know all my secret project.
2). mysql/sql or any other database => They are not portable, it needs to be installed in system too and i need to import , export everytime in system wherever i will have to use it.
How such application stores information in crypted format, so that no one can read it easily.
Any help will be great.
Thanks
If you want your data to remain secret then you must encrypt it.
How you persist the data (sqlite, text file, etc.) makes no difference whatsoever.
See also:
encrypt- decrypt with AES using C/C++
This is not REALLY an answer, but it's far too long "discussion about your subject" to fit as a comment, and I'd rather break the rules by writing one "non-answer answer" (especially now that you have already accepted another answer) than write 6 comments.
First of all, if it's written in C++, it won't be truly portable in the sense that you can carry it around and plug it in anywhere you like and just access the ifnormation, because different systems will have different OS and processor architecture. Fine if you restrict being able to "plug in" on Windows and Linux with x86 - you only need to build two copies of your code. But covering more architectures - e.g. being able to plug into a iPad or a MacBook will require two more builds of the software. Soon you'll be looking at quite a lot of code to carry around (never mind that you need the relevant C++ compiler and development environment to built the original copy). Yes, C++ is a portable language, but it doesn't mean that the executable file will "work on anything" directly - it will need to be compiled for that architecture.
One solution here may of course be to use something other than C++ - for example Java, that only needs a Java VM on the target system - it's often available on a customer system already, so less of an issue. But that won't work on for example an ipad.
Another solution is to have your own webserver at home, and just connect to your server from your customer's site. That way, none of the information (except the parts you actually show the customer) ever leaves your house. Make it secure by studying internet/web-site security, and using good passwords [and of course, you could even set it up so that it's only available at certain times when you need it, and not available 24/7]. Of course, if the information is really top-secret (nuclear weapons, criminal activities, etc), you may not want to do that for fear of someone accessing it when you don't want it to be accessed. But it's also less likely to "drop out of your pocket" if it's well protected with logins and passwords.
Encrypting data is not very hard - just download the relevant library, and go from there - crypt++ is one of those libraries.
If you store it in a database, you will need either a database that encrypts on itself, or a very good way to avoid "leaking" the clear-text information (e.g. storing files on /tmp on a linux machine), or worse, you need to decrypt the whole database before you can access it - which means that something could, at least in theory, "slurp" your entire database.
Depending on how secret your projects are, you may also need to consider that entering for example a password will be readable by the computer you are using - unless you bring your own computer as well [and in that case, there are some really good "encrypt my entire disk" software out there that is pretty much ready to use].
Also, if someone says "Can I plug in my memory stick on your computer and run some of my from it", I'm not sure I'd let that person do that.
In other words, your TECHNICAL challenges to write the code itself may not be the hardest nut to crack in your project - although interesting and challenging.
I'm working on a wxWidgets GUI application that allows the user to upload files to an FTP server and a pair of username/password is required to access the FTP server.
As far as I know, STL strings or even char* strings are visible to end user even the program is compiled already, using hex editors or maybe string extractors like Sysinternals String Utility.
So, is there a safe/secure way to store sensitive informations inside a C++ application?
PS. I cannot use .NET for this application.
This is actually independent of the programming language used.
FTP is a protocol that transfers its password in plain text. No amount of obfuscation will change that, and an attacker can easily intercept the password as it is transmitted.
And no amount of obfuscation, no matter the protocol used, will change the fact that your application has to be able to decode that password. Any attacker with access to the application binary can reverse-engineer that decoding, yielding the password.
Once you start looking at secure protocols (like SFTP), you also get the infrastructure for secure authentication (e.g. public/private key) when looking at automated access.
Even then you are placing the responsibility of not making that key file accessable to anyone else on the file system, which - depending on the operating system and overall setup - might not be enough.
But since we're talking about an interactive application, the simplest way is to not make the authentication automatic at all, but to query the user for username and password. After all, he should know, shouldn't he?
Edit: Extending on the excellent comment by Kate Gregory, in case that users share a common "technical" (or anonymous) account accessing your server, files uploaded by your app should not be visible on the server before some kind of filtering was done by you. The common way to do this is having an "upload" directory where files can be uploaded to, but not be downloaded from. If you do not take these precautions, people will use your FTP server as turntable for all kind of illegal file sharing, and you will be the one held legally responsible for that.
I'm not sure if that is possible at all, and if, than not easy. If the password is embedded and your program can read it, everybody with enough knowledge should be able to do.
You can improve security against lowlevel attempts (like hexeditor etc.) by encrypting or obfuscating (eg two passwords which generate the real password by XOR at runtime and only at the moment you need it).
But this is no protection against serious attacks by experienced people, which might decompile you program or debug it (well, there are ways to detect that, but it's like cold-war - mutual arms race of debugging-techniques against runtime-detection of these).
edit: If one knows an good way with an acceptable amount of work to protect the data (in c++ and without gigantic and/or expensive frameworks), please correct me. I would be interested in that information.
While it's true that you cannot defend against someone who decompiles your code and figures out what you're doing, you can obscure the password a little bit so that it isn't in plain text inside the code. You don't need to do a true encryption, just anything where you know the secret. For example, reverse it, rot13 it, or interleave two literal strings such as "pswr" and "asod". Or use individual character variables (that are not initialized all together in the same place) and use numbers to set them (ascii) rather than having 'a' in your code.
In your place, I would feel that snooping the traffic to the FTP server is less work than decompiling your app and reading what the code does with the literal strings. You only need to defeat the person who opens the hex and sees the strings that are easily recognized as an ID and password. A littel obscuring will go a long way in that case.
As the others said, storing a password is never really save but if you insist you can use cryptlib for encryption and decryption.
Just a raw idea for you to consider.
Calculate the md5 or SHA-2 of your password and store it in the executable.
Then do the same for input username/password and compare with stored value.
Simple and straightforward.
http://en.wikipedia.org/wiki/MD5
http://en.wikipedia.org/wiki/SHA-2
I'm writing a game that will have a lot of information (configuration, some content, etc) inside of some xml documents, as well as resource files. This will make it easier for myself and others to edit the program without having to edit the actual C++ files, and without having to recompile.
However, as the program is starting to grow there is an increase of files in the same directory as the program. So I thought of putting them inside a file archive (since they are mostly text, it goes great with compression).
My question is this: Will it be easier to compress all the files and:
Set a password to it (like a password-protected ZIP), then provide the password when the program needs it
Encrypt the archive with Crypto++ or similar
Modify the file header slightly as a "makeshift" encryption, and fix the file's headers while the file is loaded
I think numbers 1 and 2 are similar, but I couldn't find any information on whether zlib could handle password-protected archives.
Also note that I don't want the files inside the archive to be "extracted" into the folder while the program is using it. It should only be in the system's memory.
I think you misunderstands the possibilities brought up by encryption.
As long as the program is executed on an untrusted host, it's impossible to guarantee anything.
At most, you can make it difficult (encryption, code obfuscation), or extremely difficult (self-modifying code, debug/hooks detection), for someone to reverse engineer the code, but you cannot prevent cracking. And with Internet, it'll be available for all as soon as it's cracked by a single individual.
The same goes, truly, for preventing an individual to tamper with the configuration. Whatever the method (CRC, Hash --> by the way encryption is not meant to prevent tampering) it is still possible to reverse engineer it given sufficient time and means (and motivation).
The only way to guarantee an untampered with configuration would be to store it somewhere YOU control (a server), sign it (Asymmetric) and have the program checks the signature. But it would not, even then, prevent someone from coming with a patch that let's your program run with a user-supplied (unsigned) configuration file...
And you know the worst of it ? People will probably prefer the cracked version because freed from the burden of all those "security" measures it'll run faster...
Note: yes it is illegal, but let's be pragmatic...
Note: regarding motivation, the more clever you are with protecting the program, the more attractive it is to hackers --> it's like a brain teaser to them!
So how do you provide a secured service ?
You need to trust the person who executes the program
You need to trust the person who stores the configuration
It can only be done if you offer a thin client and executes everything on a server you trust... and even then you'll have trouble making sure that no-one finds doors in your server that you didn't thought about.
In your shoes, I'd simply make sure to detect light tampering with the configuration (treat it as hostile and make sure to validate the data before running anything). After all file corruption is equally likely, and if a corrupted configuration file meant a ruined client's machine, there would be hell to pay :)
If I had to choose among your three options, I'd go for Crypto++, as it fits in nicely with C++ iostreams.
But: you are
serializing your data to XML
compressing it
encrypting it
all in memory, and back again. I'd really reconsider this choice. Why not use eg. SQLite to store all your data in a file-based database (SQLite doesn't require any external database process)?
Encryption can be added through various extensions (SEE or SQLCipher). It's safe, quick, and completely transparent.
You don't get compression, but then again, by using SQLite instead of XML, this won't be an issue anyway (or so I think).
Set a password to it (like a password-protected ZIP), then provide the password when the program needs it
Firstly, you can't do this unless you are going to ask a user for the password. If that encryption key is stored in the code, don't bet on a determined reverse engineer from finding it and decrypting the archive.
The one big rule is: you cannot store encryption keys in your software, because if you do, what is the point of using encryption? I can find your key.
Now, onto other points. zlib does not support encryption and as they point out, PKZip is rather broken anyway. I suspect if you were so inclined to find one, you'd probably find a zip/compression library capable of handling encryption. (ZipArchive I believe handles Zip+AES but you need to pay for that).
But I second Daniel's answer that's just displayed on my screen. Why? Encryption/compression isn't going to give you any benefit unless the user presents some form of token (password, smartcard etc) not present in your compiled binary or related files. Similarly, if you're not using up masses of disk space, why compress?
I want to verify if the text log files created by my program being run at my customer's site have been tampered with. How do you suggest I go about doing this? I searched a bunch here and google but couldn't find my answer. Thanks!
Edit: After reading all the suggestions so far here are my thoughts. I want to keep it simple, and since the customer isn't that computer savy, I think it is safe to embed the salt in the binary. I'll continue to search for a simple solution using the keywords "salt checksum hash" etc and post back here once I find one.
Obligatory preamble: How much is at stake here? You must assume that tampering will be possible, but that you can make it very difficult if you spend enough time and money. So: how much is it worth to you?
That said:
Since it's your code writing the file, you can write it out encrypted. If you need it to be human readable, you can keep a second encrypted copy, or a second file containing only a hash, or write a hash value for every entry. (The hash must contain a "secret" key, of course.) If this is too risky, consider transmitting hashes or checksums or the log itself to other servers. And so forth.
This is a quite difficult thing to do, unless you can somehow protect the keypair used to sign the data. Signing the data requires a private key, and if that key is on a machine, a person can simply alter the data or create new data, and use that private key to sign the data. You can keep the private key on a "secure" machine, but then how do you guarantee that the data hadn't been tampered with before it left the original machine?
Of course, if you are protecting only data in motion, things get a lot easier.
Signing data is easy, if you can protect the private key.
Once you've worked out the higher-level theory that ensures security, take a look at GPGME to do the signing.
You may put a checksum as a prefix to each of your file lines, using an algorithm like adler-32 or something.
If you do not want to put binary code in your log files, use an encode64 method to convert the checksum to non binary data. So, you may discard only the lines that have been tampered.
It really depends on what you are trying to achieve, what is at stakes and what are the constraints.
Fundamentally: what you are asking for is just plain impossible (in isolation).
Now, it's a matter of complicating the life of the persons trying to modify the file so that it'll cost them more to modify it than what they could earn by doing the modification. Of course it means that hackers motivated by the sole goal of cracking in your measures of protection will not be deterred that much...
Assuming it should work on a standalone computer (no network), it is, as I said, impossible. Whatever the process you use, whatever the key / algorithm, this is ultimately embedded in the binary, which is exposed to the scrutiny of the would-be hacker. It's possible to deassemble it, it's possible to examine it with hex-readers, it's possible to probe it with different inputs, plug in a debugger etc... Your only option is thus to make debugging / examination a pain by breaking down the logic, using debug detection to change the paths, and if you are very good using self-modifying code. It does not mean it'll become impossible to tamper with the process, it barely means it should become difficult enough that any attacker will abandon.
If you have a network at your disposal, you can store a hash on a distant (under your control) drive, and then compare the hash. 2 difficulties here:
Storing (how to ensure it is your binary ?)
Retrieving (how to ensure you are talking to the right server ?)
And of course, in both cases, beware of the man in the middle syndroms...
One last bit of advice: if you need security, you'll need to consult a real expert, don't rely on some strange guys (like myself) talking on a forum. We're amateurs.
It's your file and your program which is allowed to modify it. When this being the case, there is one simple solution. (If you can afford to put your log file into a seperate folder)
Note:
You can have all your log files placed into a seperate folder. For eg, in my appplication, we have lot of DLLs, each having it's own log files and ofcourse application has its own.
So have a seperate process running in the background and monitors the folder for any changes notifications like
change in file size
attempt to rename the file or folder
delete the file
etc...
Based on this notification, you can certify whether the file is changed or not!
(As you and others may be guessing, even your process & dlls will change these files that can also lead to a notification. You need to synchronize this action smartly. That's it)
Window API to monitor folder in given below:
HANDLE FindFirstChangeNotification(
LPCTSTR lpPathName,
BOOL bWatchSubtree,
DWORD dwNotifyFilter
);
lpPathName:
Path to the log directory.
bWatchSubtree:
Watch subfolder or not (0 or 1)
dwNotifyFilter:
Filter conditions that satisfy a change notification wait. This parameter can be one or more of the following values.
FILE_NOTIFY_CHANGE_FILE_NAME
FILE_NOTIFY_CHANGE_DIR_NAME
FILE_NOTIFY_CHANGE_SIZE
FILE_NOTIFY_CHANGE_SECURITY
etc...
(Check MSDN)
How to make it work?
Suspect A: Our process
Suspect X: Other process or user
Inspector: The process that we created to monitor the folder.
Inpector sees a change in the folder. Queries with Suspect A whether he did any change to it.
if so,
change is taken as VALID.
if not
clear indication that change is done by *Suspect X*. So NOT VALID!
File is certified to be TAMPERED.
Other than that, below are some of the techniques that may (or may not :)) help you!
Store the time stamp whenever an application close the file along with file-size.
The next time you open the file, check for the last modified time of the time and its size. If both are same, then it means file remains not tampered.
Change the file privilege to read-only after you write logs into it. In some program or someone want to tamper it, they attempt to change the read-only property. This action changes the date/time modified for a file.
Write to your log file only encrypted data. If someone tampers it, when we decrypt the data, we may find some text not decrypted properly.
Using compress and un-compress mechanism (compress may help you to protect the file using a password)
Each way may have its own pros and cons. Strength the logic based on your need. You can even try the combination of the techniques proposed.
User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?
Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.
This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.
There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.
As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.
On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.
On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.
So, from the real life usage point of view, the only proper way would be
formatting, not whatever "sanitization"
right before use
according to the certain medium rules
and even following sub-rules required for this medium's different parts.
It depends on what kind of sanitizing you are doing.
For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.
For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.
I sanitize my user data much like Radu...
First client-side using both regex's and taking control over allowable characters
input into given form fields using javascript or jQuery tied to events, such as
onChange or OnBlur, which removes any disallowed input before it can even be
submitted. Realize however, that this really only has the effect of letting those
users in the know, that the data is going to be checked server-side as well. It's
more a warning than any actual protection.
Second, and I rarely see this done these days anymore, that the first check being
done server-side is to check the location of where the form is being submitted from.
By only allowing form submission from a page that you have designated as a valid
location, you can kill the script BEFORE you have even read in any data. Granted,
that in itself is insufficient, as a good hacker with their own server can 'spoof'
both the domain and the IP address to make it appear to your script that it is coming
from a valid form location.
Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run
your scripts in taint mode. This forces you to not get lazy, and to be diligent about
step number 4.
Sanitize the user data as soon as possible using well-formed regexes appropriate to
the data that is expected from any given field on the form. Don't take shortcuts like
the infamous 'magic horn of the unicorn' to blow through your taint checks...
or you may as well just turn off taint checking in the first place for all the good
it will do for your security. That's like giving a psychopath a sharp knife, bearing
your throat, and saying 'You really won't hurt me with that will you".
And here is where I differ than most others in this fourth step, as I only sanitize
the user data that I am going to actually USE in a way that may present a security
risk, such as any system calls, assignments to other variables, or any writing to
store data. If I am only using the data input by a user to make a comparison to data
I have stored on the system myself (therefore knowing that data of my own is safe),
then I don't bother to sanitize the user data, as I am never going to us it a way
that presents itself as a security problem. For instance, take a username input as
an example. I use the username input by the user only to check it against a match in
my database, and if true, after that I use the data from the database to perform
all other functions I might call for it in the script, knowing it is safe, and never
use the users data again after that.
Last, is to filter out all the attempted auto-submits by robots these days, with a
'human authentication' system, such as Captcha. This is important enough these days
that I took the time to write my own 'human authentication' schema that uses photos
and an input for the 'human' to enter what they see in the picture. I did this because
I've found that Captcha type systems really annoy users (you can tell by their
squinted-up eyes from trying to decipher the distorted letters... usually over and
over again). This is especially important for scripts that use either SendMail or SMTP
for email, as these are favorites for your hungry spam-bots.
To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have
in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in
close proximity to the door (running taint mode and using good regexes to check the
user data).
I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.
Or in summary, as my Mum would often say... 'Better safe than sorry".
UPDATE:
One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.
I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.
Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)
Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)
The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.
For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.
For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.
Early is good, definitely before you try to parse it. Anything you're going to output later, or especially pass to other components (i.e., shell, SQL, etc) must be sanitized.
But don't go overboard - for instance, passwords are hashed before you store them (right?). Hash functions can accept arbitrary binary data. And you'll never print out a password (right?). So don't parse passwords - and don't sanitize them.
Also, make sure that you're doing the sanitizing from a trusted process - JavaScript/anything client-side is worse than useless security/integrity-wise. (It might provide a better user experience to fail early, though - just do it both places.)
My opinion is to sanitize user input as soon as posible client side and server side, i'm doing it like this
(client side), allow the user to
enter just specific keys in the field.
(client side), when user goes to the next field using onblur, test the input he entered
against a regexp, and notice the user if something is not good.
(server side), test the input again,
if field should be INTEGER check for that (in PHP you can use is_numeric() ),
if field has a well known format
check it against a regexp, all
others ( like text comments ), just
escape them. If anything is suspicious stop script execution and return a notice to the user that the data he enetered in invalid.
If something realy looks like a posible attack, the script send a mail and a SMS to me, so I can check and maibe prevent it as soon as posible, I just need to check the log where i'm loggin all user inputs, and the steps the script made before accepting the input or rejecting it.
Perl has a taint option which considers all user input "tainted" until it's been checked with a regular expression. Tainted data can be used and passed around, but it taints any data that it comes in contact with until untainted. For instance, if user input is appended to another string, the new string is also tainted. Basically, any expression that contains tainted values will output a tainted result.
Tainted data can be thrown around at will (tainting data as it goes), but as soon as it is used by a command that has effect on the outside world, the perl script fails. So if I use tainted data to create a file, construct a shell command, change working directory, etc, Perl will fail with a security error.
I'm not aware of another language that has something like "taint", but using it has been very eye opening. It's amazing how quickly tainted data gets spread around if you don't untaint it right away. Things that natural and normal for a programmer, like setting a variable based on user data or opening a file, seem dangerous and risky with tainting turned on. So the best strategy for getting things done is to untaint as soon as you get some data from the outside.
And I suspect that's the best way in other languages as well: validate user data right away so that bugs and security holes can't propagate too far. Also, it ought to be easier to audit code for security holes if the potential holes are in one place. And you can never predict which data will be used for what purpose later.
Clean the data before you store it. Generally you shouldn't be preforming ANY SQL actions without first cleaning up input. You don't want to subject yourself to a SQL injection attack.
I sort of follow these basic rules.
Only do modifying SQL actions, such as, INSERT, UPDATE, DELETE through POST. Never GET.
Escape everything.
If you are expecting user input to be something make sure you check that it is that something. For example, you are requesting an number, then make sure it is a number. Use validations.
Use filters. Clean up unwanted characters.
Users are evil!
Well perhaps not always, but my approach is to always sanatize immediately to ensure nothing risky goes anywhere near my backend.
The added benefit is that you can provide feed back to the user if you sanitize at point of input.
Assume all users are malicious.
Sanitize all input as soon as possible.
Full stop.
I sanitize my data right before I do any processing on it. I may need to take the First and Last name fields and concatenate them into a third field that gets inserted to the database. I'm going to sanitize the input before I even do the concatenation so I don't get any kind of processing or insertion errors. The sooner the better. Even using Javascript on the front end (in a web setup) is ideal because that will occur without any data going to the server to begin with.
The scary part is that you might even want to start sanitizing data coming out of your database as well. The recent surge of ASPRox SQL Injection attacks that have been going around are doubly lethal because it will infect all database tables in a given database. If your database is hosted somewhere where there are multiple accounts being hosted in the same database, your data becomes corrupted because of somebody else's mistake, but now you've joined the ranks of hosting malware to your visitors due to no initial fault of your own.
Sure this makes for a whole lot of work up front, but if the data is critical, then it is a worthy investment.
User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.
I find that cleaning it immediately has two advantages. One, you can validate against it and provide feedback to the user. Two, you do not have to worry about consuming the data in other places.