multiple error reporting with menhir: which token?

multiple error reporting with menhir: which token? - ocaml

I am writing a small parser with Menhir + Ocamllex and I have two requirements I cannot seem to meet at the same time
I would like to keep parsing after an error (to report more errors).
I would like to print the token at which the error ocurred.
I can do only 1) easily, by using the error token. I can also do only 2)
easily, using the approach suggested for this question. However, I don't know of an easy way to achieve both.
The way I handle errors right now goes something like this:
pair:
| left = prodA SEPARATOR right = prodA { (* happy case *) }
| error SEPARATOR right = prodA { print_error_report $startpos;
(* would like to continue after the first error, just in case
there is a second error, so I report both *) }
One thing that would help me is accessing the lexbuf itself, so I could get the token directly. This would mean instead of $startpos I pass something like $lexbuf But as far as I can tell, there is no official way to access the lexbuf. The solution in 1 works only at the level of the caller to the parser, where the caller is itself passing lexbuf t othe parser, but not within semantic actions.
Does anyone know if it is actually available somehow? or perhaps a workaround?

Thanks to combined work by Frédéric Bour and François Pottier, there is a new version of Menhir available that supports incremental parsing. See the announcement email sent on December 17.
The idea of this incremental API is to reverse control: instead of the parser calling the lexer to process the input, you have a lower-level API where you manipulate the parser state which returns an updated state after each consumed token (in this is slightly more fine-grained as you can observe internal reductions that do not require new tokens). In particular, you can observe whether the resulting parser state is an error, and choose to backtrack and provide a different input (depending on your error-recovery startegy) to go farther along in your input.
The general idea is that this will allow to implement good error-recovery and error-reporting strategies on the parser-user side, and slowly deprecate the rather inflexible "error token" mechanism.
This is already usable, but work on those features is still ongoing, and you should expect a more robust support for these new features in other releases over the following months.

Related

Working with strings in GCP-Workflows and GCP-Admin

I'm integrating a project in GCP-Workflows with GCP-Admin, but I'm having trouble working with some data, when extracting a date it is delivered in this format: 2020-12-28T11: 20: 05.000Z, so I can't turn the string into int, and apparently there is no function in GCP like substring() either. I need to use the date with an IF, checking if it is greater or less than the reference.
How can I do this?

There is some lack of function implementation for now in Workflows. New ones are coming very soon. But I don't know if they will solve your problem
Anyway, with workflows, the correct pattern, if a built-in function isn't implemented, is to call an endpoint, for example a Cloud Function or a Cloud Run, which perform the transformation for you and return the expected result.
Quite boring to do, but don't hesitate to open feature request on the issues tracker product team is very reactive and love user feedbacks!

The Workflows standard library now includes a text module with functions for searching (including regular expressions), splitting, substrings and case transformations.

How do syntax highlighting tools implement automated testing?

How do syntax highlighting tools such as pygments and textmate bundle do automated testing?

Tools like this often simply resort to a large collection of snippets of text representing a chosen input and the expected output. For instance if you look at the Pygments Github, you can see they have giant lists of text files divided into an input section and a tokens section like so:
---input---
f'{"quoted string"}'
---tokens---
'f' Literal.String.Affix
"'" Literal.String.Single
'{' Literal.String.Interpol
'"' Literal.String.Double
'quoted string' Literal.String.Double
'"' Literal.String.Double
'}' Literal.String.Interpol
"'" Literal.String.Single
'\n' Text
Since a highlighting tool reads a piece of code and then has to identify which bits of text are parts of which bits of code (is this the start of a function? is this a comment? is it a variable name?), they usually perform various processing steps that will result in a list of tokens as above, which they can then feed into the next step (insert highlights from the first Literal.String.Interpol to the next, bold any Literal.String.Single, etc. by generating the appropriate HTML or CSS or other markup relevant to the system). Checking that these tokens are generated properly from the input text is key.
Then, depending on the language the tool is built in you might use an existing testing suite or build your own (pygments seems to use a Python-based tool called pyTest), which essentially consists of running each of the inputs through your tool in a loop, reading the output, and comparing it to the expected values. If the output doesn't match, you can display a message showing what test failed, what the input/output/expected/error values were. If an output passes, you could simply signal with a happy green checkmark. Then when the test finishes, the developer can hopefully reason out what they broke by looking over the results.
It is often a good idea to randomize the order that these inputs so that you can be sure that each step in the test doesn't have side effects that are getting passed along to the next test and cause it to pass or fail incorrectly. It might also be a good idea to time the length of the complete test. If the whole thing was taking 12 seconds yesterday, but now it takes two minutes, we may have broken something even if all the test technically "pass".
In tools like a code highlighter, you often have a good idea of what many of the inputs and outputs will look like before you can code everything up, for instance if some spec document already exists. In that case, it may be a good idea to include tests that you know won't pass right away, but mark them with some tag (perhaps some text marker within the file that says "NOT PASSING", or naming the file in a certain way), and telling your testing suite to expect those tests to fail. Then, as you fix bugs and add features, say you fixed Bug X in your attempt to make test #144 pass. Now when you run the text, it also alerts you that 10 other tests that should be failing are now passing. Congrats! You just saved yourself a lot of work trying to fix several separate problems that were actually caused by the same root issue.
As the codebase is updated, a developer would run and rerun the test to ensure that any changes he makes doesn't break tests that were working before, and then would add new tests to the collection to verify that his new feature, fixed edge case, etc., now has a known expected output that you can be sure someone won't accidentally break in the future.

What is the best way to encrypt hardcoded strings in C++?

Warning: C++ noob
I've read multiple posts on StackOverflow about string encryption. By the way, they don't answer my doubts.
I must insert one or two hardcoded strings in my code but I would like to make it difficult to read in plain text when debugging/reverse engineering. That's not all: my strings are URLs, so a simple packet analyzer (Wireshark) can read it.
I've said difficult because I know that, when the code runs, the string is somewhere (in RAM?) decrypted as plain text and somebody can read it. So, assuming that is not possible to completely secure my string, what is the best way of encrypting/decrypting it in C++?
I was thinking of something like this:
//I've omitted all the #include and main stuff of course...
string encryptedUrl = "Ajdu67gGHhbh34590Hb6vfu6gu" //Encrypted url with some known algorithm
URLDownloadToFile(NULL, encryptedUrl.decrypt(), C:\temp.txt, 0, NULL);
What about packet analyzing? I'm sure there's no way to hide the URL but maybe I'm missing something? Thank you and sorry for my worst english!
Edit 1: What my application does?
It's a simple login script. My application downloads a text file from an URL. This file contains an encrypted string that is read using fstream library. The string is then decrypted and used to login on another site. It is very weak, because there's no database, no salt, no hashing. My achievement is to ensure that neither the url nor the login string are "easy" to read from a static analisys of the binary, and possibly as hard as possible with a dynamic analysis (debugging, revers engineering, etc).

If you want to stymie packet inspectors, the bare minimum requirement is to use https with a hard-coded server certificate baked into your app.
There is no panacea for encrypting in-app content. A determined hacker with the right skills will get at the plain url, no matter what you do. The best you can hope for is to make it difficult enough that most people will just give up. The way to achieve this is to implement multiple diverse obfuscation and tripwire techniques. Including, but not limited to:
Store parts of the encrypted url and the password (preferably a one-time key) in different locations and bring them together in code.
Hide the encrypted parts in large strings of randomness that looks indistinguishable from the parts.
Bring the parts together piecemeal. E.g., Concatenate the first and second third of the encrypted url into a single buffer from one initialisation function, concatenate this buffer with the last third in a different unrelated init function, and use the final concatenation in yet another function, all called from different random places in your code.
Detect when the app is running under a debugger and have different functions trash the encrypted contents at different times.
Detection should be done at various call sites using different techniques, not by calling a single "DetectDebug" function or testing a global bool, both of which create a single point of attack.
Don't use obvious names, like, "DecryptUrl" for the relevant functions.
Harvest parts of the key from seemingly unrelated, but consistent sources. E.g., read the clock and only use a few of the high bits (high enough that that they won't change for the foreseeable future, but low enough that they're not all zero), or use a random sampling of non-volatile results from initialisation code.
This is just the tip of the iceberg and will only throw novices off the scent. None of it is going to stop, or even significantly slow down, a skillful attacker, who will simply intercept calls to the SSL library using a stealth debugger. You therefore have to ask yourself:
How much is it worth to me to protect this url, and from what kind of attacker?
Can I somehow change the system design so that I don't need to secure the url?

Try XorSTR [1, 2]. It's what I used to use when trying to hamper static analysis. Most results will come from game cheat forums, there is an html generator too.
However as others have mentioned, getting the strings is still easy for anyone who puts a breakpoint on URLDownloadToFile. However, you will have made their life a bit harder if they are trying to do static analysis.
I am not sure what your URL's do, and what your goal is in all this, but XorStr + anti-debug + packing the binary will stop most amateurs from reverse engineering your application.

the best approaches for logging localization using c++

I am working on a multinational project where target audience for logs might be from two nationalities. Therefore it is becoming important to log in more than one language , I am thinking about writing to 2 different log folders based on language every time I am logging something, but I am also wondering if there's some out of the box functionality that is coming along with logging frameworks like log4cpp?

As other commenters have mentioned, it sounds like you are going down the wrong track by looking to do multilingual logging.
My recommendation would be to use English (which is the standard for technical information, and which I guess is the language you know best) and to make sure that the language you use is clear, grammatically correct and unambiguous. Then if one of the technicians cannot understand it, they can very easily and efficiently run it through a machine translation engine such as Google Translate. Or indeed they could process the logs and run everything through Google Translate to append translated text, particularly if you annotate the logs to mark the language content.
Assuming that the input language is well-written, machine transation usually gives a good result which the end user can understand. If the message isn't clear, has typos or abbreviations, then that's where machine translation fails spectacularly.

Writing log naturally brings down the speed of execution due to file open, seek and write operations involved as part of it.
This is one primary reason why many developer and architects suggest to write log at different levels.Increasing the depth of log entries as level increases to trace down the problems better. At higher level, you will notice that your process speed drops due to more log entries getting generated.
Rather suggest you to use services that can translate from one language to other.
I'm sure there are libraries free or paid which does this translation. You can create a small utility program that runs in the background and does this conversion during process idle time.

Well one suggestion is you can use a different process/thread which listens for your log messages, which you can log it from there ..
This reduces I/O logging time in your main process/thread and you can make all changes related to Logging language over there..
For multi - Lingual support I think you can try writing with widechar string .. though I am not sure..

the best approaches for logging localization using c++
Install Qt 4 and use QObject::tr/ tr() macro for strings. Write strings in whatever language you want. Hire/Get a translator to localize strings using QT Linguist.
Please note that perfect translation is impossible, so there will be many "amusing" misunderstandings, even if your translator is a genius. So it might be a better idea to select main language for programming team.
--EDIT--
Didn't notice this part before:
in more than one language
One way to approach it is to implement log reader. Instead of writing plaintext messages, you could dump message ids (generated by some kind of macros) and string arguments if strings are formatted. "Log reader" will allow user to select desired language while viewing log file, and translate messages based on their ids/arguments using mechanism similar to QTranslator. The good thing about this approach is that you'll be able to add more languages later - so it'll be possible to retranslate old logs. The bad thing is that this format will be harder to read for "normal human", although you can add plaintext messages in addition to message ids and arguments and you'll need to write log viewer.
Qt 4 has most of this framework implemented (there are routines for dumping variants into text/data streams, and so on) along with translation tool. See QTranslator documentation and Linguist manual for more info.

When is it best to sanitize user input?

User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?

Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.
This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.
There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.
As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.
On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.
On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.
So, from the real life usage point of view, the only proper way would be
formatting, not whatever "sanitization"
right before use
according to the certain medium rules
and even following sub-rules required for this medium's different parts.

It depends on what kind of sanitizing you are doing.
For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.
For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.

I sanitize my user data much like Radu...
First client-side using both regex's and taking control over allowable characters
input into given form fields using javascript or jQuery tied to events, such as
onChange or OnBlur, which removes any disallowed input before it can even be
submitted. Realize however, that this really only has the effect of letting those
users in the know, that the data is going to be checked server-side as well. It's
more a warning than any actual protection.
Second, and I rarely see this done these days anymore, that the first check being
done server-side is to check the location of where the form is being submitted from.
By only allowing form submission from a page that you have designated as a valid
location, you can kill the script BEFORE you have even read in any data. Granted,
that in itself is insufficient, as a good hacker with their own server can 'spoof'
both the domain and the IP address to make it appear to your script that it is coming
from a valid form location.
Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run
your scripts in taint mode. This forces you to not get lazy, and to be diligent about
step number 4.
Sanitize the user data as soon as possible using well-formed regexes appropriate to
the data that is expected from any given field on the form. Don't take shortcuts like
the infamous 'magic horn of the unicorn' to blow through your taint checks...
or you may as well just turn off taint checking in the first place for all the good
it will do for your security. That's like giving a psychopath a sharp knife, bearing
your throat, and saying 'You really won't hurt me with that will you".
And here is where I differ than most others in this fourth step, as I only sanitize
the user data that I am going to actually USE in a way that may present a security
risk, such as any system calls, assignments to other variables, or any writing to
store data. If I am only using the data input by a user to make a comparison to data
I have stored on the system myself (therefore knowing that data of my own is safe),
then I don't bother to sanitize the user data, as I am never going to us it a way
that presents itself as a security problem. For instance, take a username input as
an example. I use the username input by the user only to check it against a match in
my database, and if true, after that I use the data from the database to perform
all other functions I might call for it in the script, knowing it is safe, and never
use the users data again after that.
Last, is to filter out all the attempted auto-submits by robots these days, with a
'human authentication' system, such as Captcha. This is important enough these days
that I took the time to write my own 'human authentication' schema that uses photos
and an input for the 'human' to enter what they see in the picture. I did this because
I've found that Captcha type systems really annoy users (you can tell by their
squinted-up eyes from trying to decipher the distorted letters... usually over and
over again). This is especially important for scripts that use either SendMail or SMTP
for email, as these are favorites for your hungry spam-bots.
To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have
in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in
close proximity to the door (running taint mode and using good regexes to check the
user data).
I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.
Or in summary, as my Mum would often say... 'Better safe than sorry".
UPDATE:
One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.

I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.
Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)
Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)

The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.
For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.
For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.

Early is good, definitely before you try to parse it. Anything you're going to output later, or especially pass to other components (i.e., shell, SQL, etc) must be sanitized.
But don't go overboard - for instance, passwords are hashed before you store them (right?). Hash functions can accept arbitrary binary data. And you'll never print out a password (right?). So don't parse passwords - and don't sanitize them.
Also, make sure that you're doing the sanitizing from a trusted process - JavaScript/anything client-side is worse than useless security/integrity-wise. (It might provide a better user experience to fail early, though - just do it both places.)

My opinion is to sanitize user input as soon as posible client side and server side, i'm doing it like this
(client side), allow the user to
enter just specific keys in the field.
(client side), when user goes to the next field using onblur, test the input he entered
against a regexp, and notice the user if something is not good.
(server side), test the input again,
if field should be INTEGER check for that (in PHP you can use is_numeric() ),
if field has a well known format
check it against a regexp, all
others ( like text comments ), just
escape them. If anything is suspicious stop script execution and return a notice to the user that the data he enetered in invalid.
If something realy looks like a posible attack, the script send a mail and a SMS to me, so I can check and maibe prevent it as soon as posible, I just need to check the log where i'm loggin all user inputs, and the steps the script made before accepting the input or rejecting it.

Perl has a taint option which considers all user input "tainted" until it's been checked with a regular expression. Tainted data can be used and passed around, but it taints any data that it comes in contact with until untainted. For instance, if user input is appended to another string, the new string is also tainted. Basically, any expression that contains tainted values will output a tainted result.
Tainted data can be thrown around at will (tainting data as it goes), but as soon as it is used by a command that has effect on the outside world, the perl script fails. So if I use tainted data to create a file, construct a shell command, change working directory, etc, Perl will fail with a security error.
I'm not aware of another language that has something like "taint", but using it has been very eye opening. It's amazing how quickly tainted data gets spread around if you don't untaint it right away. Things that natural and normal for a programmer, like setting a variable based on user data or opening a file, seem dangerous and risky with tainting turned on. So the best strategy for getting things done is to untaint as soon as you get some data from the outside.
And I suspect that's the best way in other languages as well: validate user data right away so that bugs and security holes can't propagate too far. Also, it ought to be easier to audit code for security holes if the potential holes are in one place. And you can never predict which data will be used for what purpose later.

Clean the data before you store it. Generally you shouldn't be preforming ANY SQL actions without first cleaning up input. You don't want to subject yourself to a SQL injection attack.
I sort of follow these basic rules.
Only do modifying SQL actions, such as, INSERT, UPDATE, DELETE through POST. Never GET.
Escape everything.
If you are expecting user input to be something make sure you check that it is that something. For example, you are requesting an number, then make sure it is a number. Use validations.
Use filters. Clean up unwanted characters.

Users are evil!
Well perhaps not always, but my approach is to always sanatize immediately to ensure nothing risky goes anywhere near my backend.
The added benefit is that you can provide feed back to the user if you sanitize at point of input.

Assume all users are malicious.
Sanitize all input as soon as possible.
Full stop.

I sanitize my data right before I do any processing on it. I may need to take the First and Last name fields and concatenate them into a third field that gets inserted to the database. I'm going to sanitize the input before I even do the concatenation so I don't get any kind of processing or insertion errors. The sooner the better. Even using Javascript on the front end (in a web setup) is ideal because that will occur without any data going to the server to begin with.
The scary part is that you might even want to start sanitizing data coming out of your database as well. The recent surge of ASPRox SQL Injection attacks that have been going around are doubly lethal because it will infect all database tables in a given database. If your database is hosted somewhere where there are multiple accounts being hosted in the same database, your data becomes corrupted because of somebody else's mistake, but now you've joined the ranks of hosting malware to your visitors due to no initial fault of your own.
Sure this makes for a whole lot of work up front, but if the data is critical, then it is a worthy investment.

User input should always be treated as malicious before making it down into lower layers of your application. Always handle sanitizing input as soon as possible and should not for any reason be stored in your database before checking for malicious intent.

I find that cleaning it immediately has two advantages. One, you can validate against it and provide feedback to the user. Two, you do not have to worry about consuming the data in other places.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js