Stata: store pathname with forbidden characters in variable

Stata: store pathname with forbidden characters in variable - stata

I am working with institution A's code, which contains among other things,
adopath ++ $prog
use "$prog/subDirectory/otherFile.do"
The actual do file is something I try not to change, as I know in its current state, it will work with institution A.
I am to define my own profile.do in order to make it works, where I need to set up $prog. I cannot see how institution A set up their profile.do, and whether it contains forbidden characters. $prog is supposed to contain the working directory, which in my case is
global prog "C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog"
As I learned in another question, spaces are forbidden characters, which is why this setup definitively will not work. I was suggested to use double quotes,
global prog ""C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog""
These worked fine to some extent, adopath ++ $prog was running smoothly now. However, the second command, use "$prog/subDirectory/otherFile.do", contains an error now. So here is my question
First best: is there a different way of defining $prog in a way that allows me to run the remainder of the code without getting errors?
Second best: Is there a safe way to rewrite use "$prog/subDirectory/otherFile.do"? That is, if I rewrite it as use $prog"/subDirectory/otherFile.do", and it proceeds working on my system, am I guaranteed it to work wherever the old code used to work? Can I safely exchange that piece of code while guaranteeing continued functionality?

The second best:
profile.do
global prog "C:\Users\foobar\Google Drive\Cloud\PhD\Projects\Labor Supply\LIAB_QM2_9310_v1_test_dta\prog"
Institution A's code
capture adopath ++ $prog
if (_rc != 0) adopath ++ "$prog"
use "$prog/subDirectory/otherFile.do"
Being really picky, you would substitute the != for ==, and 0 for the expected error code; for example 198 (for invalid syntax).
This solution respects the original code and, if necessary, will adequately handle the error produced by your profile.do file.
(Again, this wouldn't be a problem if your working directory had no blanks.)

Related

list changed function names or signatures with git diff (c++)

I'm working on a git diff parser. The main task is to find all changed function signatures. Sometimes in the chunk line with ### .... ### contains these information but sometimes not.
Last time I changed in greet() cout message and it is visible on first image as changed line and it is correct, but above in ###... line appears "void functOne() {" and that is not changed.
The second picture is about a dummy cpp source code to test git diff.
The main questions are
How can I list all changed function's signatures?
Why sometimes appears unchanged function name ?
Why sometimes doesn't appears any function name/signature in line with ###.... ?

Sometimes in the chunk line with ### .... ###
Git calls this a hunk header (after other diff software that also calls it that).
... contains [the function name] but sometimes not.
What Git puts in the function section of a diff hunk header is produced by matching earlier lines against a particular regular expression, as described in the gitattributes documentation under xfuncname (search for that string). But note that this is a regular expression, and regular expressions are inherently less capable than parsers; there will always exist valid C++ constructs that can be parsed, but not recognized by some regular expression you can write.
If Git's built in C++ xfuncname pattern is not adequate for your use, you can write your own pattern. But it's always going to be limited because regular expressions can only recognize regular languages (these are CS-theoretical or informatics terms, not to be interpreted as ordinary English language; for more, see, e.g., Regular vs Context Free Grammars).

The git diff command doesn't care about any functions. git repositories can contain any kind of text files (binary files too, but that's immaterial here), not just C++ source.
The diff command doesn't attempt to interpret the file in any way. Only a C++ compiler can fully understand a C++ file and process all function declarations.
The diff command only looks for discrete lines of text that changed and shows them together with a few unchanged lines that precede and follow them.
If the changed lines happen to be at the beginning of a function declaration, then this would include the function declaration. If they are in the middle of a long function, you only see the few preceding lines, that's it.
There are git diff options that control how many unchanged lines are shown (check git's documentation). Specifying a million lines, for example, results in the entire file getting shown, with all the changed lines marked up.
You can do that if you wish, then try to figure out the names of all the changed functions yourself, but until you write a complete C++ compiler, yourself, your heuristic parsing attempts won't be 100% correct. You might've noticed, tucked away in git diff output an indication of what git guessed the changed function might be. But, since git is also not a C++ compiler, that's also wrong, occasionally.

Regular Expression for whole world

First of all, I use C# 4.0 to parse the code of a VB6 application.
I have some old VB6 code and about 500+ copies of it. And I use a regular expression to grab all kinds of global variables from the code. The code is described as "Yuck" and some poor victim still has to support this. So I'm hoping to help this poor sucker a bit by generating overviews of specific constants. (And yes, it should be rewritten but it ain't broke, so...)
This is a sample of a code line I need to match, in this case all boolean constants:
Public Const gDemo = False 'Is this a demo version
And this is the regular expression I use at this moment:
Public\s+Const\s+g(?'Name'[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?'Value'[0-9]*)
And I think it too is yuckie, since the * at the end of the boolean group. But if I don't use it, it will only return 'T' or 'F'. I want the whole word.
Is this the proper RegEx to use as solution or is there an even nicer-looking option?
FYI, I use similar regexs to find all string constants and all numeric constants. Those work just fine. And basically the same .BAS file is used for all 50 copies but with different values for all these variables. By parsing all files, we have a good overview of how every version is configured.
And again, yes, we need to rebuild the whole project from scratch since it becomes harder to maintain these days. But it works and we need the manpower for other tasks. It just needs the occasional tweaks...

You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True)
demo

Boost::spirit illegal_backtracking exception

I use Boost.Spirit.Lex and .Qi for a simple calculator project and (as usual) it gives me some pain to debug and use. The debug prints:
<expression>
<try>boost::spirit::multi_pass::illegal_backtracking
This exception is thrown and I can't understand why. I use macros in my code and it would be a pain to give a minimal example so I give the whole project. Just do "make" at the root, and then launch ./sash, a prompt will appear, if you want to test just do "-echo $5-8".
It seems that Google didn't find any similar problems about this exception...
The parser is in arithmetic/, and the call of the parser is at the end of arithmetic/evaluator.cpp
Any helps greatly appreciate.

Your code is breaking because BOOST_SPIRIT_QI_DEBUG as well as the on_error<> handler seem to use iterators after they might have been invalidated.
To be honest, I'm not completely sure how this could happen.
Background
AFAICT lexertl uses spirit::multipass<> with a split_functor input policy and a split_std_deque storage policy [1].
Now, (luckily?) the checking policy is buf_id_check which means that the iterator will check for invalidation at the time of dereference.
Iterators are expected to be invalidated if
the iterator is derefenced, growing the buffer to >16 tokens and the iterator is the only one referring to the shared state.
or somewhere along the line clear_queue is called explicitely (e.g. from the flush_multi_path primitive in the Spirit Repository)
Honestly I don't see any of these two conditions being met. A quick and dirty
token_iterator_type clone = iter; // just to make it non-unique...
in evaluator.cpp doesn't make a difference (ruling out reason #1)
Temporary disabling the docheck implementation in the buf_id_check_policy made valgrind point out that on_error<> and BOOST_SPIRIT_DEBUG* are causing invalid memory references. Commenting both indeed makes all problems go away (and the eval_expression now works).
However, this is likely not your preferred solution.
Proposed solution
Since
you're working on a fixed, in-memory container representing the input you don't really need multi_pass behaviour emulation
you're using a trivial grammar, you don't really benefit from lexertl - while you are getting a lot of added complexity (as you can see)
I've quickly refactored some code: https://github.com/sehe/sash-refactor/commits/master
commit dec31496 sanity - lets do without macros
4 files changed, 59 insertions(+), 146 deletions(-)
commit 6056574c dead code, excess scope, excess instantiation
5 files changed, 38 insertions(+), 62 deletions(-)
commit 99d441db remove lexer
9 files changed, 25 insertions(+), 177 deletions(-)
Now, you will find that your code is generally much simpler, also much shorter, not running into multi_pass limits and you can still have SPIRIT_DEBUG as well as on_error handling :) In the end
binary size in -g3 is reduced from 16Mb to 6.5Mb
a net 263 lines of code have been removed
more importantly, it works
Here's some samples (without debug output):
$ ./sash <<< '-echo $8-9'
-1
Warning: Empty environment variable "8-9".
$ ./sash <<< '-echo $8\*9'
72
Warning: Empty environment variable "8*9".
$ ./sash <<< '-echo $8\*(9-1)'
64
Warning: Empty environment variable "8*(9-1)".
$ ./sash <<< '-echo $--+-+8\*(9-1)'
-64
Warning: Empty environment variable "--+-+8*(9-1)".
[1] Which, despite it's name, buffers previously seen tokens in a std::vector<>

Test environment for an Online Judge

I am planning to build an Online Judge on the lines of CodeChef, TechGig, etc. Initially, I will be accepting solutions only in C/C++.
Have thought through a security model for the same, but my concern as of now is how to model the execution and testing part.
Method 1
The method that seems to be more popular is to redirect standard input to the executable and redirect standard output to a file, for example:
./submission.exe < input.txt > output.txt
Then compare the output.txt file with some solution.txt file character by character and report the results.
Method 2
A second approach that I have seen is not to allow the users to write main(). Instead, write a function that accepts some arguments in the form of strings and set a global variable as the output. For example:
//This variable should be set before returning from submissionAlgorithm()
char * output;
void submissionAlgorithm(char * input1, char * input2)
{
//Write your code here.
}
At each step, and for a test case to be executed, the function submissionAlgorithm() is repeatedly called and the output variable is checked for results.
Form an initial analysis I found that Method 2 would not only be secure (I would prevent all read and write access to the filesystem from the submitted code), but also make the execution of test cases faster (maybe?) since the computations of test results would occur in memory.
I would like to know if there is any reason as to why Method 1 would be preferred over Method 2.
P.S: Of course, I would be hosting the online judge engine on a Linux Server.

Don't take this wrong, but you will need to look at security from a much higher perspective. The problem will not be the input and output being written to a file, and that should not affect performance too much either. But you will need to manage submisions that can actually take down your process (in the second case) or the whole system (with calls to the OS to write to disk, acquire too much memory....)
Disclaimer I am by no means a security expert.

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?

I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.

It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js