Stata : generate/replace alternatives? - stata

I use Stata since several years now, along with other languages like R.
Stata is great, but there is one thing that annoys me : the generate/replace behaviour, and especially the "... already defined" error.
It means that if we want to run a piece of code twice, if this piece of code contains the definition of a variable, this definition needs 2 lines :
capture drop foo
generate foo = ...
While it takes just one line in other languages such as R.
So is there another way to define variables that combines "generate" and "replace" in one command ?

I am unaware of any way to do this directly. Further, as #Roberto's comment implies, there are reasons simply issuing a generate command will not overwrite (see: replace) the contents of a variable.
To be able to do this while maintaining data integrity, you would need to issue two separate commands as your question points out (explicitly dropping the existing variable before generating the new one) - I see this as method in which Stata forces the user to be clear about his/her intentions.
It might be noted that Stata is not alone in this regard. SQL Server, for example, requires the user drop an existing table before creating a table with the same name (in the same database), does not allow multiple columns with the same name in a table, etc. and all for good reason.
However, if you are really set on being able to issue a one-liner in Stata to do what you desire, you could write a very simple program. The following should get you started:
program mkvar
version 13
syntax anything=exp [if] [in]
capture confirm variable `anything'
if !_rc {
drop `anything'
}
generate `anything' `exp' `if' `in'
end
You then would naturally save the program to mkvar.ado in a directory that Stata would find (i.e., C:\ado\personal\ on Windows. If you are unsure, type sysdir), and call it using:
mkvar newvar=expression [if] [in]
Now, I haven't tested the above code much so you may have to do a bit of de-bugging, but it has worked fine in the examples I've tried.
On a closing note, I'd advise you to exercise caution when doing this - certainly you will want to be vigilant with regard to altering your data, retain a copy of your raw data while a do file manipulates the data in memory, etc.

Related

pre_limit_mult in synth_runner package does not work well

This might be similar to this question (pre_limit_mult in synth_runner package stata does not work), However, I ask this because I could not find any useful tips from the link.
I am trying to use the pre_limit_mult(real) option to limit the placebo effects in the pool for inference. The ultimate purpose is to test the validity of estimation of the economic impact of the coup in Gambia in 1994.
However, even though I follow the practice as the guidebook explains, the command does not work and present the message like the following.
tsset country_id year
synth rgdpe pop rconna csh_i csh_x rgdpe(1971) rgdpe(1982) rgdpe(1993), trunit(21) trperiod(1994) keep("Gambia_outout") replace fig
local K = 2
synth_runner rgdpe pop rconna csh_i csh_x rgdpe(1971) rgdpe(1982) rgdpe(1993), trunit(21) trperiod(1994) keep("Gambia_outout") replace gen_vars pre_limit_mult(`K')
single_treatment_graphs, trlinediff(-1) raw_gname( rgdpe_raw) effects_gname(rgdpe_effects) effects_ylabels(-1500(750)1500) effects_ymax(2000) effects_ymin(-2000) do_color(bluishgray)
With -, gen_vars- the program needs to be able create the following variables: lead rgdpe_sy nth effect pre_rmspe post_rmspe. Please make sure there are no such varaibles and that the dependent variable [rgdpe] has a short enough name that the generated vars are not too long (usually a max of 32 characters)
I received this message.
With -, gen_vars- the program needs to be able create the following variables: lead rgdpe_synth effect pre_rmspe post_rmspe.
Please make sure there are no such varaibles and that the dependent variable [rgdpe] has a short enough name that the generated vars are not too long (usually a max of 32 characters).
Please download the input data and dofile from the following link (I could not find the way to directly attach the file in stackoverflow).
(https://drive.google.com/drive/folders/1VyP2GN3NfQT6jQ9enbCvE2VTVNxZdPgc?usp=sharing)
Does anyone know how to fix it?

I need to programmatically identify all libraries and data files read by several hundred SAS program files. Can this be accomplished programmatically? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 months ago.
Improve this question
Scenario: We have a list of over 200 SAS files. We need to identify all SAS libraries and data sets used as inputs to these programs, and write out a table linking the SAS input data sets to the associated program files. We are not SAS programmers and are just now becoming familiar with the language. The intent is to design a rearchitecture of the logic of the SAS files to be more modular.
We are conducting this analysis statically - i.e., we are not running SAS, we are attempting to extract this data purely from interrogating the code in program files themselves and we do not have access to the data files.
Solution attempted: we have parsed the SAS programs to identify inputs to SAS Procs and SAS Data steps, however there are several challenges. The approach we are using is as follows:
We have obtained a python-based parser (https://github.com/benjamincorcoran/sasdocs) that extracts key information from SAS files. We have applied it to all 200+ files and extracted parsed content into a text file. However, not all SAS syntax is supported; in particular, DataSet blocks are left as unparsed raw text, Procs with a variable number and names of arguments may be missed, and some commands, like various constructs of “set” and “merge” are missed completely by the grammar that has been implemented in the parser so far.
The parser correctly locates about 60% of the files, especially the libraries and files preceded by a "Set" statement. For reasons we do not understand, not all libraries/files preceded by a "SET" command are captured by this parser.
In addition to the "Set" command, we have observed that SAS can also reference a library/file within a Merge or Sort procedure, without a specific Set command.
We are ignoring SAS files from within the 'work' library that are created during processing; we are only concerned with external input files.
Note that we are not running these programs, we only have access to the SAS Program file sources - hence we do not have access to a SAS log.
Questions:
Is there a more direct way do accomplish this goal? Does SAS understand what files it reads and writes, and is there a method of extracting a list of all libraries and files read by SAS associated with a SAS program?
If there is no method of accomplishing this information programmatically, what are all the ways that SAS can access or reference an external library/file, other than within a SET, MERGE or SORT procedure?
SAS has a procedure that does this, PROC SCAPROC. If you do have access to SAS, this is by far the best solution. You would technically need to run SAS, but even if there are errors, in theory it might work okay - the fact that the dataset doesn't exist should be okay, unless your code is data driven.
If you're unable to run the code or run anything in SAS, you'd need to do something with text analysis.
The key things to look for which would catch most of the possibilities would be (in sort of pseudo regex code):
data [lib.]dataset(could have parens but ignore them);
set( [lib.]dataset(ignore parens))* (could have multiple)
merge( [lib.]dataset(ignore parens))* (could have multiple)
update( [lib.]dataset(ignore parens))* (could have multiple)
modify( [lib.]dataset(ignore parens))* (could have multiple)
data=[lib.]dataset(ignore parens) - this is for most PROCs input, could have spaces around the equals sign
out=[lib.]dataset(ignore parens) - this is for most PROCs output, could have spaces around the equals sign
To get more than the "most" above, you'd want to analyze which PROCs were used. Each PROC can have its own output/input options, for example proc surveyselect could use various different datasets for different things, proc format uses CNTLIN and CNTLOUT, etc. You'd also have to see if there are hash tables or other objects used in the code as that has its own elements.
The other thing you could do, only caring about external files, is identify the libname statements. Once you find them, it's possible you could just look for libname.data in the program - that's how all of the datasets in the external folders (libraries) will be referred to. This won't work, though, if you are using metadata-assigned libraries, unless there are a small enough number of them that you could possibly list them all out (and you have access to SAS to find out the list).
Ultimately, your 100% solution is to hire a SAS consultant to look at the code; without being able to run the code (and thus use SCAPROC), there's not really a perfect solution.

Output to Table instead of Graph in Stata

In Stata, is there a way to redirect the data that a command does into a table instead of a graph?
Example: if someone created a normal probability distribution of data with the pnorm var_name command, is there a way to redirect the data so that instead of appearing in a graph, it appears in a table?
To add to #Noobie's answer:
Different commands work in different ways. There's no better short summary.
What you can look out for includes
generate() options that produce new variables. (There is absolute rule that the options have this name, but that or a similar name is the most common single variety.)
Options that allow saving results to new datasets.
Saved results, especially those visible after return list or ereturn list. These can be quite elaborate, e.g. saving of matrices of counts after tabulate.
More broadly, Stata commands aren't functions! One characteristic of a function, as so named in many languages or programs, is that there is a result, with special cases where the result is void or null. There clearly are statistical programs which in broad terms hinge on calling functions which have results, and what you see displayed is often a side-effect of that. Stata commands don't work like that in the sense that the results of a program can be various. In the case of commands designed just to show something, the "result" may be a display. It's worth noting that Mata, which underlies and underpins Stata, is more recognisably a C-like language, with (e.g.) many matrix extensions, which is based on functions (and much else).
Yes and no. It really depends on the command you are using. You should look at the help files first.
For instance, pnorm does not allow that. You can create the data yourself using the formula for pnorm described in the help file, where the cumulative distribution at some point is plotted against the so-called plotting position.
Other Stata commands allow you to generate the points directly. This is the case for kdensity for instance.

How to carriage return a long local list and how to define list only once

My first question is simple, but cannot find any answer anywhere and it's driving me crazy:
When defining a local list in Stata how do I do a carriage return if the list is really long?
The usual /// doesn't work when inside double quotations marks.
For example, this doesn't work:
local reglist "lcostcrp lacres lrain ltmax ///
ltmin lrainsq lpkgmaiz lwage2 hyb gend leducavg ///
lageavg ldextn lfertskm ldtmroad"
It does work when I remove the quotation marks, but I am warned that I should include the quotations.
My second question is a more serious problem:
Having defined the local reglist, how can I get Stata to remember it for multiple subsequent uses (that is, not just one)?
For example:
local reglist lcostcrp lacres lrain ltmax ///
ltmin lrainsq ///
lpkgmaiz lwage2 ///
hyb gend leducavg lageavg ldextn lfertskm ldtmroad
reg lrevcrp `reglist' if lrevcrp~=.,r
mat brev=e(b)
mat lis brev
/*Here I have to define the local list again. How do I get Stata to remember
it from the first time ??? */
local reglist lcostcrp lacres lrain ltmax ///
ltmin lrainsq ///
lpkgmaiz lwage2 ///
hyb gend leducavg lageavg ldextn lfertskm ldtmroad
quietly tabstat `reglist' if lrevcrp~=., save
mat Xrev=r(StatTotal),1
mat lis Xrev
Here, I define the local reglist, then run a regression using this list and do some other stuff.
Then, when I want to get the means of all the variables in the local reglist, Stata doesn't remember it anymore and have to define it again. This defeats the whole purpose of defining a list.
I would appreciate it if someone could show me how to define a list just once and be able to call it as many times as one likes.
The best answer to your first question is that if you are typing a long local definition in a command, then (1) you don't need to type a carriage return, you just keep on typing and Stata will wrap around and/or (2) there is a better way to approach local definition. I wouldn't usually type long local definitions interactively because that is too tedious and error-prone.
The quotation marks are not essential for examples like yours, only essential for indicating strings with opening or closing spaces.
Your second question is mysterious. Stata won't forget definitions of local macros in the same program (wide sense) unless you explicitly blank out that macro, i.e. redefine it to an empty string. Here program (wide sense) means program (narrow sense), do-file, do-file editor contents, or main interactive session. You haven't explained why you think this happens. I suspect that you are doing something else, such as writing some of your code in the do-file editor and running that in combination with writing commands interactively via the command window. That runs into the difficulty alluded to: local macros are local to the program they are defined in, so (in the same example) macros defined in the do-file editor are local to that environment but invisible to the main interactive session, and vice versa.
I suggest that you try to provide an example of Stata forgetting a local macro definition that we can test for ourselves, but I am confident that you won't be able to do it.

Underlying mechanism in firing SQL Queries in Oracle

When we fire a SQL query like
SELECT * FROM SOME_TABLE_NAME under ORACLE
What exactly happens internally? Is there any parser at work? Is it in C/C++ ?
Can any body please explain ?
Thanks in advance to all.
Short answer is yes, of course there is a parser module inside Oracle that interprets the statement text. My understanding is that the bulk of Oracle's source code is in C.
For general reference:
Any SQL statement potentially goes through three steps when Oracle is asked to execute it. Often, control is returned to the client between each of these steps, although the details can depend on the specific client being used and the manner in which calls are made.
(1) Parse -- I believe the first action is actually to check whether Oracle has a cached copy of the exact statement text. If so, it can save the work of parsing your statement again. If not, it must of course parse the text, then determine an execution plan that Oracle thinks is optimal for the statement. So conceptually at least there are two entities at work in this phase -- the parser and the optimizer.
(2) Execute -- For a SELECT statement this step would generally run just enough of the execution plan to be ready to return some rows to the client. Depending on the details of the plan, that might mean running the whole thing, or might mean doing just a very small fraction of the work. For any other kind of statement, the execute phase is when all of the work is actually done.
(3) Fetch -- This is when rows are actually returned to the client. Generally the client has a predetermined fetch array size which sets the maximum number of rows that will be returned by a single fetch call. So there may be many fetches made for a single statement. Of course if the statement is one that cannot return rows, then there is no fetch step necessary.
Manasi,
I think internally Oracle would have its own parser, which does parsing and tries compiling the query. Think its not related to C or C++.
But need to confirm.
-Justin Samuel.