Hardcoded and optional [if] statements in program using syntax - if-statement

I'm learning how to use the optional syntax when writing simple Stata programs and I'm wondering if it's possible to hardcode if statements while at the same time passing optional [if] statements through the syntax options.
I know that a simple function can be written like:
sysuse auto
program meanprice
syntax [if]
mean price `if'
end
and then I can for example use some optional if statements like:
meanprice if price > 6000 & rep78 > 2
However, let's say I want to hardcode the price > 6000 statement and still be able to selectively choose optional if statements. The reason I want to do this is that the part I want to hardcode is very rigorous and I always want to pass these options through some nested programs that I'm writing without having to specify them each time.
I have tried using e.g.,
program meanprice_test
syntax [if]
mean price if price > 6000 `if'
end
but this does clearly not work (to my understanding because syntax is parsing text/strings?)
Is there any simple way to achieve the desired outcome using syntax and [if]? I can think of some very tedious workarounds that I'd rather avoid.

What you are defining in Stata terms is a command, not a function.
"clearly does not work" should always be explained by giving the error message, or other explicit result that indicates a problem.
That aside, consider this:
program meanprice_test
syntax [if/]
if "`if'" != "" local if "& (`if')"
mean price if price > 6000 `if'
end
. sysuse auto
(1978 Automobile Data)
. meanprice_test if foreign
Mean estimation Number of obs = 9
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
price | 8783.667 827.6595 6875.08 10692.25
--------------------------------------------------------------
. meanprice_test
Mean estimation Number of obs = 23
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
price | 9655.696 635.944 8336.829 10974.56
--------------------------------------------------------------
The problem with your code is not that syntax is parsing text [that's its job, always] but that the combination of two ifs requires more care. What you had would produce stuff like ... if ... if ... which is illegal.
So, if a user supplies an if qualifier (optional for the user, but syntactically not an option)
You need to get syntax to strip off the user-supplied if. How to do this is documented in help syntax.
Then you need to use & to combine the two if conditions. Parenthesising may help.
EDIT: If quoted strings are ever to be used in the user's if, then use compound double quotes in the program:
if `"`if'"' != "" local if `"& (`if')"'
GENERAL COMMENTS: While what you want is programmable, I think it's unnecessary and questionable practice:
For an audit trail of analysis, a do-file with a keep if statement near the beginning and a corresponding log file should suffice as a reproducible record of work on a subset of data.
For an audit trail, conversely, the use of highly specialised programs with data-specific constraints built into the code is easy to misunderstand or overlook, especially for others using your work, or even yourself at a later date.
Following this strategy carries the burden of writing lots of very specific programs, a poor use of time and energy, and of little use to others.

Related

How to allow for missing values in a summation in non-linear estimation

I am trying to do a non-linear estimation in Stata where some observations do not need all of the variables. The following is a made up example
nl (v1 = ({alpha=1})^({beta=1}*v2) + ({alpha})^({beta}*v3))
some times there is a value of v3, sometimes there isn't. If it is unneeded in the data, it is coded as missing (although its not missing in the sense the data is lacking, the data is perfect). When v3 is missing, I want Stata to treat the above expression as if the term with the v3 isnt there, so in these cases I would just want it to treat the expression for these observations as:
v1 = ({alpha=1})^({beta=1}*v2)
When I run this, stata says:
starting values invalid or some RHS variables have missing values
I know the starting values are fine,
As you can see, simply recoding the missing values to zero will not work. Because it doesn't zero out the term.
Is there something I can do with a sigma summation notation where it only adds the terms for which there are non-missing values?
-Thanks!
Something like this should work:
cls
sysuse auto, clear
gen nm_rep78 = cond(missing(rep78),1,0)
recode rep78 (.=0), gen(z_rep78)
tab nm_rep78 z_rep78
nl (price = ({alpha=1})^({beta=1}*mpg) + nm_rep78*({alpha})^({beta}*z_rep78))
The idea is that you use an indicator variable to zero out the second term.
There might be a way to get nl to use factor variable notation to simplify this, but I've been testing a new cocktail recipe all afternoon and should not attempt this.

RegEx - Order of OR'd values in capture group changes results

Visual Studio / XPath / RegEx:
Given Expression:
(?<TheObject>(Car|Car Blue)) +(?<OldState>.+) +---> +(?<NewState>.+)
Given Searched String:
Car Blue Flying ---> Crashed
I expected:
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
What I get:
TheObject = "Car"
OldState = "Blue Flying"
NewState = "Crashed"
Given new RegEx:
(?<TheObject>(Car Blue|Car)) +(?<OldState>.+) +---> +(?<NewState>.+)
Result is (what I want):
TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"
I conceptually get what's happening under the hood; the RegEx is putting the first (left-to-right) match it finds in the OR'd list into the <TheObject> group and then goes on.
The OR'd list is built at run time and cannot guarantee the order that "Car" or "Car Blue" is added to the OR'd list in <TheObject> group. (This is dramatically simplified OR'd list)
I could brute force it, by sorting the OR'd list from longest to shortest, but, I was looking for something a little more elegant.
Is there a way to make <TheObject> group capture the largest it can find in the OR'd list instead of the first it finds? (Without me having to worry about the order)
Thank you,
I would normally automatically agree with an answer like ltux's, but not in this case.
You say the alternation group is generated dynamically. How frequently is it generated dynamically? If it's every user request, it's probably faster to do a quick sort (either by longest length first, or reverse-alphabetically) on the object the expression is built from than to write something that turns (Car|Car Red|Car Blue) into (Car( Red| Blue)?).
The regex may take a bit longer (you probably won't even notice a difference in the speed of the regex) but the assembly operation may be much faster (depending on the architecture of the source of your data for the alternation list).
In simple test of an alternation with 702 options, in three methods, results are comparable using an option set like this, but none of these results are taking into calculation the amount of time to build the string, which grows as the complexity of the string grows.
The options are all the same, just in different formats
zap
zap
yes
xerox
...
apple
yes
zap
yes
xerox
...
apple
xerox
zap
yes
xerox
...
apple
...
apple
zap
yes
xerox
...
apple
Using Google Chrome and Javascript, I tried three (edit: four) different formats and saw consistent results for all between 0-2ms.
'Optimized factoring' a(?:4|3|2|1)?
Reverse alphabetically sorting (?:a4|a3|a2|a1|a)
Factoring a(?:4)?|a(?:3)?|a(?:2)?|a(?:1)?. All are consistently coming in at 0 to 2ms (the difference being what else my machine might be doing at the moment, I suppose).
Update: I found a way that you may be able to do this without sorting in Regular Expressions, using a lookahead like this (?=a|a1|a2|a3|a4|a5)(.{15}|.(14}|.{13}|...|.{2}|.) where 15 is the upper bound counting all the way down to the lower bound.
Without some restraints on this method, I feel like it can lead to a lot of problems and false positives. It would be my least preferred result. If the lookahead matches, the capture group (.{15}|...) will capture more than you'll desire on any occasion where it can. In other words, it will reach ahead past the match.
Though I made up the term Optimized Factoring in comparison to my Factoring example, I can't recommend my Factoring example syntax for any reason. Sorted would be the most logical, coupled with easier to read/maintain than exploiting a lookahead.
You haven't given much insight into your data but you may still need to sort the sub groups or factor further if the sub-options can contain spaces and may overlap, further diminishing the value of "Optimized Factoring".
Edit: To be clear, I am providing a thorough examination as to why no form of factoring is a gain here. At least not in any way that I can see. A simple Array.Sort().Reverse().Join("|") gives exactly what anyone in this situation would need.
The | operator of regular expression usually uses Aho–Corasick algorithm under the hood. It will always stop at the left most match it found. We can't change the behaviour of | operator.
So the solution is to avoid using | operator. Instead of (Car Blue|Car) or (Car|Car Blue), use (Car( Blue)?).
(?<TheObject>(Car( Blue)?) +(?<OldState>.+) +---> +(?<NewState>.+)
Then the <TheObject> group will always be Car Blue in the presence of Blue.

How do I loop over part of a variable name?

I need to use a local macro to loop over part of a variable name in Stata.
Here is what I tried to do:
local phth mep mibp mbp
tab lod_`phth'_BL
Stata will not recognize the entire variable name.
variable lod_mep not found
r(111);
If I remove the underscore after the `phth' it still does not recognize anything after the macro name.
I want to avoid using a complicated foreach loop.
Is there any way this can be done just using the simple macro?
Thanks!
Your request is a bit confusing. First, this is precisely the purpose of a loop, and second, loops in Stata are (at the "introductory level") quite simple. The following example is a bit nonsensical (and given the structure, there are easier ways of going about this), but should convey the basic idea.
// set up a similar variable name structure
sysuse auto , clear
rename (price mpg weight length) ///
(pref_base1_suff pref_base2_suff pref_base3_suff pref_base4_suff)
// define a local macro to hold the elements to loop over
local varbases = "base1 base2 base3 base4"
// refer to the items of the local macro in a loop
foreach b of local varbases {
summ pref_`b'_suff
}
See help foreach for the syntax of foreach. In particular, note that the structure employed above may not even be required due to Stata's varlist structure (see help varlist). For example, continuing with the code above:
foreach v of varlist pref_base?_suff {
summ `v'
}
The wildcard ? takes the place of one character. * could be used for more flexibility. However, if your variables are not as easily identifiable using the pattern matching allowed by varlist, a loop as in the first example is simple enough -- four very short lines of code.
Postscript
Upon further reflection (sometimes the structure of the question anchors a certain method, when an alternative approach is more straightforward), searching the help files for information on the tabulate command (help tabulate) will direct you to the following syntax: tab1 varlist [if] [in] [weight] [, tab1_options]
Given the discussion above about the use of varlists, you can simply code
tab1 lod_m*_BL
assuming, of course, that there are no other variables matching the pattern for which you do not want to report a frequency table. Alternatively,
tab1 lod_mep_BL lod_mibp_BL lod_mbp_BL
is not much longer and does the trick, albeit without the use of any sort of wildcard or macro substitution.

Formatting and displaying locals in Stata

I came across a little puzzle with Stata's locals, display, and quotes..
Consider this example:
generate var1 = 54321 in 1
local test: di %10.0gc var1[1]
Why is the call:
di "`test'"
returning
54,321
Whereas the call:
di `test'
shows
54 321
What is causing such behaviour?
Complete the sequence with
(1)
. di 54,321
54 321
(2)
. di "54,231"
54,321
display interprets (1) as an instruction to display two arguments, one by one. You get the same result with your last line as (first) the local macro test was evaluated and (second) display saw the result of the evaluation.
The difference when quotation marks are supplied is that thereby you insist that the argument is a literal string. You get the same result with your first display command for the same reasons as just given.
In short, the use of local macros here is quite incidental to the differences in results. display never sees the local macro as such; it just sees its contents after evaluation. So, what you are seeing pivots entirely on nuances in what is presented to display.
Note further that while you can use a display format in defining the contents of a local macro, that ends that story. A local does not have an attached format that sticks with it. It's just a string (which naturally may mean a string with numeric characters).

Application using machine learning to auto-correct custom sentences: how to begin?

Before asking my question, here is the situation:
I have some very basic knowledge about Artifical Intelligence, I know about Inference Engine, coding in LISP or Prolog, a bit about neural network, but not much. That's what I studied.
I have a project, an application which has to correct some custom sentences.
Those sentences are normal strings, which can contain a lot of different characters. Fortunately, thanks to Flex(lexer), I defined Tokens, which is now easier for analysis. An example of string:
AZERTY AWESOME 333.222 AZERTY MAGIC P
Which gives in tokens (example):
VERB NOUN NUMBER VERB ADJECTIVE SPEC
I also use Bison to allow some combos, and reject the others:
VERB NOUN NUMBER VERB ADJECTIVE SPEC is ok
VERB VERB NOUN NUMBER ADJECTIVE SPEC is not ok
etc...
Those sentences can have some errors when coming to my application. These errors can be from different origins, let's have some examples:
AZERTY AWESOM E 333.222 AZERTY MAGIC POINT
Additionnal space in the word awesome makes the parser to recognize a VERB and a SPEC instead of NOUN (like above). So the correction would be to remove the additionnal space.
Others errors can be a missing space (making stick two words), unknown tokens, unknown combo (for bison), no spaces at all, etc...
So I began to create my application in C++, with a determinist approach first: I created a kind of dictionnary which contains every pattern of error I found previously, and created corrections for them. It works quite well, I can correct a lot of them because I found very generic pattern. But I would like to enhance this performance by adding a machine learning feature, to correct the other ones.
I have let's say 70% of correction good with my "fixed-correction", and I would like to make this % grow up with a machine learning thing. It would learn from wrong&corrected sentences, and then would be able to correct by itself the sentences I wasn't able to correct (the last 30%).
Here is my question, I am a newbie in machine learning, even if I have already studied AI a bit, and I don't know where to begin.
My first question is: I know about neural network, but it is used to guess, right? For example, I would give it a sentence and it would be able to tell me if it's correct or not. But this is not what I want, I want the app to correct it, not just telling if it's correct. The thing is I don't really see how the application can "remove/modify" by itself.
In which direction would you suggest that I would go for ? Which Machine Learning Principles/ Tools / Technologies would you suggest for this kind of application?
I hope you understood my problem well, and will be able to help me.
Well since no other answers were posted and Benjy Kessler didn't sum up his link, I decided to post my own answer, if it can help someone.
I decided to use the N-grams strategy which is, I think, a good way to solve my problem. It is just theory, and I began coding this, but it may not work. But I like the idea and I think it is worth a try.
N-grams are UNIGRAM (a word), or a BIGRAM (pair of words) or a TRIGRAM (three words), etc... etc..
The main idea is to give my application a lot of training data, with good-format strings. For each of them, I'm going to count the n-grams (from uni to tri).
I will get then something like this (after giving the app thousands of data):
unigram table (figures are fake and random)
word | occurrence | prob
________________________________________
I | 500 | 0.2
want | 645 | 0.2
a | 2434 | 0.5
cat | 20 | 0.1
bigram table
first word | second word | occurrence | prob
___________________________________________________________
| I | want | 600 | 0.5
| want | a | 500 | 0.4
| a | cat | 100 | 0.1
same for trigram table.
So when I have this data, it is easier then to analyse, let's have an example:
I WANTA CAT
When analysing, the app will see first that WANTA doesn't exist in the table, so it will split this word until it gets a "sentence" with a good probability to happen.
So when splitting to "WANT" & "A", the app will see that "WANT" has a good probability to happen, same for "A", then he will check the bigram compatibilitty which is also good. Even trigrams with "I" and "CAT" for more precision.
I guess it's a good solution which doesn't require a lot of time and which will be, I hope, kinda efficient.
I hope I made myself clear, and help people who were wondering same questions.