How to get y axis range in Stata - stata

Suppose I am using some twoway graph command in Stata. Without any action on my part Stata will choose some reasonable values for the ranges of both y and x axes, based both upon the minimum and maximum y and x values in my data, but also upon some algorithm that decides when it would be prettier for the range to extend instead to a number like '0' instead of '0.0139'. Wonderful! Great.
Now suppose that after (or while) I draw my graph, I want to slap some very important text onto it, and I want to be choosy about precisely where the text appears. Having the minimum and maximum values of the displayed axes would be useful: how can I get these min and max numbers? (Either before or while calling the graph command.)
NB: I am not asking how to set the y or x axis ranges.

Since this issue has been a bit of a headache for me for quite some time and I believe there is no good solution out there yet I wanted to write up two ways in which I was able to solve a similar problem to the one described in the post. Specifically, I was able to solve the issue of gray shading for part of the graph using these.
Define a global macro in the code generating the axis labels This is the less elegant way to do it but it works well. Locate the tickset_g.class file in your ado path. The graph twoway command uses this to draw the axes of any graph. There, I defined a global macro in the draw program that takes the value of the omin and omax locals after they have been set to the minimum between the axis range and data range (the command that does this is local omin = min(.scale.min,omin) and analogously for the max), since the latter sometimes exceeds the former. You could also define the global further up in that code block to only get the axis extent. You can then access the axis range using the globals after the graph command (and use something like addplot to add to the previously drawn graph). Two caveats for this approach: using global macros is, as far as I understand, bad practice and can be dangerous. I used names I was sure wouldn't be included in any program with the prefix userwritten. Also, you may not have administrator privileges that allow you to alter this file based on your organization's decisions. However, it is the simpler way. If you prefer a more elegant approach along the lines of what Nick Cox suggested, then you can:
Use the undocumented gdi natscale command to define your own axis labels The gdi commands are the internal commands that are used to generate what you see as graph output (cf. https://www.stata.com/meeting/dcconf09/dc09_radyakin.pdf). The tickset_g.class uses the gdi natscale command to generate the nice numbers of the axes. Basic documentation is available with help _natscale, basically you enter the minimum and maximum, e.g. from a summarize return, and a suggested number of steps and the command returns a min, max, and delta to be used in the x|ylabel option (several possible ways, all rather straightforward once you have those numbers so I won't spell them out for brevity). You'd have to adjust this approach in case you use some scale transformation.
Hope this helps!

I like Nick's suggestion, but if you're really determined, it seems that you can find these values by inspecting the output after you set trace on. Here's some inefficient code that seems to do exactly what you want. Three notes:
when I import the log file I get this message:
Note: Unmatched quote while processing row XXXX; this can be due to a formatting problem in the file or because a quoted data element spans multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict) if quoted data spans multiple lines or option bindquote(nobind) if quotes are not used for binding data.
Sometimes the data fall outside of the min and max range values that are chosen for the graph's axis labels (but you can easily test for this).
The log linesize is actually important to my code below because the key values must fall on the same line as the strings that I use to identify the helpful rows.
* start a log (critical step for my solution)
cap log close _all
set linesize 255
log using "log", replace text
* make up some data:
clear
set obs 3
gen xvar = rnormal(0,10)
gen yvar = rnormal(0,.01)
* turn trace on, run the -twoway- call, and then turn trace off
set trace on
twoway scatter yvar xvar
set trace off
cap log close _all
* now read the log file in and find the desired info
import delimited "log.log", clear
egen my_string = concat(v*)
keep if regexm(my_string,"forvalues yf") | regexm(my_string,"forvalues xf")
drop if regexm(my_string,"delta")
split my_string, parse("=") gen(new)
gen axis = "vertical" if regexm(my_string,"yf")
replace axis = "horizontal" if regexm(my_string,"xf")
keep axis new*
duplicates drop
loc my_regex = "(.*[0-9]+)\((.*[0-9]+)\)(.*[0-9]+)"
gen min = regexs(1) if regexm(new3,"`my_regex'")
gen delta = regexs(2) if regexm(new3,"`my_regex'")
gen max_temp= regexs(3) if regexm(new3,"`my_regex'")
destring min max delta , replace
gen max = min + delta* int((max_temp-min)/delta)
*here is the info you want:
list axis min delta max

Related

SPSS- how to make the histogram template refer to the y axis as percentage

I have an odd issue regarding the SPSS (version 20) use of Chart Template, and any help will be appriciated.
I used the GUI to manualy define a chart template for Histograms. Those are simple definitions:
1) set the x axis between 0 to 100.
2) set the y axis as percent and not as actual number of examples within each bin.
3) set the bin sizes to 5.
4) set the maximal value of the y axis to 20.
I saved the template using the File->Save ChartTemplate option after changing the definitions of one histogram.
Oddly, when I implement the template on a new histogram, only definitions 1,3,4 are generated while 2 is omitted. I searched for a solution and did not find any. This is extremly frustrating since I need to waste time and effort to manualy reset the axis to the right definition over any new histogram I make (which is a lot :/ ).
There might be a way to hack the template code using notepad but I did not see any mention of the Y axis there.
Any help and comment would be much appriciated.
I can't say offhand how to set up a template to do any of those aspects, but here is an example using syntax to specify those four options.
SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 500.
COMPUTE Var = RV.UNIFORM(0,90).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
FORMATS Var (F3.0).
EXECUTE.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Var MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Var=col(source(s), name("Var"))
GUIDE: axis(dim(1), label("Var"), delta(5))
GUIDE: axis(dim(2), label("Percent in Bin"))
SCALE: linear(dim(1), min(0), max(100))
SCALE: linear(dim(2), max(20))
ELEMENT: interval(position(summary.percent.count(bin.rect(Var, binWidth(5)), base.all(acrossPanels()))))
END GPL.
And this is what the graph looks like for me (with my default chart template) in V25.

Fast R sliding window function using a RANGE rather than a PHYSICAL partition

I am trying to solve a problem: run a statistic (count; sum; mean) over an irregular time series data set, where the window-size for each line is within a given date range (preferably over a grouping column).
I have found that ORACLE SQL supports this through:
COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
And in R I have built functions that use lists to collect vectors of values for each row, but this is expensive and slow. The best solution I have found is by user: mgahan is his package boRingTrees:
R: fast sliding window with given coordinates
library("devtools")
install_github("boRingTrees","mgahan")
library("boRingTrees")
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Target)
data[,Roll:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",
target="Target",lower=0,upper=31,incbounds=T,stat=sum,na.rm=T,cores=1)]
However, when I run this against larger data sets, it also runs quite slowly.
What I have tried:
To use lists in loops to return window partitions, but this is very slow.
Importing user's functions, such as boRingTrees, which encapsulate
the problem well - but are also slow.
What I have learnt:
There is good support in R for physical partitions (up one row, group into days/weeks, etc) through zoo and rollapply, but limited support for Ranged partitions (all lines within this number of hours from a timestamp).
What I think I need:
I have come to the conclusion that I need a C function to more speedily run a sliding window over a range of dates. I have started playing with C++ in R, and these two Rcpp efforts come close (in technique) to what I think I need:
R: Rolling window function with adjustable window and step-size for irregularly spaced observations
R: fast sliding window with given coordinates
I hope this summary is useful collation of information for people trying to solve similar problems (I found searching on this topic difficult - sparse information and very different ways to describe similar things). Hopefully someone can assist me in building a faster C++ solution I can run in R (inline or .cpp). Here is a sample data set (again, courtesy of mgahan):
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Val <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Val)
e.g:
data[,RowRollCount31:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates", target="Val",lower=0,upper=31,incbounds=T,stat=length,na.rm=T)]
Ideally, the solution would use the 'interval' option as in the Oracle example (i.e windows within 'x' & 'hours' of each row), and also the 'group by'/'by_list' and 'stat' options that mgahan cleverly catered for.
Further reading /a good explanation of the problem:
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/
Many thanks in advance!

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

Stata seems to be ignoring my starting values in maximum likelihood estimation

I am trying to estimate a maximum likelihood model and it is running into convergence problems in Stata. The actual model is quite complicated, but it converges with no troubles in R when it is supplied with appropriate starting values. I however cannot seem to get Stata to accept the starting values I provide.
I have included a simple example below estimating the mean of a poisson distribution. This is not the actual model I am trying to estimate, but it demonstrates my problem. I set the trace variable, which allows you to see the parameters as Stata searches the likelihood surface.
Although I use init to set a starting value of 0.5, the first iteration still shows that Stata is trying a coefficient of 4.
Why is this? How can I force the estimation procedure to use my starting values?
Thanks!
generate y = rpoisson(4)
capture program drop mypoisson
program define mypoisson
args lnf mu
quietly replace `lnf' = $ML_y1*ln(`mu') - `mu' - lnfactorial($ML_y1)
end
ml model lf mypoisson (mean:y=)
ml init 0.5, copy
ml maximize, iterations(2) trace
Output:
Iteration 0:
Parameter vector:
mean:
_cons
r1 4
Added: Stata doesn't ignore the initial value. If you look at the output of the ml maximize command, the first line in the listing will be titled
initial: log likelihood =
Following the equal sign is the value of the likelihood for the parameter value set in the init statement.
I don't know how the search(off) or search(norescale) solutions affect the subsequent likelihood calculations, so these solution might still be worthwhile.
Original "solutions":
To force a start at your initial value, add the search(off) option to ml maximize:
ml maximize, iterate(2) trace search(off)
You can also force a use of the initial value with search(norescale). See Jeff Pitblado's post at http://www.stata.com/statalist/archive/2006-07/msg00499.html.

Plotting a volatile data file with gnuplot dynamically

I've seen some similar questions out of which I have made a system which works for me but I need to optimize it because this program alone is taking up a lot of CPU load.
Here is the problem exactly.
I have an incoming signal/stream of data which I need to plot in real time. I only want a limited number of points to be displayed at a time (Say 1024 points) so I plot the data points along the y axis against an index from 0-1024 on the x-axis. The values of the incoming data range from 0-1023.
What I do currently (This is all in C++) is I put the data into a circular loop as it comes and each time the data gets updated (Or every second/third data point), I write out to a file and using a pipe, I plot the data from that file with gnuplot.
While this works almost perfectly, it causes a fair bit of load (Depending on the input data rate, I saw even 70% usage on both my cores of my Core 2 Duo). I'll need to be running some processor intensive code along with this short program so I feel that it is almost necessary to optimize it.
What I was hoping could be done is this: Can I only plot the differences between the current plot and the new data (Or plot each point as it comes in without replotting the whole graph such that the old item at that x index is removed).
I have a fixed number of points on the graph so replot wouldn't work. I want the old point at that x location to be removed.
Unfortunately, what you're trying to accomplish can't be done. You can mark a datafile as volatile or use the refresh keyword, but those only update the plot without re-reading the data. You want to re-read the data and then only update the differences.
There are a few things that might be helpful though. 1) your eye can only register ~26 frames per second. So, if you have a way to make sure that you only send data 26x per second to gnuplot, that might help. 2) How are you writing the datafiles? Are you dumping as ascii or binary? Doing a binary dump might be faster (both for writing and for gnuplot to read). You'll have to experiment.
There is one hack which will probably not make your script go faster, but you can try it (if you know a reasonable yrange to set, and are using points to plot the data)...
#set up code:
set style line 1 lc rgb "blue"
set xrange [0:1023]
set yrange [0:1]
plot NaN notitle #Only need to do this once.
for [i=0:1023] set label i+1 at i,0 point ls 1 #Labels must have tags > 0 :-(
#this part gets repeated by your C code.
#you could move a few points at a time to make it more responsive.
set label 401 at 400,0.8 #move point number 400 to a different y value
refresh #show it at it's new location.
You can use gnuplot to do dynamic plotting of data as explained in their FAQ, using the reread function. It seems to run at quite a low load and automatically scrolls the graph when it reaches the end. To run at low load I found I had to add a ; sleep 1 after the awk command (in their example file dyn-ping-loop.gp) otherwise it spends too much CPU on looping on the awk processing.