How do I suppress markers in a Stata graph? - stata

I'm using lowess to plot average cholesterol over time. Each participant had their cholesterol measured at random dates - usually not at the same time as others.
Anyway, I want the smoothed line, but I don't want the markers, especially since the markers seem to prevent scaling the y-axis from 0-500 to 0-250. Even when I go to the Graph Editor to remove the markers by hand, I still cannot rescale the y-axis.
How do I remove the markers using code only? And will doing this allow me to rescale the y-axis? Or, should I use a different command than lowess?

Graph commands have many, many, options. It's a matter of going through them very carefully using help <command> and the manual. The following graph suppresses the markers.
clear all
set more off
sysuse auto
lowess mpg weight, mean msymbol(i)

Writing here more as a statistics user rather than a statistical programmer:
Suppressing the data sounds a very bad idea, regardless of your implication that it is what you need.
lowess isn't one thing: even with one implementation (Stata), there's still the question of what bandwidth was used. Note that there are several lowess (loess, locfit) algorithms around in different programs.
That said, the short answer is that twoway lowess rather than lowess does what you ask.


Image segmentation: identify spots and scratches with irregular borders

A quick introduction: I do physics research which includes experimental measurements and numerical simulations.
Below is the image which is the result of our theoretical model
Without going into details, I just say that the intensity and color here represent a simulated physical quantity.
Experimental results are below
The measurement has more features and details but it also has a lot of "invalid" data which are represented by darker spots, scratches and marks which have irregular borders and can vary in size and shape. Nonetheless by comparing these two pictures we can visually identify "invalid" pixels on the second figure which is the problem I am trying to solve using a computer.
Simple thresholding by intensity won't work because the valid data also can vary in intensity. I was thinking about using CNN but then I realized that it would be very tedious to prepare a training dataset because there a lot of small marks/spots needs to be marked and manually marking them will take a lot of time.
Is there any other solution for this problem? Or may be there is a pretrained neural network ( maybe SVM?) which handles a similar problem?
Let's check all options one by one taking into account the following:
you have a very specific physical process
you need accurate results
(both process-wise and geometry-wise)
It will be hard to find a "ready-to-be-used" model for your specific process. Moreover, there will be a need to take some specific actions to get an accurate geometry out of it:
Background subtraction
Background subtraction will require a threshold, so for your examples and conditions it has no sense. I produced two masks based on subtracted background, find the difference:
Color-based segmentation
With a properly defined threshold (let's assume we use delta_E) you can segment several areas of interest. For example, lets define three:
bright red
black/dark red
Let's compare:
Additional area:
So color-based segmentation seems to be an option, but it is better to improve input if possible. I hope it makes any sense.

Dealing with imbalance dataset for multi-label classification

In my case, I’ve 33 labels per samples. The input label tensors for a corresponding image are like [0,0,1,0,1,1,1,0,0,0,0,0…...33]. And the samples for some labels are quite low and some are high. I'm looking for predict the regression values. So what will be the best approach to improve the prediction? I would like to apply data balancing technique. But so far I found the balancing technique available only for multi-class. I’m grateful to you if you share your best knowledge about regarding my problem or any other idea to improve the performance. Thanks in Advance.
When using a single.model to regress multiple values, it is usually beneficial to preprocess the predictions to be in roughly the same range.
Look for example on the way detection models predict (regress) bounding box coordinates: values are scaled and the net predicts only corrections.

How do I remove the leftmost zero (on the x-axis) when graphing a categorical variable?

hist body, discrete freq xlabel(#5, labsize(small) angle(forty_five) valuelabel) produces:
I'm graphing a categorical variable, but I can't figure out how to drop the zero from the x-axis. I've tried the documentation for xlabel() and xscale() but didn't find any winners.
The short answer is to spell out that you only want xla(1/5, stuff ). How to spell out precisely which labels you want is documented.
Not the question, but this is in my view a poor graph. Go with a horizontal bar chart in which (1) the discreteness of the variable is respected;(2) the category labels are properly and readably horizontal, instead of using a most awkward device of text at 45 degrees. catplot (SSC) is one way to go. Also in Stata 13 (updated) upwards, graph hbar will do as well. You should also split the title in two lines. Even further off-topic: most consumers of this research should not care two hoots about the variable name or its question number in your survey.

Clustering a list of dates

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?
k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).
Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm

Shape-matching of plots using non-linear least squares

What would b the best way to implement a simple shape-matching algorithm to match a plot interpolated from just 8 points (x, y) against a database of similar plots (> 12 000 entries), each plot having >100 nodes. The database has 6 categories of plots (signals measured under 6 different conditions), and the main aim is to find the right category (so for every category there's around 2000 plots to compare against).
The 8-node plot would represent actual data from measurement, but for now I am simulating this by selecting a random plot from the database, then 8 points from it, then smearing it using gaussian random number generator.
What would be the best way to implement non-linear least-squares to compare the shape of the 8-node plot against each plot from the database? Are there any c++ libraries you know of that could help with this?
Is it necessary to find the actual formula (f(x)) of the 8-node plot to use it with least squares, or will it be sufficient to use interpolation in requested points, such as interpolation from the gsl library?
You can certainly use least squares without knowing the actual formula. If all of your plots are measured at the same x value, then this is easy -- you simply compute the sum in the normal way:
where y_i is a point in your 8-node plot, sigma_i is the error on the point and Y(x_i) is the value of the plot from the database at the same x position as y_i. You can see why this is trivial if all your plots are measured at the same x value.
If they're not, you can get Y(x_i) either by fitting the plot from the database with some function (if you know it) or by interpolating between the points (if you don't know it). The simplest interpolation is just to connect the points with straight lines and find the value of the straight lines at the x_i that you want. Other interpolations might do better.
In my field, we use ROOT for these kind of things. However, scipy has a great collections of functions, and it might be easier to get started with -- if you don't mind using Python.
One major problem you could have would be that the two plots are not independent. Wikipedia suggests McNemar's test in this case.
Another problem you could have is that you don't have much information in your test plot, so your results will be affected greatly by statistical fluctuations. In other words, if you only have 8 test points and two plots match, how will you know if the underlying functions are really the same, or if the 8 points simply jumped around (inside their error bars) in such a way that it looks like the plot from the database -- purely by chance! ... I'm afraid you won't really know. So the plots that test well will include false positives (low purity), and some of the plots that don't happen to test well were probably actually good matches (low efficiency).
To solve that, you would need to either use a test plot with more points or else bring in other information. If you can throw away plots from the database that you know can't match for other reasons, that will help a lot.