I have the following plot with the code:
graph bar mean_positive, over(agegroup)
How could I generate a bar plot with YA on the left, MA in the middle, and OA on the right?
bar plot
Let's guess that agegroup is a string variable, in which case
label def agegroup 1 YA 2 MA 3 OA
encode agegroup, gen(age) label(agegroup)
will get your categories in order and you can try again with age. This is a good time to be less cryptic and use longer labels and even give the exact age limits.
Another guess is that you have a numeric variable with value labels in a poor order, in which case reach first for recode.
Related
Like if i have a data frame with four columns and i want to plot any two columns of it just to visualize my data. And we can find the value of all the parameters by using this
pd.describe()
count 332.000000
mean 5645.999337
std 391.081389
min 4952.290000
25% 5294.402500
50% 5647.905000
75% 6028.805000
max 6290.980000
Now, how can we put the information that we get with this function ('pandas.describe') into the plot in just one go. Instead of using the usual 'label' function from matplotlib.
Matplotlib has the option ax.text. So you need to convert this info into text.
Here comes an example:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3]})
desc=df.describe()
Describe is also a DataFrame, you can turn every column into a string list:
data1=[i for i in desc.index]
data2=[str(i) for i in desc.A]
Now you can join both with a colon in between:
text= ('\n'.join([ a +':'+ b for a,b in zip(data1,data2)]))
Then in your graph, you can input:
ax.text(pos1, pos2, text , fontsize=15)
Where pos1 and pos2are numbers for the position of your text.
Does that help?
Tell me!
I'm trying to create "Sale Rep" summaries by "Shop", where I can simply filter a column by the rep's name, them populate a total sales for each shop next to the relevant filter result.
I'm using this to filter all the Stores by Scott:
=(filter(D25:D47,A25:A47 = "Scott"))
Next, want to associate the Store/Account in F to populate with the corresponding value of E inside of G. So, G25 should populate the value of E25 ($724), G26 with E26 ($822), and F27 with E38 ($511.50)
I don't know how to write the formula correctly, but something like this is what I'm trying to do: =IF(F25=D25:D38),E25 I know that's not right, and it won't work in a fill down. But I'm basically trying to look for and copy over the correct value match of D and E inside of G. So, Misty Mountain Medicince in F27 will be matched to the value of E38 and populated in G27.
The filter is what's throwing me off, because it's not a simple fill down. And I don't know how to match filtered results from one column to a matched value in another.
Hope the screenshot helps. Screenshot of table:
Change Field Rep: Scott to Scott and you might apply:
=query(A25:E38,"select D,E where A='"&F24&"'")
// Enter the following into G25 and copy down column G
=(filter(E25:E47, D25:D47 = F25))
or
// Enter the following into G25 will expand with content in F upto row 47
=ArrayFormula(IF(F25:F47 <> 0, VLOOKUP(F25:F47, D25:E47, 2, FALSE),))
I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')
I am attempting to produce an overlayed -xtline- plot that distinguishes between males and females (or any number of multiple groups) by displaying different plot styles for each group. I chose to recast the xtline plot as "connected" and show males using circle markers and females as triangle markers. Taking cues from this question on Statalist, I produced code similar to what is below. When I try this solution Stata produces the "too many options" error, which is perhaps predictable given the large number of unique persons. I am aware of this solution which employs combined graphs but that is also not practical given the large number of unique individuals in my data.
Does a more simple solution to this problem exist? Does Stata have the capacity to overlay multiple -xtline- plots like it can -twoway- plots?
The code below, using publicly available data from UCLA's excellent Stata guide shows my basic code and reproduces the error:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
gsort -male id
qui levelsof id if !male, loc(fidlevs)
qui levelsof id if male, loc(midlevs)
qui levelsof id, loc(alllevs)
tokenize `alllevs'
loc len_f : word count `fidlevs'
loc len_m : word count `midlevs'
loc len_all : word count `alllevs'
loc start_f = `len_all' - `len_f'
forval i = 1/`len_all' {
if `i' < `start_f' {
loc m_plot_opt "`m_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(circle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
else if `i' >= `start_f' {
loc f_plot_opt "`f_plot_opt' plot`i'opts(recast(connected) mcolor(black) msize(medsmall) msymbol(triangle) lcolor(black) lwidth(medthin) lpattern(solid))"
}
}
di "xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'"
xtline alcuse, legend(off) scheme(s1mono) overlay `m_plot_opt' `f_plot_opt'
It is difficult (for me) to separate the programming issue here from statistical or graphical views on what kind of graph works well, or at all. Even with this modest dataset there are 82 distinct identifiers, so any attempt to show them distinctly fails to be useful, if only because the resulting legend takes up most of the real estate.
There is considerable ingenuity in the question code in working through all the identifiers, but a broad-brush approach seems to work as well. Try this:
use http://www.ats.ucla.edu/stat/stata/examples/alda/data/alcohol1_pp, clear
xtset id age
separate alcuse, by(male) veryshortlabel
label var alcuse1 "male"
label var alcuse0 "female"
line alcuse? age, legend(off) sort connect(L)
Key points:
There is nothing very special about xtline. It's just a convenience wrapper. When frustrated by its wired-in choices, people often just reach for line.
To get distinct colours, distinct variables suffice, which is where separate has a role. See also this Tip.
Although the example dataset is well behaved, extra options sort connect(L) will help in some case to remove spurious connections between individuals or panels. (In extreme cases, reach for linkplot (SSC).)
This could be fine too:
line alcuse age if male || line alcuse age if !male, legend(order(1 "male" 2 "female")) sort connect(L)
I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.