Editing hbar graph (Stata) - stata

My code below produces the graph attached. However, I am trying to add two adjustments but with no luck.
1- I would like to organize the Y axis where for all industries November comes before December, rather than being arranged by which month had more jobs as in the current graph.
2- I also tried adding labels to the Y axis where it only says "Nov" & "Dec", without the additional text, and while Stata does not produce any errors, it is not changing the graph.
preserve
drop if total_jobs_industry<15
graph hbar (count) total_jobs_industry, over(month) over(industry, sort(1)) subtitle("Jobs by Industry and month", span)
restore
I know that I can change the graph with tiny details manually in Stata, but I prefer automating the process if possible.
Data example:
Example generated by -dataex-. To install: ssc install dataex
clear
input float total_jobs_industry str39 industry str8 month
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Nov_2020"
38 "Computer Hardware & Software" "Dec_2020"
12 "Consulting" "Dec_2020"
63 "" "Dec_2020"
32 "IT Services" "Dec_2020"
32 "IT Services" "Nov_2020"
38 "Computer Hardware & Software" "Nov_2020"
12 "Aerospace & Defense" "Nov_2020"
12 "Accounting" "Nov_2020"
12 "Accounting" "Dec_2020"
When I ran with sum, instead of count, I get the graph below:
preserve
drop if total_jobs_industry<15
graph hbar (sum) total_jobs_industry, over(month) over(industry, sort(1)) subtitle("Jobs by Industry and month", span)
restore
Furthermore, this is how I create the variable to the count the number of jobs per industry:
// The variable id contains observation number running from 1 to X and nt is the total number of observations
generate id = _n
generate nt = _N
// Sorting by inudstry. Now n1 is the observation number within each Industry group and total_jobs_industry is the total number of observations for each Industry group.
sort industry
by industry: generate n1 = _n
by industry: generate total_jobs_industry = _N
order total_jobs_industry, a(industry)

This is a very puzzling question. The following list of reasons is not complete.
The post seems to mix old and new versions of itself and isn't consistent. You can not reasonably expect us to decode such a meandering story reliably. The standard here is to present a minimal verifiable example, and that standard is not being met by this thread. See guidance here.
Neither of the graphs shown correspond to the data given.
It is hard for me to believe that (count) makes sense for your data. As said, it counts non-missing values, but your key variable appears to be total_count_industry. On the other hand, working variously with (sum) and the number of observations seems to confuse quite different kinds of calculations.
There appear to be duplicate observations in your example data.
You state that you ' also tried adding labels to the Y axis where it only says "Nov" & "Dec" ' but nothing in your code shows any such attempt to comment on.
You're expecting Nov_2020 to sort before Dec_2020, which won't happen because so far as Stata is concerned it is just a string variable, so the fact that D sorts before N is paramount. This is the reason December sorts before January, and it's nothing to do with sorting on industry values, which affects only the ordering of the groups of bars. You're not making use of Stata's functionality for date variables.
I doubt that I can make sense of any of these problems except the last. It seems to be a limitation of graph hbar that it ignores time variable display formats, so I used value labels to ensure that Nov and Dec sort in the order you wish.
clear
input float total_jobs_industry str39 industry str8 month
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Nov_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Dec_2020"
11 "Architectural & Engineering Services" "Nov_2020"
38 "Computer Hardware & Software" "Dec_2020"
12 "Consulting" "Dec_2020"
63 "" "Dec_2020"
32 "IT Services" "Dec_2020"
32 "IT Services" "Nov_2020"
38 "Computer Hardware & Software" "Nov_2020"
12 "Aerospace & Defense" "Nov_2020"
12 "Accounting" "Nov_2020"
12 "Accounting" "Dec_2020"
end
duplicates drop
gen mdate = monthly(month, "MY")
levelsof mdate, local(months)
tokenize "`c(Mons)'"
foreach m of local months {
local month = month(dofm(`m'))
label def mdate `m' "``month''", modify
}
label val mdate mdate
set scheme s1color
graph hbar (asis) total_jobs_industry, over(mdate) over(industry, sort(1) descending)

Related

Power BI Matrix Visual Showing Row of Blank Values Even Though Source Data Does Not Have Blanks

I have two tables one with data about franchise locations (Franchise Profile Info) and one with Award data. Each franchise location is given a certain number of awards they are allowed to give out per year. Each franchise location rolls up to a larger group depending on where in the country they are located. These tables are in a 1 to 1 relationship using Franchise ID. I am trying to create a matrix with the number of awards, total utilized, and percentage utilized rolled up to group with the ability to expand the groups and see individual locations. For some reason when I add the value fields a blank row is created. There are not any blank rows in either of the original tables so I'm not sure where this is coming from.
Franchise Profile Info table
ID
Franchise Name
Group
Street Address
City
State
164
Park's
West
12 Park Dr.
Los Angeles
CA
365
A & J
East
243 Whiteoak Rd
Stafford
VA
271
Otto's
South
89 Main St.
St. Augustine
FL
Award table
ID
Year
TotalAwards
Utilized
164
2022
16
12
365
2022
5
5
271
2022
22
17
This tables are in a relationship with a 1 to 1 match on ID
What I want the matrix to look like
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
Instead what I'm getting is this
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
0
0
0%
I can't for the life of me figure out where this row is coming from. I can add in the Group and Franchise name as rows but as soon as I add any of the value columns this blank row shows up.
You have a value on the many side that does not exist on the one side. You can read a full explanation here. https://www.sqlbi.com/articles/blank-row-in-dax/

How do I create a pivot table with weighted averages from a table in PowerBI?

I have data in the following format:
Building
Tenant
Type
Floor
Sq Ft
Rent
Term Length
1 Example Way
Jeff
Renewal
5
100
100
6
47 Fake Street
Tom
New
3
500
200
12
I need to create a visualisation in PowerBI that displays a pivot table of attribute by tenant, with a weighted averages (by square foot) column, like this:
Jeff
Tom
Weighted Average (by Sq Ft)
Building
1 Example Way
47 Fake Street
-
Type
Renewal
New
-
Floor
5
3
-
Sq Ft
100
500
433.3333333
Rent
100
200
183.3333333
Term Length (months)
6
12
11
I have unpivoted the original data, like this:
Tenant
Attribute
Value
Jeff
Building
1 Example Way
Jeff
Type
Renewal
Jeff
Floor
5
Jeff
Sq Ft
100
Jeff
Rent
100
Jeff
Term Length (months)
6
Tom
Building
47 Fake Street
Tom
Type
New
Tom
Floor
3
Tom
Sq Ft
500
Tom
Rent
200
Tom
Term Length (months)
12
I can almost create what I need from the unpivoted data using a matrix (as below), but I can't calculate the weighted averages column from that matrix.
Jeff
Tom
Building
1 Example Way
47 Fake Street
Type
Renewal
New
Floor
5
3
Sq Ft
100
500
Rent
100
200
Term Length (months)
6
12
I can also create a table with my attributes as headers (instead of in a column). This displays the right values and lets me calculate weighted averages (as below).
Building
Type
Floor
Sq Ft
Rent
Term Length (months)
Jeff
1 Example Way
Renewal
5
100
100
6
Tom
47 Fake Street
New
3
500
200
12
Weighted Average (by Sq Ft)
-
-
-
433.3333333
183.3333333
11
However, it's important that these values are displayed vertically instead of horizontally. This is pretty straightforward in Excel, but I can't figure out how to do it in PowerBI. I hope this is clear. Can anyone help?
Thanks!

How to filter distinct counts of text with a greater than indicator in Power BI?

I am working on a report that counts stores with different types of beverages. I am trying to get a distinct count of stores that are selling 4 or more Powerade flavors and two or more Coca-Cola flavors while maintaining a count of stores that are purchashing other products (Sprite, Dr. Pepper, etc.).
My data table is BEVSALES and the data looks like:
CustomerNo Brand Flavor
43 PWD Fruit Punch
37 Coca-Cola Vanilla
43 PWD Mixed Bry
37 Coca-Cola Cherry
44 Sprite Tropical Mix
43 PWD Strawberry
43 PWD Grape
44 Coca-Cola Cherry
17 Dr. Pepper Cherry
I am trying to make the data give me a distinct count of customers with filters that have PWD>=4 and Coca-Cola>=2, while keeping the customer count of Dr. Pepper and Sprite at 1 each. (1 customer purchasing PWD, 1 customer Purchasing Coca-Cola, etc.)
The best measure that I have been able to find is
= SUMX(BEVSALES, 1*(FIND("PWD",BEVSALES[Brand],,0)))
but I don't know how to put it together so the formula counts the stores that have more than 4 PWD and 2 Coca-Cola flavors. Any ideas?
The easiest way would be to do this in a separate query. Go to the query design and click on edit. Then chose your table and group by column Brand and distinctcount the column Flavor. The result should look like this (Maybe as a new table):
GroupedBrand DistinctCountFlavor
PWD 4
Coca-Cola 2
Sprite 1
Dr. Pepper 1
Now you can access the distinct count of the flavors by brands. With an IIF() statement you can check for >=4 at PWD and so on...

tabstat: How to sort/order the output by a certain variable?

I gathered some NBA players' data of their triple-double games, and would like to find out who got the most explosive data on average.
The source is "Basketball Reference - Player Game Finder - Triple Doubles".(Sorry that I can't post the direct url because of the lack of reputation)
So I generated a table summarizing descriptive statistics (e.g. count mean) for several variables (pts trb ast stl blk) usingļ¼š
tabstat pts trb ast stl blk, statistics(count mean) format(%9.1f) by(player)
What I get is the following table:
tabstat result:
How can I tell Stata to filter the players by count >= 10 (who got 10 or more triple-doubles ever) as a column then sort the table by pts and get:
Ideal result:
Like above, I would say Michael Jordan and James Harden are the Top 2 most explosive triple-double players and Darrell Walker is the most economic one.
Do study https://stackoverflow.com/help/mcve on how to present an example other people can work with straight away. Also, avoiding sports-specific jargon that won't be universally comprehensible and focusing more on the general programming problem would help. Fortunately, what you want seems clear nevertheless.
To do this you need to create a variable defining the order desired in advance of your tabstat call. To get it (value) labelled as you wish, use labmask (search labmask then download from the Stata Journal location given).
Here is some technique.
sysuse auto, clear
egen mean = mean(weight), by(rep78)
egen count = count(weight), by(rep78)
egen group = group(mean rep78) if count >= 5
replace group = -group
labmask group, values(rep78)
label var group "`: var label rep78'"
tabstat mpg weight , by(group) s(count mean) format(%1.0f)
Summary statistics: N, mean
by categories of: group (Repair Record 1978)
group | mpg weight
-------+--------------------
2 | 8 8
| 19 3354
-------+--------------------
3 | 30 30
| 19 3299
-------+--------------------
4 | 18 18
| 22 2870
-------+--------------------
5 | 11 11
| 27 2323
-------+--------------------
Total | 67 67
| 21 3030
----------------------------
Key details:
The grouping variable is based not only on the means you want to sort on but also on the original grouping variable, just in case there are ties on the means.
To get ordering from highest mean downwards, the grouping variable must be negated.
tabstat doesn't show variable labels in the body of the table. (Usually there wouldn't be enough space for them.)

Discounting losses in SAS

I'm writing my master thesis on the costs of occupational injuries. As a part of the thesis I have estimated the expected wage loss for each person for every year for four years after the injure. I would like to discount the estimated losses to a specific base year (2009) in SAS.
For the year 2009 the discounted loss is just equal the estimated loss. For 2010 and on the discounted loss can be calculated with the netpv function:
IF year=2009 then discount_loss=wage;
IF year=2010 then discount_loss=netpv(0.1,1,0,wage);
IF year=2011 then discount_loss=netpv(0.1,1,0,0,wage);
And so forth. But starting from 2014 I would like to use the estimated wage loss for 2014 as the expected loss onward - so for instance if the estimated loss is 100$ that would represent the yearly loss until retirement. Since each person don't have the same age there would be too many ways just to hard code, so I'm looking for a better way. There are approximately 200.000 persons in my data set with different estimated losses for each year.
The format of the (fictional) data looks like this:
id age year age_retirement wage_loss rate discount_loss
1 35 2009 65 -100 0.1 -100
1 36 2010 65 -100 0.1 -90,91
1 37 2011 65 -100 0.1 -82,64
1 38 2012 65 -100 0.1 -75,13
1 39 2013 65 -100 0.1 -68,30
1 40 2014 65 -100 0.1
The column discount_loss is the net present value of the loss i 2009. Calculated as above.
I would like the loss in 2014 to represent the sum of losses for the rest of the period (until age_retirement) on the labor market. That would be -100$ discounted for 2009 starting from 2014 until 2014+(65-40).
Thanks!
Use the FINANCE function for PV, Present Value.
In your situation above, you're looking for the value of 100 for 25 years of payments (65-40)=25. I'll leave the calculation of the number of years up to you.
FINANCE('PV', rate, nper, payment, <fv>, <type>);
In your case, Future Value is 0 and the type=1 as you assume payment at the beginning of the year.
The formula below calculates the present value of a series of 100 payments over 25 years with a 10% interest rate and paid at the beginning of the period.
value=FINANCE('PV', 0.1, 25, -100, 0, 1);
Value = 998.47440201
Reference is here:
https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p1cnn1jwdmhce0n1obxmu4iq26ge.htm
If you are looking for speed why not first calculate an array that contains the PV of $1 for for i years where i goes from 1 to n. Then just select the element you need and multiply. This could all be done in a data step.