Faceted graph with different colors for by(group) - stata

I have two categorical variables, industry and province, for individuals in my dataset. To create a faceted graph that allows me to see how many individuals work in an industry for each province, I am using the following code:
use https://www.stata-press.com/data/r17/nlsw88.dta
twoway (histogram industry), by(occupation)
Because I have around 20 provinces, I would like to colorize each province in by(province) differently. I have tried using twoway (graph bar industry, over(province) to no avail.
Is it possible in Stata to color each province differently?

twoway graph bar is illegal and graph bar industry by itself would at best show the means of a numerical variable industry.
The ideal graph to show frequencies of industry by province might depend on how many distinct values of industry there were; you say 20 distinct values of province.
There is no data example here, so it is hard to guess, but tabplot from the Stata Journal allows both plotting frequencies from a two-way table and different colours.

Related

How does CfsSubsetEva (Correlation-based Feature Selection) works in Weka

I have a dataset which is categorical dataset. I am using WEKA software for feature selection. I have used CfsSubsetEval as attribute evaluator with Greedystepwise method. I came to know this link that CFS uses Pearson correlation to find the strong correlation between the dataset. I also found out how to calculate Pearson correlation coefficient using this link. As per the link the data values need to be numerical for evaluation. Then how can WEKA did the evaluation on my categorical dataset?
The strange result is that Among 70 attributes CFS selects only 10 attributes. Is it because of the categorical dataset? Additionally my dataset is a highly imbalanced dataset where imbalanced ration 1:9(yes:no).
A Quick question
If you go through the link you can found the statement the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. Now I can understand the strength of the correlation coefficient which is varied in between +1 to -1 but what about the direction? How can I get that? I mean the variable is not a vector so it should not have a direction.
The method correlate in the CfsSubsetEval class is used to compute the correlation between two attributes. It calls other methods, depending on the attribute types, which I've linked here:
two numeric attributes: num_num
numeric/nominal attributes: num_nom2
two nominal attributes: nom_nom

Hi, Is there a way to build a power BI bar chart using multiple heirarchies and avoid calculating averages of averages when drilling up?

I have a dataset with columns Country, State, City, Sales. I wanted to build a drill down bar chart to drill from country to State and then City, showing the average sales. My problem is that I can't find a workaround to avoid power bi calculating the average as the average of the inmediate lower heirarchy. Since I have States with much more cities than others, when I get to category Countries, the averages are wrong, because Two States with different amount of cities are weigthed the same way when summarizing to the upper level.
Is there any way to define the granularity level on which averages should be calculated or any other workaround.
Example
example dataset
For country A, I want to show the average as 16.
Currently is doing the average between States X and Y, whose averages are 17.2 and 13, giving 15.1 as a result.
Any help on how to solve this problem will be preciated. Thanks.
avgMeasure:= CALCULATE(AVERAGE(tbl[sales]),ALLEXCEPT(tbl,tbl[country]))

Power BI - Detect Outliers and the best way to show it

I have my data like this,
Country Value
USA 100
USA 120
USA 200
UK 200
UK 210
UK 400
I need to detect outliers for each country and show them in a visual.
I tried using the box plots (Country Vs Value), but I have nearly around 3M rows and it crashes. Any suggestions on how to solve this issue in a better way would be appreciated.
Power BI has anomaly detection since the November 2020 update.
If it's choking on the size of the data, then it might help to define an aggregated table where you group it at the level needed for the visual so that, e.g., you only have as many rows as countries for your example.
In case your data per country has a normal distribution. It may help to filter on the rows where the value column Z-score is higher than 3 times the Average of the data for country(i).
The Z-score measures how many standard deviations the data point is away from its mean (X̄). Then once identified the outliers for each country, you may draw a scatter plot, showing data points in the same plot, with a different color by country.

How do I remove the leftmost zero (on the x-axis) when graphing a categorical variable?

hist body, discrete freq xlabel(#5, labsize(small) angle(forty_five) valuelabel) produces:
I'm graphing a categorical variable, but I can't figure out how to drop the zero from the x-axis. I've tried the documentation for xlabel() and xscale() but didn't find any winners.
The short answer is to spell out that you only want xla(1/5, stuff ). How to spell out precisely which labels you want is documented.
Not the question, but this is in my view a poor graph. Go with a horizontal bar chart in which (1) the discreteness of the variable is respected;(2) the category labels are properly and readably horizontal, instead of using a most awkward device of text at 45 degrees. catplot (SSC) is one way to go. Also in Stata 13 (updated) upwards, graph hbar will do as well. You should also split the title in two lines. Even further off-topic: most consumers of this research should not care two hoots about the variable name or its question number in your survey.

Population-weighted center of a state

I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.
The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf
To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;
Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org
I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.