I want to create with sas procedure hpsplit a tree decision on this data :
| city | studied_area | salary |
------------------------------------------
1| manchester | biology | 40000 |
2| london | computer science | 50000 |
3| reading | computer science | 45000 |
Each line represent a person. The variables are the city where he get his degree, the studied area and his actual salary.
I want to create a decision tree using the first two variables to guess the salary variable.
I started by doing this :
proc hpsplit data=lib1.wagesdata seed=15531;
class salary city studied_area;
model salary = city studied_area;
grow entropy;
prune costcomplexity;
run;
I used this doc : https://support.sas.com/documentation/onlinedoc/stat/141/hpsplit.pdf
But I have the following errors :
ERROR: Character variable appeared on the MODEL statement without appearing on a CLASS statement.
ERROR: Unable to create a usable predictor variable set.
Can you explain me why and how can I fix this ?
UPDATED : Simply added all the variables in the class clause to avoid the errors which is weird because in the doc that's not the case for the first example.
I added a format for my salary because the output wasn't understandable.
Related
I'm using kmatch in Stata. The reason why I use kmatch is to use the command ematch to match exactly on a specific variable in addition to the propensity score matching. Here is my code:
kmatch ps treatment age sex edu (outcome), ematch(level) comsup
I think kmatch is different from pscore and psmatch2 in that propensity scores will not be automatically stored in the dataset. I wonder if there is a way to save these propensity scores and to check which individuals are included in the matched sample.
The answer is in the help file, help kmatch. Add generate[(spec)] as an option to store the propensity scores as _KM_ps. Other helpful matching results also have the _KM_ prefix. wgenerate[(spec)] generates variables containing the ready-to-use matching weights. idgenerate[(prefix)] generates variables containing the IDs (observations numbers) of the matched controls.
Here is an example.
webuse cattaneo2, clear
kmatch ps mbsmoke mmarried mage fbaby medu (bweight), ///
generate(kpscore) wgenerate(_matched) idgenerate(_controlid) ate
Try this to compare results from kmatch and teffects psmatch, keeping only the propensity scores from each.
webuse cattaneo2, clear
tempfile temp1 temp2
keep mbsmoke mmarried mage fbaby medu bweight
gen id = _n
save temp1, replace
teffects psmatch (bweight) (mbsmoke mmarried mage fbaby medu), ///
ate generate(_pscore)
predict te_pscore, ps
keep te_pscore id
replace te_pscore = 1 - te_pscore
save temp2, replace
use temp1
kmatch ps mbsmoke mmarried mage fbaby medu (bweight), generate(kpscore) ate
rename _KM_ps k_pscore
keep k_pscore id
save temp3, replace
merge 1:1 id using temp2
drop _merge
list in 1/10
+---------------------------+
| id k_pscore te_psc~e |
|---------------------------|
1. | 1 .13229635 .1322963 |
2. | 2 .4204439 .4204439 |
3. | 3 .22490795 .2249079 |
4. | 4 .16333027 .1633303 |
5. | 5 .11024706 .1102471 |
|---------------------------|
6. | 6 .25395923 .2539592 |
7. | 7 .16283038 .1628304 |
8. | 8 .10881813 .1088181 |
9. | 9 .10988829 .1098883 |
10. | 10 .11608692 .1160869 |
+---------------------------+
I am trying to do a trivial task with Power BI Desktop. I have the following kind of data
| Name | Min | Max | Average | Median |
|-------- |----- |------- |--------- |-------- |
| team A | 0 | 3,817 | 120 | 120 |
| team B | -10 | 1,050 | 25 | 89 |
| team C | 5 | 14,320 | 50 | 48 |
And I want to create my own horizontal line with pre-defined (Start, End) points to plot for each team name the values of the Min, Max, Average, Median. And I filter the team name to adjust the numbers and the visual accordingly.
So far I have done the following static approach
The example above is totally non-dynamic because every point on the line is set by me. Also if for example, I select Team B with a higher median than average then the above visual line does not change the position of the relative spheres (in the image I posted, I have placed average always higher than the median which is not true for all the teams).
Thus, I would like to know if there is any fancy and well-plotted way to represent those 4 descriptive measures for a team name in a horizontal line that will respond when I use a different team. As I have noted on the image attached, the card visuals change when I change the team name. But the spheres do not move across the line.
My desired output
For Team B
While for Team C
I literally don't know if this is feasible in Power BI apart from the static approach I already did. Thank you in advance.
Regards.
Basic premise:
'Orders' are comprised of items from multiple 'Zones'.
Customers can call in for 'Credits' (refunds) on botched 'Orders'.
There is a true many-to-many relationship here, because one order could have multiple credits called in at different times; similarly, a customer can call in once regarding multiple orders (generating only one credit memo).
'Credits' granularity is at the item level, i.e.
CREDIT | SO | ITEM | ZONE | CREDAMT
-------------------------------------------------------
42 | 1 | 56 | A | $6
42 | 1 | 52 | A | $8
42 | 1 | 62 | B | $20
42 | 2 | 56 | A | $12
'Order Details' granularity is at the zone level, i.e.
SO | ZONE | DOL_AMT
-------------------------------
1 | A | $50
1 | B | $20
1 | C | $100
2 | A | $26
I have a 'Zone' filter table that helps me sort things better and roll up into broader categories, i.e.
ZONE | TEMP | SORT
-------------------------------
A | DRY | 2
B | COLD | 3
C | DRY | 1
What I need:
I want a pair of visuals for a side by side comparison of order total by zone next to credit total by zone.
What's working:
The 'Credits' component is easy, CreditTotal = abs(sumx(Credits,Credits[CREDAMT])) with Zone as a legend item.
I have a ORDER component that works when the zone is in the credit memo
Order $ by Zone =
CALCULATE (
SUM ( 'Order Details'[DOL_AMT] ),
USERELATIONSHIP ( 'Order Details'[SO], Credits[SO] ),
ALL ( Credits[CreditCategory] )
)
My issue:
Zones that didn't have a credit against them won't show up. So instead of
CREDIT | ZONE | ORDER $ BY ZONE
----------------------------------
42 | A | $76
42 | B | $20
42 | C | $100
I get
CREDIT | ZONE | ORDER $ BY ZONE
----------------------------------
42 | A | $76
42 | B | $20
I have tried to remove this filter by tacking on ALL(Zones[Zone]) and/or ALL('Order Details'[Zone]), but it doesn't help, presumably because it is reporting "all zones" actually found in the 'Credits' table. I'm hoping there's some way to ask it to report all zones in the 'Order Details' table based upon SOs in the 'Credits' table.
In case it helps, here's how the relationships are structured; as an aside, I've tried mixing and matching various combinations of active/inactive, single vs. bidirectional filtering, etc., but the current configuration is the only one that seems to remotely work as desired.
I'm grateful for any suggestions; please let me know if anything is unclear. Thank you.
I was able to get it to work by using 'Order Details'[Zone] rather than Zones[Zone] in the table visual and this measure:
Order $ by Zone =
CALCULATE (
SUM ( 'Order Details'[DOL_AMT] ),
USERELATIONSHIP ( 'Order Details'[SO], Credits[SO] )
)
Notice that regardless of your measure, there is no row in Credits corresponding to zone C, so it doesn't know what to put in the CREDIT column unless you tell it exactly how.
If you remove the CREDIT dimension column, then you don't need to swap tables as I suggested above. You can just use the measure above and then write a new measure for the CREDIT column instead:
CreditValue =
CALCULATE(
VALUES(Credits[CREDIT]),
ALL(Credits),
Credits[SO] IN VALUES('Order Details'[SO])
)
I would like to have a 3-way table displaying column or row percentages using three categorical variables. The command below gives the counts but I cannot find how to get percentages instead.
sysuse nlsw88
table married race collgrad, col
--------------------------------------------------------------------
| college graduate and race
| ---- not college grad ---- ------ college grad ------
married | white black other Total white black other Total
----------+---------------------------------------------------------
single | 355 256 5 616 132 53 3 188
married | 862 224 12 1,098 288 50 6 344
--------------------------------------------------------------------
How can I get percentages?
This answer will show a miscellany of tricks. The downside is that I don't know an easy way to get exactly what you ask. The upside is that all these tricks are easy to understand and often useful.
Let's use your example, which is excellent for the purpose.
. sysuse nlsw88, clear
(NLSW, 1988 extract)
Tip #1 You can calculate a percent variable for yourself. I focus on % single. In this data set married is binary, so I won't show the complementary percent.
Once you have calculated it, you can (a) rely on the fact that it is constant within the groups you used to define it (b) tabulate it directly. I find that tabdisp is underrated by users. It's billed as a programmer's command, but it is not difficult to use at all. tabdisp lets you set a display format on the fly; it does no harm and might be useful for other commands to assign one directly using format.
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. tabdisp collgrad race, c(pcsingle) format(%2.1f)
--------------------------------------
| race
college graduate | white black other
-----------------+--------------------
not college grad | 29.2 53.3 29.4
college grad | 31.4 51.5 33.3
--------------------------------------
. format pcsingle %2.1f
Tip #2 A user-written command groups offers different flexibility. groups can be installed from SSC (strictly, must be installed before you can use it). It's a wrapper for various kinds of tables, but using list as a display engine.
. * do this installation just once
. ssc inst groups
. groups collgrad race pcsingle
+-------------------------------------------------------+
| collgrad race pcsingle Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
|-------------------------------------------------------|
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
We can improve on that. We can set up better header text using characteristics. (In practice, these can be less constrained than variable names but often need to be shorter than variable labels.) We can use separators by calling up standard list options.
. char pcsingle[varname] "% single"
. char collgrad[varname] "college?"
. groups collgrad race pcsingle , subvarname sepby(collgrad)
+-------------------------------------------------------+
| college? race % single Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
|-------------------------------------------------------|
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
Tip #3 Wire display formats into a variable by making a string equivalent. I don't illustrate this fully, but I often use it when I want to combine a display of counts with numerical results with decimal places in tabdisp. format(%2.1f) and format(%3.2f) might do fine for most variables (and incidentally the important detail is the number of decimal places) but they would lead to a display of a count of 42 as 42.0 or 42.00, which would look pretty silly. The format() option of tabdisp does not reach into the string and change the contents; it doesn't even know what the string variable contains or where it came from. So, strings just get shown by tabdisp as they come, which is what you want.
. gen s_pcsingle = string(pcsingle, "%2.1f")
. char s_pcsingle[varname] "% single"
groups has an option to save what is tabulated as a fresh dataset.
Tip #4 To have a total category, temporarily double up the data. The clone of the original is relabelled as a Total category. You may need to do some extra calculations, but nothing there amounts to rocket science: a smart high school student could figure it out. Here a concrete example for line-by-line study beats lengthy explanations.
. preserve
. local Np1 = _N + 1
. expand 2
(2,246 observations created)
. replace race = 4 in `Np1'/L
(2,246 real changes made)
. label def racelbl 4 "Total", modify
. drop pcsingle
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. char pcsingle[varname] "% single"
. format pcsingle %2.1f
. gen istotal = race == 4
. bysort collgrad istotal: gen total = _N
. * for percents of the global total, we need to correct for doubling up
. scalar alltotal = _N/2
. * the table shows percents for college & race | collgrad and for collgrad | total
. bysort collgrad race : gen pc = 100 * cond(istotal, total/alltotal, _N/total)
. format pc %2.1f
. char pc[varname] "Percent"
. groups collgrad race pcsingle pc , show(f) subvarname sepby(collgrad istotal)
+-------------------------------------------------------+
| college? race % single Percent Freq. |
|-------------------------------------------------------|
| not college grad white 29.2 71.0 1217 |
| not college grad black 53.3 28.0 480 |
| not college grad other 29.4 1.0 17 |
|-------------------------------------------------------|
| not college grad Total 35.9 76.3 1714 |
|-------------------------------------------------------|
| college grad white 31.4 78.9 420 |
| college grad black 51.5 19.4 103 |
| college grad other 33.3 1.7 9 |
|-------------------------------------------------------|
| college grad Total 35.3 23.7 532 |
+-------------------------------------------------------+
Note the extra trick of using a variable not shown explicitly to add separator lines.
I am parsing the USDA's food database and storing it in SQLite for query purposes. Each food has associated with it the quantities of the same 162 nutrients. It appears that the list of nutrients (name and units) has not changed in quite a while, and since this is a hobby project I don't expect to follow any sudden changes anyway. But each food does have a unique quantity associated with each nutrient.
So, how does one go about storing this kind of information sanely. My priorities are multi-programming language friendly (Python and C++ having preference), sanity for me as coder, and ease of retrieving nutrient sets to sum or plot over time.
The two things that I had thought of so far were 162 columns (which I'm not particularly fond of, but it does make the queries simpler), or a food table that has a link to a nutrient_list table that then links to a static table with the nutrient name and units. The second seems more flexible i ncase my expectations are wrong, but I wouldn't even know where to begin on writing the queries for sums and time series.
Thanks
You should read up a bit on database normalization. Most of the normalization stuff is quite intuitive, but really going through the definition of the steps and seeing an example helps understanding the concepts and will help you greatly if you want to design a database in the future.
As for this problem, I would suggest you use 3 tables: one for the foods (let's call it foods), one for the nutrients (nutrients), and one for the specific nutrients of each food (foods_nutrients).
The foods table should have a unique index for referencing and the food's name. If the food has other data associated to it (maybe a link to a picture or a description), this data should also go here. Each separate food will get a row in this table.
The nutrients table should also have a unique index for referencing and the nutrient's name. Each of your 162 nutrients will get a row in this table.
Then you have the crossover table containing the nutrient values for each food. This table has three columns: food_id, nutrient_id and value. Each food gets 162 rows inside this table, oe for each nutrient.
This way, you can add or delete nutrients and foods as you like and query everything independent of programming language (well, using SQL, but you'll have to use that anyway :) ).
Let's try an example. We have 2 foods in the foods table and 3 nutrients in the nutrients table:
+------------------+
| foods |
+---------+--------+
| food_id | name |
+---------+--------+
| 1 | Banana |
| 2 | Apple |
+---------+--------+
+-------------------------+
| nutrients |
+-------------+-----------+
| nutrient_id | name |
+-------------+-----------+
| 1 | Potassium |
| 2 | Vitamin C |
| 3 | Sugar |
+-------------+-----------+
+-------------------------------+
| foods_nutrients |
+---------+-------------+-------+
| food_id | nutrient_id | value |
+---------+-------------+-------+
| 1 | 1 | 1000 |
| 1 | 2 | 12 |
| 1 | 3 | 1 |
| 2 | 1 | 3 |
| 2 | 2 | 7 |
| 2 | 3 | 98 |
+---------+-------------+-------+
Now, to get the potassium content of a banana, your'd query:
SELECT food_nutrients.value
FROM food_nutrients, foods, nutrients
WHERE foods_nutrients.food_id = foods.food_id
AND foods_nutrients.nutrient_id = nutrients.nutrient_id
AND foods.name = 'Banana'
AND nutrients.name = 'Potassium';
Use the second (more normalized) approach.
You could even get away with fewer tables than you mentioned:
tblNutrients
-- NutrientID
-- NutrientName
-- NutrientUOM (unit of measure)
-- Otherstuff
tblFood
-- FoodId
-- FoodName
-- Otherstuff
tblFoodNutrients
-- FoodID (FK)
-- NutrientID (FK)
-- UOMCount
It will be a nightmare to maintain a 160+ field database.
If there is a time element involved too (can measurements change?) then you could add a date field to the nutrient and/or the foodnutrient table depending on what could change.