I have a dataset which combine 14 household surveys in 14 countries. Each survey was conducted in different years and each survey has a household weight variable that only specifies to this country's context (data structure is the same across 14 countries).
Now I merged them and tried to cross tabulate the country and gender_area (four types of value: male_rural, female_rural, male_urban, female_urban) variable with weights (tab country gender [aw=hhweight], m). But I found that such a cross-tabulation would create weird values for some of the countries.
For example, if I add one if condition by the end of the tab (tab country gender [aw=hhweight] if abc==1, m), some country (KHM, NPL) 's row total would be greater than their original row total without the condition. But in this dataset, a condition would give a smaller subsample. If I don't add the weight (tab country gender, m), there is no such a problem. If I just tab one country with weight, there is no such a problem either.
So I wonder if there is any way for me to compare all countries with weight. I am not that familiar with survey data reference in Stata (svyset, strata, etc).
I tried to refer to the book Applied Survey Data Analysis, but it seems that it doesn't contain methodology to deal with such a combination.
Related
I'm practicing my Power BI skills. I've downloaded a csv file which contains data about olympic games. The dataset has many columns, such as country, athlete name, year, sport, event, medal which the athlete has won, olympic city, etc.
The problem is that I want to create a bar graph that display country name by medal types count. However if create a graph "Country" by "Medal" from original csv it will not display the correct numbers of medals, because if a country wins a medal in a team sport (like volleyball or football) it should count as only one medal, and not the sum of all medals of all athletes in that team. This could be solved by removing Athlete column and selecting distinct values of "Event" column, like creating a table using the following formula:
Table 2 = CALCULATETABLE(ALLEXCEPT('summer (3)','summer (3)'[Athlete]),DISTINCT('summer (3)'[Event]))
However, I don't want to create a new table, because I would have serious problems with relationship between them (I have no idea how to do it, to be honest). So I want to create a measure. I created the following measure:
Medal count = COUNTX(CALCULATETABLE(ALLEXCEPT('summer (3)','summer (3)'[Athlete]),DISTINCT('summer (3)'[Event])),'summer (3)'[Medal])
It is showing the correct number of all medals in olympic games history (untill 2012). However, for every country, its showing the number of gold medal, silver medal and bronze medal with the same number (the total number of olympic medals 14753). It's not filtering by the number of rows for that specific country.
The same number also appears if I select any medal type from filter option (Gold, Silver or Bronze).
I have no idea how to fix this. How can I create a measure that shows the correct number of medal type for every country?
This is what I would do. First I will create an "id" column if I haven't had that, then I will do the distinctcount on that.
The DAX for the id column should be something like this:
debug_id = CONCATENATE(Table['Year'],CONCATENATE(Table['Sport'],CONCATENATE(Table['Discipline'],CONCATENATE(Table['Country'],CONCATENATE(Table['Event'],Table['Medal'])))))
then you can basically drag and drop this field onto the x-axis (y-axis and legends stay the same) and select Count (Distinct). If you really want the measure for this, it should be quite straight forward like:
Medals count = DISTINCTCOUNT(Table['debug_id'])
How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.
To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?
The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.
Working in SAS but using some SQL code to count the number of unique patients but also the total number of observations for a set of indicators. Each record has a patient identifier, the facility where the patient is, and a group of binary indicators (0,1) for each bed section (the particular place in the hospital where the patient is). For each patient record, only 1 bed section can have a value of '1'. Overall, patients can have multiple observations in a bed section or in other bed sections, i.e. patients can be hospitalized > 1. The idea is to roll this data set up by facility and count the total # of admissions for each bed section but also the total people for each bed section. The people count will always be <= to the observation count. Counting people was just added to my to-do list and to this point I was only summing up observations for each bed section using the code below:
proc sql;
create table fac_bedsect as
select facility,
sum(bedsect_alc) as bedsect_alc,
sum(bedsect_blind) as bedsect_blind,
sum(bedsect_gen) as bedsect_gen
from bedsect_type
group by facility;
quit;
Is there a way I can incorporate into this code the # of unique people for each bed section? Thanks.
With no knowledge of the source table(s) it is impossible to answer precisely, but the syntax for counting distinct values is as seen below. You will need to use the correct column name where I have used "patient_id":
SELECT
facility
, COUNT(DISTINCT patient_id) AS patient_count
, SUM(bedsect_alc) AS bedsect_alc
, SUM(bedsect_blind) AS bedsect_blind
, SUM(bedsect_gen) AS bedsect_gen
FROM bedsect_type
GROUP BY
facility
;
I have a few million records with a list of names and phone numbers. I need to count how many people are associated with each unique phone number. The phone numbers are associated with duplicate names and unique names. So for each phone number I need to count the number of distinct users. Then this needs to be mapped to a list of stores. I tried selecting distinct phones/distinct phones but that only gives me a ratio of a distribution. So for example, if there is 10 people using three phones, then my ratio tells me that 3 phones are distributed among 10 people, but it doesnt tell me the actual number of people withn that distribution associated with the phone. Can anyone please help me with the SAS code to get the correct count where I know exactly how many phones are associated with the same phone number. Thanks in advance.
-r
If you want just the number of distinct rows that have the same phone number, you use:
proc sql;
create table phone_number_counts as
select phonenumber, count(1) as count_users
from dset
group by phonenumber;
quit;
If you want to find out distinct names within phone number, ie, if
555-123-4567 John H
555-123-4567 John H
555-123-4567 Mary Y
should result in 2, not in 3 (the first code would yield 3), then use count(distinct name) instead of count(1).
If you want something else, some example data might be helpful - ie, an example of the initial data and an example of a correct final dataset would be helpful.
I believe you're looking for count(distinct name):
proc sql;
create table phone_number_counts as
select phonenumber,
count(*) as count_rows,
count(distinct name) as unique_names
from dset
group by phonenumber;
quit;