Counting different people associated with one phone number in SAS - sas

I have a few million records with a list of names and phone numbers. I need to count how many people are associated with each unique phone number. The phone numbers are associated with duplicate names and unique names. So for each phone number I need to count the number of distinct users. Then this needs to be mapped to a list of stores. I tried selecting distinct phones/distinct phones but that only gives me a ratio of a distribution. So for example, if there is 10 people using three phones, then my ratio tells me that 3 phones are distributed among 10 people, but it doesnt tell me the actual number of people withn that distribution associated with the phone. Can anyone please help me with the SAS code to get the correct count where I know exactly how many phones are associated with the same phone number. Thanks in advance.
-r

If you want just the number of distinct rows that have the same phone number, you use:
proc sql;
create table phone_number_counts as
select phonenumber, count(1) as count_users
from dset
group by phonenumber;
quit;
If you want to find out distinct names within phone number, ie, if
555-123-4567 John H
555-123-4567 John H
555-123-4567 Mary Y
should result in 2, not in 3 (the first code would yield 3), then use count(distinct name) instead of count(1).
If you want something else, some example data might be helpful - ie, an example of the initial data and an example of a correct final dataset would be helpful.

I believe you're looking for count(distinct name):
proc sql;
create table phone_number_counts as
select phonenumber,
count(*) as count_rows,
count(distinct name) as unique_names
from dset
group by phonenumber;
quit;

Related

Combining surveys with distinct analytical weights in Stata

I have a dataset which combine 14 household surveys in 14 countries. Each survey was conducted in different years and each survey has a household weight variable that only specifies to this country's context (data structure is the same across 14 countries).
Now I merged them and tried to cross tabulate the country and gender_area (four types of value: male_rural, female_rural, male_urban, female_urban) variable with weights (tab country gender [aw=hhweight], m). But I found that such a cross-tabulation would create weird values for some of the countries.
For example, if I add one if condition by the end of the tab (tab country gender [aw=hhweight] if abc==1, m), some country (KHM, NPL) 's row total would be greater than their original row total without the condition. But in this dataset, a condition would give a smaller subsample. If I don't add the weight (tab country gender, m), there is no such a problem. If I just tab one country with weight, there is no such a problem either.
So I wonder if there is any way for me to compare all countries with weight. I am not that familiar with survey data reference in Stata (svyset, strata, etc).
I tried to refer to the book Applied Survey Data Analysis, but it seems that it doesn't contain methodology to deal with such a combination.

Create a custom table inside a dax measure and count number of values of an specific column

I'm practicing my Power BI skills. I've downloaded a csv file which contains data about olympic games. The dataset has many columns, such as country, athlete name, year, sport, event, medal which the athlete has won, olympic city, etc.
The problem is that I want to create a bar graph that display country name by medal types count. However if create a graph "Country" by "Medal" from original csv it will not display the correct numbers of medals, because if a country wins a medal in a team sport (like volleyball or football) it should count as only one medal, and not the sum of all medals of all athletes in that team. This could be solved by removing Athlete column and selecting distinct values of "Event" column, like creating a table using the following formula:
Table 2 = CALCULATETABLE(ALLEXCEPT('summer (3)','summer (3)'[Athlete]),DISTINCT('summer (3)'[Event]))
However, I don't want to create a new table, because I would have serious problems with relationship between them (I have no idea how to do it, to be honest). So I want to create a measure. I created the following measure:
Medal count = COUNTX(CALCULATETABLE(ALLEXCEPT('summer (3)','summer (3)'[Athlete]),DISTINCT('summer (3)'[Event])),'summer (3)'[Medal])
It is showing the correct number of all medals in olympic games history (untill 2012). However, for every country, its showing the number of gold medal, silver medal and bronze medal with the same number (the total number of olympic medals 14753). It's not filtering by the number of rows for that specific country.
The same number also appears if I select any medal type from filter option (Gold, Silver or Bronze).
I have no idea how to fix this. How can I create a measure that shows the correct number of medal type for every country?
This is what I would do. First I will create an "id" column if I haven't had that, then I will do the distinctcount on that.
The DAX for the id column should be something like this:
debug_id = CONCATENATE(Table['Year'],CONCATENATE(Table['Sport'],CONCATENATE(Table['Discipline'],CONCATENATE(Table['Country'],CONCATENATE(Table['Event'],Table['Medal'])))))
then you can basically drag and drop this field onto the x-axis (y-axis and legends stay the same) and select Count (Distinct). If you really want the measure for this, it should be quite straight forward like:
Medals count = DISTINCTCOUNT(Table['debug_id'])

Displaying variable sets that define each row

To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?
The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.

Count unique patients and overall observation using PROC SQL

Working in SAS but using some SQL code to count the number of unique patients but also the total number of observations for a set of indicators. Each record has a patient identifier, the facility where the patient is, and a group of binary indicators (0,1) for each bed section (the particular place in the hospital where the patient is). For each patient record, only 1 bed section can have a value of '1'. Overall, patients can have multiple observations in a bed section or in other bed sections, i.e. patients can be hospitalized > 1. The idea is to roll this data set up by facility and count the total # of admissions for each bed section but also the total people for each bed section. The people count will always be <= to the observation count. Counting people was just added to my to-do list and to this point I was only summing up observations for each bed section using the code below:
proc sql;
create table fac_bedsect as
select facility,
sum(bedsect_alc) as bedsect_alc,
sum(bedsect_blind) as bedsect_blind,
sum(bedsect_gen) as bedsect_gen
from bedsect_type
group by facility;
quit;
Is there a way I can incorporate into this code the # of unique people for each bed section? Thanks.
With no knowledge of the source table(s) it is impossible to answer precisely, but the syntax for counting distinct values is as seen below. You will need to use the correct column name where I have used "patient_id":
SELECT
facility
, COUNT(DISTINCT patient_id) AS patient_count
, SUM(bedsect_alc) AS bedsect_alc
, SUM(bedsect_blind) AS bedsect_blind
, SUM(bedsect_gen) AS bedsect_gen
FROM bedsect_type
GROUP BY
facility
;

SAS where condition exact matching

Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...