SAS Counting Keywords Across Columns - sas

I have list of keywords; CORPORATE, REAL ESTATE, COMPETITION, TRADE, DISPUTE
I want to be able to count the number of occurrences these keywords appear between columns practice_area1—Practice_area10. Therefore, I want to scan across 10 columns and create new columns. Each new column will represent each of the keywords (above) and a count as a value e.g. Corporate 4
Once this statement has ran and we have our new columns I want to create a new variable “Practice Group” which is populated with the highest count of the new variables we have just created. The dataset below is an example of how the data should look:
Please could somebody offer me some advice of the best approach to do this?
Many Thanks
Chris

All you need to do is use makean array for all the columns which you want to check. Then do loop it through each column for the word you want to check by using count function and add the count in the loop.
Code below is checking for three columns and three values. You can apply this code to as many columns as you want.
data have(drop= i);
col1 = 'CORPORATE, REAL ESTATE, REAL ESTATE';
col2= 'CORPORATE, CORPORATE, TRADE, TRADE, TRADE, REAL ESTATE';
col3= 'TRADE, TRADE, DISPUTE,REAL ESTATE';
array col[*] col1 - col3;
realestate=0;/*starting with zero*/
trade=0;
corporate= 0;
do i = 1 to 3;
realestate =realestate+count( col(i), 'REAL ESTATE');/* adding through the loop*/
TRADE =trade+count( col(i), 'TRADE');
CORPORATE= corporate+count( col(i), 'CORPORATE');
end;
run;

Related

How to combine ifelse and left_join

I have two data sets. First dataset (AH) has a columns: Account_Number, Account_Name, Market_Value and second data set (AH1) has Account_Number, Market_Value and Fund. I am trying to bring in Fund to the first data set from second dataset for only the Account_Numbers I want. Let's say I want to bring the Funds for the account number that starts with 2 if it does not starts with 2 then I want what it is in Account_Number column. Could you please advise how would I do that?
I tried this syntax:
ifelse(AH$Account_Number == starts_with("2"),left_join(AH,AH1, by = "Account_Number),AH$Account_Number

Get total count of each distinct value

If I for example have a column of countries that might repeat and the list follows like this: Spain, Spain, Italy, Spain
I want to get the result that I take the number that a country appears in the column and divide it by total number. I have tried:
CountRows = DIVIDE(DISTINCTCOUNT('Report (7)'[Country]); COUNT('Report (7)'[Country]) )
Any suggestions? do I need a new column for that?
The easiest way to achieve this type of calculation is to add one column with the number of occurrence of the selected words divided by the number of row in the table.
You need to use the function Earlier to get the context.
If you have one table named Table1 and your column Country
Something like :
Divide(COUNTROWS(FILTER(table1, Table1[Country] = EARLIER(Table1[Country]))),COUNTROWS(Table1))
Don't forget to put your new column in Percentage type or add some decimal to see the correct data.

DAX: How to achieve the effect of a join/relationship between a bitmask column and a table of bit values?

I have a table CarHistoryFact (CarHistoryFactId, CarId, CarHistoryFactTime, CarHistoryFactConditions) that tracks the historical status of cars. CarHistoryFactConditions is a 25-bit (in binary) int column that encodes the status of 25 various conditions the car may be in at a given point in time.
I have a dimension table CarConditions with a row for each of the conditions, and their base 10 bit value.
How can I implement a "relationship" between the fact and dimension, giving a list of all the conditions a given car is
I can come up with bit parsing code, but I'm not sure how to hook it up to the dimension table to get just the currently applicable conditions at a car-time.
Bitmask parsing in dax can be seen here :
https://radacad.com/quick-dax-convert-number-to-binary
You can create a CROSSJOIN Table where all records are added 25 times and then filter out the once not existing
CarHistoryConditions =
var temp = CROSSJOIN(CarHistoryFact ; CarConditions )
return FILTER(temp; MOD(TRUNC(CarHistoryFact [CarHistoryFactConditions] / CarConditions [bit]):2) = 1)
note: I assumed the CarHistoryFactConditions and bit to be an integer, not a string of bits. For sure you can change that.
The reult is a table with as many rows of conditions for each car. E.g. Car one has 2 conditions and Car two has 5 conditions. You get 7 rows

Search through the data with a loop or nested loops in SAS

I am rather a beginner in SAS. I have the following problem. Given is a big data set (my_time) which I imported into SAS looking as follows
I want to implement the following algorithm
for every account look for a status and if it is equal to na then look for the same contract after one year (one year after it gets the status na) and put the information "my_date", "status" and "money" in three new columns "new_my_date", "new_status" and "new_money" like in
I need something like countifs in excel. I found loops in SAS like DO but not for the purpose to look through all rows.
I do not even know for which key word I have to look.
I would be grateful for any hint.
A simple method would be by sorting, then exploiting the special variable prefix first. and retain statement to get the desired result.
Step 1: Sort by account, date, and status
proc sort data=have;
by account my_date status;
run;
This will guarantee that your data is in the order that you need. Since we are looking only for year+1 after the status = 'na', anything that happens in-between that doesn't matter.
Step 2: Use a data step to remember the first year when na happens for that account
data want;
set have;
by account my_date status;
retain first_na_year first_na_account;
if(first.account) then call missing(first_na_year,first_na_account);
if(status IN('na', 'tna') ) then do;
first_na_year = year;
first_na_month = month;
first_na_account = account;
end;
if( year = first_na_year+1
AND first_na_month = month
AND account = first_na_account)
AND status NOT IN('na', 'tna') )
then do;
new_status = status ;
new_my_date = my_date;
new_money = money;
end;
if(cmiss(new_status, new_my_date, new_money) ) = 0;
drop first:;
run;
For each row, we compare three things:
Is the status not 'na'?
Is the year 1 year bigger than the last time it was 'na'?
Is this the same account we're comparing?
If all are true, then we want to create the three new variables.
What's happening:
SAS is inherently a looping language, so we do not need to use a do loop here. When SAS goes to a new row, it will clear all variables in the Program Data Vector (PDV) in preparation for filling them in with the new values in the row.
Since SAS the SAS data step only goes forwards and doesn't like to go backwards, we want it to remember the first time that na occurs for that account. retain tells SAS not to discard the value of a variable when it reads a new row.
When we are done doing our comparison and we've moved onto the next account, we reset these variables to missing. by group processing allows SAS to know exactly where the first and last occurrence of the account is in the dataset.
At the end, we output only if all 3 of the new variables are not missing. cmiss counts how many variables are not missing. Note that output is always implied before the run statement, so we simply need to use an "if without then" in this case.
The final statement, drop first:;, is a simple shortcut to remove any variables that start with the phrase first. This prevents them from being shown in the final dataset.

How to collapse data while retaining other variables?

I am trying to collapse my data using proc sql. However, i noticed that when I tried to collapse my data I lost a bunch of variables that I wanted to keep. I am trying to collapse my data based on the variable MRN (which is numeric). The other variables I want to keep are CITY and SITE (these are character values) and these are constant for each unique MRN, so collapsing them should be fine.
Here is the code I am using
proc sql;
create table collapsed_data as
select distinct mrn,
sum(msk_tx_yes) as msk_tx_yes,
sum(msk_cancel_tx_yes) as msk_cancel_tx_yes,
sum(msk_ca_yes) as msk_ca_yes,
sum(msk_cancel_ca_yes) as msk_cancel_ca_yes,
sum(msk_dc_yes) as msk_dc_yes,
sum(conc_psych_tx_yes) as conc_psych_tx_yes,
sum(conc_psych_ca_yes) as conc_psych_ca_yes,
sum (conc_psych_dc_yes) as conc_psych_dc_yes,
sum (conc_yes) as conc_yes,
sum (psych_yes) as psych_yes,
sum (foot_prog) as foot_prog,
sum (hand_prog) as hand_prog,
sum (surg_prog) as surg_prog,
sum (sx_yes) as sx_yes
from temp_collapsed_data
group by mrn;
quit;
I'm not sure how to use the SELECT and DISTINCT functions together.
I thought maybe I could add the variables CITY and STATE after SELECT, while keeping DISTINCT but it doens't sem to work.
I want to be able to keep CITY and STATE in the new table along with the new summed variables I am making. How can I achieve this without turning CITY and STATE into dummy coded variables? I would like to keep them as character values if possible.
Anyone know how I can achieve this?
Yur code is already correct. Just add the variables to the select statement.
proc sql;
create table collapsed_data as
select distinct mrn, city, site,
sum(msk_tx_yes) as msk_tx_yes,
sum(msk_cancel_tx_yes) as msk_cancel_tx_yes,
sum(msk_ca_yes) as msk_ca_yes,
sum(msk_cancel_ca_yes) as msk_cancel_ca_yes,
sum(msk_dc_yes) as msk_dc_yes,
sum(conc_psych_tx_yes) as conc_psych_tx_yes,
sum(conc_psych_ca_yes) as conc_psych_ca_yes,
sum (conc_psych_dc_yes) as conc_psych_dc_yes,
sum (conc_yes) as conc_yes,
sum (psych_yes) as psych_yes,
sum (foot_prog) as foot_prog,
sum (hand_prog) as hand_prog,
sum (surg_prog) as surg_prog,
sum (sx_yes) as sx_yes
from temp_collapsed_data
group by mrn;
quit;
The distinct statement will result in not having two rows with the same information.