I have a series of values in a column -- the first value is a category description, the different categories are separated by a blank row. In the example below the first category is called A, second category is called T, and third category is called R.
What I want to do is to retain the first instance of the category name and create a new field name prefaced by the category. See the have/want below. Any ideas?
For example:
data example;
input have $1. want $4.;
datalines;
A
T A_T
G A_G
T
R T_R
E T_E
W T_W
R
H R_H
R R_R
;
you should consider using retain statement in SAS to carry the values over and lag statement to determine when you need to reset your retained value.
data have;
input category $1.;
datalines;
A
T
G
T
R
E
W
R
H
R
;
data want (drop=category_retained);
set have;
length subcategory $3.;
retain category_retained "";
if lag(category) = "" then
do;
subcategory = "";
category_retained = category;
END;
if lag(category) ne "" and category ne "" then
do;
subcategory = CATX("_",category_retained,category);
END;
RUN;
Related
I was given an excel file where someone stored all of the information in a single column (var1). I need to pull information but it will be in random orders. Good thing is the person gave the information and then put a period after it. I pulled the var1 in SAS.
3 Examples of Var1 oberservations:
Type = 2. Size = 4 in x 12 in. Group = ABC grouping.
Group = A and B Holdings. Type = 1.
Group = Mark H and Company.
The variable I need to pull is group. It always starts with "Group = " and has a period in the end. But will be anywhere within the var1 (so you can't name a specific period. Sometimes it may not exist. This variable can be any length in words. I just need to pull the string between "Group = " and the period.
This can't be done in excel due to the size of the dataset.
I have tried scan, find, splitting at the period, and I am not sure what to do at this point to organize it.
How about this?
data have;
infile cards4 truncover;
input line $100.;
list;
cards4;
Type = 2. Size = 4 in x 12 in. Group = ABC grouping.
Group = A and B Holdings. Type = 1.
Group = Mark H and Company.
;;;;
run;
data more;
set have;
infile cards4;
input #1 #;
_INFILE_ = line;
length type size group $48;
_infile_= transtrn(_infile_,' = ','=');
input (_all_)(=);
list;
cards4;
Type = 2. Size = 4 in x 12 in. Group = ABC grouping.
Group = A and B Holdings. Type = 1.
Group = Mark H and Company.
;;;;
run;
proc print;
run;
I am using SAS Enterprise Guide.
I have a new file and i was asked to generate output.
Source:
Name feeder_in feeder_out NickName
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
and the desire output:
Name feeder_final
ABBA 1
ABBA 2
ABBA A
ABBA B
POLA 1
POLA 2
CONS POLA C
CONS POLA D
CONS POLA E
I have been trying myself on handling this but no luck at all.
I tried
data test;
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
output;
end;
xequipment=newequipment;
run;
Does anyone have clue for this?
Here's my understanding of your requirements, based on the desired output: you want your output to have one observation for each combination of NAME and FEEDER_IN, plus another observation for each combination of NICKNAME and FEEDER_OUT.
On that assumption, the code would look something like (not tested):
data want;
set have;
keep name feeder_final
* Loop over FEEDER_IN and output one obs for each delimited value;
do i = 1 to countw(feeder_in, ',');
feeder_final = scan(feeder_in, i, ',');
output;
end;
* Move the NICKNAME value into NAME;
name = nickname;
* Loop over FEEDER_OUT and output one obs for each delimited value;
do i = 1 to countw(feeder_out, ',');
feeder_final = scan(feeder_out, i, ',');
output;
end;
run;
When transposing multiple columns you might want to also maintain the source row and column identifiers for further downstream analytics. The sequence of the values in the csv might also be important if you need to do pairwise joining on sequence position of the categorical form -- such as needing to match 1A 2B in row 1 and 1C 2D in row 2.
data have;
length name feeder_in feeder_out nickname $20;
input
Name& feeder_in& feeder_out& NickName&; datalines;
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
run;
data want;
_row_ + 1;
set have;
feeder = 'in ';
do seq = 1 to countw(feeder_in,',');
value = scan(feeder_in,seq,',');
OUTPUT;
end;
feeder = 'out';
do seq = 1 to countw(feeder_out,',');
value = scan(feeder_out,seq,',');
OUTPUT;
end;
keep _row_ Name feeder seq value NickName;
run;
I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)
I have 2 customers 2 months transaction date, now i need to extract only 1st transaction date for that month for that particular customer. Like that i need month wise. I didn't get proper idea on this on SAS. Can anyone help? Thanks in Advance.
Cust_name Vis_date
V 3/1/2016
V 8/1/2016
V 16/1/2016
V 18/1/2016
V 26/1/2016
V 27/1/2016
E 5/1/2016
E 8/1/2016
E 18/1/2016
E 19/1/2016
E 25/1/2016
E 26/1/2016
V 4/2/2016
V 8/2/2016
V 17/2/2016
V 25/2/2016
V 26/2/2016
V 27/2/2016
E 5/2/2016
E 8/2/2016
E 23/2/2016
E 24/2/2016
E 25/2/2016
E 28/2/2016
I would first get the month for each record and then sort. With that you can pull the observations out with a data step as follows:
data test.doc1;
set test.doc;
Month = month(__Vis_date);
run;
proc sort data=test.doc1;
by Cust_name Month __Vis_date;
run;
data test.doc2;
set test.doc1;
by Cust_name Month;
if first.Month then output;
run;
If you create a variable that keeps track of the month, this becomes pretty easy! Vis_date needs to be formatted as a date variable for this to work though.
data your_data2;
set your_data;
month = month(vis_date);
run;
proc sort data = your_data2;
by cust_name vis_date;
run;
proc sort nodupkey data = your_data2;
by cust_name month;
run;
You can do it in a single SQL statement :
proc sql ;
create table want as
select Cust_name,
put(Vis_date,yymmn6.) as Month,
min(Vis_Date) as First_Date format=date9.
from have
group by 1,2
order by 1,2 ;
quit ;
I would like to create a variable called DATFL that would have the following values for the last obseration :
DATFL
gender/scan
Here is the code :
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M F
2 jill F L
3 james F M
4 jonas M M
;
run;
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
run;
Unfortunately, the value I get for 'DATFL' at the last observation is 'gender/scan/gender/scan'.Obviously because of the retain statement that I used for 'DATFL' I ended up with duplicates. At the end of this data step, I was planning to use a CALL SYMPUT statement to load the last value into macro variable but I won't do it until I fix my issue...Can anyone provide me with a guidance on how to prevent 'DATFL' to have duplicates value at the end of the dataset ? Cheers
sas_kappel
Don't retain DATFL, Instead, retain DATFL_.
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl_;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
if missing(datfl) then datfl = datfl_;
run;
It doesn't work...Let me change the dataset (mix_) and you can see that RETAIN DATFLl_, is not working in this scenario.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
To resume, what I want is to have the DISTINCT value of DATFL, into a macro variable. The code that I proposed does,for each records,a search for variables having the letter M, if it true then DATFL receives the variable name of the array variable. If there are multiple variable names then they will be separated by '/'. For the next records, do the same, BUT add only variable names satisfying the condition AND the variables that were not already kept in DATFL. Currently, if you run my program I have for DATFL at observation 4, DATFL=gender/scan/name/scan/scan but I would like to have DATFL=gender/scan/name , because those one are the distinct values. Ultimatlly, I will then write the following code;
if eof then CALL SYMPUT('DATFL',datfl);
sas_kappel
Your revised data makes it much clearer what you're looking for. Here is some code that should give the correct result.
I've used the CALL CATX function to add new values to DATFL, separated by a /. It first checks that the relevant variable name doesn't already exist in the string.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
data _null_;
set mix_ end=eof;
length datfl $100; /*or whatever*/
retain datfl;
array m4{*} $ id name gender scan;
do i = 1 to dim(m4);
if index(m4{i},'M') and not index(datfl,vname(m4{i})) then call catx('/',datfl,vname(m4{i}));
end;
if eof then call symput('DATFL', datfl);
run;
%put datfl = &DATFL.;