I have data set that contains empty cells. It looks like
Year Volume ID
2000 999 LSE
2001 . LSE
. 555 LSE
2008 . NYSE
2010 1099 NYSE
I need to delete the row that contains empty cells. The output should look like this
Year Volume ID
2000 999 LSE
2000 1099 NYSE
I tried following code
data test;
set data;
if volume = " . " then delete;
if year= " . " then delete;
run;
But output file has 0 observations and SAS gives me
NOTE: Character values have been converted to numeric values at the
places given by (Line):(Column).
Also I tried
options missing = ' ';
data test;
set data;
if missing(cats(of _all_)) then delete;
run;
But its not working as well.
I just want to delete the rows with empty cells.
Anyone can help me to solve this issue ? Thanks in advance !!!
Options Missing only affects how things are printed or converted when going numeric -> character. In this case you have numerics, so it accomplishes nothing.
Your first code sample is mostly correct- at least, when I try it, it works. " . " is not really right, but it will convert (as the note says) to missing since none of those characters are a number.
The proper way to do this is one of the two:
data have;
input Year Volume ID $;
datalines;
2000 999 LSE
2001 . LSE
. 555 LSE
2008 . NYSE
2010 1099 NYSE
;;;;
run;
data want;
set have;
if year = . then delete;
if volume = . then delete;
run;
or
data want;
set have;
if missing(year) then delete;
if missing(volume) then delete;
run;
missing returns true if the variable is missing (which includes 28 total values, but . is the most common).
A better way to do more than one is to use the nmiss or cmiss functions (nmiss for numbers, cmiss for character or mixed type).
data want;
set have;
if nmiss(year,volume) = 0;
run;
That will return the number of missing values, which you can then test for whatever value you are looking for (in this case, zero values). You could even do:
data want;
set have;
if nmiss(of _NUMERIC_) = 0;
run;
where _NUMERIC_ is all numeric variables. (of is needed for variable lists like this to tell SAS to expect a list.)
Your second doesn't work, by the way, because it's catting the ID variable together with the others. You could have seen this by looking at the value of that cats (ie, assign it to a variable). You could have said
if cats(of _all_) = ID then delete;
but as several of us have shown that's probably inferior to the simpler solutions using nmiss.
You can just use a subsetting if nmiss() by checking the variables that must be populated:
data test;
set data;
if nmiss(year,volume)=0 ;
run;
Edit: This works if year and volume is numeric, if it is string, you can use the cmiss() function.
Don't use quotes with numeric variables, e.g.:
if volume = . then delete;
Other option that works for either character or numeric:
if missing(volume) then delete;
You could use a where clause in the set statement here as well:
data new_dataset;
set old_dataset (where = (volume is not missing or year is not missing));
run;
I always enjoy using the is not missing syntax because it seems too much like writing normal English to work
Related
For context I'm a SAS programmer in clinical trials but I have this spec for variable ADTC.
If EC.ECDTC contains a full datetime, set ADTMC to the value of EC.ECDTC in "YYYY-MM-DD hh:mm" format. If EC.ECDTC contains a full or partial date but no time part then set ADTMC to the date part of EC.ECDTC in "YYYY-MM-DD" format. In both cases, replace any missing elements of the format with "XX", for example "2022-01-01 16:XX" or "2022-01-XX"
So currently I'm using this piece of code which is partially fine but not ideal
check=count(ecdtc,'-');
if check = 0 and ~missing(ecdtc) then adtc = cats(ecdtc,"-XX-XX");
else if check = 1 then adtc = cats(ecdtc,"-XX");
else if check = 2 then adtc = ecdtc;
Is there a way I could use perl-regular expressions to have like a template of the outline of the date/datetime and have it search through the values for that column and if they don't match to add -XX if missing day or -XX-XX if missing day and month etc. I was thinking of utilising prxchange but how do you incorporate the template so it knows to add -XX in the correct position where applicable.
SUBSTR on the left.
data want2;
set have;
length adtmc $16;
if length(ecdtc) le 10 then adtmc = 'xxxx-xx-xx';
else adtmc = 'xxxx-xx-xx xx:xx';
substr(adtmc,1,length(ecdtc))=ecdtc;
run;
Honestly, I wouldn't; regex are not faster for the most part than just straight-up checking with normal code, for simple things like this. If you have time pressure, or thousands or millions of rows... not a good idea, just use scan.
But that said, it's certainly possible, and somewhat interesting. We'll use PRXPOSN, which lets us iterate through the capture buffers, and "capture" each bit. This might need some tweaking, and you might need to capture/not capture the hyphens for example, but for my data this works - if your data is different, the regex will be different (and next time, post sample data!).
data have;
length ecdtc $16;
infile datalines truncover;
input #1 ecdtc $16.;
datalines;
2020-01-01 01:02
2020-01-02
2020-01
2020
junk
;;;;
run;
data want;
set have;
length adtmc $16;
array vals[3] $;
vals[1]='XXXX';
vals[2]='-XX';
vals[3]='-XX';
_rx = prxparse('/(\d{4})(-\d{2})?(-\d{2})?( \d{2}:\d{2})?/ios');
_rc = prxmatch(_rx,ecdtc); *this does the matching. Probably should check for value of _rc to make sure it matched before continuing.;
do _i = 1 to 4; *now iterate through the four capture buffers;
_rt = prxposn(_rx,_i,ecdtc);
if _i le 3 then vals[_i] = coalescec(_rt,vals[_i]);
else timepart = _rt; *we do the timepart outside the array since it needs to be catted with a space while the others do not, easier this way;
end;
adtmc = cats(of vals[*]); *cat them together now - if you do not capture the hyphen then use catx ('-',of vals[*]) instead;
if timepart ne ' ' then adtmc = catx(' ',adtmc,timepart); *and append the timepart after.;
run;
I am working on a SAS Dataset which has missing values.
I can identify whether a particular variable has missing values using IS NULL/IS MISSING operator.
Is there any alternative way, through which I can identify which variables have missing values in one shot.
Thanks in Advance
The syntax IS NULL or IS MISSING is limited to use in SQL code (also in WHERE statements or WHERE= dataset options since those essentially use the same parser.)
To test if a value is missing you can also use the MISSING() function. Or compare it to a missing value. So for character variables test if it is equal to all blanks: c=' '. For numeric you can test x=., but you also need to look out for special missing values. So you might test if x <= .z.
To get a quick summary of number of distinct missing values for each variable you could use the NLEVEL option on PROC FREQ. Note it might not work for a large dataset with too many distinct values as the procedure will run out of memory.
use array and vname to find variable with missing values. If you want rows with missing values use cmiss function.
data have;
infile datalines missover;
input id num char $ var $;
datalines;
1 . A C
2 3 D
5 6 B D
;
/* gives variables with missing values*/
data want1(keep=miss);
set have;
array chars(*) _character_;
array nums(*) _numeric_;
do i=1 to dim(chars);
if chars(i)=' ' then
miss=vname(chars(i));
if nums(i)=. then
miss=vname(nums(i));
end;
if miss=' ' then
delete;
run;
/* gives rows with missing value*/
data want(drop=rows);
set have;
rows=cmiss(of id -- var);
if rows=1;
run;
You can use proc freq table statement with missing option. It includes missing category if missing values exist. Useful for categorical data.
data example;
input A Freq;
datalines;
1 2
2 2
. 2
;
*list variables in tables statement;
proc freq data=example;
tables A / missing;
run;
You can also use Proc Univariate it creates MissingValues table in ODS by default if any missing values exist. Useful for numeric data.
Two options (in addition to Peter Slezák's) I can suggest are :
- Use proc means with nmiss
proc means data = ___ n nmiss;
var _numeric_;
run;
In SAS Enterprise Guide, there is a characterize data task - this helps profile character variables too. (Under the hood, it is a combination of various procs, but is an easy to use option).
Hope this helps,
regards,
Sundaresh
I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;
The variable upc is already defined in my cool dataset. How do I convert it to a macro variable? I am trying to combine both text and numbers. For example blah should equal upc=123;
data cool;
set cool;
blah = catx("","upc=&upc","ccc")
run;
If upc is a numeric variable and you just want to include its value into some character string then you don't need to do anything special. Concatenation function will convert it into character before concatenating automatically:
data cool;
blah = catx("","upc=",upc,"ccc");
run;
The result:
upc----blah
123 upc= 123ccc
BTW, if you want to concatenate strings without blanks between them, you can use function CATS(), which strips all leading and trailing spaces from each argument.
The following test code works for my SAS 9.3 x64 PC.
Please note that:
1.symputx() provide the connection between dataset and macro variables.
2.cats() will be more appropriate than catx() if delimiting characters are not needed.
3.If you did not attempt to create a new data set, data _NULL_ is fine.
You can check the log to see that the correct values are being assigned.
Bill
data a;
input ID $ x y ##;
datalines;
A 1 10 A 2 20 A 3 30
;
run;
options SymbolGen MPrint MLogic MExecNote MCompileNote=all;
data _NULL_;
set a;
call symputx(cats("blah",_N_),cats(ID,x),"G");
run;
%put blah1=&blah1;
%put blah2=&blah2;
%put blah3=&blah3;
I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.