The datasets include a list of numbers:
$1,000.1M
$100.5M
$1,002.3M
$23.4M
$120.3M
I want to read the variable as a numeric in SAS
the result should be:
Money(millions)
1000.1
100.5
1002.3
23.4
120.3
I used COMMAw.d to read this data, but cannot run
The code is:
input Money(millions) COMMA9.1;
run;
How to modify it?
Thank you very much!
The COMMA informat does not expect letters like 'M', it removes only commas, blanks, dollar signs, percent signs, dashes, and close parentheses.
You can just convert your raw string to a string containing a number by removing all characters you do not need:
data input;
length moneyRaw $200;
infile datalines;
input moneyRaw $;
datalines;
$1,000.1M
$100.5M
$1,002.3M
$23.4M
$120.3M
;
run;
data result;
set input;
* "k" modifier inverts the removed characters;
money = input(compress(moneyRaw,"0123456789.","k"),best.);
run;
Or if you know regex, you can add some intrigue to the code for anyone who reads it in the future:
data resultPrx;
set input;
moneyUpdated = prxChange("s/^\$(\d+(,\d+)*(\.\d+)?)M$/$1/",1,strip(moneyRaw));
money = input(compress(moneyUpdated,','),best.);
run;
I think you're best off reading it as a character and then processing it as in Dmitry's answer. But if it was a single column you could read it if you set the delimiter to M. I suspect this will work in a demo, but not in your full process.
data input;
informat moneyRaw dollar8.;
infile datalines dlm='M';
input moneyRaw ;
*moneyRaw = moneyRaw * (1000000);
format moneyRaw dollar32.;
datalines;
$1,000.1M
$100.5M
$1,002.3M
$23.4M
$120.3M
;
run;
Related
Supposed I have two strings to convert from SAS program name to table number.
My goal is to convert the first "f-2-2-7-5-vcb" to "2.2.7.5".
And this should be done dynamically. Like for "f-2-2-12-1-2-hbd87q",
it needed to be "2.2.12.1.2" .
How to accomplish this?
data input;
input str $ 1-20;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
data want;
set input;
Sub=compress(substr(str,3,length(str)),,'kd') ;
run;
Bit of a longer way, but this works fine for me.
Use FIND() to find the first '-'
Use REVERSE() and FIND() to find the
last '-'
Use SUBSTR() and metrics + math from above to remove the first and
last components
Use TRANSLATE() to convert the - to periods.
z=find(str, '-');
end=find(strip(reverse(str)), '-');
string = translate(substr(str, z+1, length(str) - z - end), ".", "-");
A regular expression can match the dash delimited digits only sequence. The match, when extracted, can be transformed using translate.
data input;
input str $ 1-20;
rx = prxparse ("/^.*?((\d+)(-\d+)*)/");
if prxmatch(rx,str) then do;
call prxposn (rx,1,s,e);
name = substr(str,s,e);
name = translate(name,'.','-');
end;
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
funky2-2-1funky
f-2-hb17
a2bfunky
;
run;
A funky situation occurs if the digits only token sequence is preceded by a token ending with digits, or succeeded by a token starting with digits.
data input;
input str $ 1-20;
string=translate(prxchange('s/\w+?\-(.*)\-\w+/$1/',-1,strip(str)),'.','-');
datalines;
f-2-3-1-5-vcb
f-2-4-1-6-rtg
f-2-3-11-1-3-hb17
;
run;
You can do this in one line. Use subtr to keep the text between the second word and last word:
translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-')
find(str,scan(str,2,'-') : finds the starting position of the second
word.
find(str,scan(str,-1,'-') : finds the starting position of the last
word.
step2 - find(str,scan(str,2,'-'))-1 : find ending position of second
last word (length of text to copy).
Translate function: replaces '-' with '.'
substr(str,step1,step3) : copy text between second word and second to last.
Code:
data want;
set input;
Sub=translate(substr(str,find(str,scan(str,2,'-')),find(str,scan(str,-1,'-'))-find(str,scan(str,2,'-'))-1),'.','-');
put _all_;
run;
Output:
str=f-2-3-1-5-vcb Sub=2.3.1.5
str=f-2-4-1-6-rtg Sub=2.4.1.6
str=f-2-3-11-1-3-hb17 Sub=2.3.11.1.3
I have data with an ID field that is structured like this:
XX00000X
7 characters total, with the first 2 and last letters only, and numbers in between.
How can I check that the ID is structured specifically and exactly like this?
I'm not sure of how to approach checking this - one possibility was the CATs function but not sure how to apply that.
You can use a combination of functions to check this, including:
CHAR()
ANYDIGIT()
ANYALPHA()
data have;
input x $10.;
cards;
AB0000X
AO000BF
1234556
ABCDEFG
AB0123Y
AB
ABCDEFGHI
;
run;
data check;
set have;
flag=0;
if lengthn(x) ne 7 then flag=1;
length letter $1;
if flag=0 then do i=1 to 7;
letter = char(x, i);
if ( i in (1,2, 7) and anyalpha(letter) ne 1 )
or i in (3:6) and anydigit(letter) ne 1 then do;
flag=1;
leave;
end;
end;
run;
Regular expressions are obviously more succinct and likely a better approach.
Here is an approach by regular expression. [A-Z]{2} mathc first two letters, [0-9]{4} match four digits in the middle, [A-Z] match last letter, i ignore case.
data want;
set have;
flag=prxmatch("m/[A-Z]{2}[0-9]{4}[A-Z]/i",x);
run;
I have to read a file with a tab delimited x'05'c (dlm='0C'x). For few records the delimiter is present with in the string which has a double quotes. when I'm using '&' in the input statement it is working fine but records with more than one space is giving error.
Data I have to read:
1.AIRWORLDWIDE.z1234565
2.MEDICAL.y121546
3."INPUTTTFAM.ILY TRUST"
Output desired:
ID text text_ref
-----------------------------------
1 AIRWORLDWIDE z1234565
2 MEDICAL y121546
3 "INPUTTTFAM ILY TRUST"
My program :
Data Want;
format id $char1.
text $char12.
text_ref $char12.;
informat id $char1.
text $char12.
text_ref $char12.;
length id text text_ref;
infile have dlm='0C'x dsd END=eof missover ;
input id text text_ref;
/* input id (text text_ref) (& $12.); */
run;
thanks in advance
DSD is not the INFILE option you want here.
filename FT15F001 temp;
data want;
infile FT15F001 dlm='.' missover;
informat id $char1. text $char12. text_ref $char12.;
input (_all_)(:);
list;
parmcards;
1.AIRWORLDWIDE.z1234565
2.MEDICAL.y121546
3."INPUTTTFAM.ILY TRUST"
;;;;
run;
proc contents varnum;
run;
proc print;
run;
I have a variable that contains a number of firms separated by the | symbol. I would like to be able to count how many firms there. i.e., the number of | + 1, and ideally identify the location of the | symbol in the string. Note there will not be more than five firms in a single variable. I was trying to use the following approach but run into the fact that SAS treats the | symbol as a special operator.
pattern1 = prxparse('/|/'); /* I can't seem to get SAS to treat this as a text to compare */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
call prxnext(pattern1, start, stop, reassignment2, position, length);
ARRAY Y[5];
do J=1 to 5 while (position > 0);
Y[J]=position;
call prxnext(pattern1, start, stop, reassignment2, position, length);
end;
nfirms=j+1;
run;
I would do it somewhat differently. What you really want is not the number of | characters, but the actual firms, right? So search for those. Your code had a number of minor issues; primarily, you must first prxmatch before using call prxnext, your j+1 is wrong because the loop iterator actually increments one beyond the last qualifying loop value (I use j-1 because I will find one more element than you), and | is a regular expression metacharacter and must be escaped if you actually want to use it, unless it is inside [] like I am using it.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
pattern1 = prxparse('/[^|]+/io'); /* Look for non-| characters */
start = 1;
stop = length(reassignment2); /* my list of firms is in the variable reassignment2 */
rc=prxmatch(pattern1,reassignment2);
if rc>0 then do;
ARRAY Y[5];
do J=1 by 1 until (position = 0);
call prxnext(pattern1, start, stop, reassignment2, position, length);
Y[J]=position;
end;
nfirms=j-1;
end;
else nfirms=0;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
For completeness' sake, you could also do this easily without regular expressions, using call scan.
data test;
infile datalines truncover;
input #1 reassignment2 $50.;
array y[5];
do nfirms=1 by 1 until (position le 0);
call scan(reassignment2,nfirms,position,length,'|');
y[nfirms]=position;
end;
nfirms=nfirms-1; *loop ends one iteration too late;
put nfirms=;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;;;;
run;
I agree with #Joe that this could be done more simply without regular expressions, though I would simplify his code a little further to exclude the use of an array.
data test;
infile datalines truncover length = reclen;
input firmlist $varying256. reclen;
i = 0;
do until(scan(firmlist,i,"|") = "");
i + 1;
end;
nfirms = i - 1;
drop i;
datalines;
Firm1|Firm2|Firm3
Firm1|Firm2|Firm3|Firm4
Firm1
Firm1|Firm2
;
run;
You said you'd also like to capture the position of the "|" character in the string, but if there are multiple firms per record there will be multiple "|" characters in the string. If you want the position of each one, an array might be a better route, though if you only want one, the index function will get you what you want. You'd use delimpos = index(firmlist,"|");.
I hope that helps!
I am creating a SAS dataset from a database that includes a VARCHAR(5) key field.
This field includes some entries that use all 5 characters and some that use fewer.
When I import this data, I would prefer to pad all the shorter entries out to use all five characters. For this example, I want to pad on the left with 0, the character zero. So, 114 would become 00114, ABCD would become 0ABCD, and EA222 would stay as it is.
I've attempted this with a simple data statement, but of course the following does not work:
data test;
set databaseinput;
format key $5.;
run;
I've tried to do this with a user-defined informat, but I don't think it's possible to specify the ranges correctly on character fields, per this SAS KB answer. Plus, I'm fairly sure proc format won't let me define the result dynamically in terms of the incoming variable.
I'm sure there's an obvious solution here, but I'm just missing it.
Here is an alternative:
data padded_data_dsn; length key $5;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = translate(right(raw_data),'0',' ');
run;
Data raw_data_dsn;
format key $5.;
key = '4'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A114'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A1140'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
run;
I'm sure someone will have a more elegant solution, but the following code works. Essentially it is padding the variable with five leading zeros, then reversing the order of this text string so that the zeros are to the right, then reversing this text string again and limiting the size to five characters, in the original order but left-padded with zeros.
data raw_data_dsn;
format key $varying5.;
key = '114'; output;
key = 'ABCD'; output;
key = 'EA222'; output;
run;
data padded_data_dsn;
format key $5.;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = put(put('00000' || raw_data ,$revers10.),$revers5.);
run;
Here's what worked for me.
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
It works by counting the length of the existing string, and then creating a cat string of length 4 - that, and then appending the cat value and the original string together.
Notice that it screws up if the original string is length 5.
Also - it won't work if the input string has a $5. format on it.
data a; /*input dataset*/
input str $;
datalines;
a
aa
aaa
aaaa
aaaaa
;
run;
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
input:
a
aa
aaa
aaaa
aaaaa
output:
0000a
000aa
00aaa
0aaaa
0aaaa
I use this, but only works with numeric values :S. Try with another formats in the INPUT
data work.prueba;
format xx $5.;
xx='1234';
vv=PUT(INPUT(xx,best5.),z5.);
run;