Unwanted cut words in the output - sas

I tried to submit the code below, but SAS doesn't show the complete values in the output. It would cut the word after the blank in the values of the variable "COMPLICATION".
Can anyone fix it?
data complication;
length SUBJECT 8 COMPLICATION $ 15.5;
input SUBJECT COMPLICATION $ ;
datalines;
2076 Pneumonia
3585 DVT(Lower)
3630 DVT(Lower)
4585 Compartment
4599 Aspiration
4760 Acute Resp
4775 Pneumonia
2076 Heart Attack
3585 Pneumonia
3660 Heart Attack
4585 Pneumonia
4599 Pneumonia
4760 Pneumonia
4775 DVT(Lower)
2076 Renal Fail
3585 Renal Fail
3630 Pancreatit
4585 Skin Break
4599 Renal Fail
4760 Renal Fail
4775 Pneumonia
3630 Pneumonia
4775 Renal Fail
;
run;

The default value (or field) separator is a space, so the character input is 'clipped' at the first space. Use the LIST INPUT modifier & after a variable name to cause INPUT to use two spaces as the delimiter, and thus allow embedded single spaces in the field value.
The LENGTH statement should not have $15.5 change it to $15. The INPUT statement already knows about the COMPLICATION variable because it was set up with LENGTH in the prior statement. So change COMPLICATION $ to COMPLICATION &
So you want
data complication;
length SUBJECT 8 COMPLICATION $ 15;
input SUBJECT COMPLICATION & ;
...

Related

replace a variable's value if it contains a certain string

I have a variable called diagnosis and I want to replace everything that contains the word "pneumonia" with just "pneumonia"
I tried this:
replace diagnosis = "Pneumonia" if regexm (diagnosis, "pneumonia")
But I got an error: unrecognized command: regexm
I have Stata/IC version 12.1 for Windows.
Answer:
take out the space between regexm and (diagnosis, "pneumonia")
Additional suggestions:
regexm takes a long time, so I would do something more like
replace diagnosis = "Pneumonia" if diagnosis == "pneumonia"
which achieves the same result, or if you want to do this more generally you can write
replace diagnosis = strproper(diagnosis)
which has the same results in your example.
Another approach.
. clonevar newdiag = diag
. replace newdiag = "pneumonia" if strpos(strlower(diag),"pneumonia")>0
(3 real changes made)
. list, clean noobs
diag newdiag
pneumonia pneumonia
Pneumonia pneumonia
Bronchial pneumonia pneumonia
Flu and pneumonia pneumonia
earache earache

Regular expression for Number masking with exceptions

I want to mask phone numbers in a resume which also contains date in the for 2001, 2001-03 and percentages 45% 87% 78.45% 56.5%.
I only want to mask the phone numbers, and I don't need to mask it completely. If I could only mask 3 or 4 digits that makes it hard to guess, that does the job. Kindly help me out.
Phone number formats are
9876543210
98765 43210
98765-43210
9876 543 210
9876-543-210
Here is my answer:
(([0-9][- ]*){5})(([0-9][- ]*){5})
It will match exactly 10 digits with or without - or space.
After that, you can replace the first or the third group with ***** or anything you like.
For example:
$1*****
\d{4,5}[ -]?\d{3}[ -]?\d{2,3}
Strings matched:
9876543210, 98765 43210, 98765-43210, 9876 543 210, 9876-543-210
Strings not matched:
45% 87% 78.45% 56.5%
2001, 2001-03
I feel that a more complicated regex that doesn't match invalid phone numbers is not required since the requirement is to mask valid phone numbers of the above format.
Check here
Python code:
def fun(m):
if m:
return '*'*len(m.group(1))+m.group(2)
string = "Resume of candidate abcd. His phone numbers are : 9876543210, 98765 43210, 98765-43210.Date of birth of the candidate is 23-10-2013. His percentage is 57%. One more number 9876 543 213 His percentage in grad school is 44%. Another number 9876-543-210"
re.sub('(\d{4,5})([ -]?\d{3}[ -]?\d{2,3})',fun,string)
Output:
'Resume of candidate abcd. His phone numbers are : *****43210, *****
43210, *****-43210. Date of birth of the candidate is 23-10-2013. His
percentage is 57%. One more number **** 543 213 His percentage in grad
school is 44%. Another number ****-543-210'
More about re.sub:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping
occurrences of pattern in string by the replacement repl. If the
pattern isn’t found, string is returned unchanged. repl can be a
string or a function;
Just to help you on your way... I would use python to do is.
Use re module to search for number-like strings:
import re
num_re = re.compile('[0-9 -]{5,}')
with open('/my/file', 'r') as f:
for l in f:
for s in num_re.findall(l):
# Do some addition testing, like 'not starting with' or any
l.replace(s, '!!!MASKED!!!')
print l
I'm not saying that this code is finished, but it should help you on your way.
By the way, why I would use this approach:
You can easily add any tests you like to fix false positives.
Its readable.

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?
The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.
I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$
I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567
Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

Bookmark Lines with positive $ amounts in Notepad++

I have a text doc with about 9000 lines. The data is alpha numeric. Within the doc, there are approximately 150 lines I need to identify. The only common factor is that each contains a dollar amount. I've tried multiple Regex searches, and just can't get it right.
INVALID PAYMENT AMT
013 1887000 CRJ 0.00 03/04/2015-01222015 - Code 938
INVALID PAYMENT AMT
019 0 ,CRJ 426.72 03/06/2015-01282015 - Code 628
In the example above, I need to bookmark the line with the 426.72. I don't care about the other 3 lines. Every line I need in the document has a positive dollar amount.
Perhaps:
(([1-9][0-9]*)\.([0-9]*[1-9][0-9]|00)*)|(0\.([0-9]*[1-9][0-9]))

SAS DATA step / INPUT statement: reading column-based raw data AND multiple observations from single line?

I’m working with some raw data that has fixed column widths, but has all its records written into a single line (blame the data vendor, not me :-) ). I know how to use
fixed column widths in the INPUT statement, and how to use ## to read more than one observation per line, but I am having trouble when I try to do both.
As an example, here’s some code where the data has fixed column widths, but there is one line per record. This code works fine:
DATA test_1;
INPUT alpha $ 1-5 beta $ 6-10 gamma 11-15 ;
DATALINES;
a f 1
ab fg 12
abc fgh 123
abcd fghi 1234
abcdefghij12345
;
RUN;
Now here’s the code for what I’m really trying to do – all the data is in one line, and I try to use the ## notation:
DATA test_2;
INPUT alpha $ 1-5 beta $ 6-10 gamma 11-15 ##;
DATALINES;
a f 1 ab fg 12 abc fgh 123 abcd fghi 1234 abcdefghij12345
;
RUN;
This fails because it just keeps reading the beginning 15 characters, holding that record, and re-reading from the start. Based on my understanding of the semantics of the ## notation, I can definitely understand why this would be happening.
Is there any way I can accomplish reading fixed column data from a single line; that is, make test_2 have the same content as test_1? Perhaps through some combination of symbols in the INPUT statement, or maybe resorting to another method (with file I/O functions, PROC IMPORT, etc.)?
Have you tried specifying variable lengths using informats?
For example:
DATA test_2;
INPUT alpha $5. beta $5. gamma 5.0 ##;
DATALINES;
a f 1 ab fg 12 abc fgh 123 abcd fghi 1234 abcdefghij12345
;
RUN;
From the SAS documentation:
Formatted input causes the pointer to move like that of column input
to read a variable value. The pointer moves the length that is
specified in the informat and stops at the next column.