Strip characters using compress - sas

I'd like to strip non ASCII characters from a variable. I've not had success with more elegant methods, so I'm using compress and nominating the characters I'd like to keep (because I don't know the ones I'd like to remove). It works except I'd like to keep both characters " and ' but I can't pass both of these characters into the compress function correctly.
data _null_;
_text='#AB'!!byte(13)!!'C"D';
_text_select=compress(_text,"ABCDEFGHIJKLMNOPQRSTUVWXYZ /-1234567890(),.'&?;=%:+><`[]*#","k");
put _text;
put _text_select;
run;

First off, if your concern is 'control' characters, the 'c' option is a good one.
compress(textstr,,'c');
That removes things in the early part of ASCII like line feeds, tabs, etc. (Probably, the first 16 characters from '00'x to '0F'x, and possibly '07'x, though I've never seen an exact definition.)
If you want to keep basically 'printable characters', the 'w' option is helpful.
compress(textstr,,'kw');
Your method can be made to work, if it's the only way you can figure to do exactly what you want, by escaping the quote with another quote.
compress(_text,"ABCDEFGHIJKLMNOPQRSTUVWXYZ /-1234567890(),.'&?;=%:+><`[]*#""","k");
You could also use "p" to keep all punctuation marks. In fact, you could certainly simplify this at least.
data _null_;
_text='#AB'!!byte(13)!!'C"D';
_text_select=compress(_text," /-()&=%+><` []*#","knp");
put _text;
put _text_select;
run;
I'm not entirely sure of what is officially a 'punctuation mark', likely the - is also one, and possibly ().
Edit: Here's a good way to test what's kept (in the official ASCII set, ie, up to '7F'x):
data test;
length _text $255;
do _t = 1 to 255;
_text =byte(_t)||_text;
end;
_text_select=compress(_text," /-(),.'&""?;=%:+><`[]*#","kn");
put _text=;
put _text_select=;
run;
P seems to keep a lot of stuff that's a bit weirder, some of which are clearly not punctuation, so obviously SAS did something wrong there. I'm tempted to write a trouble ticket, honestly, as it definitely isn't doing what it clearly should be.

Related

Ruby Regex on Active Directory String

I have a string that represents multiple DNs for Active Directory but has been separated by commas instead of ;
The String:
CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal,
CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal,
CN=Operators,ou=App2,ou=groups,dc=pkldap,dc=internal
I am trying to write a regex that will match on both ou=App1 and not the ou=App2 but then also make the , after dc=internal become a ;
Is this possible?
The result would be:
CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal;
CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal;
Using #strip and #sub to Clean Up Your LDIF Data
Really, the "correct" answer would be to get valid LDIF in the first place, and then parse it as such with a gem like Net::LDAP. However, the changes you want to your existing file are fairly trivial. For example, we'll start by assigning the String data from your question to a variable named ldif using a here-document literal:
ldif = <<~'LDIF'
CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal,
CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal,
CN=Operators,ou=App2,ou=groups,dc=pkldap,dc=internal
LDIF
You can now modify and match the lines from the String that you want with String#each_line to iterate, and String#gsub and a Regexp lookahead assertion to find and collect the lines you want using Array#select on the output from #each_line, and storing the results into a matching_apps Array.
This all sounds much more complicated than it is. Consider the following method chain, which is really just a one-liner wrapped for readability:
matching_apps =
ldif.each_line.select { _1.match? /ou=App1(?=[,;]?$?)/ }
.map { _1.strip.sub /[,;]$/, ";" }
#=>
["CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal;",
"CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal;"]
The use of String#strip and String#sub will help to ensure that all lines are normalized the way you want, including the trailing semicolons. However, this is likely to cause problems in subsequent steps, so I'd probably recommend removing those trailing semicolons as well.
Note: You can stop reading here if you just want to solve your immediate question as originally posted. The rest of the answer covers additional considerations related to data normalization, and provides some examples on how and why you might want to strip the semicolons as well.
Why and How to Normalize without Semicolons
You can replace the final substitution from #sub with an empty String (e.g. "") to remove the trailing semicolons (if present). Normalizing without the semicolons now may save you the trouble of having to clean up those lines again later when you iterate over the Array of results stored in matching_apps from Array#select.
For example, if you need to rejoin lines with commas, interpolate the lines within other String objects in subsequent steps, or do anything where those stored semicolons may be an unexpected surprise it's better to deal with it sooner rather than later. If you really need the trailing semicolons, it's very easy to use String#concat or other forms of String interpolation to add them back, but having unexpected characters in a String can be a source of unexpected bugs that are best avoided unless you're sure you'll always need that semicolon at the end.
Example 1: Output Where Semicolons Might be Unexpected
For example, suppose you want to use the results to format output for a command-line client where a trailing semicolon wouldn't be expected. The following works nicely because the semicolons are already stripped:
matching_apps =
ldif.each_line.select { _1.match? /ou=App1(?=[,;]?$?)/ }
.map { _1.strip.sub /[,;]$/, "" }
printf "Make the following calls:\n\n"
matching_apps.each_with_index do |dn, idx|
puts %(#{idx.succ}. ldapsearch -D '#{dn}' [opts])
end
This would print out:
Make the following calls:
1. ldapsearch -D 'CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal' [opts]
2. ldapsearch -D 'CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal' [opts]
without having to first strip any trailing semicolons that might not work with the printed command, tool, or other output.
Examples of Rejoining with Commas and Semicolons
On the other hand, you can just as easily rejoin the Array elements with a comma or semicolon if you want. Consider the following two examples:
matching_apps.join ", "
#=> "CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal, CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal"
p format("(%s)", matching_apps.join("; "))
#=> "(CN=Admins,ou=App1,ou=groups,dc=pkldap,dc=internal; CN=Auditors,ou=App1,ou=groups,dc=pkldap,dc=internal)"
Keep Flexibility in Mind
If the String objects in your Array still had the trailing semicolons, you'd have to do something about them. So, unless you already know what you plan to do with each String, and whether or not the semicolons will be needed, it's probably best to keep them out of matching_apps in the first place to optimize for flexibility. That's just an opinion, to be sure, but definitely one worth considering.

How to Handle Strings in SAS

Why is it that sometimes we need to wrap the string value in single quotes, sometimes double quotes, sometimes no quotes? This is extremely frustrating when I have to go from one proc to another, especially if it involves changing a file name or url dynamically. What is the logic behind this hideous monstrosity?
%let Name01 = John Smith;
%let Name02 = 'John Smith';
%let Name03 = "John Smith";
All three work.
%let Folder = /97network/read/Regions/Northeast/;
%let FileName = SalesTarget.xlsx;
proc import
datafile = "&Folder.&FileName."
dbms = xlsx
out = SymList replace;
sheet="Sheet1";
run;
Here, &Folder.&FileName. must be in double quotes.
filename OutFile "/06specialty/ATam/AMZN.csv";
proc http url = &urlAddress. method = "get" out = OutFile;
run;
Finally, if I want to download stock prices from Yahoo Finance, url = may take the address in single quotes, or &urlAddress. in no quotes, but you cannot use double quotes. OutFile can be in single or double quotes, but not no quotes. Then in the out = clause, you have OutFile, not &OutFile.
SAS strings are very simple. They are enclosed in either single or double quote characters.
'Hello there'
"Good-bye"
If the enclosing character appears in the string it needs to be doubled up.
'I don''t know'
To your first example it is probably your operating system that is allowing filenames to include optional quotes. On Windows and Linux the qutoes can even be required in some situations when the path includes spaces or other characters that the command shell would normally interpret as delimiters in the command line.
Adding macro logic into the program is probably a large part of your confusion. First figure out what code works for the commands you are using and then you can try to generate that code using the macro processor.
Once you introduce macro logic you need to pay attention to whether your strings are using single or double quotes. There is big difference between how macro logic interacts with single and double quote characters. Strings that are bounded by single quote characters are ignored by the macro processors. So the macro trigger characters & and % are treated as normal characters. But strings that are bounded by double quote characters will be processed.
Your second example adds the complexity of working with URL syntax. URL strings use the & character for its own purpose so you need to take care to understand how SAS is going to see the code you type and whether or not the macro processor will attempt to interpret it to insure the desired string needed for the URL will be created.
SAS has 50 years of history and a lot of the code is legacy. SAS is backwards compatible. You can still run code 30 years old with no issues. There are lots of oddities, such as quotes, that are there...and will always be there. SAS is kind of a conglomeration of ~300 languages (every proc is unique plus multiple meta-languages).
Since SAS will never change, best to just ignore the oddities.
One other thing. SAS runs on lots of O/Ss so every nuance there has to be accommodated in a mostly neutral way.

How do I use numeric functions to correct date typos?

I know it's easy enough to do manual corrections on date typos, but I want to automate such corrections using one or more SAS functions, given that my dataset is large and typos are frequent.
For instance, it seems that whomever created the dataset I am cleaning often transposed digits in the year of someone's birthdate (e.g., '2102' rather than '2012', '2110' instead of '2010', etc). I'm aware of string functions such as INDEX() that find certain character values or strings and then allow for the replacement of said characters in the same position (i.e., replace "ABCD" with "ABBB", regardless of the string's location in a value). Can the same process be replicated with numeric (and specifically date) values?
I don't think SAS has any functions that would check numeric values for digit patterns. I often do data cleaning and address this issue by making a character variable out of the numeric date variable, then using character functions and Perl regex to clean the character values, and then storing the cleaned values as numeric date.
For specifically date values, you could try using SAS date functions (e.g. DAY(), MONTH(), YEAR(), MDY(), etc.) to extract parts of the date value, error-check them, and put them all back together into a date value. This could be a good quick solution if you expect a limited set of typos and you roughly know what they are. For a more thorough error check, converting the numeric values to character and using char or regex functions would give you more options.
The only really concise suggestion I can imagine is using mdy (Assuming this is date, not datetime variables).
For example:
data want;
set have;
if year(datevar) > 2100 then
datevar = mdy(month(datevar),day(datevar),year(datevar)-90);
run;
would correct any '2104' to '2014'. That's a very simple correction (and may well do as much harm as good, since '2114' is also a possible typo), but things along those lines - break the date up into its pieces, verify the pieces, reconstruct using mdy.

Is there a limit to the value levels in a proc format statement?

proc format;
value $STNAME 'AL'='Alabama'
'AK'='Alaska
'AR'='Arkansas'
'AZ'='Arizona'
'CA'='California'
'CO'='Colorado'
'CT'='Connecticut'
'DC'='DistrictOfColumbia'
'DE'='Deleware'
'FL'='Florida'
'GA'='Georgia'
'HI'='Hawaii'
'IA'='Iowa'
'ID'='Idaho'
'IL'='Illinois'
'IN'='Indiana'
'KS'='Kansas'
'KY'='Knetucky'
'LA'='Louisiana'
'MA'='Massachusetts'
'MD'='Maryland'
'ME'='Maine'
'MI'='Michigan'
'MN'='Minnesota'
'MO'='Missouri'
'MS'='Mississippi'
'MT'='Montana'
'NC'='North Carolina'
'ND'='North Dakota'
'NE'='Nebraska'
'NH'='New Hampshire'
'NJ'='New Jersey'
'NM'='New Mexico'
'NY'='New York'
'NV'='Nevada'
'OH'='Ohio'
'OK'='Oklahoma'
'OR'='Oregon'
'PA'='Pennsylvania'
'RI'='Rhode Island'
'SC'='South Carolina'
'SD'='South Dakota'
'TN'='Tennessee'
'TX'='Texas'
'UT'='Utah'
'VA'='Virginia'
'VT'='Vermont'
'WA'='Washington'
'WI'='Wisconsin'
'WV'='West Virginia'
'WY'='Wyoming';
run;
It freezes up in the middle of the proc format step. If I split I shorten it, it runs fine.
Anyone aware how to get around this?
You are missing a closing quote on Alaska. I placed the code in my IDE and I could tell from the highlighting.
As long as your hard drive can hold the SAS program file, it does not have a limit on the number of unique values inside a proc format or the amount of memory needed to load it. As #Carolina has suggested you are missing an end quote for Alaska. If there is no end quote, the states after Alaska are in a different color. After you add the end quote, the highlighting after Alaska should change to a unanimous color.
Like this:
screenshot link
It might be better to use more conventional spacing for better readability.
Also, you might want spaces between 'DistrictOfColumbia' and Kentucky is spelt incorrectly.
Hope this helps.

SAS: Where statement not working with string value

I'm trying to use PROC FREQ on a subset of my data called dataname. I would like it to include all rows where varname doesn't equal "A.Never Used". I have the following code:
proc freq data=dataname(where=(varname NE 'A.Never Used'));
run;
I thought there might be a problem with trailing or leading blanks so I also tried:
proc freq data=dataname(where=(strip(varname) NE 'A.Never Used'));
run;
My guess is for some reason my string values are not "A.Never Used" but whenever I print the data this is the value I see.
This is a common issue in dealing with string data (and a good reason not to!). You should consider the source of your data - did it come from web forms? Then it probably contains nonbreaking spaces ('A0'x) instead of regular spaces ('20'x). Did it come from a unicode environment (say, Japanese characters are legal)? Then you may have transcoding issues.
A few options that work for a large majority of these problems:
Compress out everything but alphabet characters. where=(compress(varname,,'ka') ne 'ANeverUsed') for example. 'ka' means 'keep only' and 'alphabet characters'.
UPCASE or LOWCASE to ensure you're not running into case issues.
Use put varname HEX.; in a data step to look at the underlying characters. Each two hex characters is one alphabet character. 20 is space (which strip would remove). Sort by varname before doing this so that you can easily see the rows that you think should have this value next to each other - what is the difference? Probably some special character, or multibyte characters, or who knows what, but it should be apparent here.