Removing quotes and spaces in SAS dataset - sas

I am working in SAS EG and DI, facing a very peculiar problem.
When I look into a column of a dataset in SAS DI Studio or EG, it is appearing fine. But when I paste the data into notepad, some quotes and spaces are appearing.
The data which I am seeing in EG:
But the same data when copied into Notepad,
extra quotes and spaces are appearing like this(in 6th row):
I found this problem when I am using this field as a key in a join, the other related column values for 6th row are not going to the output as the match is failing for that 6th record.
I tried many things like tranwrd,dequote and compress but none of them is changing my result.
Can someone please help in understanding what the problem is and how can this be solved.

Take a look at what is in the column so that you can decide how to handle it. This query will show you both the character string and the Hexadecimal representation of the string.
proc sql;
select postcode,put(trim(postcode),$hex.) as hexcode,count(*) as nobs
from x
group by 1,2
;
quit;
So if you see hex characters like 0A, 0D, A0, 08 or other non-printable codes then you can figure out what is happening.
So you might see that you have POSTCODE='LS5 3BT' with HEXCODE='4C533520334254' for most of the records. But perhaps have some that look like the POSTCODE='LS5 3BT', but the value of HEXCODE is something like '0A4C533520334254' which would mean that you have a linefeed character at the beginning of the string. Or perhaps instead of space ('20'X) you have a tab ('09'X) in the middle of the string.

Related

Extract multiple substrings of numbers of a specific length from string in Google Sheets

I'd need to split or extract only numbers made of 8 digits from a string in Google Sheets.
I've tried with SPLIT or REGEXREPLACE but I can't find a way to get only the numbers of that length, I only get all the numbers in the string!
For example I'm using
=SPLIT(lower(N2),"qwertyuiopasdfghjklzxcvbnm`-=[]\;' ,./!:##$%^&*()")
but I get all the numbers while I only need 8 digits numbers.
This may be a test value:
00150412632BBHBBLD 12458 32354 1312548896 ACT inv 62345471
I only need to extract "62345471" and nothing else!
Could you please help me out?
Many thanks!
Please use the following formula for a single cell.
Drag it down for more cells.
=INDEX(TRANSPOSE(QUERY(TRANSPOSE(IF(LEN(SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "))=8,
SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "),"")),"where Col1 is not null ",0)))
Functions used:
QUERY
INDEX
TRANSPOSE
IF
LEN
SPLIT
REGEXREPLACE
If you only need to do this for one cell (or you have your heart set on dragging the formula down into individual cells), use the following formula:
=REGEXEXTRACT(" "&N2&" ","\s(\d{8})\s")
However, I suspect you want to process the eight-digit number out of all cells running N2:N. If that is the case, clear whatever will be your results column (including any headers) and place the following in the top cell of that otherwise cleared results column:
=ArrayFormula({"Your Header"; IF(N2:N="",,IFERROR(REGEXEXTRACT(" "&N2:N&" ","\s(\d{8})\s")))})
Replace the header text Your Header with whatever you want your actual header text to be. The formula will show that header text and will return all results for all rows where N2:N is not null. Where no eight-digit number is found, null will be returned.
By prepending and appending a space to the N2:N raw strings before processing, spaces before and after string components can be used to determine where only eight digits exist together (as opposed to eight digits within a longer string of digits).
The only assumption here is that there are, in fact, spaces between string components. I did not assume that the eight-digit number will always be in a certain position (e.g., first, last) within the string.
Try this, take a look at Example sheet
=FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8)
Or this to get them all.
=JOIN(" ,",FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8))
Explanation
SPLIT with the dilimiter set to " " space TRANSPOSE and FILTER TRANSPOSE(SPLIT(B2," ") with the condition1 set to LEN(TRANSPOSE(SPLIT(B2," "))) is = 8
JOIN the outputed column whith " ," to gat all occurrences of number with a length of 8
Note: to get the numbers with the length of N just replace 8 in the FILTER function with a cell refrence.
Using this on a cell worked just fine for me:
(cell_with_data)=REGEXEXTRACT(A1,"[0-9]{8}$")

Regex for values that are in between spaces

I am new to regex and having difficulty obtaining values that are caught in between spaces.
I am trying to get the values "field 1" "abc/def try" from the sameple data below just using regex
Currently im using (^.{18}\s+) to skip the first 18 characters, but am at at loss of how to do grab values with spaces between.
A1234567890 field 1 abc/def try
02021051812 12 test test 12 pass
3333G132021 no test test cancel
any help/pointers will be appreciated.
If this text has fixed-width columns, you can match and trim the column values knowing the amount of chars between start of string and the column text.
For example, this regex will work for the text you posted:
^(.*?)\s*(?<=.{19})(.*?)\s*(?<=^.{34})(.*?)\s*(?<=^.{46})
See the regex demo.
So, Column 2 starts at Position 19, Column 3 starts at Position 34 and Column 4 (end of string here) is at Position 46.
However, this regex is not that efficient, and it would be really great if the data format is fixed on the provider's side.
Given the not knowing if the data is always the same length I created the following, which will provide you with a group per column you might want to use:
^((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)
Regex demo

regex in SAS - how to strip white spaces surrounding special characters?

For the address data, would anyone have any idea on how to strip white spaces around the special characters in the numeric bit of these records? For the third record, for example, I want to have "44/95" instead of "44 / 95".
The special characters I want to do this for all "/","-","|" and ",".
I am guessing that using regex is the best way but I can't think of how to do this.
data addresses1;
infile datalines ;
input #1 address $35. ;
format address $50.;
datalines;
26 32-50 CENTRE DANDENONG ROAD
9 /93-95 DANDENONG ROAD EAST
44 / 95 OUTER CRESCENT
17| 21-25 PARKHILL DRIVE
run;
I have tried something like the following code, but did not work. Can someone point me to the right direction?
data addresses2;
set addresses1;
format fixed_address fixed_address2 $255.;
address=strip(address);
fixed_address2=compbl(strip(prxchange("s/(?<=[\|.\(\)\{\}\-\:\s\*\;\.\#\&\_\/\\]) +(?=\[\|.\(\)\{\}\-\:\s\*\;\.\#\&\_\/\\])/$1/",-1,strip(fixed_address))));
run;
I have made a RegEx for you, that should Work:
\S*( ?(?![/|,-])).*(?<![[/|,-])
It selects zero or more Non-Whitesspaces then a Space not followed by any of your characters, then one or more of any character, making sure the previous char is not any of your characters. It's not elegant and you will have to strip empty maches.

Rearrange text on SAS

I can not find the way to reverse text strings.
For example I want to reverse these:
MMMM121231M34 to become 43M132121MMMM
MM1M11M1 to become 1M11M1MM
1111213111 to become 1113121111
Judging from your examples, what you mean by 'rearrange' is actually 'reverse'.
In that case, you've got the very handy reverse() function in SAS.
Used in context:
data test;
length text $32;
infile datalines;
input text $;
result=reverse(strip(text));
datalines;
MMMM121231M34
MM1M11M1
1111213111
;
run;
EDIT on #Joe's request: in the particular example above, I create the test dataset by setting a length of 32 characters for the text variable. Therefore, when reading the values from datalines, these are padded with blanks up to that total of 32 characters. Hence, when reversing that value, the result has that many blanks at the start, followed by the actual value you are looking for. By adding the strip function, you remove the excess blanks from the value of text before reversing, keeping only the "real" value in the result.

How do I concatenate cells and add extra text?

I'm very new to Calc but a relative veteran with Excel. Unfortunately I don't have the latter available to me. I'm attempting to create a new cell inline with the data I need to use like the below
AF Afghanistan
AL Albania
DZ Algeria
with an output in Column C like this
<option value="AF">Afghanistan</option>
I've tried to use the CONCATENATE function to no avail. Could someone point me in the right direction on how to achieve this in OpenOffice Calc (Version 3).
Thanks
I suppose it's a problem of escaping the quotes, since they delimit the "extra strings", too. Anyway, it should work with CONCATENATE, using this formula:
=CONCATENATE("<option value=""";A1;""">";B1;"</option>")
EDIT:
Sorry, every time messing up argument separators (with german l11n, semicolons instead of commata are used...) With an english (US) localisation, you need this version:
=CONCATENATE("<option value=""",A1,""">",B1,"</option>")
If doubling the qoutes around the first cell reference doesn't work, try to replace it with CHAR(34) (the decimal ASCII code for double quotes is 34, while 22 would be the hex value):
=CONCATENATE("<option value=",CHAR(34),A1,CHAR(34),">",B1,"</option>")
suppose 'AF' was in column A1 and 'Afghanistan' was in column C1, then this would produce the desired result
="<option value='"&A1&"'>"&C1&"</option>"
That code would give you this output
<option value='AF'>Afghanistan</option>