Comparing datasets

Comparing datasets - sas

I have 2 datasets. 1 containing the columns origin_zip(number) and destination_zip(char) and tracking_number(char) and the other containing zip.
I would like to compare these 2 datasets so I can see all the tracking numbers and destination_zips that are not in the zip column of the second dataset.
Additionally I would like to see all of the tracking_numbers and origin_zips where the origin_zips = the destination_zips.
How would I accomplish this?
origin_zip destination_zip tracking_number
12345 23456 11111
34567 45678 22222
12345 12345 33333
zip
12345
34567
23456
results_tracking_number
22222
33333

Let's start with this...I don't think this completely answers your question, but follow up with comments and I will help if I can...
data zips;
input origin_zip $ destination_zip $ tracking_number $;
datalines;
12345 23456 11111
34567 45678 22222
56789 12345 33333
;
data zip;
input zip $;
datalines;
12345
54321
34567
76543
56789
;
Proc sort data=zips;
by origin_zip;
run;
Proc sort data=zip;
by zip;
run;
Data contained not_contained;
merge zip(in=a) zips(in=b rename=(origin_zip=zip));
by zip;
if a and b then output contained;
if a and not b then output not_contained;
run;

Related

How to use two regex capture groups to make two pandas columns

I have a dataframe column of strings and I want to extract numbers to another column:
column
1 abc123
2 def456
3 ghi789jkl012
I've used:
dataframe["newColumn"] = dataframe["column"].str.extract("(\d*\.?\d+)", expand=True)
It works, but only captures the first block of numbers to one column. My desired output is
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
but can't figure out how to do it

Use Series.str.extractall with Series.unstack and DataFrame.add_prefix, last add to original DataFrame by DataFrame.join:
df = dataframe.join(dataframe["column"].str.extractall("(\d*\.?\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
column newColumn0 newColumn1
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
Or you can use (\d+), thank you #Manakin:
df = (dataframe.join(dataframe["column"].str.extractall("(\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)

Can also use split, expand=True and join back to df.
df.join(df.column.str.split('\D+', expand=True).replace({None: np.NaN}).rename({2:'newColumn2',1:'newColumn'},axis=1).iloc[:,-2::])
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012

SAS: Specify newline character when input a text file

I'm quite new to SAS and have a very simple problem. I have a text file which is saved like this:
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
So all the data is written in a single line. The ~ character denotes the newline character. My final goal is to load the text file into sas and create a SAS dataset which should look like this:
V1 V2 V3 V4 V5
123 123 345 457 4.55
123 123 345 457 4.55
123 123 345 457 4.55
So ',' is the delimiter and '~' is the new line character.
How can I achieve this?
Thank you very much for your response.
Kind Regards
Consti

Just tell SAS to use both characters as delimiters and add ## to the input statement to prevent it from going to a new line.
data want ;
infile cards dsd dlm=',~';
input v1-v5 ## ;
cards;
123,123,345,457,4.55~123,123,345,457,4.55~123,123,345,457,4.55
;;;;
Result
Obs v1 v2 v3 v4 v5
1 123 123 345 457 4.55
2 123 123 345 457 4.55
3 123 123 345 457 4.55
If you are reading from a file then you might also be able to use the RECFM=N option on the INFILE statement instead of the ## on the INPUT statement, although if the one line actually has LF or CR/LF at the end then you might want to include them in the delimiter list also.

Tom's answer is correct for files that are regular and you don't have issues with inconsistent rows.
If you do need to do exactly what you say though, it's possible; you'd convert ~ to a newline through a pre-processing step. Here's one way to do that.
First, in a data step go through the file with dlm of ~; input the fields until you run to the end of the line, and for each field, output it to a temp file (so now the line has just one data row on it).
Now you have a temp file you can read in like normal, with no ~ characters in it.
You could do this in a number of other ways, literally find/replace ~ with '0D0A'x or whatever your preferred EOL charater is for example (easier/faster to do in another language probably, if you have this in unix and have access to perl for example or even using awk/etc. you could do this probably more easily than in SAS).
filename test_in "c:\temp\test_dlm.txt";
filename test_t temp;
data _null_;
infile test_in dlm='~';
file test_t;
length test_field $32767;
do _n_= 1 by 1 until (_n_ > countc(_infile_,'~'));
input
test_field :$32767. ##;
putlog test_field;
put test_field $;
end;
stop;
run;
data want;
infile test_t dlm=',';
input v1 v2 v3 v4 v5;
run;

How do I force lines to be a certain length?

I have a text file that contains a very large list of 5-digit numbers. Some lines contain more than one 5-digit number without a newline separating them
12345
23456
34567
4567856789
67890
...
837460174975917
...
I'm trying to find a regular expression that I can use with sed that will add newlines in-between the numbers.
The desired output would be:
12345
23456
34567
45678
56789
67890
...
83746
01749
75917
...
I've played around with it a bit, but the best I can figure out is something like ^([0-9]{5}) replaced with $1/r/n. However, this adds a newline after every digit, and I'd need to remove all the blank lines afterwards which is not optimal because of the size of this file.

Light weight solution using fold :
Sample input:
cat filename
12345
23456
34567
4567856789
Solution using fold:
cat filename|fold -w5
12345
23456
34567
45678
56789
Update(As suggested by Kenavoz): To avoid unnecessary use of cat and pipe
fold -w5 filename

Using grep -o you can do this:
grep -Eo '.{5}' file
12345
23456
34567
45678
56789
67890
83746
01749
75917

Default behavior of Input Buffer in SAS while reading data from external file

Contents of a.txt
22
333
4444
55555
But when i run this code :
data numbers;
infile ’c:\a.txt’;
input var 5.;
/* list */ ;
run;
the data in numbers.sas is saved as :
333
55555
** Note the format of the data in numbers.sas and the format in a.txt
But when i use the list the input buffer is somewhat like this :
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7
2 333 3
4 55555 5
Why doesnt sas show 1 and 3?? And how is the input buffer reading?
Please explain

Try adding TRUNCOVER to your infile statement or remove the 5. after your input statement. SAS now expects a 5 digit number. If will continue reading if the line on your sourcefile is less then 5 characters long.
data numbers;
infile 'c:\a.txt' truncover;
input var 5.;
run;
For more infor read this paper on infile statement options

Delete similar lines csh

I've seen several articles on deleting duplicate lines, but I need something a little more specific. Here is an example of some raw data:
11111 AA 1 date1
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB ## date1
11111 BB ## date2
...
11111 BB ## date##
22222 AA 1 date1
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB ## date1
22222 BB ## date2
...
22222 BB ## date##
Note: Where ## is some number < 64.
I need to edit that file so it looks something like this:
11111 AA 1 date1
11111 BB 64 date1
11111 BB 64 date1
11111 BB ## date1
22222 AA 1 date1
22222 BB 64 date1
22222 BB 64 date1
22222 BB ## date1
I've seen several examples of using awk, sed, or ed along with regex to match the first part of a line. My confusion is with the occurance of the "BB 64" and "BB ##" and not just deleting all BB lines but the first.
Vital Info: Running this csh script on a Solaris v5.8
The AA lines are not important in this question except to know they are there (we are not doing anything with them).
Here's essentially what I've got so far (still having syntax issues from looking at examples using other languages, so if you can correct please do):
sed 'N;(\d{1,8}\sBB\s\d{1,2}.+\n);P;D' filename
If I were not getting errors due to syntax, I am sure this would delete all BB lines but the first "BB 64 date1." I think my sed regex above is based on uniq but only matches the frist part of the line instead of the entire line because I will need the first date of each BB (if there are more than 1 series of BB 64 for each 11111, 22222, etc the output should contain an identical BB 64 line for each series [just date1]). Any ideas?

Seems like sort -k4,4 | uniq would do the trick? (or sort +3 if the Solaris version is sufficiently old.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparing datasets - sas

Related

How to use two regex capture groups to make two pandas columns

SAS: Specify newline character when input a text file

How do I force lines to be a certain length?

Default behavior of Input Buffer in SAS while reading data from external file

Delete similar lines csh

Categories

Resources