I wanted to see if this was doable in SAS. I have a dataset of the members of congress and want to split full name into first and last. However, occasionally they seem to list their middle initial or name. It is from a .txt file.
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
Good day,
SAS is a bit clunky when it comes to Strings. However it can be done. As other have mentioned, it's the logic defined, which is the really hard part.
Begin with some raw data...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
first I select the leading names till the first bracket.
count the number of strings
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
Assumption: First sting is the last name.
If there are only 2 strings, the first and last are pretty easy with scan and substring:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
Assumption: there are only 3 strings.
approach 1. Middle name has dot in it. Filter it out.
approach 2. Middle name is shorter than real name:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;
For more see about subsring and scan:
Related
I'm trying to create a data set that will show me the duplicate transactions. The trouble I'm running into is when there are multiple orders on one order_id. The records that get assigned the 2s I would be considering the duplicate order.
data have;
input acct_id order_id;
datalines;
1 121
1 122
2 123
2 124
3 125
3 125
3 125
3 126
3 126
3 126
data want;
set have;
by acct_id order_id;
if first.acct_id then order_count = 1;
else order_count =2;
run;
My desired output is below.
acct_id | order_id | order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2
What I have coded out already I feel like is close but I can't get it figured out.
data want;
set have;
by acct_id order_id notsorted;
if first.acct_id then order_count=0;
if first.order_id then order_count+1;
put acct_id order_id order_count;
run;
acct_id order_id order_count
1 121 1
1 122 2
2 123 1
2 124 2
3 125 1
3 125 1
3 125 1
3 126 2
3 126 2
3 126 2
I have a dataset that looks like this
ID Model_Value Count_Model
111 24 2
222 12 9
234 88 6
111 88 8
222 24 10
222 88 17
I want it to look like this:
ID Model_12 Model_24 Model_88
111 0 2 8
222 9 10 17
234 0 0 6
I don't think I am searching online for the correct terms, I thought initially a transform might work but I still want the row to represent the ID not the model.
How do I go about creating this output from what I have?
Ok I believe this is it! Thank you #mjsqu !!
I was able to do this with the help of this link: http://www.sascommunity.org/mwiki/images/d/dd/PROC_Transpose_slides.pdf
data test_transpose ;
input #1 ID_P #6 Model_Value #18 Count_Model ;
cards;
111 24 2
222 12 9
234 88 6
111 88 8
222 24 10
222 88 17
run;
proc print data=test_transpose;
run;
proc sort data=test_transpose out=test_transpose_S;
By ID_P;
run;
proc transpose
data = test_transpose_S
out = test_transpose_result (drop=_name_)
prefix=Model_Value;
var Count_Model;
BY ID_P;
id Model_Value;
run;
proc print data=test_transpose_result ;
run;
Output of the original sorted dataset and the transpose!
Here is the data I have, I use proc tabulate to present it how it is presented in excel, and to make the visualization easier. The goal is to make sure groups strictly below the diagonal (i know it's a rectangle, the (1,1) (2,2)...(7,7) "diagonal") to roll up the column until it hits the diagonal or makes a group size of at least 75.
1 2 3 4 5 6 7 (month variable)
(age)
1 80 90 100 110 122 141 88
2 80 90 100 110 56 14 88
3 80 90 87 45 12 41 88
4 24 90 100 110 22 141 88
5 0 1 0 0 0 0 2
6 0 1 0 0 0 0 6
7 0 1 0 0 0 0 2
8 0 1 0 0 0 0 11
Ive already used if/thens to regroup certain data values, but I need a general way to do it for other sets.
Thanks in advance
desired results
1 2 3 4 5 6 7 (month variable)
(age)
1 80 90 100 110 122 141 88
2 80 90 100 110 56 14 88
3 104 90 87 45 12 41 88
4 0 94 100 110 22 141 88
5 0 0 0 0 0 0 2
6 0 0 0 0 0 0 6
7 0 0 0 0 0 0 13
8 0 0 0 0 0 0 0
Mock up some categorical data for some patients who have to be counted
data mock;
do patient_id = 1 to 2500;
month = ceil(7*ranuni(123));
age = ceil(8*ranuni(123));
output;
end;
stop;
run;
Create a tabulation of counts (N) similar to the one shown in the question:
options missing='0';
proc tabulate data=mock;
class month age;
table age,month*n=''/nocellmerge;
run;
For each month get the sub-diagonal patient count
proc sql;
/* create table subdiagonal_column as */
select month, count(*) as subdiag_col_freq
from mock
where age > month
group by month;
For each row get the pre-diagonal patient count
/* create table prediagonal_row as */
select age, count(*) as prediag_row_freq
from mock
where age > month
group by age;
other sets can be tricky if the categorical values are not +1 monotonic. To do a similar process for non-montonic categorical values you will need to create surrogate variables that are +1 monotonic. For example:
data mock;
do item_id = 1 to 2500;
pet = scan ('cat dog snake rabbit hamster', ceil(5*ranuni(123)));
place = scan ('farm home condo apt tower wild', ceil(6*ranuni(123)));
output;
end;
run;
proc tabulate data=mock;
class pet place;
table pet,place*n=''/nocellmerge;
run;
proc sql;
create table unq_pets as select distinct pet from mock;
create table unq_places as select distinct place from mock;
data pets;
set unq_pets;
pet_num = _n_;
run;
data places;
set unq_places;
place_num = _n_;
run;
proc sql;
select distinct place_num, mock.place, count(*) as subdiag_col_freq
from mock
join pets on pets.pet = mock.pet
join places on places.place = mock.place
where pet_num > place_num
group by place_num
order by place_num
;
I'm looking for a suggested approach to the following that is time efficient in Pandas. Let's say I have a dataframe that looks like this:
[TimeStamp] [Val]
2017-08-19 22:28:42.000 151
2017-08-19 22:28:42.001 127
2017-08-19 22:29:42.000 149
2017-08-19 22:34:10.000 127
2017-08-19 22:35:10.000 126
2017-08-19 22:36:10.000 132
2017-08-19 22:37:10.000 129
2017-08-19 22:39:10.000 124
How would I get the duration when the Val exceeds 127?
So I'd expect an answer of:
22:28:42 -> 22:28:42.001
22:29:42 -> 22:34:10.000
22:36:10 -> 22:39:10.000
I would also like to then look at these date ranges and carry out actions like:
How many datapoint are there between dates where value is above 127
First sort your data by TimeStamp
>> df['TimeStamp'] = pd.to_datetime(df['TimeStamp'])
>> df = df.sort_values('TimeStamp')
Then find positions where Val changes to lte or gt 127
>> df['changed'] = (df['Val'] > 127).astype(int).diff().fillna(1).astype(int)
>> df
TimeStamp Val changed
0 2017-08-19 22:28:42.000 151 1
1 2017-08-19 22:28:42.001 127 -1
2 2017-08-19 22:29:42.000 149 1
3 2017-08-19 22:34:10.000 127 -1
4 2017-08-19 22:35:10.000 126 0
5 2017-08-19 22:36:10.000 132 1
6 2017-08-19 22:37:10.000 129 0
7 2017-08-19 22:39:10.000 124 -1
Above, for particular TimeStamp
-1 means that Val changed to lte 127
+1 means that Val changed to gt 127
Finally construct the time intervals you need
>> pd.DataFrame({
>> 't_0': df.loc[df.changed == 1, 'TimeStamp'].reset_index(drop=True),
>> 't_n': df.loc[df.changed == -1, 'TimeStamp'].reset_index(drop=True)})
t_n t_0
0 2017-08-19 22:28:42.001 2017-08-19 22:28:42
1 2017-08-19 22:34:10.000 2017-08-19 22:29:42
2 2017-08-19 22:39:10.000 2017-08-19 22:36:10
I have a sequence of number from 1 000 000 to 9 999 999 (Total: 9,000,000). I've generated them in the excel and I would like to match them in following formats
Last 6 digits in:
1. XXX XXX (For example, 000 000 or 111 111 or 222 222)
2. X00 000 (For example, 100 000 or 200 000 or 300 000)
3. XYY YYY (For example, 122 222 or 233 333 or 411 111)
4. XY0 000 (For example, 230 000 or 750 000 or 120 000)
5. XYZ ZZZ (For example, 231 111 or 232 222 or 233 333)
6. X00 Y00 (For example, 200 300 or 100 400 or 500 600)
7. XXX Y00 (For example, 333 300 or 666 600 or 777 700)
8. XXX YYY (For example, 111 333 or 222 555 or 555 666)
9. XX YY ZZ (For example, 11 22 33 or 22 33 44 or 44 55 66)
10. X0 Y0 Z0 (For example, 10 20 30 or 30 40 50 or 60 70 80)
Would it be possible to do with regex or vba in excel 2013?
Since I don't have knowledge in Excel, should I seek someone's help for a simple program for such matching?
You can use VBA, but I believe you will need to set up each classification separately, and also ensure that they are in an order so as to not overlap.
Here is a partial example, showing a few VBA techniques, which you should be able to extend. I only dealt with the rightmost 6 digits and initially constructed a string; and also put each digit into an array element to make the testing formulas simpler.
Option Explicit
Function Classify(N As Long) As String
Dim I As Long
Dim S(1 To 6) As String
Dim sN As String
sN = Format(Right(N, 6), "000000")
For I = 1 To 6
S(I) = Mid(sN, I, 1)
Next I
If Left(sN, 3) = Right(sN, 3) Then
Classify = "XXX XXX"
ElseIf S(1) <> 0 And Mid(sN, 2) = 0 Then
Classify = "X00 000"
ElseIf S(1) <> 0 And Mid(sN, 2) Like WorksheetFunction.Rept(S(2), 5) Then
Classify = "XYY YYY"
ElseIf S(1) <> 0 And S(2) <> 0 And S(1) <> S(2) And Mid(sN, 3) = 0 Then
Classify = "XY0 000"
elseif ...
End If
End Function