SAS assign group and accumulate - sas

I have a dataset which have columns Event and Time. I need to create columns Group and Cumulative. What I need to measure is the duration of the Event 'Event1_Stop' until an 'Event1_Start' appears. Last group should sum the time meaning that the stop is ongoing and no start for the event has entered.
My data sample is:
data have;
length Event $15;
input Event $ Time;
datalines;
Event3_Start 0.2
Event2_Start 0.4
Event2_Stop 0.2
Event1_Stop 0.2
Event3_Start 0
Event4_Start 0.5
Event3_Stop 0.2
Event1_Start 0
Event4_Stop 0
Event4_Stop 0
Event1_Stop 0.3
Event3_Start 0.3
Event1_Start 0
Event3_Start 0.4
Event3_Stop 0
Event1_Stop 0.2
Event3_Start 0.2
Event2_Start 0.4
run;
The result dataset that I need to obtain is:
data have;
length Event $15;
input Event $ Time Group Cumulative;
datalines;
Event3_Start 0.2 0 0
Event2_Start 0.4 0 0
Event2_Stop 0.2 0 0
Event1_Stop 0.2 1 0.9
Event3_Start 0 1 0
Event4_Start 0.5 1 0
Event3_Stop 0.2 1 0
Event1_Start 0 0 0
Event4_Stop 0 0 0
Event4_Stop 0 0 0
Event1_Stop 0.3 2 0.6
Event3_Start 0.3 2 0
Event1_Start 0 0 0
Event3_Start 0.4 0 0
Event3_Stop 0 0 0
Event1_Stop 0.2 3 0.8
Event3_Start 0.2 3 0
Event2_Start 0.4 3 0
run;
Thanks for your suggestions.
Regards.

Thanks to #mkeintz on SAS forum for the solution:
data stop_to_start (keep=group cumulative);
set have end=end_of_have;
group+(event='Event1_Stop');
if event='Event1_Stop' then cumulative=0;
cumulative+time;
if end_of_have or event='Event1_Start' ;
run;
data want;
set have;
if _n_=1 or event='Event1_Start' then group=0;
cumulative=0;
if event='Event1_Stop' then set stop_to_start;
run;

Related

Macro variable, %eval and character operand in sas

I have a problem with sas macro and macro variable. When I use it, I get information: 'A character operand was found in the %eval function or %if condition were numeric.
I have something like distribution (d1-d5) and I want to get similar variables but shifted about diff (data before diff are equal 0). Below example table - of course I need to do something for much bigger table.
Example_table
Name d1 d2 d3 d4 d5 diff
A 0.2 0.2 0.1 0.2 0.3 1
B 0.3 0.1 0.4 0.3 0 2
C 0.1 0.2 0 0.4 0.3 2
Table I want to get: (new_table)
Name n1 n2 n3 n4 n5 diff
A 0 0.2 0.2 0.1 0.2 1
B 0 0 0.3 0.1 0.4 2
C 0 0 0.1 0.2 0 2
Data example_table;
Name = A B C;
d1 = 0.2 0.3 0.1;
d2 = 0.2 0.1 0.2;
d3 = 0.1 0.4 0;
d4 = 0.2 0.3 0.4;
d5 = 0.3 0 0.3;
diff = 1 2 2;
run;
%macro distr ();
%local i;
%do i = 1 %to 5;
if &i. <= diff then n&i. = 0;
else n&i. = d%eval(&i. - diff);
/* I cant compute this eval, it looks like diff is character variable..., but it doesn't */
%end;
%mend;
Data new_table;
Set example_table;
%distr();
run;
The macro processor knows nothing about the values of your dataset variables.
You are trying to subtract the letters diff from the value of the macro variable i. That cannot work.
You will want to use SAS code to do your data manipulation, not macro code. For example by using arrays.
data example_table;
input Name d1-d5 diff ;
cards;
A 0.2 0.2 0.1 0.2 0.3 1
B 0.3 0.1 0.4 0.3 0 2
C 0.1 0.2 0 0.4 0.3 2
;
data want;
set example_table;
array d d1-d5;
array n n1-n5;
do index=1 to dim(n);
if 1 <= index-diff <= dim(d) then n[index]=d[index-diff];
else n[index]=0;
end;
drop index d1-d5;
run;
Results:
Obs Name diff n1 n2 n3 n4 n5
1 A 1 0 0.2 0.2 0.1 0.2
2 B 2 0 0.0 0.3 0.1 0.4
3 C 2 0 0.0 0.1 0.2 0.0
You're mixing up SAS and Macro language here, specifically:
%eval(&i. - diff)
%eval is a macro function, meaning it applies to the text of the code. diff is a SAS data step variable, meaning it has some value - but %eval only operates on the text itself. So %eval is trying to take &i (a number) and subtract from it the letters diff (not a number).
Fortunately it's pretty easy - &i is available to the SAS datastep, as a number. You can use an array to resolve the problem! First declare the array, then...
else n&i. = d[&i].;
Of course, you don't need to use the macro language at all here.
data new_table;
set example_table;
array d[5] d1-d5; *technically d1-d5 is unneeded here as those are the default names;
array n[5] n1-n5; *also n1-n5 unneeded, but it is more clear;
do i = 1 to dim(d);
if i <= diff then n[i] = 0;
else n[i] = d[i];
end;
run;

SAS DO Loop seems to skip records

In writing a very simple DATA step to start a new project, I ran across some strange behavior.
The only difference between set1 and set2 is the use of the variable lagscore in the equation in set1 vs. dummy in the equation in set2.
set1 produces output that appears to indicate that including lagscore causes the score and lagscore variables to be undefined in half of the iterations.
Note that I was careful to NOT call lag() more than once and I include the call in set2 just to make sure that the lag() function call is not the source of the problem.
I appreciate any explanations. I've been away from SAS for quite awhile and I sense that I am missing something fundamental in how the processing occurs.
(Sorry for the difficult to read output. I could not figure out how to paste it and retain spacing)
data set1;
obs=1;
score=500;
a_dist = -5.0;
b_dist = 0.1;
dummy = 0;
output;
do obs = 2 to 10;
lagscore = lag(score);
score = lagscore + 1 /(b_dist * lagscore + a_dist);
output;
end;
run;
data set2;
obs=1;
score=500;
a_dist = -5.0;
b_dist = 0.1;
dummy = 0;
output;
do obs = 2 to 10;
lagscore = lag(score);
/* score = lagscore + 1 /(b_dist * lagscore + a_dist);*/
score = dummy + 1 /(b_dist * dummy + a_dist);
output;
end;
run;`
Set1 results
obs score a_dist b_dist dummy lagscore
1 500 -5 0.1 0 .
2 . -5 0.1 0 .
3 500.02 -5 0.1 0 500
4 . -5 0.1 0 .
5 500.04 -5 0.1 0 500.02
6 . -5 0.1 0 .
7 500.06 -5 0.1 0 500.04
8 . -5 0.1 0 .
9 500.08 -5 0.1 0 500.06
10 . -5 0.1 0 .
Set2 results
obs score a_dist b_dist dummy lagscore
1 500 -5 0.1 0 .
2 -0.2 -5 0.1 0 .
3 -0.2 -5 0.1 0 500
4 -0.2 -5 0.1 0 -0.2
5 -0.2 -5 0.1 0 -0.2
6 -0.2 -5 0.1 0 -0.2
7 -0.2 -5 0.1 0 -0.2
8 -0.2 -5 0.1 0 -0.2
9 -0.2 -5 0.1 0 -0.2
10 -0.2 -5 0.1 0 -0.2
The key point is that when you call the lag() function it returns a value from a queue that is initialized with missing values. The default is a queue with one item in it.
In your code:
score=500 ;
*...;
do obs = 2 to 10;
lagscore = lag(score);
score = lagscore + 1 /(b_dist * lagscore + a_dist);
output;
end;
The first iteration of the loop (obs=2), LAGSCORE will be assigned a missing value because the queue is initialized with a missing value. The value 500 will be stored in the queue. SCORE will be assigned a missing value because LAGSCORE is missing, and therefore the expression lagscore + 1 /(b_dist * lagscore + a_dist) will return missing.
The second iteration of the loop (obs=3), LAGSCORE will be assigned the value 500 (read from the queue), and the value of SCORE (a missing value) is written to the queue. Score is then assigned the value 500.2 from the expression lagscore + 1 /(b_dist * lagscore + a_dist).
The third iteration of the loop (obs=4), LAGSCORE will be assigned a missing value (read from the queue) and the value 500.2 is written to the queue.
And that pattern repeats.
If I understand your intent, you don't actually need the LAG function for this sort of data creation. You can just use a DO loop with an output statement in it, and update the value of SCORE after you output each record. Something like:
data set1 ;
score = 500 ;
a_dist = -5.0 ;
b_dist = 0.1 ;
do obs = 1 to 10 ;
output ;
score = score + (1 /(b_dist * score + a_dist)) ;
end ;
run ;
Returns:
score a_dist b_dist obs
500.000 -5 0.1 1
500.022 -5 0.1 2
500.044 -5 0.1 3
500.067 -5 0.1 4
500.089 -5 0.1 5
500.111 -5 0.1 6
500.133 -5 0.1 7
500.156 -5 0.1 8
500.178 -5 0.1 9
500.200 -5 0.1 10

Merging multiple .txt files into a csv

*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10
IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN

Calculation on groups after group by with Pandas

I have a data frame that is grouped by 2 columns - Date And Client and I sum the amount so:
new_df = df.groupby(['Date',Client'])
Now I get the following df:
Sum
Date Client
1/1 A 0.8
B 0.2
1/2 A 0.1
B 0.9
I want to be able to catch the fact that there is a high fluctuation between the ratio of 0.8 to 0.2 that changed to 0.1 to 0.9. What would be the most efficient way to do it? Also I can't access the Date and Client fields when I try to do
new_df[['Date','Client']]
Why is that?
IIUC you can use pct_change or diff:
new_df = df.groupby(['Date','Client'], as_index=False).sum()
print (new_df)
Date Client Sum
0 1/1 A 0.8
1 1/1 B 0.2
2 1/2 A 0.1
3 1/2 B 0.9
new_df['pct_change'] = new_df.groupby('Date')['Sum'].pct_change()
new_df['diff'] = new_df.groupby('Date')['Sum'].diff()
print (new_df)
Date Client Sum pct_change diff
0 1/1 A 0.8 NaN NaN
1 1/1 B 0.2 -0.75 -0.6
2 1/2 A 0.1 NaN NaN
3 1/2 B 0.9 8.00 0.8

If column match with value, use gsub and print value to another column

I use some example:
INPUT:
0.6 0.7 A:0.01 - 0
C:0.01 0.1 - 0.2 0
0.7 0.02 G:0.2 - 0
0.5 0.23 0.1 T:0.05 0
0.1 0.2 0.3 0.58 0
So, if some column has a value start with A C T or G I would like to change it to "0" or "-" and last column change to "W" (it is $34 $35 $36 $37 $38 )
OUTPUT:
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
I would like to use awk.
awk '{if($34=="^:^");gsub($34,"*","0") && gsub($38,"0","W"); else print}' file
and same for other columns.
Thank you.
How about like this:
$ awk '{for(i=1;i<=4;i++){if ($i ~ /A:|C:|T:|G:/){$i=0; $NF="W"}}}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
In a more readable format:
$ awk '{
for(i=1;i<=4;i++) { # Loop through the fieds
if ($i ~ /A:|C:|T:|G:/) { # If current field matches pattern
$i=0 # Replace it with zero
$NF="W" # And make the last field a 'W'
}
}
}1' file | column -t
If you want to limit it to specific columns, you can use an array:
awk '{c="1,3";split(c,cols,/,/);for(i in cols){if ($cols[i] ~ /A:|C:|T:|G:/){$cols[i]=0; $NF="W"}}}1' file | column -t
what about something like this:
awk -v OFS="\t" '{if (gsub(/G:|C:|A:|T:/, "0")) print $1,$2,$3,$4,"W"; else print $0}'
And then replace values strarting 00 to zero.
If you don't care about spacing:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
if you do:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0