SAS DO Loop seems to skip records - sas

In writing a very simple DATA step to start a new project, I ran across some strange behavior.
The only difference between set1 and set2 is the use of the variable lagscore in the equation in set1 vs. dummy in the equation in set2.
set1 produces output that appears to indicate that including lagscore causes the score and lagscore variables to be undefined in half of the iterations.
Note that I was careful to NOT call lag() more than once and I include the call in set2 just to make sure that the lag() function call is not the source of the problem.
I appreciate any explanations. I've been away from SAS for quite awhile and I sense that I am missing something fundamental in how the processing occurs.
(Sorry for the difficult to read output. I could not figure out how to paste it and retain spacing)
data set1;
obs=1;
score=500;
a_dist = -5.0;
b_dist = 0.1;
dummy = 0;
output;
do obs = 2 to 10;
lagscore = lag(score);
score = lagscore + 1 /(b_dist * lagscore + a_dist);
output;
end;
run;
data set2;
obs=1;
score=500;
a_dist = -5.0;
b_dist = 0.1;
dummy = 0;
output;
do obs = 2 to 10;
lagscore = lag(score);
/* score = lagscore + 1 /(b_dist * lagscore + a_dist);*/
score = dummy + 1 /(b_dist * dummy + a_dist);
output;
end;
run;`
Set1 results
obs score a_dist b_dist dummy lagscore
1 500 -5 0.1 0 .
2 . -5 0.1 0 .
3 500.02 -5 0.1 0 500
4 . -5 0.1 0 .
5 500.04 -5 0.1 0 500.02
6 . -5 0.1 0 .
7 500.06 -5 0.1 0 500.04
8 . -5 0.1 0 .
9 500.08 -5 0.1 0 500.06
10 . -5 0.1 0 .
Set2 results
obs score a_dist b_dist dummy lagscore
1 500 -5 0.1 0 .
2 -0.2 -5 0.1 0 .
3 -0.2 -5 0.1 0 500
4 -0.2 -5 0.1 0 -0.2
5 -0.2 -5 0.1 0 -0.2
6 -0.2 -5 0.1 0 -0.2
7 -0.2 -5 0.1 0 -0.2
8 -0.2 -5 0.1 0 -0.2
9 -0.2 -5 0.1 0 -0.2
10 -0.2 -5 0.1 0 -0.2

The key point is that when you call the lag() function it returns a value from a queue that is initialized with missing values. The default is a queue with one item in it.
In your code:
score=500 ;
*...;
do obs = 2 to 10;
lagscore = lag(score);
score = lagscore + 1 /(b_dist * lagscore + a_dist);
output;
end;
The first iteration of the loop (obs=2), LAGSCORE will be assigned a missing value because the queue is initialized with a missing value. The value 500 will be stored in the queue. SCORE will be assigned a missing value because LAGSCORE is missing, and therefore the expression lagscore + 1 /(b_dist * lagscore + a_dist) will return missing.
The second iteration of the loop (obs=3), LAGSCORE will be assigned the value 500 (read from the queue), and the value of SCORE (a missing value) is written to the queue. Score is then assigned the value 500.2 from the expression lagscore + 1 /(b_dist * lagscore + a_dist).
The third iteration of the loop (obs=4), LAGSCORE will be assigned a missing value (read from the queue) and the value 500.2 is written to the queue.
And that pattern repeats.
If I understand your intent, you don't actually need the LAG function for this sort of data creation. You can just use a DO loop with an output statement in it, and update the value of SCORE after you output each record. Something like:
data set1 ;
score = 500 ;
a_dist = -5.0 ;
b_dist = 0.1 ;
do obs = 1 to 10 ;
output ;
score = score + (1 /(b_dist * score + a_dist)) ;
end ;
run ;
Returns:
score a_dist b_dist obs
500.000 -5 0.1 1
500.022 -5 0.1 2
500.044 -5 0.1 3
500.067 -5 0.1 4
500.089 -5 0.1 5
500.111 -5 0.1 6
500.133 -5 0.1 7
500.156 -5 0.1 8
500.178 -5 0.1 9
500.200 -5 0.1 10

Related

Comput modulo between floating point numbers in C++

I have the following code to compute modulo between two floating point numbers:
auto mod(float x, float denom)
{
return x>= 0 ? std::fmod(x, denom) : denom + std::fmod(x + 1.0f, denom) - 1.0f;
}
It does only work partially for negative x:
-8 0
-7.75 0.25
-7.5 0.5
-7.25 0.75
-7 1
-6.75 1.25
-6.5 1.5
-6.25 1.75
-6 2
-5.75 2.25
-5.5 2.5
-5.25 2.75
-5 3
-4.75 -0.75 <== should be 3.25
-4.5 -0.5 <== should be 3.5
-4.25 -0.25 <== should be 3.75
-4 0
-3.75 0.25
-3.5 0.5
-3.25 0.75
-3 1
-2.75 1.25
-2.5 1.5
-2.25 1.75
-2 2
-1.75 2.25
-1.5 2.5
-1.25 2.75
-1 3
-0.75 3.25
-0.5 3.5
-0.25 3.75
0 0
How to fix it for negative x. Denom is assumed to be an integer greater than 0. Note: fmod as is provided by the standard library is broken for x < 0.0f.
x is in the left column, and the output is in the right column, like so:
for(size_t k = 0; k != 65; ++k)
{
auto x = 0.25f*(static_cast<float>(k) - 32);
printf("%.8g %.8g\n", x, mod(x, 4));
}
Note: fmod as is provided by the standard library is broken for x < 0.0f
I guess you want the result to always be a positive value1:
In mathematics, the result of the modulo operation is an equivalence class, and any member of the class may be chosen as representative; however, the usual representative is the least positive residue, the smallest non-negative integer that belongs to that class (i.e., the remainder of the Euclidean division).
The usual workaround was shown in Igor Tadetnik's comment, but that seems not enough.
#IgorTandetnik That worked. Pesky signed zero though, but I guess you cannot do anything about that.
Well, consider this(2, 3):
auto mod(double x, double denom)
{
auto const r{ std::fmod(x, denom) };
return std::copysign(r < 0 ? r + denom : r, 1);
}
1) https://en.wikipedia.org/wiki/Modulo
2) https://en.cppreference.com/w/cpp/numeric/math/copysign
3) https://godbolt.org/z/fdr9cbsYT

SAS assign group and accumulate

I have a dataset which have columns Event and Time. I need to create columns Group and Cumulative. What I need to measure is the duration of the Event 'Event1_Stop' until an 'Event1_Start' appears. Last group should sum the time meaning that the stop is ongoing and no start for the event has entered.
My data sample is:
data have;
length Event $15;
input Event $ Time;
datalines;
Event3_Start 0.2
Event2_Start 0.4
Event2_Stop 0.2
Event1_Stop 0.2
Event3_Start 0
Event4_Start 0.5
Event3_Stop 0.2
Event1_Start 0
Event4_Stop 0
Event4_Stop 0
Event1_Stop 0.3
Event3_Start 0.3
Event1_Start 0
Event3_Start 0.4
Event3_Stop 0
Event1_Stop 0.2
Event3_Start 0.2
Event2_Start 0.4
run;
The result dataset that I need to obtain is:
data have;
length Event $15;
input Event $ Time Group Cumulative;
datalines;
Event3_Start 0.2 0 0
Event2_Start 0.4 0 0
Event2_Stop 0.2 0 0
Event1_Stop 0.2 1 0.9
Event3_Start 0 1 0
Event4_Start 0.5 1 0
Event3_Stop 0.2 1 0
Event1_Start 0 0 0
Event4_Stop 0 0 0
Event4_Stop 0 0 0
Event1_Stop 0.3 2 0.6
Event3_Start 0.3 2 0
Event1_Start 0 0 0
Event3_Start 0.4 0 0
Event3_Stop 0 0 0
Event1_Stop 0.2 3 0.8
Event3_Start 0.2 3 0
Event2_Start 0.4 3 0
run;
Thanks for your suggestions.
Regards.
Thanks to #mkeintz on SAS forum for the solution:
data stop_to_start (keep=group cumulative);
set have end=end_of_have;
group+(event='Event1_Stop');
if event='Event1_Stop' then cumulative=0;
cumulative+time;
if end_of_have or event='Event1_Start' ;
run;
data want;
set have;
if _n_=1 or event='Event1_Start' then group=0;
cumulative=0;
if event='Event1_Stop' then set stop_to_start;
run;

Macro variable, %eval and character operand in sas

I have a problem with sas macro and macro variable. When I use it, I get information: 'A character operand was found in the %eval function or %if condition were numeric.
I have something like distribution (d1-d5) and I want to get similar variables but shifted about diff (data before diff are equal 0). Below example table - of course I need to do something for much bigger table.
Example_table
Name d1 d2 d3 d4 d5 diff
A 0.2 0.2 0.1 0.2 0.3 1
B 0.3 0.1 0.4 0.3 0 2
C 0.1 0.2 0 0.4 0.3 2
Table I want to get: (new_table)
Name n1 n2 n3 n4 n5 diff
A 0 0.2 0.2 0.1 0.2 1
B 0 0 0.3 0.1 0.4 2
C 0 0 0.1 0.2 0 2
Data example_table;
Name = A B C;
d1 = 0.2 0.3 0.1;
d2 = 0.2 0.1 0.2;
d3 = 0.1 0.4 0;
d4 = 0.2 0.3 0.4;
d5 = 0.3 0 0.3;
diff = 1 2 2;
run;
%macro distr ();
%local i;
%do i = 1 %to 5;
if &i. <= diff then n&i. = 0;
else n&i. = d%eval(&i. - diff);
/* I cant compute this eval, it looks like diff is character variable..., but it doesn't */
%end;
%mend;
Data new_table;
Set example_table;
%distr();
run;
The macro processor knows nothing about the values of your dataset variables.
You are trying to subtract the letters diff from the value of the macro variable i. That cannot work.
You will want to use SAS code to do your data manipulation, not macro code. For example by using arrays.
data example_table;
input Name d1-d5 diff ;
cards;
A 0.2 0.2 0.1 0.2 0.3 1
B 0.3 0.1 0.4 0.3 0 2
C 0.1 0.2 0 0.4 0.3 2
;
data want;
set example_table;
array d d1-d5;
array n n1-n5;
do index=1 to dim(n);
if 1 <= index-diff <= dim(d) then n[index]=d[index-diff];
else n[index]=0;
end;
drop index d1-d5;
run;
Results:
Obs Name diff n1 n2 n3 n4 n5
1 A 1 0 0.2 0.2 0.1 0.2
2 B 2 0 0.0 0.3 0.1 0.4
3 C 2 0 0.0 0.1 0.2 0.0
You're mixing up SAS and Macro language here, specifically:
%eval(&i. - diff)
%eval is a macro function, meaning it applies to the text of the code. diff is a SAS data step variable, meaning it has some value - but %eval only operates on the text itself. So %eval is trying to take &i (a number) and subtract from it the letters diff (not a number).
Fortunately it's pretty easy - &i is available to the SAS datastep, as a number. You can use an array to resolve the problem! First declare the array, then...
else n&i. = d[&i].;
Of course, you don't need to use the macro language at all here.
data new_table;
set example_table;
array d[5] d1-d5; *technically d1-d5 is unneeded here as those are the default names;
array n[5] n1-n5; *also n1-n5 unneeded, but it is more clear;
do i = 1 to dim(d);
if i <= diff then n[i] = 0;
else n[i] = d[i];
end;
run;

How to calculate cumulative product in SAS?

I need to create a variable that takes the product of the values of all prior values and including the one in the current obs.
data temp;
input time cond_prob;
datalines;
1 1
2 0.2
3 0.3
4 0.4
5 0.6
;
run;
Final data should be:
1 1
2 0.2 (1*0.2)
3 0.06 (0.2* 0.3)
4 0.024 (0.06 * 0.4
5 0.0144 (0.024 *0.6)
This seems like a simple code but I can't get it to work. I can do cumulative sums but cumulative product is not working when using the same logic.
Use the RETAIN functionality.
For the first record I set it to a value of 1 because anything multiplied by 1 will stay the same.
data want;
set temp;
retain cum_product 1;
cum_product = cond_prob * cum_product;
run;

Merging multiple .txt files into a csv

*New to Python.
I'm trying to merge multiple text files into 1 csv; example below -
filename.csv
Alpha
0
0.1
0.15
0.2
0.25
0.3
text1.txt
Alpha,Beta
0,10
0.2,20
0.3,30
text2.txt
Alpha,Charlie
0.1,5
0.15,15
text3.txt
Alpha,Delta
0.1,10
0.15,20
0.2,50
0.3,10
Desired output in the csv file: -
filename.csv
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 10
0.15 0 15 20
0.2 20 0 50
0.25 0 0 0
0.3 30 0 10
The code I've been working with and others that were provided give me an answer similar to what is at the bottom of the page
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
outputDf = pandas.merge(leftDf, outputDf, how='inner', on='Alpha', sort=True, copy=False).fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
The answer I get however is instead of the desired result: -
Alpha Beta Charlie Delta
0 10 0 0
0.1 0 5 0
0.1 0 0 10
0.15 0 15 0
0.15 0 0 20
0.2 20 0 0
0.2 0 0 50
0.25 0 0 0
0.3 30 0 0
0.3 0 0 10
IIUC you can create list of all DataFrames - dfs, in loop append mergedDf and last concat all DataFrames to one:
import pandas
import glob
import os
def mergeData(indir="dir/path", outdir="dir/path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/path/filename.csv"
right = filename
output = "/path/filename.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='right',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#add missing rows from leftDf (in sample Alpha - 0.25)
#fill NaN values by 0
outputDf = pandas.merge(leftDf,outputDf,how='left',on="Alpha", sort=True).fillna(0)
#columns are converted to int
outputDf[['Beta', 'Charlie']] = outputDf[['Beta', 'Charlie']].astype(int)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie
0 0.00 10 0
1 0.10 0 5
2 0.15 0 15
3 0.20 20 0
4 0.25 0 0
5 0.30 30 0
EDIT:
Problem is you change parameter how='left' in second merge to how='inner':
def mergeData(indir="Dir Path", outdir="Dir Path"):
dfs = []
os.chdir(indir)
fileList=glob.glob("*.txt")
for filename in fileList:
left= "/Path/Final.csv"
right = filename
output = "/Path/finalMerged.csv"
leftDf = pandas.read_csv(left)
rightDf = pandas.read_csv(right)
mergedDf = pandas.merge(leftDf,rightDf,how='inner',on="Alpha", sort=True)
dfs.append(mergedDf)
outputDf = pandas.concat(dfs, ignore_index=True)
#need left join, not inner
outputDf = pandas.merge(leftDf, outputDf, how='left', on='Alpha', sort=True, copy=False)
.fillna(0)
print (outputDf)
outputDf.to_csv(output, index=0)
mergeData()
Alpha Beta Charlie Delta
0 0.00 10.0 0.0 0.0
1 0.10 0.0 5.0 0.0
2 0.10 0.0 0.0 10.0
3 0.15 0.0 15.0 0.0
4 0.15 0.0 0.0 20.0
5 0.20 20.0 0.0 0.0
6 0.20 0.0 0.0 50.0
7 0.25 0.0 0.0 0.0
8 0.30 30.0 0.0 0.0
9 0.30 0.0 0.0 10.0
import pandas as pd
data1 = pd.read_csv('samp1.csv',sep=',')
data2 = pd.read_csv('samp2.csv',sep=',')
data3 = pd.read_csv('samp3.csv',sep=',')
df1 = pd.DataFrame({'Alpha':data1.Alpha})
df2 = pd.DataFrame({'Alpha':data2.Alpha,'Beta':data2.Beta})
df3 = pd.DataFrame({'Alpha':data3.Alpha,'Charlie':data3.Charlie})
mergedDf = pd.merge(df1, df2, how='outer', on ='Alpha',sort=False)
mergedDf1 = pd.merge(mergedDf, df3, how='outer', on ='Alpha',sort=False)
a = pd.DataFrame(mergedDf1)
print(a.drop_duplicates())
output:
Alpha Beta Charlie
0 0.00 10.0 NaN
1 0.10 NaN 5.0
2 0.15 NaN 15.0
3 0.20 20.0 NaN
4 0.25 NaN NaN
5 0.30 30.0 NaN