SAS IF then statement - sas

Hello for whatever reason my if then statement will not work for this code. What I am trying to get it to do is (kinda obvious but whatever) if the salary is LE 30,000 then make new variable income equal to low. Here is what I have so far.
data newdd2;
input subject group$ year salary : comma7. ##;
IF (salary <= 30,000) THEN income = 'low';
datalines;
1 A 2 53,900 2 B 2 37,400 3 A 1 49,500
4 C 2 43,900 5 B 3 38,400 6 A 3 39,500
7 A 3 53,600 8 B 2 37,700 9 C 1 49,900
10 C 2 43,300 11 B 3 57,400 12 B 3 39,500
13 B 1 33,900 14 A 2 41,400 15 C 2 49,500
16 C 1 43,900 17 B 1 39,400 18 A 3 39,900
19 A 2 53,600 20 A 2 37,700 21 C 3 42,900
22 C 2 43,300 23 B 1 57,400 24 C 3 69,500
25 C 2 33,900 26 A 2 35,300 27 A 2 47,500
28 C 2 43,900 29 B 3 38,400 30 A 1 32,500
31 A 3 53,600 32 B 2 37,700 33 C 1 41,600
34 C 2 43,300 35 B 3 57,400 36 B 3 39,500
37 B 2 33,900 38 A 2 41,400 39 C 2 79,500
40 C 1 43,900 41 C 1 29,500 42 A 3 39,900
43 A 2 53,600 44 A 2 37,500 45 C 3 42,900
46 C 2 43,300 47 B 1 47,400 48 C 3 59,500
run;
The error I keep getting is (The work dataset may be incomplete), however I am sure that my code is correct I've tried a number of things but no success yet thanks in advance.

You cannot use a comma in a numeric literal.
IF (salary <= 30000) THEN income = 'low';

Related

Subtracting values based on a index column and using a condition in the same column in DAX

I've a lot of material on Stack about this, but i'm still not able to reproduce it.
Sample data set.
Asset
Value
Index
A
10
1
B
15
1
C
20
1
A
11
2
B
17
2
C
24
2
A
18
3
B
25
3
C
30
3
What i want to do is, subtract the Asset values individually based on the index column.
Ex:
Asset A [1] -> 10
Asset A [2] -> 11
11 - 10 = 1
So the table would be like this.
Asset
Value
Index
Diff
A
10
1
0
B
15
1
0
C
20
1
0
A
11
2
1
B
17
2
2
C
24
2
4
A
18
3
7
B
25
3
8
C
30
3
6
This need's to be done using DAX.
Can you guys help me ?
Best Regards!
I just did this and it worked.
Diff =
var Assets = 'Table'[Asset]
var Ind = 'Table'[Index] - 1
Return
IF(Ind = -1, 0, 'Table'[Value] - CALCULATE(SUM('Table'[Value]),FILTER('Table','Table'[Asset] = Assets && 'Table'[Index] = Ind)))

One variable in kg and grams. another indicates which unit; how can I get new variable in kg?

In Stata quantity has inputs in both kg and grams. while unit =1 indicates kg and unit=2 indicates grams. How can I generate a new variable quantity_kg which converts all gram values into kg?
My existing dataset-
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
My expected dataset
input double(hhid quantity unit unit_price quantity_kg)
1 24 1 . 24
1 4 1 . 4
1 350 2 50 .35
1 550 2 90 .55
1 2 1 65 2
1 3.5 1 85 3.5
1 1 1 20 1
1 4 1 25 4
1 2 1 . 2
2 1 1 30 1
2 2 1 15 2
2 1 1 20 1
2 250 2 10 .25
2 2 1 20 2
2 400 2 10 .40
2 100 2 60 .10
2 1 1 20 1
The code below does what you want.
This looks like household data where one typically has to do a lot of unit conversions. They are also a common source of error so I have included the best practice of defining conversion rates and unit codes in locals. If you define this at one place, then you can reuse these locals in multiple places where you convert units. It is easy to spot typos in the rows with replace as you would notice if one row said kilo_rate but then gram_unit. In this simple example it might be overkill, but if you have many units and rates, then this is a neat way to avoid errors.
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
end
*Define conversion rates and unit codes
local kilo_rate = 1
local kilo_unit = 1
local gram_rate = 0.001
local gram_unit = 2
*Create the standardized variable
gen quantity_kg = .
replace quantity_kg = quantity * `kilo_rate' if unit == `kilo_unit'
replace quantity_kg = quantity * `gram_rate' if unit == `gram_unit'
// unit 1 means kg, unit 2 means g, and 1000 g = 1 kg
generate quantity_kg = cond(unit == 1, quantity, cond(unit == 2, quantity/1000, .))
Your example doesn't have any missing values on unit, but it does no harm to imagine that they might occur.
Providing a comment by way of explanation could be anywhere between redundant and essential for third parties.

count and group the column by sequence

I have a dataset that has to be grouped by number as follows.
ID dept count
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
so for every 3rd row I need a new level the output should be as follows.
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
I have tried counting the number of rows based on the dept and count.
data want;
set have;
by dept count;
if first.count then level=1;
else level+1;
run;
this generates a count but not what exactly I am looking for
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
It isn't quite clear what output you want. I've extended your input data a bit - please
could you clarify what output you'd expect for this input and what the logic is for generating it?
I've made a best guess at roughly what you might be aiming for - incrementing every 3 rows with the same dept and count - perhaps this will be enough for you to get to the answer you want?
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
11 30 4
12 30 4
13 30 4
14 30 4
;
run;
data want;
set have;
by dept count;
if first.count then do;
level = 0;
dummy = 0;
end;
if mod(dummy,3) = 0 then level + 1;
dummy + 1;
drop dummy;
run;
Output:
ID dept count level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 1
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 1
10 30 4 2
11 30 4 2
12 30 4 2
13 30 4 3
14 30 4 3
One way to do this is to nest the SET statement inside a DO loop. Or in this case two DO loops. One to generate the LEVEL (within DEPT) and the second to count by twos. Use the LAST.DEPT flag to handle odd number of observations.
So if I modify the input to include odd number of observations in some groups.
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 20 4
8 30 4
9 30 4
10 30 4
;
Then can use this step to assign the LEVEL variable.
data want ;
do level=1 by 1 until(last.dept);
do sublevel=1 to 2 until(last.dept);
set have;
by dept;
output;
end;
end;
run;
Results:
Obs level sublevel ID dept count
1 1 1 1 10 2
2 1 2 2 10 2
3 1 1 3 20 4
4 1 2 4 20 4
5 2 1 5 20 4
6 2 2 6 20 4
7 3 1 7 20 4
8 1 1 8 30 4
9 1 2 9 30 4
10 2 1 10 30 4

Management of spell data: months spent in given state in the past 24 months

I am working with a spell dataset that has the following form:
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "unemployment spell" 2 "employment spell"
lab val spell_type lab_spelltype
where persid is the id of the person; start and end are the months when the yearly unemployment/employment spell starts and ends, respectively; t_start and t_end are the same measures but starting to count from 1st January 1996; event is equal to 1 for the employment entries for which the previous row was an unemployment spell.
The data is such that there are no overlapping spells during a given year, and each year contiguous spells of the same type have been merged together.
My goal is, for each row such that event is 1, to compute the number of months spent as employed in the last 6 months and 24 months.
In this specific example, what I would like to get is:
clear all
input persid start end t_start t_end spell_type year spell_number event empl_6 empl_24
1 8 9 44 45 1 1999 1 0 . .
1 12 12 60 60 1 2000 1 0 . .
1 1 1 61 61 1 2001 1 0 . .
1 7 11 67 71 1 2001 2 0 . .
1 1 4 85 88 2 2003 1 0 . .
1 5 7 89 91 1 2003 2 1 0 5
1 8 11 92 95 2 2003 3 0 . .
1 1 1 97 97 2 2004 1 0 . .
1 1 3 121 123 1 2006 1 1 0 0
1 4 5 124 125 2 2006 2 0 . .
1 6 9 126 129 1 2006 3 1 3 3
1 10 11 130 131 2 2006 4 0 . .
1 12 12 132 132 1 2006 5 1 4 7
1 1 12 157 168 1 2009 1 0 . .
1 1 12 169 180 1 2010 1 0 . .
1 1 12 181 192 1 2011 1 0 . .
1 1 12 193 204 1 2012 1 0 . .
1 1 12 205 216 1 2013 1 0 . .
end
So the idea is that I have to go back to rows preceding each event==1 entry and count how many months the individual was employed.
Can you suggest a way to obtain this final result?
Some suggested to expand the dataset, but perhaps there are better ways to tackle the problem (especially because the dataset is quite large).
EDIT
The correct labeling of the employment status is:
lab define lab_spelltype 1 "employment spell" 2 "unemployment spell"
The number of past months spent in employment (empl_6 and empl_24) and the definition of event are now correct with this label.
A solution to the problem is to:
expand the data so to have it monthly,
fill in the gap months with tsfill and finally,
use sum() and lag operators to get the running sum for the last 6 and 24 months.
See also Robert solution for some ideas I borrowed.
Important: this is almost surely not an efficient way to solve the issue, especially if the data is large (as in my case).
However, the plus is that one actually "sees" what happens in background to make sure the final result is the one desired.
Also, importantly, this solution takes into account cases where 2 (or more) events happen within 6 (or 24) months from each other.
clear all
input persid start end t_start t_end spell_type year spell_number event
1 8 9 44 45 1 1999 1 0
1 12 12 60 60 1 2000 1 0
1 1 1 61 61 1 2001 1 0
1 7 11 67 71 1 2001 2 0
1 1 4 85 88 2 2003 1 0
1 5 7 89 91 1 2003 2 1
1 8 11 92 95 2 2003 3 0
1 1 1 97 97 2 2004 1 0
1 1 3 121 123 1 2006 1 1
1 4 5 124 125 2 2006 2 0
1 6 9 126 129 1 2006 3 1
1 10 11 130 131 2 2006 4 0
1 12 12 132 132 1 2006 5 1
1 1 12 157 168 1 2009 1 0
1 1 12 169 180 1 2010 1 0
1 1 12 181 192 1 2011 1 0
1 1 12 193 204 1 2012 1 0
1 1 12 205 216 1 2013 1 0
end
lab define lab_spelltype 1 "employment" 2 "unemployment"
lab val spell_type lab_spelltype
list
* generate Stata monthly dates
gen spell_start = ym(year,start)
gen spell_end = ym(year,end)
format %tm spell_start spell_end
list
* expand to monthly data
gen n = spell_end - spell_start + 1
expand n, gen(expanded)
sort persid year spell_number (expanded)
bysort persid year spell_number: gen month = spell_start + _n - 1
by persid year spell_number: replace event = 0 if _n > 1
format %tm month
* xtset, fill months gaps with "empty" rows, use lags and cumsum to count past months in employment
xtset persid month, monthly // %tm format
tsfill
bysort persid (month): gen cumsum = sum(spell_type) if spell_type==1
bysort persid (month): replace cumsum = cumsum[_n-1] if cumsum==.
bysort persid (month): gen m6 = cumsum-1 - L7.cumsum if event==1 // "-1" otherwise it sums also current empl month
bysort persid (month): gen m24 = cumsum-1 - L25.cumsum if event==1
drop if event==.
list persid start end year m* if event
The posted example is of little utility in developing and testing a solution so I made up fake data that has the same properties. It's bad practice to use 1 and 2 as values for an indicator so I replaced the employed indicator with 1 meaning employed, 0 otherwise. Using month and year separately is also useless so Stata monthly dates are used.
The first solution uses tsegen (from SSC) after expanding each spell to one observation per month. With panel data, all you need to do is to sum the employment indicator for the desired time window.
The second solution uses rangestat (also from SSC) and does the same computations without expanding the data at all. The idea is simple, just add the duration of previous employment spells if the end of the spell falls within the desired window. Of course if the end of the spell falls within the window but not the start, days outside the window must be subtracted.
* fake data for 100 persons, up to 10 spells with no overlap
clear
set seed 123423
set obs 100
gen long persid = _n
gen spell_start = ym(runiformint(1990,2013),1)
expand runiformint(1,10)
bysort persid: gen spellid = _n
by persid: gen employed = runiformint(0,1)
by persid: gen spell_avg = int((ym(2015,12) - spell_start) / _N) + 1
by persid: replace spell_start = spell_start[_n-1] + ///
runiformint(1,spell_avg) if _n > 1
by persid: gen spell_end = runiformint(spell_start, spell_start[_n+1]-1)
replace spell_end = spell_start + runiformint(1,12) if mi(spell_end)
format %tm spell_start spell_end
* an event is an employment spell that immediately follow an unemployment spell
by persid: gen event = employed & employed[_n-1] == 0
* expand to one obs per month and declare as panel data
expand spell_end - spell_start + 1
bysort persid spellid: gen ym = spell_start + _n - 1
format %tm ym
tsset persid ym
* only count employement months; limit results to first month event obs
tsegen m6 = rowtotal(L(1/6).employed)
tsegen m24 = rowtotal(L(1/24).employed)
bysort persid spellid (ym): replace m6 = . if _n > 1 | !event
bysort persid spellid (ym): replace m24 = . if _n > 1 | !event
* --------- redo using rangestat, without any monthly expansion ----------------
* return to original obs but keep first month results
bysort persid spellid: keep if _n == 1
* employment end and duration for employed observations only
gen e_end = spell_end if employed
gen e_len = spell_end - spell_start + 1 if employed
foreach target in 6 24 {
// define interval bounds but only for event observations
// an out-of-sample [0,0] interval will yield no results for non-events
gen low`target' = cond(event, spell_start-`target', 0)
gen high`target' = cond(event, spell_start-1, 0)
// sum employment lengths and save earliest employment spell info
rangestat (sum) empl`target'=e_len ///
(firstnm) firste`target'=e_end firste`target'len=e_len, ///
by(persid) interval(spell_end low`target' high`target')
// remove from the count months that occur before lower bound
gen e_start = firste`target' - firste`target'len + 1
gen outside = low`target' - e_start
gen empl`target'final = cond(outside > 0, empl`target'-outside, empl`target')
replace empl`target'final = 0 if mi(empl`target'final) & event
drop e_start outside
}
* confirm that we match the -tsegen- results
assert m24 == empl24final
assert m6 == empl6final

Comparison of two CSV files in Python

I want to compare two csv files looking like below.
Here I want to find out unmatched signals.
I need some help in python. Please help me.
File 1
2
USER Name
7/31/2015 0:00
<XXXXXXX>
1 Signal_1 10
2 Signal_2 1 2 3 4 5
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_4 20
1 Signal_5 30
2 Signal_6 6 7 8 9 10 11 12 13
2 Signal_7 55 1.05 1.6 14.1
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
2 Signal_9 1 1 2 3 2
1 Signal_10 40
File 2
2
USER Name
7/31/2015 0:00
<XXXXXXX>
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
2 Signal_7 55 1.05 1.6 14.1
1 Signal_1 10
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
1 Signal_10 14
2 Signal_9 1 1 2 3 2
2 Signal_6 6 7 8 59 10 15 12 13
1 Signal_4 20
2 Signal_2 1 2 3 4 5
Result:
File
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_5 30
1 Signal_10 40
2 Signal_6 6 7 8 9 10 11 12 13
File 2
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
1 Signal_10 14
2 Signal_9 1 1 2 3 2
If you want to check for fairly exact comparisons, you can use sets quite easily:
def sigset(fname):
with open(fname, 'rb') as f:
data = set(' '.join(line.split()) for line in f
if 'Signal' in line)
return data
s1 = sigset('sig1.txt')
s2 = sigset('sig2.txt')
print('File 1')
for line in sorted(s1 - s2):
print(line)
print('')
print('File 2')
for line in sorted(s2 - s1):
print(line)
with open('Sample1.csv', 'r') as t1, open('Sample2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
print fileone
print filetwo
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
for line in fileone:
if line not in filetwo:
outFile.write(line)