When to use semaphore locks / unlocks vs. wait / notify?

When to use semaphore locks / unlocks vs. wait / notify? - concurrency

I'm learning Promela and using SPIN to model some examples I found. This model involves a food ordering simulation. So the customer orders, cashier takes order, sends to server, back to customer etc.
Here is a flow of the program.
The specific processes are as followed.
Here is the code I have written so far:
#define NCUSTS 3 /* number of customers */
#define NCASHIERS 1 /* number of cashiers */
#define NSERVERS 1 /* number of servers */
#define NOBODY 255
#define semaphore byte /* define a sempahore */
/*
* lock (down) and unlock (up) functions for mutex semaphores
*/
inline unlock(s) {s++;}
inline lock(s) {atomic{ s>0 ; s--}}
/*
* wait (down) and notify (up) functions for signaling semaphores
*/
inline notify(s) {s++;}
inline wait(s) {atomic{ s>0 ; s--}}
mtype = {CHILI, SANDWICH, PIZZA, NULL} ; // the types of foods (added null for resets)
mtype favorites[NCUSTS];
mtype orders[NCUSTS] = NULL;
byte ordering = NOBODY;
semaphore waitingFood[NCUSTS] = 1;
semaphore cashierOpen = 1;
semaphore serverOpen = 1;
bool waiting[NCUSTS] = false;
/*
* Process representing a customer.
* Takes in their favorite food and an integer id
* to represent them
*/
proctype Customer(mtype favorite; byte id)
{
/* customer cycle */
do
::
//Enter
printf("Customer %d Entered\n", id);
//Record
favorites[id] = favorite;
//Wait for cashier
cashierOpen > 0;
atomic{
lock(cashierOpen);
printf("Cashier selects customer %d\n", id);
ordering = id;
}
//Order
orders[id] = favorite;
printf("Customer orders %e\n", favorite);
unlock(cashierOpen);
ordering = NOBODY;
printf("Customer %d is waiting for %e\n", id, favorite);
waiting[id] = true;
wait(waitingFood[id]);
waitingFood[id] > 0;
printf("Customer %d recieves food and leaves\n", id);
favorites[id] = NULL;
orders[id] = NULL;
od ;
}
/*
* Process representing a cashier
*/
proctype Cashier()
{
do
::
printf("Cashier is waiting for a customer\n");
cashierOpen < 1;
printf("Cashier takes the order of Customer %d\n", ordering);
serverOpen > 0;
printf("Cashier passes order to server\n");
od ;
}
/*
* Process representing a server
*/
proctype Server()
{
byte id;
do
::
printf("Server is waiting for order\n");
for(id : 0..2){
if
:: waiting[id] ->
lock(serverOpen);
printf("Server creates order of %e for %d\n", orders[id], id);
printf("Server delivers order of %e to %d\n", orders[id], id);
notify(waitingFood[id]);
unlock(serverOpen);
:: else ->
skip;
fi;
}
od ;
}
/*
* Sets up processes. This model creates two
* customers with the favorite foods PIZZA & CHILI.
*/
init{
atomic{
run Customer(PIZZA, 0) ;
run Customer(CHILI, 1) ;
run Cashier();
run Server();
}
}
Clearly, the program does not work as I expected. Could someone help me understand how to use semaphores and when to use locks unlocks waits and notifies here?

This part of your model has to be changed:
:: waiting[id] ->
...
notify(waitingFood[id]);
...
When waitingFood[id] is released by the Server, the Customer does not immediately turn waiting[id] to false, so it is possible that the Server handles the same Customer's request more than once (actually, it is very likely to happen).
In fact, by adding the following ltl property to the model:
ltl p0 { [] (waitingFood[0] < 2) };
and then checking the property, it is confirmed that the variable waitingFood can be assigned an "incorrect" value:
~$ spin -search -bfs t.pml
ltl p0: [] ((waitingFood[0]<2))
Depth= 10 States= 13 Transitions= 13 Memory= 128.195
Depth= 20 States= 620 Transitions= 878 Memory= 128.195
pan:1: assertion violated !( !((waitingFood[0]<2))) (at depth 22)
pan: wrote t.pml.trail
(Spin Version 6.4.8 -- 2 March 2018)
Warning: Search not completed
+ Breadth-First Search
+ Partial Order Reduction
Full statespace search for:
never claim + (p0)
assertion violations + (if within scope of claim)
cycle checks - (disabled by -DSAFETY)
invalid end states - (disabled by never claim)
State-vector 60 byte, depth reached 22, errors: 1
1242 states, stored
1239 nominal states (stored-atomic)
684 states, matched
1926 transitions (= stored+matched)
3 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.104 equivalent memory usage for states (stored*(State-vector + overhead))
0.381 actual memory usage for states
128.000 memory used for hash table (-w24)
128.293 total actual memory usage
pan: elapsed time 0 seconds
The problematic trace is:
~$ spin -p -g -l -t t.pml
ltl p0: [] ((waitingFood[0]<2))
starting claim 4
using statement merging
Starting Customer with pid 2
1: proc 0 (:init::1) t.pml:124 (state 1) [(run Customer(PIZZA,0))]
Starting Customer with pid 3
2: proc 0 (:init::1) t.pml:125 (state 2) [(run Customer(CHILI,1))]
Starting Cashier with pid 4
3: proc 0 (:init::1) t.pml:126 (state 3) [(run Cashier())]
Starting Server with pid 5
4: proc 0 (:init::1) t.pml:127 (state 4) [(run Server())]
Server is waiting for order
5: proc 4 (Server:1) t.pml:100 (state 1) [printf('Server is waiting for order\\n')]
5: proc 4 (Server:1) t.pml:101 (state 2) [id = 0]
Server(4):id = 0
6: proc 4 (Server:1) t.pml:101 (state 3) [((id<=2))]
Cashier is waiting for a customer
7: proc 3 (Cashier:1) t.pml:83 (state 1) [printf('Cashier is waiting for a customer\\n')]
Customer 1 Entered
8: proc 2 (Customer:1) t.pml:45 (state 1) [printf('Customer %d Entered\\n',id)]
Customer 0 Entered
9: proc 1 (Customer:1) t.pml:45 (state 1) [printf('Customer %d Entered\\n',id)]
10: proc 1 (Customer:1) t.pml:48 (state 2) [favorites[id] = favorite]
favorites[0] = PIZZA
favorites[1] = 0
favorites[2] = 0
11: proc 1 (Customer:1) t.pml:51 (state 3) [((cashierOpen>0))]
12: proc 1 (Customer:1) t.pml:12 (state 4) [((cashierOpen>0))]
12: proc 1 (Customer:1) t.pml:12 (state 5) [cashierOpen = (cashierOpen-1)]
cashierOpen = 0
Cashier selects customer 0
12: proc 1 (Customer:1) t.pml:54 (state 8) [printf('Cashier selects customer %d\\n',id)]
cashierOpen = 0
12: proc 1 (Customer:1) t.pml:55 (state 9) [ordering = id]
ordering = 0
cashierOpen = 0
13: proc 1 (Customer:1) t.pml:58 (state 11) [orders[id] = favorite]
orders[0] = PIZZA
orders[1] = NULL
orders[2] = NULL
Customer orders PIZZA
14: proc 1 (Customer:1) t.pml:59 (state 12) [printf('Customer orders %e\\n',favorite)]
15: proc 1 (Customer:1) t.pml:11 (state 13) [cashierOpen = (cashierOpen+1)]
cashierOpen = 1
16: proc 1 (Customer:1) t.pml:61 (state 15) [ordering = 255]
ordering = 255
Customer 0 is waiting for PIZZA
17: proc 1 (Customer:1) t.pml:64 (state 16) [printf('Customer %d is waiting for %e\\n',id,favorite)]
18: proc 1 (Customer:1) t.pml:65 (state 17) [waiting[id] = 1]
waiting[0] = 1
waiting[1] = 0
waiting[2] = 0
19: proc 4 (Server:1) t.pml:103 (state 4) [(waiting[id])]
20: proc 4 (Server:1) t.pml:12 (state 5) [((serverOpen>0))]
20: proc 4 (Server:1) t.pml:12 (state 6) [serverOpen = (serverOpen-1)]
serverOpen = 0
Server creates order of PIZZA for 0
21: proc 4 (Server:1) t.pml:105 (state 9) [printf('Server creates order of %e for %d\\n',orders[id],id)]
Server delivers order of PIZZA to 0
22: proc 4 (Server:1) t.pml:106 (state 10) [printf('Server delivers order of %e to %d\\n',orders[id],id)]
23: proc 4 (Server:1) t.pml:17 (state 11) [waitingFood[id] = (waitingFood[id]+1)]
waitingFood[0] = 2
waitingFood[1] = 1
waitingFood[2] = 1
spin: trail ends after 23 steps
#processes: 5
favorites[0] = PIZZA
favorites[1] = 0
favorites[2] = 0
orders[0] = PIZZA
orders[1] = NULL
orders[2] = NULL
ordering = 255
waitingFood[0] = 2
waitingFood[1] = 1
waitingFood[2] = 1
cashierOpen = 1
serverOpen = 0
waiting[0] = 1
waiting[1] = 0
waiting[2] = 0
23: proc 4 (Server:1) t.pml:11 (state 14)
23: proc 3 (Cashier:1) t.pml:84 (state 2)
23: proc 2 (Customer:1) t.pml:48 (state 2)
23: proc 1 (Customer:1) t.pml:18 (state 21)
23: proc 0 (:init::1) t.pml:129 (state 6) <valid end state>
23: proc - (p0:1) _spin_nvr.tmp:2 (state 6)
5 processes created
Here are a few additional comments based on reading your model:
For the Customer, it is pointless to wait over cashierOpen > 0, because it is already done inside lock(cashierOpen);
The fact that there is only one ordering variable means that your model may display incorrect information as soon as cashierOpen is initialized to a value > 1
The Customer should release cashierOpen with unlock(cashierOpen) and set ordering to NOBODY within an atomic { } statement. Otherwise some other Customer could write inside ordering in-between the two instructions and then the former Customer would incorrectly overwrite such variable with NOBODY.
The array waitingFood[NCUSTS] is initialized to 1. It is unclear to me what you expect to happen when you write wait(waitingFood[id]), as the memory location should already contain 1 and thus the Customer does not have to wait. Instead, I think that the array should be initialized to 0, and perhaps it is also worth updating its name to mirror this change.
Again, writing waitingFood[id] > 0 after wait(waitingFood[id]) seems to be not only pointless, but in this case also incorrect. The semaphore should contain 0/1 values at this stage! When wait(waitingFood[id]) is able to acquire the semaphore, the memory location waitingFood[id] is set to 0, so the line waitingFood[id] > 0 would block the Customer forever. The only reason why right now this does not happen is because of the bug I underlined at the beginning of this answer, which allows the Server to serve the same Customer multiple times.

Related

How can I write a query to carry a remaining balance of hours forward for load leveling a schedule?

I have a query result with a total amount of hours scheduled per week in chronological order without gaps and have a set amount of hours that can be processed each week. Any hours not processed should be carried over to one or more following weeks. The following information is available.
Week | Hours | Capacity
1 2000 160
2 100 160
3 0 140
4 150 160
5 500 160
6 1500 160
Each week it should reduce the new hours plus carried over hours by the Capacity but never go below zero. A positive value should carry into the following week(s).
Week | Hours | Capacity | LeftOver = (Hours + LAG(LeftOver) - Capacity)
1 400 160 240 (400 + 0 - 160)
2 100 160 180 (100 + 240 - 160)
3 0 140 40 ( 0 + 180 - 140)
4 20 160 0 ( 20 + 40 - 160) (no negative, change to zero)
5 500 160 340 (500 + 0 - 160)
6 0 160 180 ( 0 + 340 - 160)
I'm assuming this can be done with cte recursion and a running value that doesn't go below zero but I can't find any specific examples of how this would be written.

Well, you are not wrong, a recursive common table expression is indeed an option to construct a solution.
Construction of recursive queries can generally be done in steps. Run your query after every step and validate the result.
Define the "anchor" of your recursion: where does the recursion start?Here the start is defined by Week = 1.
Define a recursion iteration: what is the relation between iterations?Here that would be the incrementing week numbers d.Week = r.Week + 1.
Avoiding negative numbers can be resolved with a case expression.
Sample data
create table data
(
Week int,
Hours int,
Capacity int
);
insert into data (Week, Hours, Capacity) values
(1, 400, 160),
(2, 100, 160),
(3, 0, 140),
(4, 20, 160),
(5, 500, 160),
(6, 0, 160);
Solution
with rcte as
(
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours - d.Capacity > 0
then d.Hours - d.Capacity
else 0
end as LeftOver
from data d
where d.Week = 1
union all
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours + r.LeftOver - d.Capacity > 0
then d.Hours + r.LeftOver - d.Capacity
else 0
end
from rcte r
join data d
on d.Week = r.Week + 1
)
select r.Week,
r.Hours,
r.Capacity,
r.LeftOver
from rcte r
order by r.Week;
Result
Week Hours Capacity LeftOver
---- ----- -------- --------
1 400 160 240
2 100 160 180
3 0 140 40
4 20 160 0
5 500 160 340
6 0 160 180
Fiddle to see things in action.

I ended up writing a few CTEs then a recursive CTE and got what I needed. The capacity is a static number here but will be replaced later with one that takes holidays and vacations into account. Will also need to consider the initial 'LeftOver' value for the first week but could use this query with an earlier date period to find the most recent date with a zero LeftOver value then use that as a new start date, then filter out those earlier weeks in the final query.
DECLARE #StartDate date = (SELECT MAX(FirstDayOfWorkWeek) FROM dbo._Calendar WHERE Date <= GETDATE());
DECLARE #EndDate date = DATEADD(week, 12, #StartDate);
DECLARE #EmployeeQty int = (SELECT ISNULL(COUNT(*), 0) FROM Employee WHERE DefaultDepartment IN (4) AND Hidden = 0 AND DateTerminated IS NULL);
WITH hours AS (
/* GRAB ALL NEW HOURS SCHEDULED FOR EACH WEEK IN THE SELECTED PERIOD */
SELECT c.FirstDayOfWorkWeek as [Date]
, SUM(budget.Hours) as hours
FROM dbo.Project_Phase phase
JOIN dbo.Project_Budget_Labor budget on phase.ID = budget.Phase
JOIN dbo._Calendar c on CONVERT(date, phase.Date1) = c.[Date]
WHERE phase.CompletedOn IS NULL AND phase.Project <> 4266
AND phase.Date1 BETWEEN #StartDate AND #EndDate
AND budget.Department IN (4)
GROUP BY c.FirstDayOfWorkWeek
)
, weeks AS (
/* CREATE BLANK ROWS FOR EACH WEEK AND JOIN TO ACTUAL HOURS TO ELIMINATE GAPS */
/* ADD A ROW NUMBER FOR RECURSION IN NEXT CTE */
SELECT cal.[Date]
, ROW_NUMBER() OVER(ORDER BY cal.[Date]) as [rownum]
, ISNULL(SUM(hours.Hours), 0) as Hours
FROM (SELECT FirstDayOfWorkWeek as [Date] FROM dbo._Calendar WHERE [Date] BETWEEN #StartDate AND #EndDate GROUP BY FirstDayOfWorkWeek) as cal
LEFT JOIN hours on cal.[Date] = hours.[Date]
GROUP BY cal.[Date]
)
, spread AS (
/* GRAB FIRST WEEK AND USE RECURSION TO CREATE RUNNING TOTAL THAT DOES NOT DROP BELOW ZERO*/
SELECT TOP 1 [Date]
, rownum
, Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), 0.00) as LeftOver
, Hours as running
FROM weeks
ORDER BY rownum
UNION ALL
SELECT curr.[Date]
, curr.rownum
, curr.Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), CASE WHEN curr.Hours + prev.LeftOver - (#EmployeeQty * 40) < 0 THEN 0 ELSE curr.Hours + prev.LeftOver - (#EmployeeQty * 40) END) as LeftOver
, curr.Hours + prev.LeftOver as running
FROM weeks curr
JOIN spread prev on curr.rownum = (prev.rownum + 1)
)
SELECT spread.Hours as NewHours
, spread.LeftOver as PrevHours
, spread.Capacity
, spread.running as RunningTotal
, CASE WHEN running < Capacity THEN running ELSE Capacity END as HoursThisWeek
FROM spread

How to extract where clause as array in spark sql?

I am trying to extract where clause from SQL query.
Multiple conditions in where clause should be in form array. Please help me.
Sample Input String:
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
Output Expected:
Array("col1=1", "(col2 between 1 and 10 or col2 between 190 and 200)", "col2 is not null")
Thanks in advance.
EDIT:
My question here is like... I would like to split all the conditions as separate items... let's say my query is like
select * from table where col1=1 and (col2 between 1 and 10 or col2 between 190 and 200) and col2 is not null
The output I'm expecting is like
List("col1=1", "col2 between 1 and 10", "col2 between 190 and 200", "col2 is not null")
The thing is the query may have multiple levels of conditions like
select * from table where col1=1 and (col2 =2 or(col3 between 1 and 10 or col3 is between 190 and 200)) and col4='xyz'
in output each condition should be a separate item
List("col1=1","col2=2", "col3 between 1 and 10", "col3 between 190 and 200", "col4='xyz'")

I wouldn't use Regex for this. Here's an alternative way to extract your conditions based on Catalyst's Logical Plan :
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case And(l,r) => Seq(l,r)
case o:Predicate => Seq(o)
}
}.toList.flatten
println(predicates)
Output :
List(('col1 = 1), ((('col2 >= 1) && ('col2 <= 10)) || (('col2 >= 190) && ('col2 <= 200))), isnotnull('col2))
Here the predicates are still Expressions and hold information (tree representation).
EDIT :
As asked in comment, here's a String (user friendly I hope) representation of the predicates :)
val plan = df.queryExecution.logical
val predicates: Seq[Expression] = plan.children.collect{case f: Filter =>
f.condition.productIterator.flatMap{
case o:Predicate => Seq(o)
}
}.toList.flatten
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(Seq(l,r).flatMap(stringifyExpressions).mkString("(",") OR (", ")"))
case eq: EqualTo => Seq(s"${eq.left.toString} = ${eq.right.toString}")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
val stringRepresentation = predicates.flatMap{stringifyExpressions}
println(stringRepresentation)
New Output :
List('col1 = 1, ('col2 between 1 and 10) OR ('col2 between 190 and 200), 'col2 is not null)
You can keep playing with the recursive stringifyExpressions method if you want to customize the output.
EDIT 2 : In response to your own edit :
You can change the Or / EqualTo cases to the following
def stringifyExpressions(expression: Expression): Seq[String] = {
expression match{
case And(l,r) => (l,r) match {
case (gte: GreaterThanOrEqual,lte: LessThanOrEqual) => Seq(s"""${gte.left.toString} between ${gte.right.toString} and ${lte.right.toString}""")
case (_,_) => Seq(l,r).flatMap(stringifyExpressions)
}
case Or(l,r) => Seq(l,r).flatMap(stringifyExpressions)
case EqualTo(l,r) =>
val prettyLeft = if(l.resolved && l.dataType == StringType) s"'${l.toString}'" else l.toString
val prettyRight = if(r.resolved && r.dataType == StringType) s"'${r.toString}'" else r.toString
Seq(s"$prettyLeft=$prettyRight")
case inn: IsNotNull => Seq(s"${inn.child.toString} is not null")
case p: Predicate => Seq(p.toString)
}
}
This gives the 4 elements List :
List('col1=1, 'col2 between 1 and 10, 'col2 between 190 and 200, 'col2 is not null)
For the second example :
select * from table where col1=1 and (col2 =2 or (col3 between 1 and 10 or col3 between 190 and 200)) and col4='xyz'
You'd get this output (List[String] with 5 elements) :
List('col1=1, 'col2=2, 'col3 between 1 and 10, 'col3 between 190 and 200, 'col4='xyz')
Additional note: If you want to print the attribute names without the starting quote, you can handle it by printing this instead of toString :
node.asInstanceOf[UnresolvedAttribute].name

SAS Sql Case statement - how to convert SAS data step into sql case

I am trying to rewrite SAS data step into SAS Sql. I keep getting syntax errors for the versions I have written so far. The documentation and examples do not address the type of if/then I am working with.
Below is the original data step:
data tmp1;
set _tst1;
if put(tin, $gp.) = 'N' and
(missing(npi_num) or index(npi_num,"~") >= 1) then del1=1;
if put(tin, $gp.) = "Y" or
put(tin, $msp.) = "Y" or
put(cats(tin, npi_num), $pio.) =: "Y" or
put(cats(tin, npi_num), $cpc.) = 'Y' then del2=1;
if sum(num_elig, msr_yes, num_excl, msr_no) ^gt 0 then del3=1;
if sum(del1,del2,del3) > 0 then delete;
run;
Here is my attempt:
proc sql;
create table tst as
select *,
case
when
put(tin, $gp.) = 'N' and
(missing(npi_num) or index(npi_num,"~") >= 1)
then 1
else 0
end as del1
when
put(tin, $gp.) = "Y" or
put(tin, $msp.) = "Y" or
put(cats(tin, npi_num), $pio.) =: "Y" or
put(cats(tin, npi_num), $cpc.) = 'Y'
then 1
else 0
end as del2
when
sum(num_eligible, msr_met, num_exclusion, msr_not_met) ^gt 0
then 1
else 0
end as del3
from _tst1;
quit;

Printing odd prime every 100K primes found

I'm trying to make a program that print every 100K-th odd prime number until 10M using Potion, my code:
last = 3
res = (last) # create array
loop:
last += 2
prime = true
len = res length -1
i = 0
while(i<len):
v = res at(i)
if(v*v > last): break.
if(last%v == 0): prime = false, break.
i += 1
.
if(prime):
res append(last)
if(res size % 100000 == 0): last print.
if(last>9999999): break.
.
.
But this gives Segmentation fault (core dumped), I wonder what's wrong?
For reference, the working Ruby version:
res = [last=3]
loop do
last += 2
prime = true
res.each do |v|
break if v*v > last
if last % v == 0
prime = false
break
end
end
if prime
res << last
puts last if res.length % 100000 == 0
break if last > 9999999
end
end
The output should be:
1299721
2750161
4256249
5800139
7368791
8960467
and no, this is not a homework, just out of curiosity.

you found it out by yourself, great!
println is called say in potion.
And it crashed in res size.
E.g. use this for debbugging:
rm config.inc
make DEBUG=1
bin/potion -d -Dvt example/100thoddprime.pn
and then press enter until you get to the crash.
(example/100thoddprime.pn:18): res append(last)
>
; (3, 5)
[95] getlocal 1 1 ; (3, 5)
[96] move 2 1 ; (3, 5)
[97] loadk 1 5 ; size
[98] bind 1 2 ; function size()
[99] loadpn 3 0 ; nil
[100] call 1 3Segmentation fault
so size on res returned nil, and this caused the crash.
And instead of last print, "\n" print.
Just do last say.
This came from perl6 syntax, sorry :)

My bad, I forgot to change from res length -1 to res length when changing from 0 to len (i), because this syntax not recognized as a loop (failed to receive break).
last = 3
res = (last)
loop:
last println
last += 2
prime = true
len = res length
i = 0
while(i<len):
v = res at(i)
if(v*v > last): break.
if(last%v == 0): prime = false, break.
i += 1
.
if(prime):
res append(last)
if(res length % 100000 == 0): last print, "\n" print.
if(last>9999999): break.
.
.

Stata - assign different variables depending on the value within a variable

Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}

I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+

Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.

Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

When to use semaphore locks / unlocks vs. wait / notify? - concurrency

Related

How can I write a query to carry a remaining balance of hours forward for load leveling a schedule?

How to extract where clause as array in spark sql?

SAS Sql Case statement - how to convert SAS data step into sql case

Printing odd prime every 100K primes found

Stata - assign different variables depending on the value within a variable

Categories

Resources