If I stop a code from running halfway does it impact the other half of the program? - sas

New to SAS and just wondering - if I have let's say a code of 8 steps/procs and towards the end I have a drop statement dropping a table created. If I stop the code halfway at let's say the 4th step/Proc - would it impact the other steps at all? Especially the drop statement?
Thanks in advance!
Thought it wouldn't impact the final dataset I dropped but I'm seeing a lot of nulls on that table that assuming it does? But if anyone could confirm the above it would be much appreciated

Related

Google Sheets hours worked before and after specified time

I am trying to specify a time that starts a shift premium for a time card. I was so happy I finally figured out how to get it, but then I changed a time to a first shift time and thats when I lost all hope. I have spent hours messing with this and my current formula's looks like
x1.05
=IF(COUNT(A20:B20)=2,MOD($B$18-A20,1)*24,"")
x1.1
=IF(COUNT(A20:B20)=2,MOD(B20-$B$18,1)*24,"")
My end goal is to have it so first shift (7am - 3pm) fall under base pay, 2nd shift (4pm-12am) fall under x1.05 and anything past 12am falls under x1.1
Just getting this far has been a mind blower for me and any help would be greatly appreciated.
I have moved this to its own sheet to share with everyone. I am still playing with things myself and trying different things, but on the sheet I have included the variables along with the wording of the contract that effects the pay scale. NOTE the times I originally provided were just examples from the top of my head.
My Sheet
It doesn't look like your current sheet accounts for the multiple criteria you're wanting to evaluate. First shift (x1) begins at 7; second shift (x1.05) begins at 15:00, and third shift (x1.10) begins at 24:00.
I added each of those directly above the columns they affect and used this formula for the Base column:
=IF(COUNT($A20:$B20)=2,IF($A20<F$18,(MOD(MIN(F$18,$B20)-$A20,1))*24,0),"")
...this for the x1.05 column:
=IF(COUNT($A20:$B20)=2,IF($A20<G$18,(MOD(MIN(G$18,$B20)-$A20,1)*24)-E20,0),"")
...and this for the x1.10 column:
=IF(COUNT($A20:$B20)=2,(MOD(MIN(E$18+1,$B20)-$A20,1)*24)-SUM(E20:F20),"")
So far, it's working as expected. One thing I didn't add into it is to account for someone who starts their shift before 7 a.m. If you want this to be included in the x1.10 column, you could add a calculation for that to the formula there.
Here's what it looks like:
I'm working in Excel, but these formulas should all work in Google Sheets as well.

Suggestions for statistical model/approach to “Pattern recognition for non-uniform time data”

I have a dataset from which I would like to detect recurring patterns (i.e: daily, weekly, monthly). The dataset only contains a time stamp (datetime), and the spacing is non-uniform.
The observations in the data reflect the exact time when this one person passes my window. He does this several times a day (on a single day he walks by my window approx 10-30 times), and I am trying to see, if there is any pattern (there might also be some seasonality, sudden changes in previous behavior and other interesting stuff going on).
Does anyone have a suggestion for a statistical model/approach that might be helpful in figuring out if there is any pattern in this behavior? Hopefully, I’ll be able to predict when he will pass my window again ;)
How would you approach this?
Any help would really be appreciated.

SAS out of memory error

I'm getting a "The remote Process is out of memory" in SAS DIS (Data Integration Studio):
Since it is possible that my approach is wrong, I'll explain the problem I'm working on and the solution I've decided on:
I have a large list of customer names which need cleaning. In order to achieve this, I use a .csv file containing regular expression patterns and their corresponding replacements; (I use this approach since it is easier to add new patterns to the file and upload it to the server for the deployed job to read from rather than harcoding new rules and redeploying the job).
In order to get my data step to make use of the rules in the file I add the patterns and their replacements to an array in the first iteration of my data step then apply them to my names. Something like:
DATA &_OUPUT;
ARRAY rule_nums{1:&NOBS} _temporary_;
IF(_n_ = 1) THEN
DO i=1 to &NOBS;
SET WORK.CLEANING_RULES;
rule_nums{i} = PRXPARSE(CATS('s/',rule_string_match,'/',rule_string_replace,'/i'));
END;
SET WORK.CUST_NAMES;
customer_name_clean = customer_name;
DO i=1 to &NOBS;
customer_name_clean = PRXCHANGE(a_rule_nums{i},1,customer_name_clean);
END;
RUN;
When I run this on around ~10K rows or less, it always completes and finishes extremely quickly. If I try on ~15K rows it chokes for a super long time and eventually throws an "Out of memory" error.
To try and deal with this I built a loop (using the SAS DIS loop transformation) wherein I number the rows of my dataset first, then apply the preceding logic in batches of 10000 names at a time. After a very long time I got the same out of memory error, but when I checked my target table (Teradata) I noticed that it ran and loaded the data for all but the last iteration. When I switched the loop size from 10000 to 1000 I saw exactly the same behaviour.
For testing purposes I've been working with only around ~500K rows but will soon have to handle millions and am worried about how this is going to work. For reference, the set of cleaning rules I'm applying is currently 20 rows but will grow to possibly a few hundred.
Is it significantly less efficient to use a file with rules rather than hard coding the regular expressions directly in my datastep?
Is there any way to achieve this without having to loop?
Since my dataset gets overwritten on every loop iteration, how can there be an out of memory error for datasets that are 1000 rows long (and like 3 columns)?
Ultimately, how do I solve this out of memory error?
Thanks!
The issue turned out to be that the log that the job was generating was too large. The possible solutions are to disable logging or to redirect the log to a location which can be periodically purged and/or has enough space.

sort error at end of proc sql for inner join

I ran the following code and an hour later, just as the code was finishing a sort execute error occurred. Is there something wrong with my code or is my computer processor and Ram insufficient
proc sql;
create table today as
select a.account_number, a.client_type, a.device ,a.entry_date_est,
a.entry_time_est, a.duration_seconds, a.channel_name, b.esn, b.service_start_date,
b.service_end_date, b.product_name, b.billing_frequency_fee, b.plan_category,
b.plan_subtype, b.plan_type
from listen_nomiss a inner join service_nomiss b
on (a.account_number = b.account_number)
order by account_number;
quit;
That error is most commonly seen when you run out of utility space to perform the sort. A few suggestions for troubleshooting are available in this SAS KB post; the most useful suggestions:
options fullstimer msglevel=i ; will give you a lot more information about what's going on behind the scenes, so you can troubleshoot what is causing the issue
proc options option=utilloc; run; will tell you where the utility directory is that your temporary files will be created in for the sort. Verify that about 3 times the space needed for the final table is available - sorting requires roughly 3 times the space in order to properly sort the dataset due to how the sort is processed.
OPTIONS COMPRESS; will save some (possibly a lot of) space if not already enabled.
options memsize; and options sortsize; will tell you how much memory is allocated to SAS, and at what size a sort is done in memory versus on disk. sortsize should be about 1/3 of memsize (given the requirement of 3x space to process it). If your final table is around but just over sortsize, you may be better off trying to increase sortsize if the default is too low (same for memsize).
You could also have some issues with permissions; some of the other suggestions in the kb article relate to verifying you actually have permission to write to the utility directory, or that it exists at all.
I've had a project in the past where resources was an issue as well.
A couple of ways around it when sorting were:
Don't forget that proc sort has a TAGSORT option, which will make it first only sort on the by statement variables and attach everything else afterwards. Useful when having many columns not involved in the by statement.
Indexes: if you build an index of exactly the variables in your by-statement, you can use a by statement without sorting, it will rely on the index.
Split it up: you can split up the dataset in multiple chunks and sort each chunk separately. Then you do a data step where you put them all in the set statement. When you use a by statement there as well, SAS will weave the records so that the result is also according to the by-statement.
Note that these approaches have a performance hit (maybe the third one only to a lesser extent) and indexes can give you headaches if you don't take them into account later on (or destroy them intentionally).
One note if/when you would rewrite the whole join as a SAS merge: keep in mind that SAS merge does not by itself mimic many-to-many joins. (it does one-to-one, one-to-many and many-to-one) Probably not the case here (it rarely is), but i mention it to be on the safe side.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.