Proc SQL in SAS and lag and lead - sas

I'm struggling a bit with a problem I can't quite get my head around.
Let's say we have a few columns;
IP-address, time stamp, SSN.
How would I go finding occurrences where the same IP appears in several records where the time is within the same one hour window (as an example of a window of time) and there are several SSNs.
This could for example be used for received applications for whatever, where we get a lot of traffic from one location where the data given varies.
Might lag or lead be good?
I'm using SAS, but only Proc SQL really. Might lag or lead be a way to go?
Thank you for the help!

There are some uncertainty in "one hour window" description. It depends when is your starting point - one hour from when?
Otherwise you could end up with a double cycle:
For every IP
For every timestamp
Check if other timestamps of the same IP exists between 1 hour and with different SSN
A simpler solution might be using lag function.
First sort by IP and time stamp.
Second use lag to calculated new column with time difference between each two rows. Flag it when it is less than 1 hour. Use this flag in next query grouping to identify distinct SSN.
Problem with latter solution that it will mark records that are beyond 1 hour window in total.

Related

Power BI: Relative Time Under 5 Hours returns no data

I have a PBI desktop dashboard I've created to pull machine data from a local SQL server. I'm using a relative date time filter on one of the pages to drill down data for live feed, however anything under 5 hours of the relative time, the data goes blank.
I use 4 log tables for the raw data, each having their own time stamp for each instance. Each are related using a ID table with other general information contained. In addition, time is related using a calculated table to create a timeframe of all instances:
Relationship Model
DateTable = distinct(union(SUMMARIZE(LogFault,LogFault[Time]),SUMMARIZE(LogGood,LogGood[Time]),SUMMARIZE(LogReject,LogReject[Time]),SUMMARIZE(LogState,LogState[Time])))
5 Hours Relative Time
4 hours relative time
As you can see from the top right of the images, not even the times are pulled to the page. Is there a limitation to PBI on the relative time function? This wouldn't make sense to me if there is a "minutes" option under relative time. Any feedback on this would be appreciated.
For those looking in the future, unfortunately PowerBI desktop, along with service, appears to only like to work in the UTC time zone. So the relative date/time was filtering based on the UTC time zone, not my time zone (EST). In order to resolve this, I had to create a new calculated column next to my distinct time stamps to correct for the time zone. I then used the adjusted time for the relative time filtering, but the charts remained under the original time stamps.
UTC to EST time zone adjust
UTC_AdjustTZ = FORMAT(DateTable[Time]+TIME(4,0,0),"General Date")
Chart Example after adjust
Chart after fix implemented
Probably because your filter on Date Table doesn't reach the destined table. Normally filter moves from one side to many side, then one side to many side in a chain of relationships; but
In your case for example:
Filter goes from Date Table to Log Reject then It can't move to RejectDefinitions because of the filter direction. You have 2 options here:
1) Change the model relationships : Make Log Reject(One side) and RejectDefinitions(Many side) if It is possible.
OR
2) Set the filter direction as Both in the model.
You need to do this for all the remaining log tables(LogFault-FaultDefinitions,Logstate-StateDefinitions)
I hope It solves your problem. Please check that your model is not ambiguous after making those changes.

How would one go around creating a due by attribute in redshift

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!
It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

Filtering data by time of the day in SAS

I am a beginner in SAS and I have a data set of traffic incidents to analyse. I want to filter out the data by time of the day - all incidents before 18:00:00 . or incidents between 9:00:00 - 18:00:00
I have tried to find a suitable code, but have not had any success. Could anybody help out with this? Im using the standard SAS not enterprise guide.
Is it with a WHERE statement? if so, how do I input the time?
I assume from your description you have a data set with a time variable and want to subset it using a hard-coded time of day. For this, it's easiest to use a time literal with standard WHERE processing. A time literal is a time specified in quotes followed by the T character.
For example, you can create something similar to the following that will subset the times data set but only with observations where time is earlier than 18:00:
data times_before_6pm;
set times;
where time < '18:00't; /* restrict to times of day earlier than 6pm */
run;
This assumes your times are time values and not datetime values. If they are datetime values, you'll need to extract the time portion from it (using the TIMEPART() function, which you can do in the WHERE statement).
Hope this helps.

Lookup primary keys in multiple tables

the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.

Merging data with a dates table gives strange behaviour when some recordfield is added

I'm using Crystel Reports again after not touching it for about 8 years.
I'm having this situation...
I have 1 data table, and 1 table with just day numbers from 1 to 31.
Nothing is really linked between each other.
In my report I let the user select a reference date.
From that date I grab the maximum days of the month.
The report lists a row per day of that month but there are no actual database fields inthere. Just the first 2 letters of the dayname, the day number and another formula based field showing 'yes/no' or '' depending on a main record value.
So far so good.
In the group header I was adding the fields from the main datatable which went all fine until I added fields that in the query on the sql server rely on some cases but CR just read it out as 1 singe record row with everything in it.
For some reason the report generation goes from 1-2 seconds to 30-40 once I add that field that just outputs 'X' or ''. (it represents things assigned to that user)
Other reports where I'm using the same data still generate in 2 seconds.
To get this working right and to eleminate double date records I'm stuck with 3 groups.
I think this ain't optimal and the reason for the slow down although it wasn't there at the start.
So I was wondering:
Should I go for a sub report for the day listing?
Can I feed the subreport with my date parameter?
or is there some kind of scripted way to list a row x-times without all the grouping requirements?
Synchro was right, the problem was in the actual query/view.
For some reason the view takes half a minute longer if you just added an order by to a specific field.
The "where id between 211 and 265 or id=67" has been moved from a joined view to the actual query.
Thanks for the hint, Synchro.