How to reduce time taken by sas macro code to run - sas

I need help with my monthly report sas code below:
Firstly the code takes too long to run while the data is relatively small. When it completes a message that reads: The contents of log is too large.
Please can you check what the issue which my code?
Meaning of macro variable:
&end_date. = last day of the previous month. for instance 30-Apr-22
&lastest_refrsh_dt. = The latest date the report was published.
once the report is published, we updated the config table with &end_date.
work.schedule_dt: a table that contains the update flag. if all flags are true, we proceed but if update flags are false exit. at the six day of month, if the flag is still false, then email that reads "data not available" is sent.

Normally, that message about the log is due to warnings in the logs over type issues. From what you describe, it is typically due to an issue on date interpretation.
There is nothing in this post to aid in helping beyond that. You need to open the log and find out what the message is. Otherwise, it is speculation on our part.

Related

SAS code runs with 0 observations if called from %INCLUDE

I'd like to start by saying I'm no SAS wiz by no means.
I inherited SAS code from a team that no longer exists which was written by people who no longer work here, so there is nobody around that would be more familiar with how things work.
The structure of things is:
We have a SAS program that works as a scheduler for triggering a selection of smaller programs in a daily basis. The way it works is using statements to check for the time of day and based on that it then triggers programs that are stored in the server by using an %include statement.
This has worked flawlessly for the past 2 years and suddenly from yesterday on all the codes that are triggered by this scheduler are running with 0 observations.
If I manually open a program in the server (the same program that the scheduler triggers) it runs fine. If the scheduler triggers it then the log shows me that the data set has 0 observations and then stops the step.
This happens for every step in a program since the first one, which can be as simple as the step outlined below:
data drawdown;
set server01.legacy_mapping_drawdown;
run;
If I run the above step manually, log shows:
NOTE: The data set WORK.drawdown has 13643 observations and 107 variables.
If this is triggered by the %include statement, then the log reads:
NOTE: The data set WORK.drawdown has 0 observations and 107 variables.
WARNING: Data set WORK.drawdown was not replaced because this step was stopped.
I have no clue whatsoever as to why this would be happening.
The fact that this started happening on the 02/02/2020 leads me to believe that the new year might have something to do with it.
The code in the scheduler hasn't been touched at all in a while and the several codes are being triggered. It's how they perform that changes depending on being triggered manually or via the scheduler.
I know there is little to no technical details here but there isn't much to it really.
Would appreciate any ideas on this.
Thanks.

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

What happens to my data after hitting the break key in Stata?

Suppose I had the following structure for a script called mycode.do in Stata
-some code to modify original data-
save new_data, replace
-some other code to perform calculations on new_data-
Now suppose I press the break button to stop Stata after it has saved new_data in the script. My understanding is that Stata will undo the changes made to the data if it is interrupted with the break button before it has finished. Following such interruption, will Stata erase new_data.dta from memory if it didn't exist initially (or revert it back to its original form if it already existed before mycode.do was executed)?
Stata documentation says "After you click on Break, the state of the system is the same as if you had never issued the original command." However, it sounds as if you expect that it treats an entire do-file as a "command". I do not believe this is the case. I believe once the save is completed, then the file new_data has been replaced, and Stata is not able to revert the file to the version before the save.
The Stata Reference Manual also says, in the documentation for Stata release 13, [R] 16.1.4 Error handling in do-files, "If you press Break while executing a do-file, Stata responds as though an error has occurred, stopping the do-file." Example 4 discusses this further and seems to support my interpretation.
This seems to me to have interesting implications for Stata "commands" that are implemented as ado files.

How do I get the most recent report data?

I'm trying to build a tool that collects a few data points from a user usage report with
https://www.googleapis.com/admin/reports/v1/usage/{user}/all/dates/{yyyy-mm-dd}
Since the data is delayed - how do I get the most recent report? If I were to query today's (2013-11-22) date I would get something like:
Data for dates later than 2013-11-19 is not yet available. Please check back later
Is there a set number of days/hours for reports to be available - or do I have to trial and error backwards until I get a successful response?
I believe there is a delay of about 48 hours for the reports as of right now. However, if Google is able to improve on that, you'll want your app to be able to take advantage of those improvements without any changes needed.
I suggest you make a first attempt using today's date. When that fails, parse the error response to grab the last date report data is available for and use that value. This way you're always making only 2 max attempts and if Google improves the delay to 24 hours or even less, your app is able to take immediate advantage of that change.

Reliably get Latest Event Log Record with WQL

I have written an application which collects windows logs from linux, via the Zenoss wmi-client package.
It uses WQL to query the Event log and parses the return. My problem is trying to find the latest entry in the log.
I stumbled across this which tells me to use the NumberOfRecords column in a query such as this
Select NumberOfRecords from Win32_NTEventLogFile Where LogFileName = 'Application'
and use the return value from that as the highest log.
My question is, I have heard that the Windows Event log is a circular buffer, that is it overwrites it's oldest logs with new ones as the log gets full. Will this have an impact on NumberOfRecords, as if that happens, the "RecordNumber" property of the events will continue to increase, however the actual Number of Records in the event log wouldn't change (as for every entry written, one is dropped).
Can anyone shed some insight to how this actually works (whether NumberOfRecords is the highest RecordNumber, or the actual number of events in the log), and perhaps suggest a solution?
Update
So we know now that NumberOfRecords won't work on it's own because the Event Log is a ring buffer. The MS Solution is to get the Oldest record and add it to NumberOfRecords to get the actual latest record.
This is possible through WinAPI, but I am calling remotely from Linux. Does anyone know how I might achieve this in my scenario?
Thanks
NumberOfRecords will not always be the max record number because the log is circular and the log can be cleared and you may have 1 entry but it's record number is 1000.
The way you would do this using the win api would be to get the oldest record number and add the number of records in the log to get the max record number. It doesn't look like Win32_NTEventLogFile has a oldest record number field to use.
Are you trying to get the latest record every time you query the log? You can use TimeGenerated when you query Win32_NTLogEvent to get everything > NOW. You can iterate that list to find your max record number.
You need the RecordNumber of the newest record, but there is no fast way to get it.
Generally, you have to:
SELECT RecordNumber FROM Win32_NTLogEvent WHERE LogFile='Application'
And find the max RecordNumber through results. But this can take tens of seconds or minutes if the size of log file is big...it's very slow.
But!
You can get number of records:
SELECT NumberOfRecords FROM Win32_NTEventlogFile WHERE LogfileName='Application'
This is very fast. And then reduce the selection to speedup the search of the newest record:
SELECT RecordNumber FROM Win32_NTLogEvent WHERE LogFile='Application' AND RecordNumber>='_number_of_records_'
The execution time of this <= than in general case.