Missing the obvious with inconsistently delimited data? - sas

I have built something in SAS to pull down Yahoo! finance .csv data. The code I have built now works fine and I have built some robust error handling into the code. The problem I have had with the data though is that the .csv feed is unsupported and not clean.
The data is comma delimited, but some of the data also has commas in it. Some of the fields are in quotes and some are not. Also the length of the fields varies wildly as as well. A field like Market Capitlisation for example could run form a few million to hundreds of billions.
As a result, if you pass multiple stock metrics for multiple stocks through to the Yahoo! API at the same time, you will get rows of .csv data where each field is in a different place, is a different length and is inconsistently delimited.
I have tried multiple infile options that could handle some of these errors in isolation, but not all of them together. My only solution that works is to download single stock metrics by multiple stocks at the same time.
This gives me what I want, but it takes over an hour to run the data for the NASDAQ and the NYSE. Have I overlooked another method for handling this type of problem?
Thanks

This is the outline of a way to do what you are looking for. The whole of the code to do this would be too long to post here and out of scope of what this site looks to do.
Create a SAS program that takes a stock ticker from the SYSPARM automatic macro, and downloads the data to a data set named the same as the ticker into a permanent library.
The SYSPARM macro is set by the value you set on the commandline to call SAS
sas.exe myprog.sas -sysparm XYZ
This would set &SYSPARM to resolve XYZ
Write a SAS program that merges all the ticker data sets together for further processing.
Create a program in a language like Perl or Python, (or shell script, etc.) that loops over a range of tickers and calls your SAS program, passing the ticker through SYSPARM.
Use a threading, forking, etc. package from that language to have multiple of these running at the same time. You can probably go to some multiple of the CPU cores on your machine as this processing will not be CPU intensive. Test values to you find one that works.
From that same language call your SAS program to merge the datasets.

Related

I need to programmatically identify all libraries and data files read by several hundred SAS program files. Can this be accomplished programmatically? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 months ago.
Improve this question
Scenario: We have a list of over 200 SAS files. We need to identify all SAS libraries and data sets used as inputs to these programs, and write out a table linking the SAS input data sets to the associated program files. We are not SAS programmers and are just now becoming familiar with the language. The intent is to design a rearchitecture of the logic of the SAS files to be more modular.
We are conducting this analysis statically - i.e., we are not running SAS, we are attempting to extract this data purely from interrogating the code in program files themselves and we do not have access to the data files.
Solution attempted: we have parsed the SAS programs to identify inputs to SAS Procs and SAS Data steps, however there are several challenges. The approach we are using is as follows:
We have obtained a python-based parser (https://github.com/benjamincorcoran/sasdocs) that extracts key information from SAS files. We have applied it to all 200+ files and extracted parsed content into a text file. However, not all SAS syntax is supported; in particular, DataSet blocks are left as unparsed raw text, Procs with a variable number and names of arguments may be missed, and some commands, like various constructs of “set” and “merge” are missed completely by the grammar that has been implemented in the parser so far.
The parser correctly locates about 60% of the files, especially the libraries and files preceded by a "Set" statement. For reasons we do not understand, not all libraries/files preceded by a "SET" command are captured by this parser.
In addition to the "Set" command, we have observed that SAS can also reference a library/file within a Merge or Sort procedure, without a specific Set command.
We are ignoring SAS files from within the 'work' library that are created during processing; we are only concerned with external input files.
Note that we are not running these programs, we only have access to the SAS Program file sources - hence we do not have access to a SAS log.
Questions:
Is there a more direct way do accomplish this goal? Does SAS understand what files it reads and writes, and is there a method of extracting a list of all libraries and files read by SAS associated with a SAS program?
If there is no method of accomplishing this information programmatically, what are all the ways that SAS can access or reference an external library/file, other than within a SET, MERGE or SORT procedure?
SAS has a procedure that does this, PROC SCAPROC. If you do have access to SAS, this is by far the best solution. You would technically need to run SAS, but even if there are errors, in theory it might work okay - the fact that the dataset doesn't exist should be okay, unless your code is data driven.
If you're unable to run the code or run anything in SAS, you'd need to do something with text analysis.
The key things to look for which would catch most of the possibilities would be (in sort of pseudo regex code):
data [lib.]dataset(could have parens but ignore them);
set( [lib.]dataset(ignore parens))* (could have multiple)
merge( [lib.]dataset(ignore parens))* (could have multiple)
update( [lib.]dataset(ignore parens))* (could have multiple)
modify( [lib.]dataset(ignore parens))* (could have multiple)
data=[lib.]dataset(ignore parens) - this is for most PROCs input, could have spaces around the equals sign
out=[lib.]dataset(ignore parens) - this is for most PROCs output, could have spaces around the equals sign
To get more than the "most" above, you'd want to analyze which PROCs were used. Each PROC can have its own output/input options, for example proc surveyselect could use various different datasets for different things, proc format uses CNTLIN and CNTLOUT, etc. You'd also have to see if there are hash tables or other objects used in the code as that has its own elements.
The other thing you could do, only caring about external files, is identify the libname statements. Once you find them, it's possible you could just look for libname.data in the program - that's how all of the datasets in the external folders (libraries) will be referred to. This won't work, though, if you are using metadata-assigned libraries, unless there are a small enough number of them that you could possibly list them all out (and you have access to SAS to find out the list).
Ultimately, your 100% solution is to hire a SAS consultant to look at the code; without being able to run the code (and thus use SCAPROC), there's not really a perfect solution.

How can I aggregate intel amplifier batch results?

I'm solving a number of instances with my code and I'd need to find the worst hotspots, where "worst" is defined as a hotspot over a wide range of instances. So for every instance I have collected hotspot analysis data in batch mode using amplxe-cl. Now I'd like to aggregate this data, I'd like to analyze them together. Is there any way to do this with vtune?
Update:
This is not an mpi application. There are a number of different datasets (problems, instances, pick your term :-) that need to be processed by my application. Depending on the data in a single instance the application can take very different turns while processing it, thus running the application on different instances can result in different hotspots. The purpose of the aggregation would be, as #ArunJose_Intel guessed, is to find hotspots that are common in all runs, that are present in the processing of all kind of instances.
I can collect hotspot analysis for every instance easily using batch mode and I can inspect them individually, but I'd like to see an aggregate analysis.
Of course, I could just process them in one run one after the other, but that would take several weeks, while I can process them as individual problems in a few hours on a cluster of identical machines.
In vtune it is not possible to combine multiple GUI reports. You have an option to compare across two different reports to see what has changed but clearly this is not what you are looking for.
A workaround you could possibly try is to create command line reports from the vtune results you have already collected. These command line reports would be in easily parsable data formats like CSV . Once you have reports in these formats you could have could write your custom scripts/code to aggregate multiple of these csv reports, with whatever logic you wish to have them aggregated.
Please find below some samples to create command line reports
1)Generate a Hotspots report from the r001hs result on Linux*, and save it to /home/test/MyReport.txt in text format.
vtune -report hotspots -result-dir r001hs -report-output /home/test/MyReport.txt
2)Generate a hotspots report in the CSV format from the most recent result and save it in the current Linux working directory. Use the format option with the csv argument and the csv-delimiter option to specify a delimiter, such as comma.
vtune -R hotspots -report-output MyReport.csv -format csv -csv-delimiter comma
For more information
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports.html
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports/saving-and-formatting-reports.htm

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

I need help in designing my C++ Console application

I have a task to complete.
There are two types of csv files 4000+ both related to each other.
2 types are:
1. Country2.csv
2. Security_Name.csv
Contents of Country2.csv:
Company Name;Security Name;;;;Final NOS;Final FFR
Contents of Security_Name.csv:
Date;Close Price;Volume
There are multiple countries and for each country multiple security files
Now I need to READ them do some CALCULATION and then WRITE the output in another files
READ
Read both the file Country 2.csv and Security.csv and extract all the data from them.
For example :
Read France 2.csv, extract Security_Name, Final NOS, Final FFR
Then Read Security.csv(which matches the Security_Name) and extract Date, Close Price, Volume
Calculation
Calculations are basically finding Median of the values extracted which is quite simple.
For Example:
Monthly Median Traded Values
Daily Traded Value of a Security ... and so on
Write
Based on the month I need to sort the output in two different file with following formats:
If Month % 3 = 0
Save It as MONTH_NAME.csv in following format:
Security name; 12-month indicator; 3-month indicator; FOT
Else
Save It as MONTH_NAME.csv in following format:
Security Name; Monthly Median Traded Value Ratio; Number of days Volume > 0
My question is how do I design my application in such a way that it is maintainable and the flow of data throughout the execution is seamless?
So first thing. Based on the kind of data you are looking to generate, I would probably be looking at moving this data to a SQL db if at all possible. This is "one SQL query" kind of stuff. And far more maintainable than C++ that generates CSV files from CSV files.
Barring that, I would probably look at using datamash and/or perl. On a Windows platform, you could do this through Cygwin or WSL. Probably less maintainable, but so much easier it's not too much of an issue.
That said, if you're looking for something moderately maintainable, C++ could work. The first thing I would do is design my input classes. Data-centric, but it can work. It sounds like you could have a Country class, a Security class, and a SecurityClose class...or something along those lines. You can think about whether a Security class should contain a collection of SecurityClosees (data), or whether the data should just be "loose" and reference the Security it belongs to. Same with the Country->Security relationship.
Once you've decided how all that's going to look, you want something (likely a function) that can tokenize a CSV line. So "1,2,3" gets turned into a vector<string> with the contents "1" "2" "3". Then, each of your input classes should have a constructor or initializer that takes a vector<string> and populates itself. You might need to pass higher level data along too. Like the filename if you want the security data to know which security it belongs to..
That's basically most of the battle there. Once you've pulled your data into sensibly organized classes, the rest should come more easily. And if you run into bumps, hopefully you can ask specific design or implementation questions from there.

use uno (openoffice api) to open spreadsheet *without* recalculation

I'm using pyuno to read an excel spreadsheet (running on linux.) Many cells have formulas referring to addins that are, obviously, not available. However the cell values are what I want.
But when I load and read the sheet, it seems those formulas are being evaluated and thus the values are being overwritten with errors.
I've tried several things, none of which have worked:
set flags AutomaticCalculation=False, MacroExecutionMode=NEVER_EXECUTE in the call to desktop.loadComponentFromURL
call document.enableAutomaticCalculation(False) on the loaded document
Any suggestions?
If formluas aren't a matter, you might circumvent the problem by processing a copy of your spreadsheet in which only the values (not the formulas) are present.
To achieve this quickly, select the whole sheet content, copy, special paste; then remove everything except "value". Save to a new file (make sure you don't overwrite the original file or every formula will be lost!). Your script should then be able to process this file.
This is an ugly solution, as there must be a way to do it programmaticaly.
Calc does not yet support using the cached results after loading the document. Libreoffice Calc does now use cached results for xls documents. The results are also stored in ods but are ignored while loading the document and the formula result is evaluated by compiling and interpreting the saved formula.
There are some plans to add this for ods and xlsx too but there are many ods producers out there writting incorrect results in the file. So till now the only solution is to have a second version of the document only saving the results (or implementing it inside calc).