Stata: saving scalars/globals/matrices to .dta file - workaround? - stata

I compute predicted factor scores for many different measures as well as the measurement error for these factor scores, then delete the measures. I delete the measures because my data is quite large; I do not want all of the measures using RAM in the large data set where I run my analysis.
For my analysis, I regress on factors and other variables. I can correct the regression coefficients for measurement error by using the measurement error of these factor scores. However, I cannot figure out a convenient way to save the measurement error associated with each factor to a .dta file.
Why am I not running all of this in one Stata session, thus obviating the need to save the scalars/macros/matrices?
I work on a server and on my PC. The server has lot of memory and processing power, but is very inconvenient to use. So I often break my work into two stages. First, I clean the data and reduce variable count (in this case, computing factor scores from a large set of measures). Cleaning the data itself often takes a huge amount of memory when I use multiple reshape commands. Then I save the cleaned data to a .dta file and work on it on my PC. The cleaned data is small enough to run on my PC, and doesn't require the sort of manipulations that use an excess of RAM.
I have considered a few approaches.
Create a variable for the measurement error of each factor. While
this can work, I don't like it for several reasons: A) This is a profligate use of memory. I need only one scalar per factor variable, but I am creating _N cells for this variable. While I may be able to make the data set fit in memory by judiciously dropping variables or using other workarounds, I want a better solution. B) It just seems conceptually wrong.
Create a variable that contains all scalar values, and a second variable that contains the name for these scalars (ie, the factor to
which they associate. I am having trouble making this work. How do I extract the value for each non-missing _n and put it into a Stata matrix or Mata matrix? Alternatively, how do I create a set of macros where the macro name comes from the variable containing the name and the macro value comes from the variable containing the scalar?
Somehow save the scalars/macros/Mata matrix/Stata matrix directly and load after opening the .dta file. Apparently, Stata does not save scalars, macros, Mata matrices, or Stata matrices to a .dta file. So the most convenient and obvious solution doesn't exist in Stata. I have seen other people recommend putting the scalars into a mata matrix, then loading the mata matrix into memory and saving as a .dta file. Then I could open this file, save it to a mata matrix, then load the data I want to work on. All this seems needlessly complicated and I am hoping there is a better way.
I would love advice on a simpler method to save these scalars, or else a way to make one of the above approaches simpler.
This is very frustrating. While Stata is extremely powerful and easy to use for a variety of things, it also has these frustrating 'holes' where you can spend an entire day trying make something work that you thought would be quite simple.
Stunningly, the easiest solution would be to simply copy the scalars to a spreadsheet and manually input them. This is not an automated solution, but I am realizing it would take me only maybe a quarter-of-an-hour instead of the enormous time it is taking me to automate this.

To answer your second point
Suppose the variable "factorname" contains the factor name and the variable "error" contains the measurement error
forval i = 1/`=_N'{
local factor_`=factorname[`i']' = error[`i']
}
will create _N local containing measurement errors.

Your long introduction doesn't give the reader an idea of the size order of the data you are dealing with. How many factors? Average number of measures for each of them? Observations? Also, you do not provide some code and it becomes quite hard to help you.
Anyways, I would avoid the creation of variables.
Doesn't estimates save *whatever.dta* work in your case?
If you really want to store your estimation results in a macro to load it in another Stata session, then you cannot directly save them in .dta file.
However, you can still associate the macros to a dataset and retrieve them afterwards defining characteristics.

Related

SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).
A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.
My current solution, which is not very elegant (but it works) is to do the following:
if _N_ = 1 then do;
declare hash h1(Dataset:"DataSet1");
h1.DefineKey("key1");
h1.DefineData("Value");
h1.DefineDone();
declare hash h2(Dataset:"DataSet1");
h2.DefineKey("key2");
h2.DefineData("Value");
h2.DefineDone();
end;
set DataSet2;
rc = h1.find();
if rc NE 0 then do;
rc = h2.find();
end;
So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.
Does anyone know of a way to make this more efficient/easier to read/less memory intensive?
Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!
Thanks in advance,
Adam.
I am a huge proponent of hash table lookups - they've helped me do some massive multi hundred-million row joins in minutes that otherwise would could have taken hours.
The way you're doing it isn't a bad route. If you find yourself running low on memory, the first thing to identify is how much memory your hash table is actually using. This article by sasnrd shows exactly how to do this.
Once you've figured out how much it's using and have a benchmark, or if it doesn't even run at all because it runs out of memory, you can play around with some options to see how they improve your memory usage and performance.
1. Include only the keys and data you need
When loading your hash table, exclude any unnecessary variables. You can do this before loading the hash table, or during. You can use dataset options to help reduce table size, such as where, keep, and drop.
dcl hash h1(dataset: 'mydata(keep=key var1)');
2. Reduce the variable lengths
Long character variables take up more memory. Decreasing the length to their minimum required value will help reduce memory usage. Use the %squeeze() macro to automatically reduce all variables to their minimum required size before loading. You can find that macro here.
%squeeze(mydata, mydata_smaller);
3. Adjust the hashexp option
hashexp helps improve performance and reduce hash collisions. Larger values of hashexp will increase memory usage but may improve performance. Smaller values will reduce memory usage. I recommend reading the link above and also looking at the link at the top of this post by sasnrd to get an idea of how it will affect your join. This value should be sized appropriately depending on the size of your table. There's no hard and fast answer as to what value you should use, my recommendation is as big as your system can handle.
dcl hash h1(dataset: 'mydata', hashexp:2);
4. Allocate more memory to your SAS session
If you often run out of memory with your hash tables, you may have too low of a memsize. Many machines have plenty of RAM nowadays, and SAS does a really great job of juggling multiple hard-hitting SAS sessions even on moderately equipped machines. Increasing this can make a huge difference, but you want to adjust this value as a last resort.
The default memsize option is 2GB. Try increasing it to 4GB, 8GB, 16GB, etc., but don't go overboard, like setting it to 0 to use as much memory as it wants. You don't want your SAS session to eat up all the memory on the machine if other users are also on it.
Temporarily setting it to 0 can be a helpful troubleshooting tool to see how much memory your hash object actually occupies if it's not running. But if it's your own machine and you're the only one using it, you can just go ham and set it to 0.
memsize can be adjusted at SAS invocation or within the SAS Configuration File directly (sasv9.cfg on 9.4, or SASV9_Option environment variable in Viya).
I have a fairly similar problem that I approached slightly differently.
First: all of what Stu says is good to keep in mind, regardless of the issue.
If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.
Build a hash table with key1 as key, key2 as data along with your actual data. Make sure that key1 is the "better" key - the one that is more fully populated. Rename Key2 to some other variable name, to make sure you don't overwrite your real key2.
Search on key1. If key1 is found, great! Move on.
If key1 is missing, then use a hiter object (hash iterator) to iterate over all of the records searching for your key2.
This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.
The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.
Or split your dataset:
data haskey1 nokey1;
set yourdata;
if missing(key1) then output nokey1;
else output haskey1;
run;
Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.
Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

PDI aka Kettle : is better "add constant" or "literal string" in table input?

In a Kettle/ PDI transformation, I need to write on a table the values from another table plus other static strings.
1 Table input : read records;
2 Add constants : add "status" = "A"; (and other static strings)
3 Table output : write old values + status and other constants
Is it better to add the literal in the table input "select" (select id,field1, 'A' as status from ...) or better to use an Add Constants step?
I suppose it's better to reduce steps' quantity, because with "Add constants"
you need to instantiate a new step.
EDIT: For "better" I mean faster and less memory consuming
My opinion is to make the minimum transformation on the Input Table step, because the philosophy of the PDI is to make visible all the transformation.
Now, if you're an expert in SQL, or have a legacy select of 200 lines with complex computation, my answer would be different.
Creating one more step in transformation will lead to separate thread allocation since every step is separate thread as far as allocation of at least one BlockingQueue since rows between steps is distributed in memory through these structures.
Using one more step even so simple as add constant will cause additional resource allocations.
PDI is still happy to be open source.
If you curious how it is done this is base transformation step implementation (was for long time) -> https://github.com/pentaho/pentaho-kettle/blob/master/engine/src/main/java/org/pentaho/di/trans/step/BaseStep.java
This is example of code used to distribute rows between steps -> https://github.com/pentaho/pentaho-kettle/blob/master/core/src/main/java/org/pentaho/di/core/BlockingRowSet.java#L54
Sure for simple add constant in sql query PDI will be overhead. There is a lot of use-cases how to make some operations faster or less memory consuming, but about GUI or any other feature actually PDI is famous of?

SAS outputting results to the input dataset (same in and out dataset name)

I could not find information about this problem, or could not specify the question correctly.
Let me ask the question with code:
Is this operation
data work.tmp;
set work.tmp;
* some changes to data here;
run;
or especially
proc sort data = work.tmp out = work.tmp;
by x;
run;
dangerous in any way, or considered a bad practice in SAS? Note the same input and output dataset names, which is my main point. Does SAS handle this situation correctly so there would be no ambiguous results with running this kind of data step/procedure?
The latter, sorting into itself, is done fairly frequently; as sort is just re-arranging the dataset, and (unless you are depending on the order being otherwise, or unless you use a where clause to filter the dataset or rename/keep/drop options) doesn't do any permanent harm to the dataset, it's not considered bad practice, as long as tmp is in work (or a libname intended to be used as a working directory). SAS creates a temporary file to do the sort, and when it's successful it deletes the old one and renames the temporary file; no substantial risk of corruption.
The former, setting a dataset to itself in a data step, is usually not considered a good practice. That's because a data step often does something irreversible - ie, if you run it once it has a different result than if you run it again. Thus, you risk not knowing what status your datastet has; and while with sort you can rely on knowing because you get an obvious error if it's not properly sorted most of the time, with the data step you might never know. As such, each data step should generally produce a new dataset (at least, new to that thread). There are times when it's necessary to do this, or at least would be substantially wasteful not to - perhaps a macro that sometimes does a long data step and sometimes doesn't - but usually you can program around it.
It's not dangerous in the sense that the file system will get confused, though; similar to sort, SAS will simply create a temporary file, fill the new dataset, then delete the old one and rename the temporary file.
(I leave aside mention of things like modify which must set a dataset to itself, as that has an obvious answer...)
Some examples of why this is not considered good practice. Say you're working interactively, and you have the following code dataset named tmp:
data tmp;
set sashelp.class;
run;
If you were to run the below code twice, it would run fine the first time, but on the second run you would receive a warning as the variable age no longer exists on that dataset:
data tmp;
set tmp;
drop age;
run;
In this case, it's a pretty harmless example, and you are lucky enough that SAS is simply giving a warning. Depending on what the data step was doing though, it could just just as easily have been something that generates an error, e.g.:
data tmp;
set tmp (rename=(age=blah));
run;
Or even worse, it may generate no ERROR or WARNING, and change the expected results like the below code:
data tmp;
set tmp;
weight = log(weight);
run;
Our intention is to apply a simple log transformation to the weight variable in preparation for modeling, but if we accidentally run the step a second time, we are calculating the log(log(weight)). No warnings or errors will be given and looking at the dataset it will not be immediately obvious that anything is wrong.
IMO, you are much better off creating iterative datasets, ie. tmp1, tmp2, tmp3, and so on... for every process that updates the dataset in some way. Space is much cheaper than spending time debugging.

Memory Issues in C++

I am having run-time memory allocation errors with a C++ application. I have eliminated memory leaks, invalid pointer references and out-of-bounds vector assignments as the source of the issue - I am pretty sure it's something to do with memory fragmentation and I am hoping to get help with how to further diagnose and correct the problem.
My code is too large to post (about 15,000 lines - I know not huge but clearly too large to put online), so I am going to describe things with a few relevant snippets of code.
Basically, my program takes a bunch of string and numerical data sets as inputs (class objects with vector variables of type double, string, int and bool), performs a series of calculations, and then spits out the resulting numbers. I have tested and re-tested the calculations and outputs - everything is calculating as it should, and on smaller datasets things run perfectly.
However, when I scale things up, I start getting memory allocation errors, but I don't think I am even close to approaching the memory limits of my system - please see the two graphs below...my program cycles through a series of scenarios (performing identical calculations under a different set of parameters for each scenario) - in the first graph, I run 7 scenarios on a dataset of about 200 entries. As the graph shows, each "cycle" results in memory swinging up and back down to its baseline, and the overall memory usage is tiny (see the seven small blips on the right half of the bottom graph). On the second graph, I am now running a dataset of about 10,000 entries (see notes on dataset below). In this case, I only get through 2 full cycles before getting my error (as it is trying to resize a class object for the third scenario). You can see the first two scenarios in the bottom right-half graph; a lot more memory usage than before, but still only a small fraction of available memory. And as with the smaller dataset, usage increases while my scenario runs, and then decreases back to it's initial level before reaching the next scenario.
This pattern, along with other tests I have done, lead me to believe it's some sort of fragmentation problem. The error always occurs when I am attempting to resize a vector, although the particular resize operation that causes the error varies based on the dataset size. Can anyone help me understand what's going on here and how I might fix it? I can describe things in much greater detail but already felt like my post was getting long...please ask questions if you need to and I will respond/edit promptly.
Clarification on the data set
The numbers 200 and 10,000 represent the number of unique records I am analyzing. Each record contains somewhere between 75 and 200 elements / variables, many of which are then being manipulated. Further, each variable is being manipulated over time and across multiple iterations (both dimensions variable). As a result, for an average "record" (the 200 to 10,000 referenced above), there could be easily as many as 200,000 values associated with it - a sample calculation:
1 Record * 75 Variables * 150 periods * 20 iterations = 225,000 unique values per record.
Offending Code (in this specific instance):
vector<LoanOverrides> LO;
LO.resize(NumOverrides + 1); // Error is occuring here. I am certain that NumOverrides is a valid numerical entry = 2985
// Sample class definition
class LoanOverrides {
public:
string IntexDealName;
string LoanID;
string UniqueID;
string PrepayRate;
string PrepayUnits;
double DefaultRate;
string DefaultUnits;
double SeverityRate;
string SeverityUnits;
double DefaultAdvP;
double DefaultAdvI;
double RecoveryLag;
string RateModRate;
string RateModUnits;
string BalanceForgivenessRate;
string BalanceForgivenessRateUnits;
string ForbearanceRate;
string ForbearanceRateUnits;
double ForbearanceRecoveryRate;
string ForbearanceRecoveryUnits;
double BalloonExtension;
double ExtendPctOfPrincipal;
double CouponStepUp;
};
You have a 64-bit operating system capable of allocating large quantities of memory, but have built your application as a 32-bit application, which can only allocate a maximum of about 3GB of memory. You are trying to allocate more than that.
Try compiling as a 64-bit application. This may enable you to reach your goals. You may have to increase your pagefile size.
See if you can dispose of intermediate results earlier than you are currently doing so.
Try calculating how much memory is being used/would be used by your algorithm, and try reworking your algorithm to use less.
Try avoiding duplicating data by reworking your algorithm. I see you have a lot of reference data, which by the looks of it isn't going to change during the application run. You could put all of that into a single vector, which you allocate once, then refer to them via integer indexes everywhere else, rather than copying them. (Just guessing that you are copying them).
Try avoiding loading all the data at once by reworking your algorithm to work on batches.
Without knowing more about your application, it is impossible to offer better advice. But basically you are running out of memory because you are allocating a huge, huge amount of it, and based on your application and the snippets you have posted I think you can probably avoid doing so with a little thought. Good luck.

Building static (but complicated) lookup table using templates

I am currently in the process of optimizing a numerical analysis code. Within the code, there is a 200x150 element lookup table (currently a static std::vector <std::vector<double>> ) that is constructed at the beginning of every run. The construction of the lookup table is actually quite complex- the values in the lookup table are constructed using an iterative secant method on a complicated set of equations. Currently, for a simulation, the construction of the lookup table is 20% of the run time (run times are on the order of 25 second, lookup table construction takes 5 seconds). While 5-seconds might not seem to be a lot, when running our MC simulations, where we are running 50k+ simulations, it suddenly becomes a big chunk of time.
Along with some other ideas, one thing that has been floated- can we construct this lookup table using templates at compile time? The table itself never changes. Hard-coding a large array isn't a maintainable solution (the equations that go into generating the table are constantly being tweaked), but it seems that if the table can be generated at compile time, it would give us the best of both worlds (easily maintainable, no overhead during runtime).
So, I propose the following (much simplified) scenario. Lets say you wanted to generate a static array (use whatever container suits you best- 2D c array, vector of vectors, etc..) at compile time. You have a function defined-
double f(int row, int col);
where the return value is the entry in the table, row is the lookup table row, and col is the lookup table column. Is it possible to generate this static array at compile time using templates, and how?
Usually the best solution is code generation. There you have all the freedom and you can be sure that the output is actually a double[][].
Save the table on disk the first time the program is run, and only regenerate it if it is missing, otherwise load it from the cache.
Include a version string in the file so it is regenerated when the code changes.
A couple of things here.
What you want to do is almost certainly at least partially possible.
Floating point values are invalid template arguments (just is, don't ask why). Although you can represent rational numbers in templates using N1/N2 representation, the amount of math that you can do on them does not encompass the entire set that can be done on rational numbers. root(n) for instance is unavailable (see root(2)). Unless you want a bajillion instantiations of static double variables you'll want your value accessor to be a function. (maybe you can come up with a new template floating point representation that splits exp and mant though and then you're as well off as with the double type...have fun :P)
Metaprogramming code is hard to format in a legible way. Furthermore, by its very nature, it's rather tough to read. Even an expert is going to have a tough time analyzing a piece of TMP code they didn't write even when it's rather simple.
If an intern or anyone under senior level even THINKS about just looking at TMP code their head explodes. Although, sometimes senior devs blow up louder because they're freaking out at new stuff (making your boss feel incompetent can have serious repercussions even though it shouldn't).
All of that said...templates are a Turing-complete language. You can do "anything" with them...and by anything we mean anything that doesn't require some sort of external ability like system access (because you just can't make the compiler spawn new threads for example). You can build your table. The question you'll need to then answer is whether you actually want to.
Why not have separate programs? One that generates the table and stores it in a file, and one that loads the file and runs the simulation on it. That way, when you need to tweak the equations that generate the table, you only need to recompile that program.
If your table was a bunch of ints, then yes, you could. Maybe. But what you certainly couldn't do is generate doubles at compile-time.
More importanly, I think that a plain double[][] would be better than a vector of vectors here- you're pushing a LOT of dynamic allocation for a statically sized table.