How to vectorise in SAS/IML? - sas-iml

Suppose I have
proc iml;
x = {1,2,3};
I am interested in computing CDF ('F',1,2, ....) for every element of x. Of course, I can write a loop, but I would like to know the smart way of doing it, i.e. vectorise?
I tried to google but could not find anything, any takers?

When SAS/IML calls Base SAS functions, you can usually pass in a vector in place of the scalar arguments. In your example, it's as easy as
c = cdf("Normal", x);
For more information, you can read about how to call Base SAS functions from SAS/IML.
If the need arises, you can also pass in vectors for MULTIPLE arguments, provided that the vectors are the same size.

Related

Write multidimensional vectors (tensors) of scalars to file in C++

The description of the object I have
I have several N-dimensional containers in my code, representing tensors, whose types are defined as
std::vector<std::vector<std::vector<...<double>...>>>
These type of data structures occur in several different sizes and dimensions and they only contain scalar numbers. The number of dimensions is known for every vector and can be accessed as eg. tensor::dimension. Since they're representing tensors, they're never "irregular": at the bottom level, vectors always contain the same number of elements, like this:
// THIS IS HOW THEY ALWAYS LOOK LIKE
T tensor = {{1,2,3,4}, {1,2,3,4}, {1,2,3,4}}
// THIS IS WHAT NEVER HAPPENS
T tensor = {{1,2,3}, {1,2,3,4}, {1,2}}
What I want to do with this object
I want to save each of these multidimensional vectors (tensors basically) into different files, which I can then easily load/read eg. in Python into a numpy.array - for further analysis and visualization. How can I achieve this to save any of these N-dimensional std::vectors in modern C++ without explicitly defining a basic write-to-txt function with N nested loops for each vector with different dimensions?
(Note: Solutions/advice that require/mention only standard libraries are preferred, but I'm happy to hear any other answers too!)
The only way to iterate over something in C++ is a loop, in some sort of shape, matter, or form. So no matter what you're going to have loops. There are no workarounds or alternatives, but it doesn't mean you actually have to write all these loops yourself, one at a time. This is why we have templates in C++. What you are looking for is a recursive template, that recursively peels away each dimension: until the last one which gets implemented for real-sies, basically letting your compiler write every loop for you. Mission accomplished. Starting with a simplistic example of writing out a plain vector
void write_vec(const std::vector<double> &v)
{
for (const auto &value:vector)
std::cout << value << std::endl;
}
The actual details of how you want to save each value, and which files, is irrelevant here, you can adjust the above code to make it work in whichever way you see fit. The point that you want to make it work for some artbirary dimensions. Simply add a template with the same name, then let overload resolution do all the work for you:
template<typename T>
void write_vec(const std::vector<std::vector<T>> &v)
{
for (const auto &value:vector)
write_vec(value);
}
Now, a write_vec(anything), where anything is any N-"deep" vector that ends up in a std::vector<double> will walk its way downhill, on its own, and write out every double.

How to use arrays in machine learning classes?

I'm new to C++ and I think a good way for me to jump in is to build some basic models that I've built in other languages. I want to start with just Linear Regression solved using first order methods. So here's how I want things to be organized (in pseudocode).
class LinearRegression
LinearRegression:
tol = <a supplied tolerance or defaulted to 1e-5>
max_ite = <a supplied max iter or default to 1k>
fit(X, y):
// model learns weights specific to this data set
_gradient(X, y):
// compute the gradient
score(X,y):
// model uses weights learned from fit to compute accuracy of
// y_predicted to actual y
My question is when I use fit, score and gradient methods I don't actually need to pass around the arrays (X and y) or even store them anywhere so I want to use a reference or a pointer to those structures. My problem is that if the method accepts a pointer to a 2D array I need to supply the second dimension size ahead of time or use templating. If I use templating I now have something like this for every method that accepts a 2D array
template<std::size_t rows, std::size_t cols>
void fit(double (&X)[rows][cols], double &y){...}
It seems there likely a better way. I want my regression class to work with any size input. How is this done in industry? I know in some situations the array is just flattened into row or column major format where just a pointer to the first element is passed but I don't have enough experience to know what people use in C++.
You wrote a quite a few points in your question, so here are some points addressing them:
Contemporary C++ discourages working directly with heap-allocated data that you need to manually allocate or deallocate. You can use, e.g., std::vector<double> to represent vectors, and std::vector<std::vector<double>> to represent matrices. Even better would be to use a matrix class, preferably one that is already in mainstream use.
Once you use such a class, you can easily get the dimension at runtime. With std::vector, for example, you can use the size() method. Other classes have other methods. Check the documentation for the one you choose.
You probably really don't want to use templates for the dimensions.
a. If you do so, you will need to recompile each time you get a different input. Your code will be duplicated (by the compiler) to the number of different dimensions you simultaneously use. Lots of bad stuff, with little gain (in this case). There's no real drawback to getting the dimension at runtime from the class.
b. Templates (in your setting) are fitting for the type of the matrix (e.g., is it a matrix of doubles or floats), or possibly the number of dimesions (e.g., for specifying tensors).
Your regressor doesn't need to store the matrix and/or vector. Pass them by const reference. Your interface looks like that of sklearn. If you like, check the source code there. The result of calling fit just causes the class object to store the parameter corresponding to the prediction vector β. It doesn't copy or store the input matrix and/or vector.

standard and efficient map between objects

I am working on clustering problem where I have something called distance matrix. This distance matrix is something like:
the number of nodes(g) are N (dynamic)
This matrix is Symmetric (dist[i,j]==dist[j,i])
g1,g2,.... are object (they contain strings , integers and may even more..)
I want to be able to reach any value by simple way like dist[4][3] or even more clear way like dist(g1,g5) (here g1 and g5 may be some kind of pointer or reference)
many std algorithm will be applied on this distance matrix like min, max, accumulate ..etc
preferably but not mandatory, I would like not to use boost or other 3rd party libraries
What is the best standard way to declare this matrix.
You can create two dimensional vector like so
std::vector<std::vector<float> > table(N, std::vector<float>(N));
don`t forget to initialize it like this, it reserves memory for N members, so it does not need to reallocate all the members then you are adding more. And does not fragment the memory.
you can access its members like so
table[1][2] = 2.01;
it does not uses copy constructors all the time because vector index operator returns a reference to a member;
so it is pretty efficient if N does not need to change.

Smart way to implement lookup in high-perf FORTRAN code

I'm writing a simulation in FORTRAN 77, in which I need to get a parameter value from some experimental data. The data comes from an internet database, so I have downloaded it in advance, but there is no simple mathematical model that can be used to provide values on a continous scale - I only have discrete data points. However, I will need to know this parameter for any value on the x axis, and not only the discrete ones I have from the database.
To simplify, you could say that I know the value of f(x) for all integer values of x, and need a way to find f(x) for any real x value (never outside the smallest or largest x I have knowledge of).
My idea was to take the data and make a linear interpolation, to be able to fetch a parameter value; in pseudo-code:
double xd = largest_data_x_lower_than(x)
double slope = (f(xd+dx)-f(xd))/dx // dx is the distance between two x values
double xtra = x-xd
double fofx = f(xd)+slope*xtra
To implement this, I need some kind of lookup for the data points. I could make the lookup for xd easy by getting values from the database for all integer x, so that xd = int(x) and dx = 1, but I still have no idea how to implement the lookup for f(xd).
What would be a good way to implement this?
The value will be fetched something like 10^7 to 10^9 times during one simulation run, so performance is critical. In other words, reading from IO each time I need a value for f(xd) is not an option.
I currently have the data points in a text file with one pair of (tab-delimited) x,f(x) on each line, so bonus points for a solution that also provides a smooth way of getting the data from there into whatever shape it needs to be.
You say that that you have the values for all integers. Do you have pairs i, f(i) for all integers i from M to N? Then read the values f(i) into an array y dimensioned M:N. Unless the number of values is HUGE. For real values between M and N it is easy to index into the array and interpolate between the nearest pair of values.
And why use FORTRAN 77? Fortran 90/95/2003 have been with us for some years now...
EDIT: Answering question in the comment, re how to read the data values only once, in FORTRAN 77, without having to pass them as an argument in a long chain of calls. Technique 1: on program startup, read them into the array, which is in a named common block. Technique 2: the first time the function that returns f(x) is called, read the values into a local variable that is also on a SAVE statement. Use a logical which is SAVEd to designate whether or not the function is on its first call or not. Generally I'd prefer technique 2 as being more "local", but its not thread safe. If you are the doing simulation in parallel, the first technique could be done in a startup phase, before the program goes multi-threaded.
Here is an example of the use of SAVE: fortran SAVE statement. (In Fortran 95 notation ... convert to FORTRAN 77). Put the read of the data into the array in the IF block.
you probably want a way to interpolate or fit your data, but you need to be more specific about, say, dimensionality of your data, how your data behave, what fashion are you accessing your data (for example, maybe your next request is always near the last one), how the grid is made (evenly spaced, random or some other fashion), and where you're needing the data to be able to know which method is the best for you.
However, if the existing data set is very dense and near linear then you can certainly do a linear interpolation.
Using your database (file), you could create an array fvals with fvals(ii) being the function f(xmin + (ii-1) * dx). The mapping between x-value xx and your array index is ii = floor((xx - xmin) / dx) + 1. Once you know ii, you can use the points around it for interpolation: Either doing linear interpolation using ii and ii+1 or some higher order polynomial interpolation. For latter, you could use the corresponding polint routine from Numerical Recipes. See page 103.

C++ efficiently extracting subsets from vector of user defined structure

Let me preface this with the statement that most of my background has been with functional programming languages so I'm fairly novice with C++.
Anyhow, the problem I'm working on is that I'm parsing a csv file with multiple variable types. A sample line from the data looks as such:
"2011-04-14 16:00:00, X, 1314.52, P, 812.1, 812"
"2011-04-14 16:01:00, X, 1316.32, P, 813.2, 813.1"
"2011-04-14 16:02:00, X, 1315.23, C, 811.2, 811.1"
So what I've done is defined a struct which stores each line. Then each of these are stored in a std::vector< mystruct >. Now say I want to subset this vector by column 4 into two vectors where every element with P in it is in one and C in the other.
Now the example I gave is fairly simplified, but the actual problem involves subsetting multiple times.
My initial naive implementation was iterate through the entire vector, creating individual subsets defined by new vectors, then subsetting those newly created vectors. Maybe something a bit more memory efficient would be to create an index, which would then be shrunk down.
Now my question is, is there a more efficient, in terms of speed/memory usage) way to do this either by this std::vector< mystruct > framework or if there's some better data structure to handle this type of thing.
Thanks!
EDIT:
Basically the output I'd like is first two lines and last line separately. Another thing worth noting, is that typically the dataset is not ordered like the example, so the Cs and Ps are not grouped together.
I've used std::partition for this. It's not part of boost though.
If you want a data structure that allows you to move elements between different instances cheaply, the data structure you are looking for is std::list<> and it's splice() family of functions.
I understand you have not per se trouble doing this but you seem to be concerned about memory usage and performance.
Depending on the size of your struct and the number of entries in the csv file it may be advisabe to use a smart pointer if you don't need to modify the partitioned data so the mystruct objects are not copied:
typedef std::vector<boost::shared_ptr<mystruct> > table_t;
table_t cvs_data;
If you use std::partition (as another poster suggested) you need to define a predicate that takes the indirection of the shared_ptr into accont.