What is TDOT subroutine in BLAS?

What is TDOT subroutine in BLAS? - fortran

I tried to use opt-einsum to generate contraction path for Fortran implementation and I came across an expression TDOT
https://optimized-einsum.readthedocs.io/en/stable/greedy_path.html?highlight=tdot
scaling BLAS current remaining
4 TDOT tfp,fr->tpr tpr->tpr
I cannot find it in http://www.netlib.org/lapack/explore-html/index.html or other BLAS webpage after a few searching :(
What is it in BLAS?

This is correct, TDOT is not part of BLAS. If we look at this particular expression we see that the data needs to be organized before a BLAS call can happen as the "zip" index f is not on the left or right hand side of each tensor.
TDOT tfp,fr->tpr
---
tmp_tpf = tfp -> tpf # Reogrganize data
tmp_tpf,fr -> tpr # Standard BLAS call with `f` as the zip index
There are some libraries such as TBLIS which expand upon BLAS functionality and allow for non-contiguous expressions by transposing data as blocks are moved from RAM to cache for extremely high performance. TDOT is explicitly stated in the opt_einsum docs since it is generally not a good thing due to the memory copy before contraction; the memory copy can often be the bottleneck!
Quick note I'm the author of opt_einsum, I would love a PR if you get the chance!

Related

How to extract matrixL() and matrixU() when using Eigen::CholmodSupernodalLLT?

I'm trying to use Eigen::CholmodSupernodalLLT for Cholesky decomposition, however, it seems that I could not get matrixL() and matrixU(). How can I extract matrixL() and matrixU() from Eigen::CholmodSupernodalLLT for future use?

A partial answer to integrate what others have said.
Consider Y ~ MultivariateNormal(0, A). One may want to (1) evaluate the (log-)likelihood (a multivariate normal density), (2) sample from such density.
For (1), it is necessary to solve Ax = b where A is symmetric positive-definite, and compute its log-determinant. (2) requires L such that A = L * L.transpose() since Y ~ MultivariateNormal(0, A) can be found as Y = L u where u ~ MultivariateNormal(0, I).
A Cholesky LLT or LDLT decomposition is useful because chol(A) can be used for both purposes. Solving Ax=b is easy given the decomposition, andthe (log)determinant can be easily derived from the (sum)product of the (log-)components of D or the diagonal of L. By definition L can then be used for sampling.
So, in Eigen one can use:
Eigen::SimplicialLDLT solver(A) (or Eigen::SimplicialLLT), when solver.solve(b) and calculate the determinant using solver.vectorD().diag(). Useful because if A is a covariance matrix, then solver can be used for likelihood evaluations, and matrixL() for sampling.
Eigen::CholmodDecomposition does not give access to matrixL() or vectorD() but exposes .logDeterminant() to achieve the (1) goal but not (2).
Eigen::PardisoLDLT does not give access to matrixL() or vectorD() and does not expose a way to get the determinant.
In some applications, step (2) - sampling - can be done at a later stage so Eigen::CholmodDecomposition is enough. At least in my configuration, Eigen::CholmodDecomposition works 2 to 5 times faster than Eigen::SimplicialLDLT (I guess because of the permutations done under the hood to facilitate parallelization)
Example: in Bayesian spatial Gaussian process regression, the spatial random effects can be integrated out and do not need to be sampled. So MCMC can proceed swiftly with Eigen::CholmodDecomposition to achieve convergence for the uknown parameters. The spatial random effects can then be recovered in parallel using Eigen::SimplicialLDLT. Typically this is only a small part of the computations but having matrixL() directly from CholmodDecomposition would simplify them a bit.

You cannot do this using the given class. The class you are referencing is equotation solver (which indeed uses cholesky decomposition). To decompose your matrix you should rather use Eigen::LLT. Code example from their website:
MatrixXd A(3,3);
A << 4,-1,2, -1,6,0, 2,0,5;
LLT<MatrixXd> lltOfA(A);
MatrixXd L = lltOfA.matrixL();
MatrixXd U = lltOfA.matrixU();

As reported somewhere else, e.g., it cannot be done easily.
I am copying a possible recommendation (answered by Gael Guennebaud himself), even if somewhat old:
If you really need access to the factor to do your own cooking, then
better use the built-in SimplicialL{D}LT<> class. Extracting the
factors from the supernodal internal represations of Cholmod/Pardiso
is indeed not straightforward and very rarely needed. We have to
check, but if Cholmod/Pardiso provide routines to manipulate the
factors, like applying it to a vector, then we could let
matrix{L,U}() return a pseudo expression wrapping these routines.
Developing code for extracting this is likely beyond SO, and probably a topic for a feature request.
Of course, the solution with LLT is at hand (but not the topic of the OP).

FFTW in Fortran result contains only zeros

I have been trying to write a simple program to perform an fft on a 1D input array using fftw3. Here I am using a seismogram as an input. The output array is, however, coming out to contain only zeroes.
I know that the input is correct as I have tried doing the fft of the same input file in MATLAB as well, which gives correct results. There is no compilation error. I am using f95 to compile this, however, gfortran was also giving pretty much the same results. Here is the code that I wrote:-
program fft
use functions
implicit none
include 'fftw3.f90'
integer nl,row,col
double precision, allocatable :: data(:,:),time(:),amplitude(:)
double complex, allocatable :: out(:)
integer*8 plan
open(1,file='test-seismogram.xy')
nl=nlines(1,'test-seismogram.xy')
allocate(data(nl,2))
allocate(time(nl))
allocate(amplitude(nl))
allocate(out(nl/2+1))
do row = 1,nl
read(1,*,end=101) data(row,1),data(row,2)
amplitude(row)=data(row,2)
end do
101 close(1)
call dfftw_plan_dft_r2c_1d(plan,nl,amplitude,out,FFTW_R2HC,FFTW_PATIENT)
call dfftw_execute_dft_r2c(plan, amplitude, out)
call dfftw_destroy_plan(plan)
do row=1,(nl/2+1)
print *,out(row)
end do
deallocate(data)
deallocate(amplitude)
deallocate(time)
deallocate(out)
end program fft
The nlines() function is a function which is used to calculate the number of lines in a file, and it works correctly. It is defined in the module called functions.
This program pretty much tries to follow the example at http://www.fftw.org/fftw3_doc/Fortran-Examples.html
There might just be a very simple logical error that I am making, but I am seriously unable to figure out what is going wrong here. Any pointers would be very helpful.
This is pretty much how the whole output looks like:-
.
.
.
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
.
.
.
My doubt is directly regarding fftw, since there is a tag for fftw on SO, so I hope this question is not off topic

As explained in the comments first by #roygvib and #Ross, the plan subroutines overwrite the input arrays because they try the transform many times with different parameters. I will add some practical use considerations.
You claim you do care about performance. Then there are two possibilities:
You do the transform only once as you show in your code. Then there is no point to use FFTW_MEASURE. The planning subroutine is many times slower than actual plan execute subroutine. Use FFTW_ESTIMATE and it will be much faster.
FFTW_MEASURE tells FFTW to find an optimized plan by actually
computing several FFTs and measuring their execution time. Depending
on your machine, this can take some time (often a few seconds).
FFTW_MEASURE is the default planning option.
FFTW_ESTIMATE specifies that, instead of actual measurements of
different algorithms, a simple heuristic is used to pick a (probably
sub-optimal) plan quickly. With this flag, the input/output arrays are
not overwritten during planning.
http://www.fftw.org/fftw3_doc/Planner-Flags.html
You do the same transform many times for different data. Then you must do the planning only once before the first transform and than re-use the plan. Just make the plan first and only then you fill the array with the first input data. Making the plan before every transport would make the program extremely slow.

Univac Math pack subroutines in old-school FORTRAN (pre-77)

I have been looking at an engineering paper here which describes an old FORTRAN code for solving pipe flow equations (it's dated 1974, before FORTRAN was standardised as Fortran 77). On page 42 of this document the old code calls the following subroutine:
C SYSTEM SUBROUTINE FROM UNIVAC MATH-PACK TO
C SOLVE LINEAR SYSTEM OF EQ.
CALL GJR(A,51,50,NP,NPP,$98,JC,V)
It's a bit of a long shot, but do any veterans or ancient code buffs recall this system subroutine and it's input arguments? I'm having trouble finding any information about it.
If I can adapt the old code my current application I may rewrite this in C++ or VBA, and will be looking for an equivalent function in these languages.

I'll add to this answer if I find anything more detailed, but I have a place to start looking for the arguments to GJR.
This function is part of the Sperry UNIVAC MATH-PACK library - a full list of functions in the library can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a170611.pdf GJR is described as "determinant; inverse; solution of simultaneous equations". Marginally helpful.
A better description comes from http://nvlpubs.nist.gov/nistpubs/jres/74B/jresv74Bn4p251_A1b.pdf
A FORTRAN subroutine, one of the Univac 1108 Math Pack programs,
available on the library tapes at the University of Maryland computing
center. It solves simultaneous equations, computes a determinant, or
inverts a matrix or any combination of the three above by using a
Gauss-Jordan elimination technique with column pivoting.
This is slightly more useful, but what we really want is "MATH-PACK, Programmer Reference", UP-7542 Rev. 1 from Sperry-UNIVAC (Unisys) I find a lot of references to this document but no full-text PDF of the document itself.
I'd take a look at the arguments in the function call, how they are set up and how the results are used, then look for equivalent routines in LAPACK or BLAS. See http://www.netlib.org/lapack/
I have a few books on piping networks including "Analysis of Flow in Pipe Networks" by Jeppson (same author as in the original PDF hosted by USU) https://books.google.com/books/about/Analysis_of_flow_in_pipe_networks.html?id=peZSAAAAMAAJ - I'll see if I can dig that up. The book may have a more portable matrix solver than the proprietary Sperry-UNIVAC library.
Update:
From p. 41 of http://ngds.egi.utah.edu/files/GL04099/GL04099_1.pdf I found documentation for the CGJR function, the complex version of GJR from the same library. It is likely the only difference in the arguments is variable type (COMPLEX instead of REAL):
CGJR is a subroutine which solves simultaneous equations, computes a determinant, inverts a matrix, or does any combination of these three operations, by using a Gauss-Jordan elimination technique with column pivoting.
The procedure for using CGJR is as follows:
Calling statement: CALL CGJR(A,NC,NR,N,MC,$K,JC,V)
where
A is the matrix whose inverse or determinant is to be determined. If simultaneous equations are solved, the last MC-N columns of the matrix are the constant vectors of the equations to be solved. On output, if the inverse is computed, it is stored in the first N columns of A. If simultaneous equations are solved, the last MC-N columns contain the solution vectors. A is a complex array.
NC is an integer representing the maximum number of columns of the array A.
NR is an integer representing the maximum number of rows of the array A.
N is an integer representing the number of rows of the array A to be operated on.
MC is the number of columns of the array A, representing the coefficient matrix if simultaneous equations are being solved; otherwise it is a dummy variable.
K is a statement number in the calling program to which control is returned if an overflow or singularity is detected.
1) If an overflow is detected, JC(1) is set to the negative of the last correctly completed row of the reduction and control is then returned to statement number K in the calling program.
2) If a singularity is detected, JC(1)is set to the number of the last correctly completed row, and V is set to (0.,0.) if the determinant was to be computed. Control is then returned to statement number K in the calling program.
JC is a one dimensional permutation array of N elements which is used for permuting the rows and columns of A if an inverse is being computed .. If an inverse is not computed, this array must have at least one cell for the error return identification. On output, JC(1) is N if control is returned normally.
V is a complex variable. On input REAL(V) is the option indicator, set as follows:
invert matrix
compute determinant
do 1. and 2.
solve system of equations
do 1. and 4.
do 2. and 4.
do 1., 2. and 4.
Notes on usage of row dimension arguments N and NR:
The arguments N and NR refer to the row dimensions of the A matrix.
N gives the number of rows operated on by the subroutine, while NR
refers to the total number of rows in the matrix as dimensioned by the
calling program. NR is used only in the dimension statement of the
subroutine. Through proper use of these parameters, the user may specify that only a submatrix, instead of the entire matrix, be operated on by the subroutine.
In your application (pipe flow), look at how matrix A and vector V are populated before the call to GJR and how they are used after the call.
You may be able to replace the call to GJR with a call to LAPACK's SGESV or DGESV without much difficulty.
Aside: The Fortran community really needs a drop-in 'Rosetta library' that wraps LAPACK, etc. for replacing legacy/proprietary IBM, UNIVAC, and Numerical Recipes math functions. The perfect case would be that maintainers would replace legacy functions with de facto standard math functions but in the real world, many of these older programs are un(der)maintained and there simply isn't the will (or, as in this case, the ability) to update them.
Update 2:
I started work on a compatibility library for the Sperry MATH-PACK and STAT-PACK routines as well as a few other legacy libraries, posted at https://bitbucket.org/apthorpe/alfc
Further, I located my copy of Jeppson's Analysis of Flow in Pipe Networks which is a slightly more legible version of the PDF of Steady Flow Analysis of Pipe Networks: An Instructional Manual and modernized the codes listed in the text. I have posted those at https://bitbucket.org/apthorpe/jeppson_pipeflow
Note that I found a number of errors in both the code listings and in the example problems given for many of the codes. If you're trying to learn how to write a pipe flow solver based on Jeppson's paper or text, I'd strongly suggest reviewing my updated codes and test cases because they will save you hours of effort trying to understand why the code doesn't work and why you can't replicate the example cases. This took a fair amount of forensic computing to sort out.
Update 3:
The source to CGJR and DGJR can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a110089.pdf. DGJR is the closest to what you want, though it references more routines that aren't available (proprietary UNIVAC error-handling routines). It should be easy to convert `DGJR' to single precision and skip the proprietary calls. Otherwise, use the compatibility library mentioned above.

C/C++ library for lazy evaluation of SIMD/SSE expressions

Libraries such as intel-MKL or amd-ACML provide easier interface to SIMD operations on vectors, but I want to chain several functions together. Are there readily available libraries where I can register a parse tree for an expression like
log( tanh(x) + exp(x) )
and then evaluate it on all members of an array ? What I want to avoid is to make a temporary arrays of tanh(x), exp(x) and tanh(x) + exp(x) by calling the mkl or acml functions for tanh(), exp() and +.
I can unroll the loop by hand and use the sse instructions directly, but was wondering if there are C++ libraries which does this for you, i.e.
1. Handles SIMD/SSE functions
2. Allows building of parse trees out of SIMD/SSE functions.
I am very much a newbie and have never used SSE or MKL/ACML before, just venturing out into new territory.

It may not do exactly what you want, but I suggest you take a look at macstl. It's a SIMD valarray implementation which uses template metaprogramming, and which can combine expressions into a single loop. You may be able to use this as is or perhaps as a basis for something closer to what you need.

Have a look at Intel ABB. It uses a just in time compilation approach IIRC. It can use vector instructions and multithreading depending on the sizes of the vectors you act upon.

Profiling memory usage in Mathematica

Is there any way to profile the mathkernel memory usage (down to individual variables) other than paying $$$ for their Eclipse plugin (mathematica workbench, iirc)?
Right now I finish execution of a program that takes multiple GB's of ram, but the only things that are stored should be ~50MB of data at most, yet mathkernel.exe tends to hold onto ~1.5GB (basically, as much as Windows will give it). Is there any better way to get around this, other than saving the data I need and quitting the kernel every time?
EDIT: I've just learned of the ByteCount function (which shows some disturbing results on basic datatypes, but that's besides the point), but even the sum over all my variables is nowhere near the amount taken by mathkernel. What gives?

One thing a lot of users don't realize is that it takes memory to store all your inputs and outputs in the In and Out symbols, regardless of whether or not you assign an output to a variable. Out is also aliased as %, where % is the previous output, %% is the second-to-last, etc. %123 is equivalent to Out[123].
If you don't have a habit of using %, or only use it to a few levels deep, set $HistoryLength to 0 or a small positive integer, to keep only the last few (or no) outputs around in Out.
You might also want to look at the functions MaxMemoryUsed and MemoryInUse.
Of course, the $HistoryLength issue may or not be your problem, but you haven't shared what your actual evaluation is.
If you're able to post it, perhaps someone will be able to shed more light on why it's so memory-intensive.

Here is my solution for profiling of memory usage:
myByteCount[symbolName_String] :=
Replace[ToHeldExpression[symbolName],
Hold[x__] :>
If[MemberQ[Attributes[x], Protected | ReadProtected],
Sequence ## {}, {ByteCount[
Through[{OwnValues, DownValues, UpValues, SubValues,
DefaultValues, FormatValues, NValues}[Unevaluated#x,
Sort -> False]]], symbolName}]];
With[{listing = myByteCount /# Names[]},
Labeled[Grid[Reverse#Take[Sort[listing], -100], Frame -> True,
Alignment -> Left],
Column[{Style[
"ByteCount for symbols without attributes Protected and \
ReadProtected in all contexts", 16, FontFamily -> "Times"],
Style[Row#{"Total: ", Total[listing[[All, 1]]], " bytes for ",
Length[listing], " symbols"}, Bold]}, Center, 1.5], Top]]
Evaluation the above gives the following table:

Michael Pilat's answer is a good one, and MemoryInUse and MaxMemoryUsed are probably the best tools you have. ByteCount is rarely all that helpful because what it measures can be a huge overestimate because it ignores shared subexpressions, and it often ignores memory that isn't directly accessible through Mathematica functions, which is often a major component of memory usage.
One thing you can do in some circumstances is use the Share function, which forces subexpressions to be shared when possible. In some circumstances, this can save you tens or even hundreds of magabytes. You can tell how well it's working by using MemoryInUse before and after you use Share.
Also, some innocuous-seeming things can cause Mathematica to use a whole lot more memory than you expect. Contiguous arrays of machine reals (and only machine reals) can be allocated as so-called "packed" arrays, much the way they would be allocated by C or Fortran. However, if you have a mix of machine reals and other structures (including symbols) in an array, everything has to be "boxed", and the array becomes an array of pointers, which can add a lot of overhead.

One way is to automatize restarting of kernel when it goes out of memory. You can execute your memory-consuming code in a slave kernel while the master kernel only takes the result of computation and controls memory usage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js