Maxima: creating a function that acts on parts of a string - regex

Context: I'm using Maxima on a platform that also uses KaTeX. For various reasons related to content management, this means that we are regularly using Maxima functions to generate the necessary KaTeX commands.
I'm currently trying to develop a group of functions that will facilitate generating different sets of strings corresponding to KaTeX commands for various symbols related to vectors.
Problem
I have written the following function makeKatexVector(x), which takes a string, list or list-of-lists and returns the same type of object, with each string wrapped in \vec{} (i.e. makeKatexVector(string) returns \vec{string} and makeKatexVector(["a","b"]) returns ["\vec{a}", "\vec{b}"] etc).
/* Flexible Make KaTeX Vector Version of List Items */
makeKatexVector(x):= block([ placeHolderList : x ],
if stringp(x) /* Special Handling if x is Just a String */
then placeHolderList : concat("\vec{", x, "}")
else if listp(x[1]) /* check to see if it is a list of lists */
then for j:1 thru length(x)
do placeHolderList[j] : makelist(concat("\vec{", k ,"}"), k, x[j] )
else if listp(x) /* check to see if it is just a list */
then placeHolderList : makelist(concat("\vec{", k, "}"), k, x)
else placeHolderList : "makeKatexVector error: not a list-of-lists, a list or a string",
return(placeHolderList));
Although I have my doubts about the efficiency or elegance of the above code, it seems to return the desired expressions; however, I would like to modify this function so that it can distinguish between single- and multi-character strings.
In particular, I'd like multi-character strings like x_1 to be returned as \vec{x}_1 and not \vec{x_1}.
In fact, I'd simply like to modify the above code so that \vec{} is wrapped around the first character of the string, regardless of how many characters there may be.
My Attempt
I was ready to tackle this with brute force (e.g. transcribing each character of a string into a list and then reassembling); however, the real programmer on the project suggested I look into "Regular Expressions". After exploring that endless rabbit hole, I found the command regex_subst; however, I can't find any Maxima documentation for it, and am struggling to reproduce the examples in the related documentation here.
Once I can work out the appropriate regex to use, I intend to implement this in the above code using an if statement, such as:
if slength(x) >1
then {regex command}
else {regular treatment}
If anyone knows of helpful resources on any of these fronts, I'd greatly appreciate any pointers at all.

Looks like you got the regex approach working, that's great. My advice about handling subscripted expressions in TeX, however, is to avoid working with names which contain underscores in Maxima, and instead work with Maxima expressions with indices, e.g. foo[k] instead of foo_k. While writing foo_k is a minor convenience in Maxima, you'll run into problems pretty quickly, and in order to straighten it out you might end up piling one complication on another.
E.g. Maxima doesn't know there's any relation between foo, foo_1, and foo_k -- those have no more in common than foo, abc, and xyz. What if there are 2 indices? foo_j_k will become something like foo_{j_k} by the preceding approach -- what if you want foo_{j, k} instead? (Incidentally the two are foo[j[k]] and foo[j, k] when represented by subscripts.) Another problematic expression is something like foo_bar_baz. Does that mean foo_bar[baz], foo[bar_baz] or foo_bar_baz?
The code for tex(x_y) yielding x_y in TeX is pretty old, so it's unlikely to go away, but over the years I've come to increasing feel like it should be avoided. However, the last time it came up and I proposed disabling that, there were enough people who supported it that we ended up keeping it.
Something that might be helpful, there is a function texput which allows you to specify how a symbol should appear in TeX output. For example:
(%i1) texput (v, "\\vec{v}");
(%o1) "\vec{v}"
(%i2) tex ([v, v[1], v[k], v[j[k]], v[j, k]]);
$$\left[ \vec{v} , \vec{v}_{1} , \vec{v}_{k} , \vec{v}_{j_{k}} ,
\vec{v}_{j,k} \right] $$
(%o2) false
texput can modify various aspects of TeX output; you can take a look at the documentation (see ? texput).

While I didn't expect that I'd work this out on my own, after several hours, I made some progress, so figured I'd share here, in case anyone else may benefit from the time I put in.
to load the regex in wxMaxima, at least on the MacOS version, simply type load("sregex");. I didn't have this loaded, and was trying to work through our custom platform, which cost me several hours.
take note that many of the arguments in the linked documentation by Dorai Sitaram occur in the reverse, or a different order than they do in their corresponding Maxima versions.
not all the "pregexp" functions exist in Maxima;
In addition to this, escaping special characters varied in important ways between wxMaxima, the inline Maxima compiler (running within Ace editor) and the actual rendered version on our platform; in particular, the inline compiler often returned false for expressions that compiled properly in wxMaxima and on the platform. Because I didn't have sregex loaded on wxMaxima from the beginning, I lost a lot of time to this.
Finally, the regex expression that achieved the desired substitution, in my case, was:
regex_subst("\vec{\\1}", "([[:alpha:]])", "v_1");
which returns vec{v}_1 in wxMaxima (N.B. none of my attempts to get wxMaxima to return \vec{v}_1 were successful; escaping the backslash just does not seem to work; fortunately, the usual escaped version \\vec{\\1} does return the desired form).
I have yet to adjust the code for the rest of the function, but I doubt that will be of use to anyone else, and wanted to be sure to post an update here, before anyone else took time to assist me.
Always interested in better methods / practices or any other pointers / feedback.

Related

What is the name of the data structure for and-or-lists (or and-or-trees) and where can I read about it?

I recently needed to make a data structure which was a nested list of and/or questions. Since most every interesting thing has been discovered by someone else previously, I’m looking for the name of this data structure. it looks something like this.
‘((a b c) (b d e) (c (a b) (f a)))
The interpretation is I want to find abc or bde or caf or caa or cbf or cba and the list encapsulates that. At the top level each item is or’ed together and sub-lists of the top level are and’ed together and sub-lists of sub-lists are or’ed again sub-lists of those are and’ed and sub-lists of those or’ed ad infinitum. Note that in my example, all the lists are the same length, in my real application the lists vary in length.
The code to walk such a “tree” is relatively simple, but I’m assuming that there is a name for that type of tree and there is stuff I can read about it.
These lists are equivalent to fixed length regular expressions (which I've seen referred to as "network expressions", but I am particularly interested in this data structure and representation thereof.
In general (in the very high level of abstraction) it is:
Context free grammar -Wiki
If you allow it to be infinitely nested, then it is not a regular expression because of presence of parentheses (left and right should match).
If you consider, that expressions inside parentheses are ordered. I mean that a and b and c is equivalent to (a and b) and c. You get then Binary expression tree -Wiki
But for your particular case, it is probably: Disjunctive normal form -Wiki
I am not sure, but my intuition says that it is regular expression again because you have only 2 levels of nesting (1st - for 'or-ed' and 2nd - for 'and-ed' parts)
The trees are also a subset of DAWGS - directed acyclic word graphs and one could construct them the same way.
In my case, I have a very small set that I have built by hand and I don't worry about getting the minimal set, but instead just want something that I can easily write down but deals with the types of simple variations I see. Basically, I have different ways of finding where I keep my .el files based upon the different directory structures of various OSes I use. (E.g. when I was working at Google, the /usr/local/emacs/site-lisp directory was actually more like /usr/local/Google/emacs/site-lisp.)
I don't need a full regex, but there are about a dozen variations, some having quite long lists of nested sub-directories (c:\users\cfclark\appData\roaming\emacs.emacs.d or some other awful thing) that I wanted to write down (and then have emacs make an automated search to find the one that is appropriate to this particular installation). And every time I go to a new job, I can simply add to the list a description of where they are in that setup.
Anyway, as that code has evolved, I found that I had I was doing (nested or's and and's and realized that the structure generalized to the alternating or/and/or/and/... case). So, my assumption is that someone must have discovered this before. I had hints of it myself several years ago, but didn't set down to implement it. The Disjunctive Normal Form link mpasko256 gave is also particularly relevant. I don't normalize to that level, I still keep nested and's and or's rather than flattening to 2, but I do have a distinct structure, or's at the top, then and's, then or's....

Regular Expression for whole world

First of all, I use C# 4.0 to parse the code of a VB6 application.
I have some old VB6 code and about 500+ copies of it. And I use a regular expression to grab all kinds of global variables from the code. The code is described as "Yuck" and some poor victim still has to support this. So I'm hoping to help this poor sucker a bit by generating overviews of specific constants. (And yes, it should be rewritten but it ain't broke, so...)
This is a sample of a code line I need to match, in this case all boolean constants:
Public Const gDemo = False 'Is this a demo version
And this is the regular expression I use at this moment:
Public\s+Const\s+g(?'Name'[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?'Value'[0-9]*)
And I think it too is yuckie, since the * at the end of the boolean group. But if I don't use it, it will only return 'T' or 'F'. I want the whole word.
Is this the proper RegEx to use as solution or is there an even nicer-looking option?
FYI, I use similar regexs to find all string constants and all numeric constants. Those work just fine. And basically the same .BAS file is used for all 50 copies but with different values for all these variables. By parsing all files, we have a good overview of how every version is configured.
And again, yes, we need to rebuild the whole project from scratch since it becomes harder to maintain these days. But it works and we need the manpower for other tasks. It just needs the occasional tweaks...
You can use: Public\s+Const\s+g(?<Name>[a-zA-Z][a-zA-Z0-9]*)\s+=\s+(?<Value>False|True)
demo

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
List(
Constant("abc"),
ZeroOrOne(Constant("d")),
Constant("efg"),
OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
Constant("qrs"),
AnyChar()
)
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?
If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.
You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.
Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results
Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.
There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.
If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.
To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order
With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.