custom regular expression parser - regex

i would like to do regular expression matching on custom alphabets, using custom commands. the purpose is to investigate equations and expressions that appear in meteorology.
So for example my alpabet an be [p, rho, u, v, w, x, y, z, g, f, phi, t, T, +, -, /] NOTE: the rho and phi are multiple characters, that should be treated as single character.
I would also like to use custom commands, such a \v for variable, i.e. not the arithmatic operators.
I would like to use other commands such as (\v). note the dot should match dx/dt, where x is a variable. similarly, given p=p(x,y,z), p' would match dp/dx, dp/dy, and dp/dz, but not dp/df. (somewhere there would be given that p = p(x,y,z)).
I would also like to be able to backtrack.
Now, i have investigated PCRE and ragel with D, i see that the first two problems are solvable, with multiple character objects defined s fixed objects. and not a character class.
However how do I address the third?
I dont see either PCRE or RAGEL admitting a way to use custom commands.
Moreover, since I would like to use backtrack I am not sure if Ragel is the correct option, as this wouuld need a stack, which means I would be using CFG.
Is there perhaps a domainspeific language to build such regex/cfg machines (for linux 64 bit if that matters)

There is nothing impossible. Just write new class with regex inside with your programming language and define new syntax. It will be your personal regular expression syntax. For example, like:
result = latex_string.match("p'(x,y,z)", "full"); // match dp/dx, dp/dy, dp/dz
result = latex_string_array.match("p'(x,y,z)", "partial"); // match ∂p/∂x, ∂p/∂y, ∂p/∂z
. . .
The method match will treat new, pseudo-regular expression inside the your class and will return the result in desirable form. You can simply make input definition as a string and/or array form. Actually, if some function have to be matched by all derivatives, you must simplify search notation to .match("p'").
One simple notice:
,
have source: \mathrm{d}y=\frac{\mathrm{d}y}{\mathrm{d}t}\mathrm{d}t, and:
,
dy=\frac{dy}{dt}dt, and finally:
,
is dy=(dy/dt)dt
The problem of generalization for latex equations meaning with regular expressions is human input factor. It is just a notation and author can select various manners of input.
The best and precise way is to analysis of formula content and creation a computation three. In this case, you will search not just notations of differentials or derivatives, but instructions to calculate differentials and derivatives, but anyway it is connected with detailed analysis of the formula string with multiple cases of writing manners.
One more thing, and good news for you! It's not necessary to define magic regex-latex multibyte letter greek alphabet. UTF-8 have ρ - GREEK SMALL LETTER RHO you can use in UI, but in search method treat it as \rho, and use simply /\\frac{d\\rho}{dx}/ regex notation.
One more example:
// search string
equation = "dU= \left(\frac{\partial U}{\partial S}\right)_{V,\{N_i\}}dS+ \left(\frac{\partial U}{\partial V}\right)_{S,\{N_i\}}dV+ \sum_i\left(\frac{\partial U}{\partial N_i}\right)_{S,V,\{N_{j \ne i}\}}dN_i";
. . .
// user input by UI
. . .
// call method
equation.equation_match("U'");// example notation for all types of derivatives for all variables
. . .
// inside the 'equation_match' method you will use native regex methods
matches1 = equation.match(/dU/); // dU
matches2 = equation.match(/\\partial U/); // ∂U
etc.
return(matches);// combination of matches

Related

OpenModelica SimulationOptions 'variableFilter' not working with '^' exceptions

To reduce size of my simulation output files, I want to give variable name exceptions instead of a list of many certain variables to the simulationsOptions/outputFilter (cf. OpenModelica Users Guide / Output) of my model. I found the regexp operator "^" to fullfill my needs, but that didn't work as expected. So I think that something is wrong with the interpretation of connected character strings when negated.
Example:
When I have any derivatives der(...) in my model and use variableFilter=der.* the output file will contain all the filtered derivatives. Since there are no other varibles beginning with character d the same happens with variableFilter=d.*. For testing I also tried variableFilter=rde.* to confirm that every variable is filtered.
When I now try to except by variableFilter=^der.*, =^rde.* or =^d.*, I get exactly the same result as without using ^. So the operator seems to be ignored in this notation.
When I otherwise use variableFilter=[^der].*, =[^rde].* or even =[^d].*, all wanted derivation variables are filtered from the ouput, but there is no difference between those three expressions above. For me it seems that every character is interpretated standalone and not as as a connected string.
Did I understand and use the regexp usage right or could this be a code bug?
Side/follow-up question: Where can I officially report this for software revision?
_
OpenModelica v.1.19.2 (64-bit)

Parsing series of formulas from a string using sympy

I have a pandas df that contains many string formulas that I would like to be able to parse and eventually solve. I came across parse_expr and initially seemed like it would work for my problem but now I'm not so sure.
An example string formula might look like this:
A = B + C; D = A*.2;
parse_expr would seem to work well if i had a system of equations and I may not be using this correctly. As it stands, parse_expr throws an "invalid syntax" error I believe because of the equal sign. Can anyone tell if its possible to solve this problem using parse_expr or if there is another approach I should try?
SymPy cannot parse a bunch of semicolon-separated formulas at once, so the string needs to be split first. It will need to be split again at =, assuming all formulas have = in them. After parsing each side of =, you can combine them with Eq, which is SymPy's equation object; or use them somehow else.
from sympy import S, Eq
str = "A = B + C; D = A*.2;"
result = [Eq(*map(S, f.split("="))) for f in str.split(";")[:-1]]
The result is [Eq(A, B + C), Eq(D, 0.2*A)]
I use S, short for sympify; parse_expr could be used similarly, and it has a few options that are not needed here.
parse_expr is based on the Python tokenizer, but it has several extensions. These extensions take the form of functions that take a list of tokens, a locals dictionary, and a globals dictionary, and return a modified list of tokens. These are passed as a tuple to parse_expr, like parse_expr(expression, transformations=(transformation1, transformation2, ...)).
It's probably easiest to just take a look at the source of the sympy.parsing.sympy_parser submodule to see the existing transformations and how they work. Some of the transformations that are there will probably be useful to you. In this case, you would want a transformation that transforms the = token into something else (actually there's already a transformation function convert_equals_sign in the sympy_parser submodule that does this). You assumedly also want to handle *. somehow.
I've also written a guide on Python tokenization which may be helpful here: https://www.asmeurer.com/brown-water-python
If your syntax is too far off from Python's then it will be challenging to use parse_expr, since it only works with Python's tokenizer. In that case, you'd need to generate your own grammar and parser (e.g., using antlr) for your DSL and parse it into something that can then be transformed into a SymPy expression.

Use of paratheses symbols in django-viewflow flow

I have been at a loss understanding the use of parentheses in django-viewflow flow code.
For example in the code below
start = (
flow.Start(views.StartView)
.Permission('shipment.can_start_request')
.Next(this.split_clerk_warehouse)
)
# clerk
split_clerk_warehouse = (
flow.Split()
.Next(this.shipment_type)
.Next(this.package_goods)
)
from here
It seems as though, a tuple containing functions is assigned to start and to split_clerk_warehouse e.t.c. What does it mean. From my best guess it would seem that the .Next functions accept a tuple as input.
NOTE I do understand the method chaining used here. I am just at a loss to understand the use of braces.
Thanks.
If I understand correctly, you wonder what the use is of the outer brackets.
Let us first write the (first, but applicable to the second) statement without outer brackets:
start = flow.Start(views.StartView).Permission('shipment.can_start_request').Next(this.split_clerk_warehouse)
This is exactly equivalent to code in your sample. But you probably agree that this is quite unreadable. It requires a user to scroll over the code, and furthermore it is a long chain of characters, without any structure. A programmer would have a hard time understanding it, especially if - later - we would also use brackets inside the parameters of calls.
So perhaps it would make sense to write it like:
start = flow.Start(views.StartView).
Permission('shipment.can_start_request').
Next(this.split_clerk_warehouse)
But this will not work: Python is a language that uses spacing as a way to attach semantics on code. As a result it will break: Python will try to parse the separate linkes as separate statements. But then what to do with the tailing dot? As a result the parser would error.
Now Python has some ways to write statements in a multi-line fashion. For example with backslashes:
start = flow.Start(views.StartView). \
Permission('shipment.can_start_request'). \
Next(this.split_clerk_warehouse)
with the backslash we specify that the next line actually belongs to the current one, and thus it is parsed like we wrote this all on a single line.
The disadvantage is that we easily can forget a backslash here, and this would again let the parser error. Furthermore this requires linear work: for every line we have to add one element.
But programming languages actually typically have a feature that programmers constantly use to group (sub)expressions together: brackets. We use it to give precedence (for example 3 * (2 + 5)), but we can use it to simply group one expression over multiple lines as well, like:
start = (
flow.Start(views.StartView)
.Permission('shipment.can_start_request')
.Next(this.split_clerk_warehouse)
)
Everything that is within the brackets belongs to the same expression, so Python will ignore the new lines.
Note that tuple literals also use brackets. For example:
() # empty tuple
(1, ) # singleton tuple (one element)
(1, 'a', 2, 5) # 4-tuple
But here we need to write a comma at the end for a singleton tuple, or multiple elements separated by comma's , (except for the empty tuple).

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

regex matching multiple values when they might not exist

I am trying to right a preg_match_all to match horse race distance.
My source lists races as:
xmxfxy
I want to match the m value, the f value, the y value. However different races will maybe only have m, or f, or y, or two of them or even all three.
// e.g. $raw = 5f213y;
preg_match_all('/(\d{1,})m|(\d{1,})f|(\d{1,})y/', $raw, $distance);
The above sort of works, but for some reason the matches appear in unpredictable positions in the returned array. I guess it is because it is running the match 3 times for each OR. How do I match all three (that may or may not exist) in a single run.
EDIT
A full sample string is:
Hardings Catering Services Handicap (Div I) Cl6 5f213y
If I understand you correctly, you're processing listings (like the one in your question) one at a time. If that's the case, you should be using preg_match, not preg_match_all, and the regex should match the whole "distance" code, not individual components of it. Try this:
preg_match('#\b(?:(?<M>\d+)m|(?<F>\d+)f|(?<Y>\d+)y){1,3}\b#',
$raw, $distance);
The results are now stored in a one-dimensional array, but you don't need to worry about the group numbers anyway; you can access them by name instead (e.g., $distance['M'], $distance['F'], $distance['Y']).
Note that, while this regex matches codes with one, two, or three components, it doesn't require the letters to be unique. There's nothing to stop it from matching something like 1m2m3m (a weakness shared by your own approach, by the way).
you can use "?" as a conditional
preg_match_all('/((\d{1,})m)?|((\d{1,})f)?|((\d{1,})y)?/', $raw, $distance);
If I understand what you're asking correctly, you would like to get each number from these values separately? This works for me:
$input = "Hardings Catering Services Handicap (Div I) Cl6 5f213y";
preg_match_all('/((\d+)(m|f|y))/', $input, $matches);
After the preg_match_all() executes, $matches[2] holds an array of the numbers that matched (in this case, $matches[2][0] is 5 and $matches[2][1] is 213.
If all three values exist, m will be in $matches[2][0], f in $matches[2][1], and y in $matches[2][2]. If any values are missing, the next value gets bumped up a spot. It may also come in handy that $matches[3] will hold an array of the corresponding letter matched on, so if you need to check whether it was an m, f, or y, you can.
If this isn't what you're after, please provide an example of the output you would like to see for this or another sample input.