Writing grammar rules in SML Using Regular Expressions - regex

I want to write a converter for iCalendar to CSV in SML. Hence, I need to write grammar rules for it. I understand that certain rules can be written by defining them as datatype. To begin with, I am facing problems to write rules for Regular Expressions (terminals).
As an example, I want to write the given Regex in SML :
label → [a-zA-Z0-9-]+
Can anybody tell me how to write this rule in SML?
EDIT
So far, I have declared a datatype variables that denotes the various variables of the grammar.
datatype variables = Label of String
I have declared a function isLabel. It takes as input s (of type string) and returns Label(s) if it satisfies the given regex (by checking if ASCII values lie in the given range) else raises exception. I gotta feeling that I have found the way to solve.
Other symbols/variables of the grammar can be defined similarly in the datatype variables.

See Unix Programming with Standard ML page 163+ for an example of SML/NJ's regular expression library in action.
Steps:
Add SML/NJ Library. In the smlnj REPL use:
CM.make "$/regexp-lib.cm"
Make a regular expression Engine:
structure RE = RegExpFn (structure P = AwkSyntax
structure E = BackTrackEngine)
Define label:
val label = RE.compileString "[a-zA-Z0-9-]+"
Define a target:
val target = "ab9A-f"
Match label against target:
val match = StringCvt.scanString (RE.find label) target
Extract values from match according to program logic.

Related

HTML tokenizer algorithm

I'm trying to write a basic html parser which doesn't tolerate errors and was reading HTML5 parsing algorithm but it's just too much information for a simple parser. I was wondering if someone had an idea on the logic for a basic tokenizer which would simply turn a small html into a list of significant tokens. I'm more of interested in the logic than the code..
std::string html = "<div id='test'> Hello <span>World</span></div>";
Tokenizer t;
t.tokenize(html);
So for the above html, I want to convert it to a list of something like this:
["<","div","id", "=", "test", ">", "Hello", "<", "span", ">", "world", "</", "span", ">", "<", "div", ">"]
I don't have anything for the tokenize method but was wondering if iterating over the html character by character is the best way to build the list..
void Tokenizer::tokenize(std::string html){
std::list<std::string> tokens;
for(int i = 0; i < html.length();i++){
char c = html[i];
if(...){
...
}
}
}
I think what you are looking for is a lexical analyzer. Its goal is getting all the tokens that are defined in your language, in this case is HTML. As #IraBaxter said, you can use a Lexical tool, like Lex, that is founded in Linux or OSX; but you must define the rule and, for this, you need use Regular Expressions.
But, if you wan to know about an algorithm for this issue you can check the book of Keith D. Cooper & Linda Torczon, chapter 2, Scanners. This chapter talks about Automatas and who they can be used to create a Scanner where it use a Table-Driven Scanner to get tokens, like you want. Let me share you an image of this chapter:
The idea is that you define a DFA where you have:
A finite set of states in the recognizer, including start state, accepting states and error state.
An Alfabet.
A function which helps to determine if a transition is valid or not, using the table of transitions or, if you don't want use a table, coding the automata.
Take a time to study this chapter.
The other answers here are great, and you should definitely use a lexical-analyzer-generator like flex for the job. The input to such a generator is a list of rules that identify the different token types. An input file might look like this:
WHITE_SPACE \s*
IDENTIFIER [a-zA-Z0-9_]+
LEFT_ANGLE <
The algorithm that flex uses is essentially:
Find the rule that matches the most text.
If two rules match the same length of text, choose the one that occurs earlier in the list of rules provided.
You could write this algorithm quite easily yourself using regular expressions. However, do remember that this will not be as fast as flex, since flex compiles the regular expressions away into a very fast DFA.

custom regular expression parser

i would like to do regular expression matching on custom alphabets, using custom commands. the purpose is to investigate equations and expressions that appear in meteorology.
So for example my alpabet an be [p, rho, u, v, w, x, y, z, g, f, phi, t, T, +, -, /] NOTE: the rho and phi are multiple characters, that should be treated as single character.
I would also like to use custom commands, such a \v for variable, i.e. not the arithmatic operators.
I would like to use other commands such as (\v). note the dot should match dx/dt, where x is a variable. similarly, given p=p(x,y,z), p' would match dp/dx, dp/dy, and dp/dz, but not dp/df. (somewhere there would be given that p = p(x,y,z)).
I would also like to be able to backtrack.
Now, i have investigated PCRE and ragel with D, i see that the first two problems are solvable, with multiple character objects defined s fixed objects. and not a character class.
However how do I address the third?
I dont see either PCRE or RAGEL admitting a way to use custom commands.
Moreover, since I would like to use backtrack I am not sure if Ragel is the correct option, as this wouuld need a stack, which means I would be using CFG.
Is there perhaps a domainspeific language to build such regex/cfg machines (for linux 64 bit if that matters)
There is nothing impossible. Just write new class with regex inside with your programming language and define new syntax. It will be your personal regular expression syntax. For example, like:
result = latex_string.match("p'(x,y,z)", "full"); // match dp/dx, dp/dy, dp/dz
result = latex_string_array.match("p'(x,y,z)", "partial"); // match ∂p/∂x, ∂p/∂y, ∂p/∂z
. . .
The method match will treat new, pseudo-regular expression inside the your class and will return the result in desirable form. You can simply make input definition as a string and/or array form. Actually, if some function have to be matched by all derivatives, you must simplify search notation to .match("p'").
One simple notice:
,
have source: \mathrm{d}y=\frac{\mathrm{d}y}{\mathrm{d}t}\mathrm{d}t, and:
,
dy=\frac{dy}{dt}dt, and finally:
,
is dy=(dy/dt)dt
The problem of generalization for latex equations meaning with regular expressions is human input factor. It is just a notation and author can select various manners of input.
The best and precise way is to analysis of formula content and creation a computation three. In this case, you will search not just notations of differentials or derivatives, but instructions to calculate differentials and derivatives, but anyway it is connected with detailed analysis of the formula string with multiple cases of writing manners.
One more thing, and good news for you! It's not necessary to define magic regex-latex multibyte letter greek alphabet. UTF-8 have ρ - GREEK SMALL LETTER RHO you can use in UI, but in search method treat it as \rho, and use simply /\\frac{d\\rho}{dx}/ regex notation.
One more example:
// search string
equation = "dU= \left(\frac{\partial U}{\partial S}\right)_{V,\{N_i\}}dS+ \left(\frac{\partial U}{\partial V}\right)_{S,\{N_i\}}dV+ \sum_i\left(\frac{\partial U}{\partial N_i}\right)_{S,V,\{N_{j \ne i}\}}dN_i";
. . .
// user input by UI
. . .
// call method
equation.equation_match("U'");// example notation for all types of derivatives for all variables
. . .
// inside the 'equation_match' method you will use native regex methods
matches1 = equation.match(/dU/); // dU
matches2 = equation.match(/\\partial U/); // ∂U
etc.
return(matches);// combination of matches

how to ensure that a particular token is always added by the user in the input file?

input file:
parameter1 abc
parameter2 123
parameter3 xyz
if parameter2 is mandatory to be defined and the user forgets to do so,can yacc be used to report about this missing variable?
I will expand on my comment and try to make a proper answer. yacc is a tool for doing syntactic analysis, that is, the analysis of the grammatical arrangement of words or tokens. Use a yacc-generated parser to recognize as valid a string of tokens like
a = b + 2
and to reject as invalid a string like
2 b a = +
The same tokens are present, but in a different, nongrammatical order.
Instead, a simple string-matching tool like grep that uses some simple regular expressions seems to be the choice for you. The regular expression
/^parameter2/
matches any line that starts with the string "parameter2", and the regular expression
/^parameter[0-9]\s*[0-9]+$/
matches any line that consists of a parameter numbered from 0 to 9, some whitespace, and a string of digits. You have other options for matching across lines, matching case insensitively, and so on.
Now, if your particular problem includes validating type information for the values assigned to the parameters, e.g., parameter2 must take an integer, not a string, yacc might be useful. But, as I've written, I think it's a lot of apparatus to set up for what reads like a simple problem.
You could create a syntactic rule that says parameter2 must appear exactly once in the input:
valid_file: opt_param_list param2 opt_param_list
;
The grammar would then only recognize as syntactically valid a file that contained a param2 somewhere.
However, what you're after is more of a semantic check than a syntactic check; you'd probably do better implementing the rule in the actions rather than in the grammar:
valid_file: opt_param_list
{ if (param2_specified())
YYACCEPT;
else
{
err_report("No specification for parameter2");
YYABORT;
}
}
;

Measure the "matching"?

Is there mechanism to measure or compare of how tight the pattern corresponds to the given string? By pattern I mean regex or something similar. For example we have string "foobar" and two regexes: "fooba." and ".*" Both patterns match the string. Is it possible to determine that "fooba." is more appropriate pattern for given string then ".*"?
There are metrics and heuristics for string 'distance'. Check this for example http://en.wikipedia.org/wiki/Edit_distance
Here is one random Java implementation that came with Google search.
http://www.merriampark.com/ldjava.htm
Some metrics are expensive to compute so look around and find one that fits your needs.
As for your specific example, IIRC, regex matching in Java prioritizes terms by matching length and then order so if you use something like
"(foobar)|(.*)", it will match the first one and you can determine this by examining the results returned for the two capture groups.
How about this for an idea: Use the length of your regular expression: length("fooba.") > length(".*"), so "fooba." is more specific...
However, it depends on where the regular expressions come from and how precise you need to be as "fo.*|.*ba" would be longer than "fooba.", so the solution will not always work.
What you're asking for isn't really a property of regular expressions.
Create an enum that measures "closeness", and create a class that will hold a given regex, and a closeness value. This requires you to determine which regex is considered "more close" than another.
Instantiate your various classes, and let them loose on your code, and compare the matched objects, letting the "most closeness" one rise to the top.
pseudo-code, without actually comparing anything, or resembling any sane language:
enum Closeness
Exact
PrettyClose
Decent
NotSoClose
WayOff
CouldBeAnything
mune
class RegexCloser
property Closeness Close()
property String Regex()
ssalc
var foo = new RegexCloser(Closeness := Exact, Regex := "foobar")
var bar = new RegexCloser(Closeness := CouldBeAnything, Regex := ".*")
var target = "foobar";
if Regex.Match(target, foo)
print String.Format("foo {0}", foo.Closeness)
fi
if Regex.Match(target, bar)
print String.Format("bar {0}", bar.Closeness)
fi

Modify PL/SQL statement strings in C++

This is my use case: Input is a string representing an Oracle PL/SQL statement of arbitray complexity. We may assume it's a single statement (not a script).
Now, several bits of this input string have to be rewritten.
E.g. table names need to be prefixed, aggregate functions in the selection list that don't use a column alias should be assigned a default one:
SELECT SUM(ABS(x.value)),
TO_CHAR(y.ID,'111,111'),
y.some_col
FROM
tableX x,
(SELECT DISTINCT ID
FROM tableZ z
WHERE ID > 10) y
WHERE
...
becomes
SELECT SUM(ABS(x.value)) COL1,
TO_CHAR(y.ID,'111,111') COL2,
y.some_col
FROM
pref.tableX x,
(SELECT DISTINCT ID, some_col
FROM pref.tableZ z
WHERE ID > 10) y
WHERE
...
(Disclaimer: just to illustrate the issue, statement does not make sense)
Since aggregate functions might be nested and subSELECTs are a b_tch, I dare not use regular expressions. Well, actually I did and achieved 80% of success, but I do need the remaining 20%.
The right approach, I presume, is to use grammars and parsers.
I fiddled around with c++ ANTLR2 (although I do not know much about grammars and parsing with the help of such). I do not see an easy way to get the SQL bits:
list<string> *ssel = theAST.getSubSelectList(); // fantasy land
Could anybody maybe provide some pointers on how "parsing professionals" would pursue this issue?
EDIT: I am using Oracle 9i.
Maybe you can use this, it changes an select statement into an xml block:
declare
cl clob;
begin
dbms_lob.createtemporary (
cl,
true
);
sys.utl_xml.parsequery (
user,
'select e.deptno from emp e where deptno = 10',
cl
);
dbms_output.put_line (cl);
dbms_lob.freetemporary (cl);
end;
/
<QUERY>
<SELECT>
<SELECT_LIST>
<SELECT_LIST_ITEM>
<COLUMN_REF>
<SCHEMA>MICHAEL</SCHEMA>
<TABLE>EMP</TABLE>
<TABLE_ALIAS>E</TABLE_ALIAS>
<COLUMN_ALIAS>DEPTNO</COLUMN_ALIAS>
<COLUMN>DEPTNO</COLUMN>
</COLUMN_REF>
....
....
....
</QUERY>
See here: http://forums.oracle.com/forums/thread.jspa?messageID=3693276&#3693276
Now you 'only' need to parse this xml block.
Edit1:
Sadly I don't fully understand the needs of the OP but I hope this can help (It is another way of asking the 'names' of the columns of for example query select count(*),max(dummy) from dual):
set serveroutput on
DECLARE
c NUMBER;
d NUMBER;
col_cnt PLS_INTEGER;
f BOOLEAN;
rec_tab dbms_sql.desc_tab;
col_num NUMBER;
PROCEDURE print_rec(rec in dbms_sql.desc_rec) IS
BEGIN
dbms_output.new_line;
dbms_output.put_line('col_type = ' || rec.col_type);
dbms_output.put_line('col_maxlen = ' || rec.col_max_len);
dbms_output.put_line('col_name = ' || rec.col_name);
dbms_output.put_line('col_name_len = ' || rec.col_name_len);
dbms_output.put_line('col_schema_name= ' || rec.col_schema_name);
dbms_output.put_line('col_schema_name_len= ' || rec.col_schema_name_len);
dbms_output.put_line('col_precision = ' || rec.col_precision);
dbms_output.put_line('col_scale = ' || rec.col_scale);
dbms_output.put('col_null_ok = ');
IF (rec.col_null_ok) THEN
dbms_output.put_line('True');
ELSE
dbms_output.put_line('False');
END IF;
END;
BEGIN
c := dbms_sql.open_cursor;
dbms_sql.parse(c,'select count(*),max(dummy) from dual ',dbms_sql.NATIVE);
dbms_sql.describe_columns(c, col_cnt, rec_tab);
for i in rec_tab.first..rec_tab.last loop
print_rec(rec_tab(i));
end loop;
dbms_sql.close_cursor(c);
END;
/
(See here for more info: http://www.psoug.org/reference/dbms_sql.html)
The OP also want to be able to change the schema name of the table in a query. I think the easiest say to achieve that is to query the table names from user_tables and search in sql statement for those table names and prefix them or to do a 'alter session set current_schema = ....'.
If the source of the SQL statement strings are other coders, you could simply insist that the parts that need changing are simply marked by special escape conventions, e.g., write $TABLE instead of the table name, or $TABLEPREFIX where one is needed. Then finding the places that need patching can be accomplished with a substring search and replacement.
If you really have arbitrary SQL strings and cannot get them nicely marked, you need to somehow parse the SQL string as you have observed. The XML solution certainly is one possible way.
Another way is to use a program transformation system. Such a tool can parse a string for a language instance, build ASTs, carry out analysis and transformation on ASTs, and then spit a revised string.
The DMS Software Reengineering Toolkit is such a system. It has PLSQL front end parser. And it can use pattern-directed transformations to accomplish the rewrites you appear to need. For your example involving select items:
domain PLSQL.
rule use_explicit_column(e: expression):select_item -> select_item
"\e" -> "\e \column\(\e\)".
To read the rule, you need to understand that the stuff inside quote marks represents abstract trees in some computer langauge which we want to manipulate. What the "domain PLSQL" phrase says is, "use the PLSQL parser" to process the quoted string content, which is how it knows. (DMS has lots of langauge parsers to choose from). The terms
"expression" and "select_item" are grammatical constructs from the language of interest, e.g., PLSQL in this case. See the railroad diagrams in your PLSQL reference manual.
The backslash represents escape/meta information rather than target langauge syntax.
What the rule says is, transform those parsed elements which are select_items
that are composed solely of an expression \e, by converting it into a select_item consisting of the same expression \e and the corresponding column ( \column(\e) ) presumably based on position in the select item list for the specific table. You'd have to implement a column function that can determine the corresponding name from the position of the select item. In this example, I've chosen to define the column function to accept the expression of interest as argument; the expression is actually passed as the matched tree, and thus the column function can determine where it is in the select_items list by walking up the abstract syntax tree.
This rule handles just the select items. You'd add more rules to handle the other various cases of interest to you.
What the transformation system does for you is:
parse the language fragment of interest
build an AST
let you pattern match for places of interest (by doing AST pattern matching)
but using the surface syntax of the target langauge
replace matched patterns by other patterns
compute aritrary replacements (as ASTs)
regenerate source text from the modified ASTs.
While writing the rules isn't always trivial, it is what is necessary if your problem
is stated as posed.
The XML suggested solution is another way to build such ASTs. It doesn't have the nice pattern matching properties although you may be able to get a lot out of XSLT. What I don't know is if the XML has the parse tree in complete detail; the DMS parser does provide this by design as it is needed if you want to do arbitrary analysis and transformation.