parse user agent with Ragel - regex

I have the following line from apache (changing the apache log format is not an option):
... referrer=- user_agent=JSON-RPC Client status=200 size=44 ...
and I am trying to parse it with Ragel. I am getting all the fields extracted except the user_agent. The user guide says that strong difference -- ensures that the first machine does NOT contain the second. In my case, I want to match anything that does not contain " status=", which would signify the next field. However, my current definition (below) appears to be skipping user_agent entirely; I still get status and following fields. Am I utilizing strong difference properly?
...
referrer = ^space+ >mark %{ emit("referrer"); };
user_agent = any* -- ' status=' >mark %{ emit("user_agent"); };
status = digit+ >mark %{ emit("status"); };
...
line = (
...
space "referrer="
referrer
space "user_agent="
user_agent
space "status="
status
space "size="
...
);

I know this question is old, but I'm just getting into Ragel and this may help others.
The problem here is operator precedence. The action operators > and % are evaluated before the strong difference operator --. It's easy to fix with parentheses:
user_agent = (any* -- ' status=') >mark %{ emit("user_agent"); };
I would also use space instead of hard-coding a space, at least for consistency:
user_agent = (any* -- (space 'status=')) >mark %{ emit("user_agent"); };

You can use state charts, try this state machine:
STATUS := digit+ ' ' #{ fnext SIZE; };
SIZE := digit+;
USER_AGENT =
start: (
's' -> s1 |
[^s] #{ if (!ua_start) ua_start = p; } -> start
),
s1: (
't' -> s2 |
[^t] -> start
),
s2: (
'a' -> s3 |
[^a] -> start
),
s3: (
't' -> s4 |
[^t] -> start
),
s4: (
'u' -> s5 |
[^u] -> start
),
s5: (
's' -> s6 |
[^s] -> start
),
s6: (
'=' #{ fnext STATUS; ua_stop = p - 7; } -> final |
[^=] -> start
)
;
main := "user_agent=" USER_AGENT ;
U can get the user agent in "ua_start" -> "ua_stop", for other variables, i think u already knew how to mark & get the characters.

Related

Defining new POSIX-like character class names in C/C++

I'm currently working on a project in c++ to use regex as HTTP FRC rules. In the RFC 1945, Chapter 2.2 - Basic Rules there are the following rules:
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
word = token | quoted-string
token = 1*<any CHAR except CTLs or tspecials>
tspecials = "(" | ")" | "<" | ">" | "#"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext) <"> )
qdtext = <any CHAR except <"> and CTLs, but including LWS>
What I'm interested in is the usage of character classes like [:digit:] or at least recycling the regex. The pseudo-code would become something like this (\ is already escaped and regex is already in string form):
CHAR = "\\x00-\\x7F"
CTL = "\\x00-\\x19\\x7F"
CR = "\\r"
LF = "\\n"
SP = "\\x20"
HT = "\\t"
//here I start "recycling" old regexes
CRLF = "[:CR:][:LF:]"
LWS = "[:CRLF:]* ( [:SP:] | [:HT:] )+"
//here a declaration might happen before using the token or quoted_string class
word = "[:token:] | [:quoted_string:]"
token = "( (?= [[:CHAR:]] ) [^[:tspecial:][:CTL:]] )+"
tspecials = "()<>#,;:\\\\\\"/\\[\\]?={}[:SP:][:HT:]"
quoted_string = " ( \\" ([:qdtext:])* \\" ) "
//Little trick to allow LWS but not CTLs: https://stackoverflow.com/a/18017758/9373031
qdtext = "(?=[[:CHAR:]]) ( [:LWS:] | [^\\"[:CTL:]] )"
What I tried so far is to store them as string and then chain them together with a +, but looked ugly and not very optimized. Of course I could repeat some regexs but it started becoming an enormous monster the further I went.
I tried googling a while, but nor did I find anything about adding custom POSIX-like classes, neither did I find anything about recycling (and optimizing?) regexs.
What I need to do is to optimize and prettify regex originating string such that they could be parsed into a new one as POSIX-like classes or in some other way (code in C/C++):
std::regex CR ("\\r");
std::regex LF ("\\r");
std::regex CRLF ("[:CR:] [:LF:]");
Option 1:
[:CR:] [:LF:] would be expanded to \\r \\n and at compilation would become: std::regex CRLF ("\\r \\n");
Option 2:
[:CR:] [:LF:] would be "expanded" as "two functions" to optimize regex at run-time.
So far I found std::ctype_base has the static methods used for classnames in the std::regex_traits<CharT>::lookup_classname function, that should be used for finding defined classnames: is it possible to extend the masks used?
You need a kind of a metalanguage and some compiler for it. It is not a task for just C++ preprocessor or/and compiler's constant folding or other compile-stage features.
With the metalanguage you will describe your variant of extended RE. Then your compiler will parse that and generate some input for the main project - either just a set of strings to be used as input for the conventional RE, or something more smart and complex.
Tools for your task do exist: http://www.nongnu.org/bnf/, flex/bison, etc. They allow you not only to produce just some set of RE-strings, but to create the whole parser for your metalanguage (you have asked for optimization) - if such a concept is allowed for your project.
Or you can write your own parser from scratch.

Retrieve skipped white space In antlr4 parser from listener

I'm trying to construct an object from a parsed message.
I'm using Antlr4 and C++
My issue is that I need to skip white spaces during lexing/parsing but then I have to get them back when I construct my message object in the Listener.
Here's my grammar
grammar MessageTest;
WS: ('\t' | ' ' | '\r' | '\n' )+ -> skip;
message:
messageInfo
startOfMessage
messageText+
| EOF;
messageInfo:
senderName
filingTime
receiverName
;
senderName: WORD;
filingTime: DIGITS;
receiverName: WORD;
messageText: ( WORD | DIGITS | ALLOWED_SYMBOLS)+;
startOfMessage: START_OF_MESSAGE_SYMBOL ;
START_OF_MESSAGE_SYMBOL:':';
WORD: LETTER+;
DIGITS: DIGIT+;
LPAREN: '(';
RPAREN: ')';
ALLOWED_SYMBOLS: '-'| '.' | ',' | '/' | '+' | '?';
fragment LETTER: [A-Z];
fragment DIGIT: [0-9];
So this grammar works well, my parsing tree is correct for the following message example: JOHN0120JANE:HI HOW ARE YOU?
I get this parse tree:
message (
messageInfo (
senderName (
"JOHN"
)
filingTime (
"0120"
)
receiverName (
"JANE"
)
)
startOfMessage (
":"
)
messageText (
"HI"
"HOW"
"ARE"
"YOU"
"?"
)
)
The problem is when Im trying to retrieve the whole messageText as:
HI HOW ARE YOU? I instead get HIHOWAREYOU? from the MessageTextContext
What am I doing wrong?
The getText() retrieval functions never consider skipped or hidden tokens. But it's easy to get the original text of your input (even just a range that corresponds to a specific parse rule), by using the indexes stored in the generated tokens. Parse rule contexts contain a start and an end node, so it's easy to go from the context to the original input like this:
std::string MySQLRecognizerCommon::sourceTextForContext(ParserRuleContext *ctx, bool keepQuotes) {
return sourceTextForRange(ctx->start, ctx->stop, keepQuotes);
}
//----------------------------------------------------------------------------------------------------------------------
std::string MySQLRecognizerCommon::sourceTextForRange(tree::ParseTree *start, tree::ParseTree *stop, bool keepQuotes) {
Token *startToken = antlrcpp::is<tree::TerminalNode *>(start) ? dynamic_cast<tree::TerminalNode *>(start)->getSymbol()
: dynamic_cast<ParserRuleContext *>(start)->start;
Token *stopToken = antlrcpp::is<tree::TerminalNode *>(stop) ? dynamic_cast<tree::TerminalNode *>(start)->getSymbol()
: dynamic_cast<ParserRuleContext *>(stop)->stop;
return sourceTextForRange(startToken, stopToken, keepQuotes);
}
//----------------------------------------------------------------------------------------------------------------------
std::string MySQLRecognizerCommon::sourceTextForRange(Token *start, Token *stop, bool keepQuotes) {
CharStream *cs = start->getTokenSource()->getInputStream();
size_t stopIndex = stop != nullptr ? stop->getStopIndex() : std::numeric_limits<size_t>::max();
std::string result = cs->getText(misc::Interval(start->getStartIndex(), stopIndex));
if (keepQuotes || result.size() < 2)
return result;
char quoteChar = result[0];
if ((quoteChar == '"' || quoteChar == '`' || quoteChar == '\'') && quoteChar == result.back()) {
if (quoteChar == '"' || quoteChar == '\'') {
// Replace any double occurence of the quote char by a single one.
replaceStringInplace(result, std::string(2, quoteChar), std::string(1, quoteChar));
}
return result.substr(1, result.size() - 2);
}
return result;
}
This code is tailored towards use with MySQL (e.g. wrt. quoting characters), but is easy to adapt for any other use case. The essential part is to use the tokens (e.g. taken from a parse rule context) and get the original input from the character input stream.
Code taken from the MySQL Workbench code base.
Seems like you want Lexical Modes.
The idea of using them is simple: when your lexer encounters START_OF_MESSAGE_SYMBOL, it has to switch its context where only one token is possible, let's say MESSAGE_TEXT token.
Once this token has been determined, the lexer's mode switches back to its default mode.
To do this you should first split you grammar into two parts: lexer grammar and a parser grammar, since lexical modes are not allowed in a combined grammar. And then you can use
pushMode() and popMode() commands.
Here's an example:
MessageTestLexer.g4
lexer grammar MessageTestLexer;
WS: ('\t' | ' ' | '\r' | '\n' )+ -> skip;
START_OF_MESSAGE_SYMBOL:':' -> pushMode(MESSAGE_MODE); //pushing MESSAGE_MODE when START_OF_MESSAGE_SYMBOL is encountered
WORD: LETTER+;
DIGITS: DIGIT+;
LPAREN: '(';
RPAREN: ')';
ALLOWED_SYMBOLS: '-'| '.' | ',' | '/' | '+' | '?';
fragment LETTER: [A-Z];
fragment DIGIT: [0-9];
mode MESSAGE_MODE; //tokens below are related to MESSAGE_MODE only
MESSAGE_TEXT: ~('\r'|'\n')*; //consuming any character until the end of the line. You can provide your own rule
END_OF_THE_LINE: ('\r'|'\n') -> popMode; //switching back to the default mode
MessageTestParser.g4
parser grammar MessageTestParser;
options {
tokenVocab=MessageTestLexer; //declaring which lexer rules to use in this parser
}
message:
messageInfo
startOfMessage
MESSAGE_TEXT //use the token instead
| EOF;
messageInfo:
senderName
filingTime
receiverName
;
senderName: WORD;
filingTime: DIGITS;
receiverName: WORD;
startOfMessage: START_OF_MESSAGE_SYMBOL;
P.S. did not test these grammars, but seems it should work.

Regex: How to Implement Negative Lookbehind in PL/SQL

How do I match all the strings that begin with loockup. and end with _id but not prefixed by msg? Here below are some examples:
lookup.asset_id -> should match
lookup.msg_id -> shouldn't match
lookup.whateverelse_id -> should match
I know Oracle does not support negative lookbehind (i.e. (?<!))... so I've tried to explicitly enumerate the possibilities using alternation:
regexp_count('i_asset := lookup.asset_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
dbms_output.put_line('match'); -- this matches as expected
end if;
regexp_count('i_msg := lookup.msg_id;', 'lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id') <> 0 then
dbms_output.put_line('match'); -- this shouldn’t match
-- but it does like the previous example... why?
end if;
The second regexp_count expression should't match... but it does like the first one. Am I missing something?
EDIT
In the real use case, I've a string that contains PL/SQL code that might contains more than one lookup.xxx_id instances:
declare
l_source_code varchar2(2048) := '
...
curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');
asset : = lookup.asset_id(key_val => ''UBSN''); -- this is wrong since it does
-- not specify key_type
...
msg := lookup.msg_id(key_val => ''hello''); -- this is fine since msg_id does
-- not require key_type
';
...
end;
I need to determine whether there is at least one wrong lookup, i.e. all occurrences, except lookup.msg_id, must also specify the key_type parameter.
With lookup\.[^\(]+([^m]|m[^s]|ms[^g])_id, you are basically asking to check for a string
starting with lookup. denoted by lookup\.,
followed by at least one character different from ( denoted by [^\(]+,
followed by either -- ( | | )
one character different from m -- [^m], or
two characters: m plus no s -- m[^s], or
three characters: ms and no g -- ms[^g], and
ending in _id denoted by _id.
So, for lookup.msg_id, the first part matches obviously, the second consumes ms, and leaves the g for the first alternative of the third.
This could be fixed by patching up the third part to be always three characters long like lookup\.[^\(]+([^m]..|m[^s.]|ms[^g])_id. This, however, would fail everything, where the part between lookup. and _id is not at least four characters long:
WITH
Input (s, r) AS (
SELECT 'lookup.asset_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.msg_id', 'shouldn''t match' FROM DUAL UNION ALL
SELECT 'lookup.whateverelse_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.a_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.ab_id', 'should match' FROM DUAL UNION ALL
SELECT 'lookup.abc_id', 'should match' FROM DUAL
)
SELECT
r, s, INSTR(s, 'lookup.msg_id') has_msg, REGEXP_COUNT(s , 'lookup\.[^\(]+([^m]..|m[^s]|ms[^g])_id') matched FROM Input
;
| R | S | HAS_MSG | MATCHED |
|-----------------|------------------------|---------|---------|
| should match | lookup.asset_id | 0 | 1 |
| shouldn't match | lookup.msg_id | 1 | 0 |
| should match | lookup.whateverelse_id | 0 | 1 |
| should match | lookup.a_id | 0 | 0 |
| should match | lookup.ab_id | 0 | 0 |
| should match | lookup.abc_id | 0 | 0 |
If you have just to make sure, there is no msg in the position in question, you might want to go for
(INSTR(s, 'lookup.msg_id') = 0) AND REGEXP_COUNT(s, 'lookup\.[^\(]+_id') <> 0
For code clarity REGEXP_INSTR(s, 'lookup\.[^\(]+_id') > 0 might be preferable…
#j3d Just comment if further detail is required.
With the requirements still being kind of vague…
Split the string at the semicolon.
Check each substring s to comply:
WITH Input (s) AS (
SELECT ' curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
SELECT 'curry := lookup.curry_id(key_val => ''CHF'', key_type => ''asset_iso'');' FROM DUAL UNION ALL
SELECT 'asset := lookup.asset_id(key_val => ''UBSN'');' FROM DUAL UNION ALL
SELECT 'msg := lookup.msg_id(key_val => ''hello'');' FROM DUAL
)
SELECT
s
FROM Input
WHERE REGEXP_LIKE(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id\(key_val => ''[a-zA-Z0-9]+''\);$')
OR
((REGEXP_INSTR(s, '^\s*[a-z]+\s+:=\s+lookup\.msg_id') = 0)
AND (REGEXP_INSTR(s, '[(,]\s*key_type') > 0)
AND (REGEXP_INSTR(s,
'^\s*[a-z]+\s+:=\s+lookup\.[a-z]+_id\(( ?key_[a-z]+ => ''[a-zA-Z_]+?'',?)+\);$') > 0))
;
| S |
|--------------------------------------------------------------------------|
|[tab] curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
| curry := lookup.curry_id(key_val => 'CHF', key_type => 'asset_iso');|
| msg := lookup.msg_id(key_val => 'hello');|
This would tolerate a superfluous comma right before the closing parenthesis. But if the input is syntactically correct, such a comma won't exist.

Regex Classic ASP

I've currently got a string which contains a URL, and I need to get the base URL.
The string I have is http://www.test.com/test-page/category.html
I am looking for a RegEx that will effectively remove any page/folder names at the end. The issue is that some people may enter the domain in the following formats:
http://www.test.com
www.test.co.uk/
www.test.info/test-page.html
www.test.gov/test-folder/test-page.html
It must return http://www.websitename.ext/ each time i.e. the domain name and extension (e.g. .info .com .co.uk etc) with a forward slash at the end.
Effectively it needs to return the base URL, without any page/folder names. Is there any easy way to do with with a Regular Expression?
Thanks.
My approach: Use a RegEx to extract the domain name. Then add http: to the front and / to the end. Here's the RegEx:
^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))
Also see this answer to the question Extract root domain name from string. (It left me somewhat disatisfied, although pointed out the need to account for https, the port number, and user authentication info which my RegEx does not do.)
Here is an implementation in VBScript. I put the RegEx in a constant and defined a function named GetDomainName(). You should be able to incorporate that function in your ASP page like this:
normalizedUrl = "http://" & GetDomainName(url) & "/"
You can also test my script from the command prompt by saving the code to a file named test.vbs and then passing it to cscript:
cscript test.vbs
Test Program
Option Explicit
Const REGEXPR = "^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))"
' ^^^^^^^^^ ^^^^^^ ^^^^^^^^^^ ^^^^
' A B1 B2 C
'
' A - An optional 'http://' scheme
' B1 - Followed by one or more alpha-numeric characters
' B2 - Followed optionally by one or more occurences of a string
' that begins with a period that is followed by
' one or more alphanumeric characters, and
' C - Terminated by a slash or nothing.
Function GetDomainName(sUrl)
Dim oRegex, oMatch, oMatches, oSubMatch
Set oRegex = New RegExp
oRegex.Pattern = REGEXPR
oRegex.IgnoreCase = True
oRegex.Global = False
Set oMatches = oRegex.Execute(sUrl)
If oMatches.Count > 0 Then
GetDomainName = oMatches(0).SubMatches(0)
Else
GetDomainName = ""
End If
End Function
Dim Data : Data = _
Array( _
"xhttp://www.test.com" _
, "http://www..test.com" _
, "http://www.test.com." _
, "http://www.test.com" _
, "www.test.co.uk/" _
, "www.test.co.uk/?q=42" _
, "www.test.info/test-page.html" _
, "www.test.gov/test-folder/test-page.html" _
, ".www.test.co.uk/" _
)
Dim sUrl, sDomainName
For Each sUrl In Data
sDomainName = GetDomainName(sUrl)
If sDomainName = "" Then
WScript.Echo "[ ] [" & sUrl & "]"
Else
WScript.Echo "[*] [" & sUrl & "] => [" & sDomainName & "]"
End If
Next
Expected Output:
[ ] [xhttp://www.test.com]
[ ] [http://www..test.com]
[ ] [http://www.test.com.]
[*] [http://www.test.com] => [www.test.com]
[*] [www.test.co.uk/] => [www.test.co.uk]
[*] [www.test.co.uk/?q=42] => [www.test.co.uk]
[*] [www.test.info/test-page.html] => [www.test.info]
[*] [www.test.gov/test-folder/test-page.html] => [www.test.gov]
[ ] [.www.test.co.uk/]
I haven't coded Classic ASP in 12 years and this is totally untested.
result = "http://" & Split(Replace(url, "http://",""),"/")(0) & "/"

Extracting field from type Ocaml

This seems like a dumb question to ask, as when I prototype it inside of a terminal I am able to make this work. But when I use the following specific module:
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Lexing.html
and this code:
(*Identifiers*)
let ws = [' ' '\t']*
let id = ['A'-'Z' 'a'-'z'] +
let map = id ws ':' ws id
let feed = '{' ws map+ ws '}'
let feeds = '[' ws feed + ws ']'
(*Entry Points *)
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| feeds as expr { Feeds( expr ) }
| id as expr { Id(expr) }
| feed as expr {
let pos = Lexing.lexeme_start_p lexbuf in
let pos_bol = pos.pos_bol in
print_string (string_of_int pos_bol);
print_string "\n";
Feed(expr) }
I am getting the following error:
Error: Unbound record field label pos_bol
and I am kind of perplexed to why this happening. In the documentation I linked above it says that pos_bol is a field of the type Lexing.position
Sorry, I feel like this is going to have a rather obvious answer when it is answered, but thanks any way!
In OCaml, sum constructors and record field names are scoped inside modules, like identifiers. The position record is defined inside Lexing, which isn't opened in the current scope, so instead of pos.pos_bol you should use pos.Lexing.pos_bol.