regular expression for ipv4 address in CIDR notation - c++

I am using the below regular expression to match ipv4 address in CIDR notation.
[ \t]*(((2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)(/(3[012]|[12]?[0-9])))[ \t]*
I have tested the above using [http://regexpal.com/][1]
It seems to match the following example 192.168.5.10/24
However when I use the same example in flex it says "unrecognized rule".Is there some limitation in flex in that it does not support all the features? The above regex seems pretty basic without the use of any extended features.Can some one point out why flex is not recognizing the rule.
Here is a short self contained example that demonstrates the problem
IPV4ADDRESS [ \t]*(((2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)(/(3[012]|[12]?[0-9])))[ \t]*
SPACE [ \t]
%x S_rule S_dst_ip
%%
%{
BEGIN S_rule;
%}
<S_rule>(dst-ip){SPACE} {
BEGIN(S_dst_ip);
}
<S_dst_ip>\{{IPV4ADDRESS}\} {
printf("\n\nMATCH [%s]\n\n", yytext);
BEGIN S_rule;
}
. { ECHO; }
%%
int main(void)
{
while (yylex() != 0)
;
return(0);
}
int yywrap(void)
{
return 1;
}
When I try to do flex test.l it give "unrecognized rule" error.I want to match
dst-ip { 192.168.10.5/10 }

The "/" in your IPV4ADDRESS pattern needs to be escaped ("\/").
An un-escaped "/" in a flex pattern is the trailing context operator.

Related

How can I match an exact string with variable text before and behind the string in PHP?

I' use a small validation script that tells me when a given url is blocked by robots.txt.
For example there is a given url like http://www.example.com/dir/test.html
My current script tells me if the url is blocked, when there is a line in robots.txt like:
Disallow: /test1.html
But it also says that the url is blocked when there are lines like:
Disallow: /tes
Thats wrong.
I googled something like "regex exact string" and found lots of solutions for the problem above.
But this leads to another problem. When I check exact string in an url http://www.example.com/dir/test1/page.html and in robots.txt is a line like
Disallow: /test1/page.html
My script doesn't get it because it looks for
Disallow: /dir/test1/page.html
And says: That the target page.html is not blocked - but it is!
How can I match an exact string with variable text before and behind the string?
Here is the short-version of the script:
/* example for $rules */
$rules = array("/tes", "/test", "/test1", "/test/page.html", "/test1/page.html", "/dir/test1/page.html")
/*example for $parsed['path']:*/
"dir/test.html"
"dir/test1/page.html"
"test1/page.html"
foreach ($rules as $rule) {
// check if page is disallowed to us
if (preg_match("/^$rule/", $parsed['path']))
return false;
}
EDIT:
This is the whole function:
function robots_allowed($url, $useragent = false) {
// parse url to retrieve host and path
$parsed = parse_url($url);
$agents = array(preg_quote('*'));
if ($useragent)
$agents[] = preg_quote($useragent);
$agents = implode('|', $agents);
// location of robots.txt file
$robotstxt = !empty($parsed['host']) ? #file($parsed['scheme'] . "://" . $parsed['host'] . "/robots.txt") : "";
// if there isn't a robots, then we're allowed in
if (empty($robotstxt))
return true;
$rules = array();
$ruleApplies = false;
foreach ($robotstxt as $line) {
// skip blank lines
if (!$line = trim($line))
continue;
// following rules only apply if User-agent matches $useragent or '*'
if (preg_match('/^\s*User-agent: (.*)/i', $line, $match)) {
$ruleApplies = preg_match("/($agents)/i", $match[1]);
}
if ($ruleApplies && preg_match('/^\s*Disallow:(.*)/i', $line, $regs)) {
// an empty rule implies full access - no further tests required
if (!$regs[1])
return true;
// add rules that apply to array for testing
$rules[] = preg_quote(trim($regs[1]), '/');
}
}
foreach ($rules as $rule) {
// check if page is disallowed to us
if (preg_match("/^$rule/", $parsed['path']))
return false;
}
// page is not disallowed
return true;
}
The URL comes from user input.
Try everything at once, avoid the array.
/(?:\/?dir\/)?\/?tes(?:(?:t(?:1)?)?(?:\.html|(?:\/page\.html)?))/
https://regex101.com/r/VxL30W/1
(?: /?dir / )?
/?tes
(?:
(?:
t
(?: 1 )?
)?
(?:
\.html
|
(?: /page \. html )?
)
)
I've found a solution to match /test or /test/hello or /test/ but not to match /testosterone or /hellotest:
(?:\/test$|\/test\/)
With PHP-Variables:
if (preg_match("/(?:" . $rule . "$|" . $rule . "\/)/", $parsed['path']))
Based on the funktion above.
https://regex101.com/r/DFVR5T/3
Can I use (?:\/ ...) or is that wrong?

Glib regex for matching whole word?

For matching a whole word, the regex \bword\b should suffice. Yet the following code always returns 0 matches
try {
string pattern = "\bhtml\b";
Regex wordRegex = new Regex (pattern, RegexCompileFlags.CASELESS, RegexMatchFlags.NOTEMPTY);
MatchInfo matchInfo;
string lineOfText = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
wordRegex.match (lineOfText, RegexMatchFlags.NOTEMPTY, out matchInfo);
stdout.printf ("Match count is: %d\n", matchInfo.get_match_count ());
} catch (RegexError regexError) {
stderr.printf ("Regex error: %s\n", regexError.message);
}
This should be working as testing the \bhtml\b pattern returns one match for the provided string in testing engines. But on this program it returns 0 matches. Is the code wrong? What regex in Glib would be used to match a whole word?
It looks like you have to escape the backslash too:
try {
string pattern = "\\bhtml\\b";
Regex wordRegex = new Regex (pattern, RegexCompileFlags.CASELESS, RegexMatchFlags.NOTEMPTY);
MatchInfo matchInfo;
string lineOfText = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
wordRegex.match (lineOfText, RegexMatchFlags.NOTEMPTY, out matchInfo);
stdout.printf ("Match count is: %d\n", matchInfo.get_match_count ());
} catch (RegexError regexError) {
stderr.printf ("Regex error: %s\n", regexError.message);
}
Output:
Match count is: 1
Demo
You can simplify your code with regular expression literals:
Regex regex = /\bhtml\b/i;
You don't have to quote backslashes in the regular expression literal syntax. (Front slashes would be problematic though.)
Full example:
void test_match (string text, Regex regex) {
MatchInfo match_info;
if (regex.match (text, RegexMatchFlags.NOTEMPTY, out match_info)) {
stdout.printf ("Match count is: %d\n", match_info.get_match_count ());
}
else {
stdout.printf ("No match");
}
}
int main () {
Regex regex = /\bhtml\b/i;
test_match ("<!DOCTYPE html PUBLIC>", regex);
return 0;
}

Match any UTF-8 letter in javacc regular expression

How can i write regular expression in javacc that matches any UTF-8 letter? I red that some regular expression engines support \p{L} but this doesn't work in javacc or i am doing something wrong.
TOKEN : { < #UTF_LETTER : \p{L} > }
Ok so here is an answer thanks to Wiktor Stribiżew. You need to paste pL var into javacc code yourself like this:
TOKEN : { < #UTF_LETTER : [
"a"-"z",
"A"-"Z",
"\u00AA",
"\u00B5",
"\u00BA",
"\u00C0"-"\u00D6",
"\u00D8"-"\u00F6",
"\u00F8"-"\u02C1",
"\u02C6"-"\u02D1",
"\u02E0"-"\u02E4",
"\u02EC",
"\u02EE",
"\u0370"-"\u0374",
"\u0376",
"\u0377",
"\u037A"-"\u037D",
"\u037F",
"\u0386",
"\u0388"-"\u038A",
"\u038C",
"\u038E"-"\u03A1",
"\u03A3"-"\u03F5",
"\u03F7"-"\u0481",
"\u048A"-"\u052F",
"\u0531"-"\u0556",
"\u0559",
"\u0561"-"\u0587",
"\u05D0"-"\u05EA",
"\u05F0"-"\u05F2",
"\u0620"-"\u064A",
"\u066E",
"\u066F",
"\u0671"-"\u06D3",
"\u06D5",
"\u06E5",
"\u06E6",
"\u06EE",
"\u06EF",
"\u06FA"-"\u06FC",
"\u06FF",
"\u0710",
"\u0712"-"\u072F",
"\u074D"-"\u07A5",
"\u07B1",
"\u07CA"-"\u07EA",
"\u07F4",
"\u07F5",
"\u07FA",
"\u0800"-"\u0815",
"\u081A",
"\u0824",
"\u0828",
"\u0840"-"\u0858",
"\u08A0"-"\u08B2",
"\u0904"-"\u0939",
"\u093D",
"\u0950",
"\u0958"-"\u0961",
"\u0971"-"\u0980",
"\u0985"-"\u098C",
"\u098F",
"\u0990",
"\u0993"-"\u09A8",
"\u09AA"-"\u09B0",
"\u09B2",
"\u09B6"-"\u09B9",
"\u09BD",
"\u09CE",
"\u09DC",
"\u09DD",
"\u09DF"-"\u09E1",
"\u09F0",
"\u09F1",
"\u0A05"-"\u0A0A",
"\u0A0F",
"\u0A10",
"\u0A13"-"\u0A28",
"\u0A2A"-"\u0A30",
"\u0A32",
"\u0A33",
"\u0A35",
"\u0A36",
"\u0A38",
"\u0A39",
"\u0A59"-"\u0A5C",
"\u0A5E",
"\u0A72"-"\u0A74",
"\u0A85"-"\u0A8D",
"\u0A8F"-"\u0A91",
"\u0A93"-"\u0AA8",
"\u0AAA"-"\u0AB0",
"\u0AB2",
"\u0AB3",
"\u0AB5"-"\u0AB9",
"\u0ABD",
"\u0AD0",
"\u0AE0",
"\u0AE1",
"\u0B05"-"\u0B0C",
"\u0B0F",
"\u0B10",
"\u0B13"-"\u0B28",
"\u0B2A"-"\u0B30",
"\u0B32",
"\u0B33",
"\u0B35"-"\u0B39",
"\u0B3D",
"\u0B5C",
"\u0B5D",
"\u0B5F"-"\u0B61",
"\u0B71",
"\u0B83",
"\u0B85"-"\u0B8A",
"\u0B8E"-"\u0B90",
"\u0B92"-"\u0B95",
"\u0B99",
"\u0B9A",
"\u0B9C",
"\u0B9E",
"\u0B9F",
"\u0BA3",
"\u0BA4",
"\u0BA8"-"\u0BAA",
"\u0BAE"-"\u0BB9",
"\u0BD0",
"\u0C05"-"\u0C0C",
"\u0C0E"-"\u0C10",
"\u0C12"-"\u0C28",
"\u0C2A"-"\u0C39",
"\u0C3D",
"\u0C58",
"\u0C59",
"\u0C60",
"\u0C61",
"\u0C85"-"\u0C8C",
"\u0C8E"-"\u0C90",
"\u0C92"-"\u0CA8",
"\u0CAA"-"\u0CB3",
"\u0CB5"-"\u0CB9",
"\u0CBD",
"\u0CDE",
"\u0CE0",
"\u0CE1",
"\u0CF1",
"\u0CF2",
"\u0D05"-"\u0D0C",
"\u0D0E"-"\u0D10",
"\u0D12"-"\u0D3A",
"\u0D3D",
"\u0D4E",
"\u0D60",
"\u0D61",
"\u0D7A"-"\u0D7F",
"\u0D85"-"\u0D96",
"\u0D9A"-"\u0DB1",
"\u0DB3"-"\u0DBB",
"\u0DBD",
"\u0DC0"-"\u0DC6",
"\u0E01"-"\u0E30",
"\u0E32",
"\u0E33",
"\u0E40"-"\u0E46",
"\u0E81",
"\u0E82",
"\u0E84",
"\u0E87",
"\u0E88",
"\u0E8A",
"\u0E8D",
"\u0E94"-"\u0E97",
"\u0E99"-"\u0E9F",
"\u0EA1"-"\u0EA3",
"\u0EA5",
"\u0EA7",
"\u0EAA",
"\u0EAB",
"\u0EAD"-"\u0EB0",
"\u0EB2",
"\u0EB3",
"\u0EBD",
"\u0EC0"-"\u0EC4",
"\u0EC6",
"\u0EDC"-"\u0EDF",
"\u0F00",
"\u0F40"-"\u0F47",
"\u0F49"-"\u0F6C",
"\u0F88"-"\u0F8C",
"\u1000"-"\u102A",
"\u103F",
"\u1050"-"\u1055",
"\u105A"-"\u105D",
"\u1061",
"\u1065",
"\u1066",
"\u106E"-"\u1070",
"\u1075"-"\u1081",
"\u108E",
"\u10A0"-"\u10C5",
"\u10C7",
"\u10CD",
"\u10D0"-"\u10FA",
"\u10FC"-"\u1248",
"\u124A"-"\u124D",
"\u1250"-"\u1256",
"\u1258",
"\u125A"-"\u125D",
"\u1260"-"\u1288",
"\u128A"-"\u128D",
"\u1290"-"\u12B0",
"\u12B2"-"\u12B5",
"\u12B8"-"\u12BE",
"\u12C0",
"\u12C2"-"\u12C5",
"\u12C8"-"\u12D6",
"\u12D8"-"\u1310",
"\u1312"-"\u1315",
"\u1318"-"\u135A",
"\u1380"-"\u138F",
"\u13A0"-"\u13F4",
"\u1401"-"\u166C",
"\u166F"-"\u167F",
"\u1681"-"\u169A",
"\u16A0"-"\u16EA",
"\u16F1"-"\u16F8",
"\u1700"-"\u170C",
"\u170E"-"\u1711",
"\u1720"-"\u1731",
"\u1740"-"\u1751",
"\u1760"-"\u176C",
"\u176E"-"\u1770",
"\u1780"-"\u17B3",
"\u17D7",
"\u17DC",
"\u1820"-"\u1877",
"\u1880"-"\u18A8",
"\u18AA",
"\u18B0"-"\u18F5",
"\u1900"-"\u191E",
"\u1950"-"\u196D",
"\u1970"-"\u1974",
"\u1980"-"\u19AB",
"\u19C1"-"\u19C7",
"\u1A00"-"\u1A16",
"\u1A20"-"\u1A54",
"\u1AA7",
"\u1B05"-"\u1B33",
"\u1B45"-"\u1B4B",
"\u1B83"-"\u1BA0",
"\u1BAE",
"\u1BAF",
"\u1BBA"-"\u1BE5",
"\u1C00"-"\u1C23",
"\u1C4D"-"\u1C4F",
"\u1C5A"-"\u1C7D",
"\u1CE9"-"\u1CEC",
"\u1CEE"-"\u1CF1",
"\u1CF5",
"\u1CF6",
"\u1D00"-"\u1DBF",
"\u1E00"-"\u1F15",
"\u1F18"-"\u1F1D",
"\u1F20"-"\u1F45",
"\u1F48"-"\u1F4D",
"\u1F50"-"\u1F57",
"\u1F59",
"\u1F5B",
"\u1F5D",
"\u1F5F"-"\u1F7D",
"\u1F80"-"\u1FB4",
"\u1FB6"-"\u1FBC",
"\u1FBE",
"\u1FC2"-"\u1FC4",
"\u1FC6"-"\u1FCC",
"\u1FD0"-"\u1FD3",
"\u1FD6"-"\u1FDB",
"\u1FE0"-"\u1FEC",
"\u1FF2"-"\u1FF4",
"\u1FF6"-"\u1FFC",
"\u2071",
"\u207F",
"\u2090"-"\u209C",
"\u2102",
"\u2107",
"\u210A"-"\u2113",
"\u2115",
"\u2119"-"\u211D",
"\u2124",
"\u2126",
"\u2128",
"\u212A"-"\u212D",
"\u212F"-"\u2139",
"\u213C"-"\u213F",
"\u2145"-"\u2149",
"\u214E",
"\u2183",
"\u2184",
"\u2C00"-"\u2C2E",
"\u2C30"-"\u2C5E",
"\u2C60"-"\u2CE4",
"\u2CEB"-"\u2CEE",
"\u2CF2",
"\u2CF3",
"\u2D00"-"\u2D25",
"\u2D27",
"\u2D2D",
"\u2D30"-"\u2D67",
"\u2D6F",
"\u2D80"-"\u2D96",
"\u2DA0"-"\u2DA6",
"\u2DA8"-"\u2DAE",
"\u2DB0"-"\u2DB6",
"\u2DB8"-"\u2DBE",
"\u2DC0"-"\u2DC6",
"\u2DC8"-"\u2DCE",
"\u2DD0"-"\u2DD6",
"\u2DD8"-"\u2DDE",
"\u2E2F",
"\u3005",
"\u3006",
"\u3031"-"\u3035",
"\u303B",
"\u303C",
"\u3041"-"\u3096",
"\u309D"-"\u309F",
"\u30A1"-"\u30FA",
"\u30FC"-"\u30FF",
"\u3105"-"\u312D",
"\u3131"-"\u318E",
"\u31A0"-"\u31BA",
"\u31F0"-"\u31FF",
"\u3400"-"\u4DB5",
"\u4E00"-"\u9FCC",
"\uA000"-"\uA48C",
"\uA4D0"-"\uA4FD",
"\uA500"-"\uA60C",
"\uA610"-"\uA61F",
"\uA62A",
"\uA62B",
"\uA640"-"\uA66E",
"\uA67F"-"\uA69D",
"\uA6A0"-"\uA6E5",
"\uA717"-"\uA71F",
"\uA722"-"\uA788",
"\uA78B"-"\uA78E",
"\uA790"-"\uA7AD",
"\uA7B0",
"\uA7B1",
"\uA7F7"-"\uA801",
"\uA803"-"\uA805",
"\uA807"-"\uA80A",
"\uA80C"-"\uA822",
"\uA840"-"\uA873",
"\uA882"-"\uA8B3",
"\uA8F2"-"\uA8F7",
"\uA8FB",
"\uA90A"-"\uA925",
"\uA930"-"\uA946",
"\uA960"-"\uA97C",
"\uA984"-"\uA9B2",
"\uA9CF",
"\uA9E0"-"\uA9E4",
"\uA9E6"-"\uA9EF",
"\uA9FA"-"\uA9FE",
"\uAA00"-"\uAA28",
"\uAA40"-"\uAA42",
"\uAA44"-"\uAA4B",
"\uAA60"-"\uAA76",
"\uAA7A",
"\uAA7E"-"\uAAAF",
"\uAAB1",
"\uAAB5",
"\uAAB6",
"\uAAB9"-"\uAABD",
"\uAAC0",
"\uAAC2",
"\uAADB"-"\uAADD",
"\uAAE0"-"\uAAEA",
"\uAAF2"-"\uAAF4",
"\uAB01"-"\uAB06",
"\uAB09"-"\uAB0E",
"\uAB11"-"\uAB16",
"\uAB20"-"\uAB26",
"\uAB28"-"\uAB2E",
"\uAB30"-"\uAB5A",
"\uAB5C"-"\uAB5F",
"\uAB64",
"\uAB65",
"\uABC0"-"\uABE2",
"\uAC00"-"\uD7A3",
"\uD7B0"-"\uD7C6",
"\uD7CB"-"\uD7FB",
"\uF900"-"\uFA6D",
"\uFA70"-"\uFAD9",
"\uFB00"-"\uFB06",
"\uFB13"-"\uFB17",
"\uFB1D",
"\uFB1F"-"\uFB28",
"\uFB2A"-"\uFB36",
"\uFB38"-"\uFB3C",
"\uFB3E",
"\uFB40",
"\uFB41",
"\uFB43",
"\uFB44",
"\uFB46"-"\uFBB1",
"\uFBD3"-"\uFD3D",
"\uFD50"-"\uFD8F",
"\uFD92"-"\uFDC7",
"\uFDF0"-"\uFDFB",
"\uFE70"-"\uFE74",
"\uFE76"-"\uFEFC",
"\uFF21"-"\uFF3A",
"\uFF41"-"\uFF5A",
"\uFF66"-"\uFFBE",
"\uFFC2"-"\uFFC7",
"\uFFCA"-"\uFFCF",
"\uFFD2"-"\uFFD7",
"\uFFDA"-"\uFFDC"
] > }

Regular Expression how to add " to beginning and end of word

I have a json string that looks something like this:
{
operations: [
validateAddressAndBRClassification, vintageValidateAddressAndGeo, deleteShareInformation, validateCountry, validatePostcode, pageThroughAddress, getFromRelationships, getToRelationships, getUnresolvedAddresses, validateCity, getVisits, getOperationalAddress, getGeo, looseAddressSearch, getServiceInformation, getAddress, getInternalProp, validateAddress, disputeBR, validateAddressAndGeo, supportedCountries, internalMx, getBRProfileHistory, databaseStatus, getBRProfile, addressLookup, getDataSourceMetaInformation, getBRDisputeHistory, validateStateProv, getGeopoliticalElementList, validatePostal, postalLookup, signalDataSourceChange, setInternalProp, getShareInformation, addVisit, addressSearch, getRelatedAddresses, ping, showOperations
]
}
I need to add a double quote " to the beginning and end of each word. What would the regular expression for this be?
So I need it to look like this:
{
"operations": [
"validateAddressAndBRClassification", "vintageValidateAddressAndGeo", "deleteShareInformation", "etc"
}
I'm not sure which programming language you are using but in JavaScript you could achieve it with the following line:
string = string.replace(/(\w+)/g, "\"$1\"");
Example: https://jsfiddle.net/wLyL0hvk/

Regex for strings in Bibtex

I'm trying to parse Bibtex files using lex/yacc. Strings in the bibtex database can be surrounded by quotes "..." or with braces - {...}
But every entry is also enclosed in braces. How do differentiate between an entry and a string surrounded by braces?
#Book{sweig42,
Author = { Stefan Sweig },
title = { The impossible book },
publisher = { Dead Poet Society},
year = 1942,
month = mar
}
you have various options:
lexer start conditions (from a Lex tutorial)
building on the ideas from greg ward, enhance your lex rules with start conditions ('modes' as they are called in the referenced source).
specifically, you would have the start conditions BASIC ENTRY STRING and the following rules (example taken and slightly enhanced from here):
%START BASIC ENTRY STRING
%%
/* Lexical grammar, mode 1: top-level */
<BASIC>AT # { BEGIN ENTRY; }
<BASIC>NEWLINE \n
<BASIC>COMMENT \%[^\n]*\n
<BASIC>WHITESPACE. [\ \r\t]+
<BASIC>JUNK [^#\n\ \r\t]+
/* Lexical grammar, mode 2: in-entry */
<ENTRY>NEWLINE \n
<ENTRY>COMMENT \%[^\n]*\n
<ENTRY>WHITESPACE [\ \r\t]+
<ENTRY>NUMBER [0-9]+
<ENTRY>NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+ { if (stricmp(yytext, "comment")==0) { BEGIN STRING; } }
<ENTRY>LBRACE \{ { if (delim == '\0') { delim='}'; } else { blevel=1; BEGIN STRING; } }
<ENTRY>RBRACE \} { BEGIN BASIC; }
<ENTRY>LPAREN \( { BEGIN STRING; delim=')'; plevel=1; }
<ENTRY>RPAREN \)
<ENTRY>EQUALS =
<ENTRY>HASH \#
<ENTRY>COMMA ,
<ENTRY>QUOTE \" { BEGIN STRING; bleveL=0; plevel=0; }
/* Lexical grammar, mode 3: strings */
<STRING>LBRACE \{ { if (blevel>0) {blevel++;} }
<STRING>RBRACE \} { if (blevel>0) { blevel--; if (blevel == 0) { BEGIN ENTRY; } } }
<STRING>LPAREN \( { if (plevel>0) { plevel++;} }
<STRING>RPAREN \} { if (plevel>0) { plevel--; if (plevel == 0) { BEGIN ENTRY; } } }
<STRING>QUOTE \" { BEGIN ENTRY; }
please note that the rule set is by no means complete but should get you started. more details to be found here.
btparse
These docs explain in a fairly detailed fashion thenintricacies of parsing the bibtex formats and comes with a 'python parser.
biblex
you might also be interested in employing the unix toolchain of biblex and bibparse. these tools generate and parse a bibtex token stream, respectively.
more info can be found here.
best regards, carsten