Converting Traditional Logic to XSLT - regex
Just trying to convert the below simple code into XSLT and I tried to use Regular expressions with XSLT 2.0 but some how it is not working. Could any one please advice? Thank you.
string[256] Fname;
integer ctr;
Fname="";
ctr=0;
while ctr <= len(#FirstName) do
begin
if mid(#FirstName,ctr,1) = "A" |
mid(#FirstName,ctr,1) = "B" |
mid(#FirstName,ctr,1) = "C" |
mid(#FirstName,ctr,1) = "D" |
mid(#FirstName,ctr,1) = "E" |
mid(#FirstName,ctr,1) = "F" |
mid(#FirstName,ctr,1) = "G" |
mid(#FirstName,ctr,1) = "H" |
mid(#FirstName,ctr,1) = "I" |
mid(#FirstName,ctr,1) = "J" |
mid(#FirstName,ctr,1) = "K" |
mid(#FirstName,ctr,1) = "L" |
mid(#FirstName,ctr,1) = "M" |
mid(#FirstName,ctr,1) = "N" |
mid(#FirstName,ctr,1) = "O" |
mid(#FirstName,ctr,1) = "P" |
mid(#FirstName,ctr,1) = "Q" |
mid(#FirstName,ctr,1) = "R" |
mid(#FirstName,ctr,1) = "S" |
mid(#FirstName,ctr,1) = "T" |
mid(#FirstName,ctr,1) = "U" |
mid(#FirstName,ctr,1) = "V" |
mid(#FirstName,ctr,1) = "W" |
mid(#FirstName,ctr,1) = "X" |
mid(#FirstName,ctr,1) = "Y" |
mid(#FirstName,ctr,1) = "Z" |
mid(#FirstName,ctr,1) = "a" |
mid(#FirstName,ctr,1) = "b" |
mid(#FirstName,ctr,1) = "c" |
mid(#FirstName,ctr,1) = "d" |
mid(#FirstName,ctr,1) = "e" |
mid(#FirstName,ctr,1) = "f" |
mid(#FirstName,ctr,1) = "g" |
mid(#FirstName,ctr,1) = "h" |
mid(#FirstName,ctr,1) = "i" |
mid(#FirstName,ctr,1) = "j" |
mid(#FirstName,ctr,1) = "k" |
mid(#FirstName,ctr,1) = "l" |
mid(#FirstName,ctr,1) = "m" |
mid(#FirstName,ctr,1) = "n" |
mid(#FirstName,ctr,1) = "o" |
mid(#FirstName,ctr,1) = "p" |
mid(#FirstName,ctr,1) = "q" |
mid(#FirstName,ctr,1) = "r" |
mid(#FirstName,ctr,1) = "s" |
mid(#FirstName,ctr,1) = "t" |
mid(#FirstName,ctr,1) = "u" |
mid(#FirstName,ctr,1) = "v" |
mid(#FirstName,ctr,1) = "w" |
mid(#FirstName,ctr,1) = "x" |
mid(#FirstName,ctr,1) = "y" |
mid(#FirstName,ctr,1) = "z" |
mid(#FirstName,ctr,1) = "-"
then
Fname = Fname + mid(#FirstName,ctr,1);
ctr = ctr+1;
end
Fname = trimleft(Fname,"-");
if len(Fname) = 2 then
Fname = Fname + "-";
if len(Fname) = 1 then
Fname = Fname + "--";
if len(Fname) > 50 then
Fname = left(Fname,50);
if len(Fname) = 0 then
Fname = "UNKNOWN";
#FirstName = Fname;
Solution:
Same logic is applied on another filed called PostalCode and here is what I tried but some how regular expression not working. I am trying to fix it mean while posting here as well for experts solutions.
<xsl:if test="contains(substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1),'[^a-zA-Z1-9. ]')">
<xsl:variable name="vPostalCode"
select="concat(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1))"/>
<xsl:element name="PostalCode">
<xsl:if test="string-length($vPostalCode) < 9 ">
<xsl:value-of
select="substring($vPostalCode,1, 5)"/>
</xsl:if>
<xsl:if test="string-length(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode)) > 10 ">
<xsl:value-of
select="substring($vPostalCode,1, 10)"/>
</xsl:if>
</xsl:element>
</xsl:if>
OK, I think I've worked out what your code does: it constructs a string containing all the letters, digits, and hyphens from an input string, and removes everything else. It then pads with hyphens to a minimum length of three, and truncates to a maximum length of 50. (Why couldn't you have told us that?)
Also, if you tried to write the code and it didn't work then you should show us the code so we can tell you where you went wrong.
The first part of the problem can be done using
replace($in, "[^A-Za-z0-9\-]", "")
Padding with hyphens to length 3 can be done with
if (string-length($s) lt 3)
then substring(concat($s, "---"), 1, 3)
else $s
Truncation to a maximum of 50 characters can be done with
substring($s, 1, 50)
Related
How to handle multi-line rules for gor parsing bnf grammar using boost spirit qi
Assuming I have a BNF grammar like this <code> ::= <letter><digit> | <letter><digit><code> <letter> ::= a | b | c | d | e | f | g | h | i <digit> ::= 0 | 1 | 2 | 3 | 4 If you look at the <letter> rule, its continuation starts with the | but that of the <digit> rule starts with the production with | appearing at the end of the previous line. I also don't want to use a particular symbol to represent the end of a rule. How do check if a rule as ended using the Boost Spirit Qi for implementation. I have just gone through the tutorial on the boost page and wondering how I am going to handle this.
Wikipedia BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon character “;” marks the end of a rule. So the simple answer is: the input isn't BNF. Iff you want to support it anyways (at your own peril :)) you'll have to make it so. So, let's write a simplistic BFN grammar, literally mapping from Wikipedia BNF <syntax> ::= <rule> | <rule> <syntax> <rule> ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end> <opt-whitespace> ::= " " <opt-whitespace> | "" <expression> ::= <list> | <list> <opt-whitespace> "|" <opt-whitespace> <expression> <line-end> ::= <opt-whitespace> <EOL> | <line-end> <line-end> <list> ::= <term> | <term> <opt-whitespace> <list> <term> ::= <literal> | "<" <rule-name> ">" <literal> ::= '"' <text1> '"' | "'" <text2> "'" <text1> ::= "" | <character1> <text1> <text2> ::= '' | <character2> <text2> <character> ::= <letter> | <digit> | <symbol> <letter> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" <symbol> ::= "|" | " " | "!" | "#" | "$" | "%" | "&" | "(" | ")" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "#" | "[" | "\" | "]" | "^" | "_" | "`" | "{" | "}" | "~" <character1> ::= <character> | "'" <character2> ::= <character> | '"' <rule-name> ::= <letter> | <rule-name> <rule-char> <rule-char> ::= <letter> | <digit> | "-" It could look like this: template <typename Iterator> struct BNF: qi::grammar<Iterator, Ast::Syntax()> { BNF(): BNF::base_type(start) { using namespace qi; start = skip(blank) [ _rule % +eol ]; _rule = _rule_name >> "::=" >> _expression; _expression = _list % '|'; _list = +_term; _term = _literal | _rule_name; _literal = '"' >> *(_character - '"') >> '"' | "'" >> *(_character - "'") >> "'"; _character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-"); _rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>'; BOOST_SPIRIT_DEBUG_NODES( (_rule)(_expression)(_list)(_term) (_literal)(_character) (_rule_name)) } private: qi::rule<Iterator, Ast::Syntax()> start; qi::rule<Iterator, Ast::Rule(), qi::blank_type> _rule; qi::rule<Iterator, Ast::Expression(), qi::blank_type> _expression; qi::rule<Iterator, Ast::List(), qi::blank_type> _list; // lexemes qi::rule<Iterator, Ast::Term()> _term; qi::rule<Iterator, Ast::Name()> _rule_name; qi::rule<Iterator, std::string()> _literal; qi::rule<Iterator, char()> _character; }; Now it will parse your sample (corrected to be BNF): std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code> <letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" <digit> ::= "0" | "1" | "2" | "3" | "4" )"; Live On Compiler Explorer Prints: code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>} letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i} digit ::= {0} | {1} | {2} | {3} | {4} Remaining: " " Support Line-Wrapped Rules The best way is to not accept them - since the grammar wasn't designed for it unlike e.g. EBNF. You can force the issue by doing a negative look-ahead in the skipper: _skipper = blank | (eol >> !_rule); start = skip(_skipper) [ _rule % +eol ]; For technical reasons (Boost spirit skipper issues) that doesn't compile, so we need to feed it a placeholder skipper inside the look-ahead: _blank = blank; _skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]); start = skip(_skipper.alias()) [ _rule % +eol ]; Now it parses the same but with various line-breaks: std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code> <letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" <digit> ::= "0" | "1" | "2" | "3" | "4" )"; Printing: code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>} letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i} digit ::= {0} | {1} | {2} | {3} | {4} FULL LISTING Compiler Explorer //#define BOOST_SPIRIT_DEBUG #include <boost/spirit/include/qi.hpp> #include <boost/fusion/adapted.hpp> #include <fmt/ranges.h> #include <fmt/ostream.h> #include <iomanip> namespace qi = boost::spirit::qi; namespace Ast { struct Name : std::string { using std::string::string; using std::string::operator=; friend std::ostream& operator<<(std::ostream& os, Name const& n) { return os << '<' << n.c_str() << '>'; } }; using Term = boost::variant<Name, std::string>; using List = std::list<Term>; using Expression = std::list<List>; struct Rule { Name name; // lhs Expression rhs; }; using Syntax = std::list<Rule>; } BOOST_FUSION_ADAPT_STRUCT(Ast::Rule, name, rhs) namespace Parser { template <typename Iterator> struct BNF: qi::grammar<Iterator, Ast::Syntax()> { BNF(): BNF::base_type(start) { using namespace qi; _blank = blank; _skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]); start = skip(_skipper.alias()) [ _rule % +eol ]; _rule = _rule_name >> "::=" >> _expression; _expression = _list % '|'; _list = +_term; _term = _literal | _rule_name; _literal = '"' >> *(_character - '"') >> '"' | "'" >> *(_character - "'") >> "'"; _character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-"); _rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>'; BOOST_SPIRIT_DEBUG_NODES( (_rule)(_expression)(_list)(_term) (_literal)(_character) (_rule_name)) } private: using Skipper = qi::rule<Iterator>; Skipper _skipper, _blank; qi::rule<Iterator, Ast::Syntax()> start; qi::rule<Iterator, Ast::Rule(), Skipper> _rule; qi::rule<Iterator, Ast::Expression(), Skipper> _expression; qi::rule<Iterator, Ast::List(), Skipper> _list; // lexemes qi::rule<Iterator, Ast::Term()> _term; qi::rule<Iterator, Ast::Name()> _rule_name; qi::rule<Iterator, std::string()> _literal; qi::rule<Iterator, char()> _character; }; } int main() { Parser::BNF<std::string::const_iterator> const parser; std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code> <letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" <digit> ::= "0" | "1" | "2" | "3" | "4" )"; auto it = input.begin(), itEnd = input.end(); Ast::Syntax syntax; if (parse(it, itEnd, parser, syntax)) { for (auto& rule : syntax) fmt::print("{} ::= {}\n", rule.name, fmt::join(rule.rhs, " | ")); } else { std::cout << "Failed\n"; } if (it != itEnd) std::cout << "Remaining: " << std::quoted(std::string(it, itEnd)) << "\n"; } Also Live On Coliru (without libfmt)
PL/SQL: regexp_like for string not start with letters
For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore. For example : String : AM_HTCEVOBLKHS_BX [false] String : AP_HTCEVOBLKHSPBX [false] String : BM_HTCEVOBLKHS_BX [true] String : A_HTCEVODSAP_DSSD [true] String : A_HTCEVOB_A_CDSED [true] String : MP_HTCEVOBLKHS_BX [true] Can you make this pattern ? My current solution doesn't work: BEGIN IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN dbms_output.put_line('TRUE'); ELSE dbms_output.put_line('FALSE'); END IF; END; /
why you need regexp why you not use simple substr? with t1 as (select 'AM_HTCEVOBLKHS_BX' as f1 from dual union all select 'AP_HTCEVOBLKHSPBX' from dual union all select 'BM_HTCEVOBLKHS_BX' from dual union all select 'A_HTCEVODSAP_DSSD' from dual union all select 'A_HTCEVOB_A_CDSED' from dual union all select 'MP_HTCEVOBLKHS_BX' from dual union all select null from dual union all select '1' from dual) select f1, case when substr(f1, 1, 2) in ('AM', 'AP') then 'false' else 'true' end as check_result from t1
If you have a table of patterns then: SQL Fiddle Oracle 11g R2 Schema Setup: CREATE TABLE strings ( string ) AS SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL; CREATE TABLE patterns ( pattern ) AS SELECT '^AM' FROM DUAL UNION ALL SELECT '^AP' FROM DUAL; Query 1: -- Negative Matches: SELECT string FROM strings s LEFT OUTER JOIN patterns p ON ( REGEXP_LIKE( string, pattern ) ) WHERE p.pattern IS NULL Results: | STRING | |-------------------| | BM_HTCEVOBLKHS_BX | | A_HTCEVODSAP_DSSD | | A_HTCEVOB_A_CDSED | | MP_HTCEVOBLKHS_BX | Query 2: -- Positive Matches: SELECT DISTINCT string FROM strings s INNER JOIN patterns p ON ( REGEXP_LIKE( string, pattern ) ) Results: | STRING | |-------------------| | AM_HTCEVOBLKHS_BX | | AP_HTCEVOBLKHSPBX | Query 3: -- All Matches: SELECT string, CASE WHEN REGEXP_LIKE( string, ( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL ) FROM patterns ) ) THEN 'True' ELSE 'False' END AS Matched FROM strings s Results: | STRING | MATCHED | |-------------------|---------| | AM_HTCEVOBLKHS_BX | True | | AP_HTCEVOBLKHSPBX | True | | BM_HTCEVOBLKHS_BX | False | | A_HTCEVODSAP_DSSD | False | | A_HTCEVOB_A_CDSED | False | | MP_HTCEVOBLKHS_BX | False | If you want to pass the pattern as a single string then: Query 4: -- Negative Matches: SELECT string FROM strings WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' ) Results: | STRING | |-------------------| | BM_HTCEVOBLKHS_BX | | A_HTCEVODSAP_DSSD | | A_HTCEVOB_A_CDSED | | MP_HTCEVOBLKHS_BX | Query 5: -- Positive Matches: SELECT string FROM strings WHERE REGEXP_LIKE( string, '^(AM|AP)' ) Results: | STRING | |-------------------| | AM_HTCEVOBLKHS_BX | | AP_HTCEVOBLKHSPBX | Query 6: -- All Matches: SELECT string, CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' ) THEN 'True' ELSE 'False' END AS Matched FROM strings Results: | STRING | MATCHED | |-------------------|---------| | AM_HTCEVOBLKHS_BX | True | | AP_HTCEVOBLKHSPBX | True | | BM_HTCEVOBLKHS_BX | False | | A_HTCEVODSAP_DSSD | False | | A_HTCEVOB_A_CDSED | False | | MP_HTCEVOBLKHS_BX | False |
Try this: ^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$ The idea is to describe all possible start of the string. ( # a group [B-Z][A-Z]* # The first character is not a "A" | # OR A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M" | # OR A[A-Z]{2,} # a "A" followed by more than 1 letter ) # close the group ^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this: not regexp_like( field, '^(AM_)|^(AP_)' ) As it is a LIKE function you don't need any more on the regex expression.
Explain BNF syntax for NID in RFC 2141
I am having trouble understanding some BNF syntax from RFC2141. The line is <NID> ::= <let-num> [ 1,31<let-num-hyp> ]. I think it means that <NID> is a symbol for a string, with constrained by two rules: The string must be begin with a single occurence of any of the <let-num> characters. This character may be followed by 0-31 occurrences* of any of the <let-num-hyp> characters. Am I reading this correctly? Because, if I am, some of the implications are a bit confusing. *equivalent to "optionally, 1-31 occurrences The complete BNF syntax for a <NID> (Namespace Identifier) in RFC2141 is: <NID> ::= <let-num> [ 1,31<let-num-hyp> ] <let-num-hyp> ::= <upper> | <lower> | <number> | "-" <let-num> ::= <upper> | <lower> | <number> <upper> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" <lower> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" <number> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
You've interpreted it correctly. What are the confusing implications? <NID> ::= <let-num> [ 1,31<let-num-hyp> ] means one occurrence of <let-num> followed optionally by up to 31 occurrences of <let-num-hyp>. Taking into account the other definitions, this means a string of at least one character and at most 32 characters, consisting of letters of either case, numerals, and hyphens, with the first character not allowed to be a hyphen.
code optimization
I must write a function "to_string" wich receives this datatype datatype prop = Atom of string | Not of prop | And of prop*prop | Or of prop*prop; and returns a string. Example show And(Atom("saturday"),Atom("night")) = "(saturday & night)" My function is working but I have 2 problems. the interpreter tells me -> Warning: match nonexhaustive I think i can write the function with locals functions for all the types (Not, And, Or) and avoid duplicate code but I don't know how. there is my code datatype prop = Atom of string | Not of prop | And of prop*prop | Or of prop*prop; fun show(Atom(alpha)) = alpha | show(Not(Atom(alpha))) = "(- "^alpha^" )" | show(Or(Atom(alpha),Atom(beta))) = "( "^alpha^" | "^beta^" )" | show(Not(Or(Atom(alpha),Atom(beta)))) = "(- ( "^alpha^" | "^beta^" ))" | show(Or(Not(Atom(alpha)),Atom(beta))) = "( (-"^alpha^") | "^beta^" )" | show(Or(Atom(alpha),Not(Atom(beta)))) = "( "^alpha^" | (-"^beta^") )" | show(Or(Not(Atom(alpha)),Not(Atom(beta)))) = "( (-"^alpha^") | (-"^beta^") )" | show(And(Atom(alpha),Atom(beta))) = "( "^alpha^" & "^beta^" )" | show(Not(And(Atom(alpha),Atom(beta)))) = "(- ( "^alpha^" & "^beta^" ))" | show(And(Not(Atom(alpha)),Atom(beta))) = "( (-"^alpha^") & "^beta^" )" | show(And(Atom(alpha),Not(Atom(beta)))) = "( "^alpha^" & (-"^beta^") )" | show(And(Not(Atom(alpha)),Not(Atom(beta)))) = "( (-"^alpha^") & (-"^beta^") )"; Thanks a lot for your help.
The general rule is as follows: if you have a recursive data type, you should use a recursive function to transform it. Your match expression is not exhaustive because there are a lot of variants you can't handle - i.e. And(And(Atom("a"), Atom("b")), Atom("c")). You should rewrite the function with recursive calls to itself - i.e. replace Not(Atom(alpha)) match with Not(expr): show(Not(expr)) = "(- " ^ show(expr) ^ " )" I'm sure you can figure out the rest (you'll have two recursive calls for and/or).
Regular Expression Period Issue
((https?|ftp)://|www.)(\S+[^.*]) I would like this expression to check for . in succession to each other. If it finds two or more periods back to back, the expression should fail. On the other hand, if it succeeds, I want it to match every character and/or symbol up until the first white space encountered. In other words: www.yahoo..com should fail On a related note: I realize that this expression is very basic in terms of judging valid URL structure. I have another "more intelligent" regular expression in place that precedes the one above. The purpose of the posted one is meant to check the validity of the URL that is passed from the initial regular expression via preg_match_all.
You may awnt to check out FILTER_VALIDATE_URL with http://php.net/manual/en/book.filter.php instead of using Regex to validate your URLS. Here's example usage: $url = "http://www.example.com"; if(!filter_var($url, FILTER_VALIDATE_URL)) { echo "URL is not valid"; } else { echo "URL is valid"; }
You can do something like this: ((https?|ftp)\:\/\/|www.)((?:[\w\-]+\.)*[\w\-]+) This will not yet check for valid URLs, even if you skip double dots. I'd advise not to use regex if the language you're using (PHP?) has other means of validating an URL. The RFC states the following: ; URL schemeparts for ip based protocols: ip-schemepart = "//" login [ "/" urlpath ] login = [ user [ ":" password ] "#" ] hostport hostport = host [ ":" port ] host = hostname | hostnumber hostname = *[ domainlabel "." ] toplabel domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit alphadigit = alpha | digit hostnumber = digits "." digits "." digits "." digits port = digits user = *[ uchar | ";" | "?" | "&" | "=" ] password = *[ uchar | ";" | "?" | "&" | "=" ] urlpath = *xchar ; depends on protocol see section 3.1 ; HTTP httpurl = "http://" hostport [ "/" hpath [ "?" search ]] hpath = hsegment *[ "/" hsegment ] hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ] search = *[ uchar | ";" | ":" | "#" | "&" | "=" ] ; Miscellaneous definitions lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" alpha = lowalpha | hialpha digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" safe = "$" | "-" | "_" | "." | "+" extra = "!" | "*" | "'" | "(" | ")" | "," national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`" punctuation = "<" | ">" | "#" | "%" | <"> reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" escape = "%" hex hex unreserved = alpha | digit | safe | extra uchar = unreserved | escape xchar = unreserved | reserved | escape digits = 1*digit
Using negative lookahead is an easy way if your engine supports it: (?!.*\.\.)((https?|ftp)\:\/\/|www.)(\S+[^.*]) Otherwise, you have to be more specific: ^((https?|ftp)\:\/\/|www.)((\.[^.]|[^.\s])+[^.*])($|\s+)