Converting Traditional Logic to XSLT - regex

Just trying to convert the below simple code into XSLT and I tried to use Regular expressions with XSLT 2.0 but some how it is not working. Could any one please advice? Thank you.
string[256] Fname;
integer ctr;
Fname="";
ctr=0;
while ctr <= len(#FirstName) do
begin
if mid(#FirstName,ctr,1) = "A" |
mid(#FirstName,ctr,1) = "B" |
mid(#FirstName,ctr,1) = "C" |
mid(#FirstName,ctr,1) = "D" |
mid(#FirstName,ctr,1) = "E" |
mid(#FirstName,ctr,1) = "F" |
mid(#FirstName,ctr,1) = "G" |
mid(#FirstName,ctr,1) = "H" |
mid(#FirstName,ctr,1) = "I" |
mid(#FirstName,ctr,1) = "J" |
mid(#FirstName,ctr,1) = "K" |
mid(#FirstName,ctr,1) = "L" |
mid(#FirstName,ctr,1) = "M" |
mid(#FirstName,ctr,1) = "N" |
mid(#FirstName,ctr,1) = "O" |
mid(#FirstName,ctr,1) = "P" |
mid(#FirstName,ctr,1) = "Q" |
mid(#FirstName,ctr,1) = "R" |
mid(#FirstName,ctr,1) = "S" |
mid(#FirstName,ctr,1) = "T" |
mid(#FirstName,ctr,1) = "U" |
mid(#FirstName,ctr,1) = "V" |
mid(#FirstName,ctr,1) = "W" |
mid(#FirstName,ctr,1) = "X" |
mid(#FirstName,ctr,1) = "Y" |
mid(#FirstName,ctr,1) = "Z" |
mid(#FirstName,ctr,1) = "a" |
mid(#FirstName,ctr,1) = "b" |
mid(#FirstName,ctr,1) = "c" |
mid(#FirstName,ctr,1) = "d" |
mid(#FirstName,ctr,1) = "e" |
mid(#FirstName,ctr,1) = "f" |
mid(#FirstName,ctr,1) = "g" |
mid(#FirstName,ctr,1) = "h" |
mid(#FirstName,ctr,1) = "i" |
mid(#FirstName,ctr,1) = "j" |
mid(#FirstName,ctr,1) = "k" |
mid(#FirstName,ctr,1) = "l" |
mid(#FirstName,ctr,1) = "m" |
mid(#FirstName,ctr,1) = "n" |
mid(#FirstName,ctr,1) = "o" |
mid(#FirstName,ctr,1) = "p" |
mid(#FirstName,ctr,1) = "q" |
mid(#FirstName,ctr,1) = "r" |
mid(#FirstName,ctr,1) = "s" |
mid(#FirstName,ctr,1) = "t" |
mid(#FirstName,ctr,1) = "u" |
mid(#FirstName,ctr,1) = "v" |
mid(#FirstName,ctr,1) = "w" |
mid(#FirstName,ctr,1) = "x" |
mid(#FirstName,ctr,1) = "y" |
mid(#FirstName,ctr,1) = "z" |
mid(#FirstName,ctr,1) = "-"
then
Fname = Fname + mid(#FirstName,ctr,1);
ctr = ctr+1;
end
Fname = trimleft(Fname,"-");
if len(Fname) = 2 then
Fname = Fname + "-";
if len(Fname) = 1 then
Fname = Fname + "--";
if len(Fname) > 50 then
Fname = left(Fname,50);
if len(Fname) = 0 then
Fname = "UNKNOWN";
#FirstName = Fname;
Solution:
Same logic is applied on another filed called PostalCode and here is what I tried but some how regular expression not working. I am trying to fix it mean while posting here as well for experts solutions.
<xsl:if test="contains(substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1),'[^a-zA-Z1-9. ]')">
<xsl:variable name="vPostalCode"
select="concat(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1))"/>
<xsl:element name="PostalCode">
<xsl:if test="string-length($vPostalCode) < 9 ">
<xsl:value-of
select="substring($vPostalCode,1, 5)"/>
</xsl:if>
<xsl:if test="string-length(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode)) > 10 ">
<xsl:value-of
select="substring($vPostalCode,1, 10)"/>
</xsl:if>
</xsl:element>
</xsl:if>

OK, I think I've worked out what your code does: it constructs a string containing all the letters, digits, and hyphens from an input string, and removes everything else. It then pads with hyphens to a minimum length of three, and truncates to a maximum length of 50. (Why couldn't you have told us that?)
Also, if you tried to write the code and it didn't work then you should show us the code so we can tell you where you went wrong.
The first part of the problem can be done using
replace($in, "[^A-Za-z0-9\-]", "")
Padding with hyphens to length 3 can be done with
if (string-length($s) lt 3)
then substring(concat($s, "---"), 1, 3)
else $s
Truncation to a maximum of 50 characters can be done with
substring($s, 1, 50)

Related

How to handle multi-line rules for gor parsing bnf grammar using boost spirit qi

Assuming I have a BNF grammar like this
<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= a | b | c | d | e
| f | g | h | i
<digit> ::= 0 | 1 | 2 | 3 |
4
If you look at the <letter> rule, its continuation starts with the | but that of the <digit> rule starts with the production with | appearing at the end of the previous line. I also don't want to use a particular symbol to represent the end of a rule.
How do check if a rule as ended using the Boost Spirit Qi for implementation.
I have just gone through the tutorial on the boost page and wondering how I am going to handle this.
Wikipedia
BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon character “;” marks the end of a rule.
So the simple answer is: the input isn't BNF.
Iff you want to support it anyways (at your own peril :)) you'll have to make it so. So, let's write a simplistic BFN grammar, literally mapping from Wikipedia BNF
<syntax> ::= <rule> | <rule> <syntax>
<rule> ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end>
<opt-whitespace> ::= " " <opt-whitespace> | ""
<expression> ::= <list> | <list> <opt-whitespace> "|" <opt-whitespace> <expression>
<line-end> ::= <opt-whitespace> <EOL> | <line-end> <line-end>
<list> ::= <term> | <term> <opt-whitespace> <list>
<term> ::= <literal> | "<" <rule-name> ">"
<literal> ::= '"' <text1> '"' | "'" <text2> "'"
<text1> ::= "" | <character1> <text1>
<text2> ::= '' | <character2> <text2>
<character> ::= <letter> | <digit> | <symbol>
<letter> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<symbol> ::= "|" | " " | "!" | "#" | "$" | "%" | "&" | "(" | ")" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "#" | "[" | "\" | "]" | "^" | "_" | "`" | "{" | "}" | "~"
<character1> ::= <character> | "'"
<character2> ::= <character> | '"'
<rule-name> ::= <letter> | <rule-name> <rule-char>
<rule-char> ::= <letter> | <digit> | "-"
It could look like this:
template <typename Iterator>
struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
BNF(): BNF::base_type(start) {
using namespace qi;
start = skip(blank) [ _rule % +eol ];
_rule = _rule_name >> "::=" >> _expression;
_expression = _list % '|';
_list = +_term;
_term = _literal | _rule_name;
_literal = '"' >> *(_character - '"') >> '"'
| "'" >> *(_character - "'") >> "'";
_character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-");
_rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';
BOOST_SPIRIT_DEBUG_NODES(
(_rule)(_expression)(_list)(_term)
(_literal)(_character)
(_rule_name))
}
private:
qi::rule<Iterator, Ast::Syntax()> start;
qi::rule<Iterator, Ast::Rule(), qi::blank_type> _rule;
qi::rule<Iterator, Ast::Expression(), qi::blank_type> _expression;
qi::rule<Iterator, Ast::List(), qi::blank_type> _list;
// lexemes
qi::rule<Iterator, Ast::Term()> _term;
qi::rule<Iterator, Ast::Name()> _rule_name;
qi::rule<Iterator, std::string()> _literal;
qi::rule<Iterator, char()> _character;
};
Now it will parse your sample (corrected to be BNF):
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" | "4"
)";
Live On Compiler Explorer
Prints:
code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}
digit ::= {0} | {1} | {2} | {3} | {4}
Remaining: "
"
Support Line-Wrapped Rules
The best way is to not accept them - since the grammar wasn't designed for it unlike e.g. EBNF.
You can force the issue by doing a negative look-ahead in the skipper:
_skipper = blank | (eol >> !_rule);
start = skip(_skipper) [ _rule % +eol ];
For technical reasons (Boost spirit skipper issues) that doesn't compile, so we need to feed it a placeholder skipper inside the look-ahead:
_blank = blank;
_skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
start = skip(_skipper.alias()) [ _rule % +eol ];
Now it parses the same but with various line-breaks:
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
| "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" |
"4"
)";
Printing:
code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}
digit ::= {0} | {1} | {2} | {3} | {4}
FULL LISTING
Compiler Explorer
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#include <iomanip>
namespace qi = boost::spirit::qi;
namespace Ast {
struct Name : std::string {
using std::string::string;
using std::string::operator=;
friend std::ostream& operator<<(std::ostream& os, Name const& n) {
return os << '<' << n.c_str() << '>';
}
};
using Term = boost::variant<Name, std::string>;
using List = std::list<Term>;
using Expression = std::list<List>;
struct Rule {
Name name; // lhs
Expression rhs;
};
using Syntax = std::list<Rule>;
}
BOOST_FUSION_ADAPT_STRUCT(Ast::Rule, name, rhs)
namespace Parser {
template <typename Iterator>
struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
BNF(): BNF::base_type(start) {
using namespace qi;
_blank = blank;
_skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
start = skip(_skipper.alias()) [ _rule % +eol ];
_rule = _rule_name >> "::=" >> _expression;
_expression = _list % '|';
_list = +_term;
_term = _literal | _rule_name;
_literal = '"' >> *(_character - '"') >> '"'
| "'" >> *(_character - "'") >> "'";
_character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-");
_rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';
BOOST_SPIRIT_DEBUG_NODES(
(_rule)(_expression)(_list)(_term)
(_literal)(_character)
(_rule_name))
}
private:
using Skipper = qi::rule<Iterator>;
Skipper _skipper, _blank;
qi::rule<Iterator, Ast::Syntax()> start;
qi::rule<Iterator, Ast::Rule(), Skipper> _rule;
qi::rule<Iterator, Ast::Expression(), Skipper> _expression;
qi::rule<Iterator, Ast::List(), Skipper> _list;
// lexemes
qi::rule<Iterator, Ast::Term()> _term;
qi::rule<Iterator, Ast::Name()> _rule_name;
qi::rule<Iterator, std::string()> _literal;
qi::rule<Iterator, char()> _character;
};
}
int main() {
Parser::BNF<std::string::const_iterator> const parser;
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
| "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" |
"4"
)";
auto it = input.begin(), itEnd = input.end();
Ast::Syntax syntax;
if (parse(it, itEnd, parser, syntax)) {
for (auto& rule : syntax)
fmt::print("{} ::= {}\n", rule.name, fmt::join(rule.rhs, " | "));
} else {
std::cout << "Failed\n";
}
if (it != itEnd)
std::cout << "Remaining: " << std::quoted(std::string(it, itEnd)) << "\n";
}
Also Live On Coliru (without libfmt)

PL/SQL: regexp_like for string not start with letters

For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore.
For example :
String : AM_HTCEVOBLKHS_BX [false]
String : AP_HTCEVOBLKHSPBX [false]
String : BM_HTCEVOBLKHS_BX [true]
String : A_HTCEVODSAP_DSSD [true]
String : A_HTCEVOB_A_CDSED [true]
String : MP_HTCEVOBLKHS_BX [true]
Can you make this pattern ?
My current solution doesn't work:
BEGIN
IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN
dbms_output.put_line('TRUE');
ELSE
dbms_output.put_line('FALSE');
END IF;
END;
/
why you need regexp why you not use simple substr?
with t1 as
(select 'AM_HTCEVOBLKHS_BX' as f1
from dual
union all
select 'AP_HTCEVOBLKHSPBX'
from dual
union all
select 'BM_HTCEVOBLKHS_BX'
from dual
union all
select 'A_HTCEVODSAP_DSSD'
from dual
union all
select 'A_HTCEVOB_A_CDSED'
from dual
union all
select 'MP_HTCEVOBLKHS_BX' from dual
union all
select null from dual
union all
select '1' from dual)
select f1,
case
when substr(f1, 1, 2) in ('AM', 'AP') then
'false'
else
'true'
end as check_result
from t1
If you have a table of patterns then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE strings ( string ) AS
SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL
UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL
UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL
UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL;
CREATE TABLE patterns ( pattern ) AS
SELECT '^AM' FROM DUAL
UNION ALL SELECT '^AP' FROM DUAL;
Query 1:
-- Negative Matches:
SELECT string
FROM strings s
LEFT OUTER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
WHERE p.pattern IS NULL
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 2:
-- Positive Matches:
SELECT DISTINCT
string
FROM strings s
INNER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 3:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string,
( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL )
FROM patterns )
)
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings s
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
If you want to pass the pattern as a single string then:
Query 4:
-- Negative Matches:
SELECT string
FROM strings
WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 5:
-- Positive Matches:
SELECT string
FROM strings
WHERE REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 6:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' )
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
Try this:
^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$
The idea is to describe all possible start of the string.
( # a group
[B-Z][A-Z]* # The first character is not a "A"
| # OR
A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M"
| # OR
A[A-Z]{2,} # a "A" followed by more than 1 letter
) # close the group
^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this:
not regexp_like( field, '^(AM_)|^(AP_)' )
As it is a LIKE function you don't need any more on the regex expression.

Explain BNF syntax for NID in RFC 2141

I am having trouble understanding some BNF syntax from RFC2141.
The line is <NID> ::= <let-num> [ 1,31<let-num-hyp> ]. I think it means that <NID> is a symbol for a string, with constrained by two rules:
The string must be begin with a single occurence of any of the <let-num> characters.
This character may be followed by 0-31 occurrences* of any of the <let-num-hyp> characters.
Am I reading this correctly? Because, if I am, some of the implications are a bit confusing.
*equivalent to "optionally, 1-31 occurrences
The complete BNF syntax for a <NID> (Namespace Identifier) in RFC2141 is:
<NID> ::= <let-num> [ 1,31<let-num-hyp> ]
<let-num-hyp> ::= <upper> | <lower> | <number> | "-"
<let-num> ::= <upper> | <lower> | <number>
<upper> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" |
"I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" |
"Y" | "Z"
<lower> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z"
<number> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
You've interpreted it correctly. What are the confusing implications?
<NID> ::= <let-num> [ 1,31<let-num-hyp> ]
means one occurrence of <let-num> followed optionally by up to 31 occurrences of <let-num-hyp>.
Taking into account the other definitions, this means a string of at least one character and at most 32 characters, consisting of letters of either case, numerals, and hyphens, with the first character not allowed to be a hyphen.

code optimization

I must write a function "to_string" wich receives this datatype
datatype prop = Atom of string | Not of prop | And of prop*prop | Or of prop*prop;
and returns a string.
Example
show
And(Atom("saturday"),Atom("night")) =
"(saturday & night)"
My function is working but I have 2 problems.
the interpreter tells me -> Warning: match nonexhaustive
I think i can write the function with locals functions for all the types (Not, And, Or) and avoid duplicate code but I don't know how.
there is my code
datatype prop = Atom of string | Not of prop | And of prop*prop | Or of prop*prop;
fun show(Atom(alpha)) = alpha
| show(Not(Atom(alpha))) = "(- "^alpha^" )"
| show(Or(Atom(alpha),Atom(beta))) = "( "^alpha^" | "^beta^" )"
| show(Not(Or(Atom(alpha),Atom(beta)))) = "(- ( "^alpha^" | "^beta^" ))"
| show(Or(Not(Atom(alpha)),Atom(beta))) = "( (-"^alpha^") | "^beta^" )"
| show(Or(Atom(alpha),Not(Atom(beta)))) = "( "^alpha^" | (-"^beta^") )"
| show(Or(Not(Atom(alpha)),Not(Atom(beta)))) = "( (-"^alpha^") | (-"^beta^") )"
| show(And(Atom(alpha),Atom(beta))) = "( "^alpha^" & "^beta^" )"
| show(Not(And(Atom(alpha),Atom(beta)))) = "(- ( "^alpha^" & "^beta^" ))"
| show(And(Not(Atom(alpha)),Atom(beta))) = "( (-"^alpha^") & "^beta^" )"
| show(And(Atom(alpha),Not(Atom(beta)))) = "( "^alpha^" & (-"^beta^") )"
| show(And(Not(Atom(alpha)),Not(Atom(beta)))) = "( (-"^alpha^") & (-"^beta^") )";
Thanks a lot for your help.
The general rule is as follows: if you have a recursive data type, you should use a recursive function to transform it.
Your match expression is not exhaustive because there are a lot of variants you can't handle - i.e. And(And(Atom("a"), Atom("b")), Atom("c")).
You should rewrite the function with recursive calls to itself - i.e. replace Not(Atom(alpha)) match with Not(expr):
show(Not(expr)) = "(- " ^ show(expr) ^ " )"
I'm sure you can figure out the rest (you'll have two recursive calls for and/or).

Regular Expression Period Issue

((https?|ftp)://|www.)(\S+[^.*])
I would like this expression to check for . in succession to each other. If it finds two or more periods back to back, the expression should fail. On the other hand, if it succeeds, I want it to match every character and/or symbol up until the first white space encountered.
In other words:
www.yahoo..com should fail
On a related note: I realize that this expression is very basic in terms of judging valid URL structure. I have another "more intelligent" regular expression in place that precedes the one above. The purpose of the posted one is meant to check the validity of the URL that is passed from the initial regular expression via preg_match_all.
You may awnt to check out FILTER_VALIDATE_URL with http://php.net/manual/en/book.filter.php instead of using Regex to validate your URLS.
Here's example usage:
$url = "http://www.example.com";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
You can do something like this:
((https?|ftp)\:\/\/|www.)((?:[\w\-]+\.)*[\w\-]+)
This will not yet check for valid URLs, even if you skip double dots. I'd advise not to use regex if the language you're using (PHP?) has other means of validating an URL.
The RFC states the following:
; URL schemeparts for ip based protocols:
ip-schemepart = "//" login [ "/" urlpath ]
login = [ user [ ":" password ] "#" ] hostport
hostport = host [ ":" port ]
host = hostname | hostnumber
hostname = *[ domainlabel "." ] toplabel
domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit = alpha | digit
hostnumber = digits "." digits "." digits "." digits
port = digits
user = *[ uchar | ";" | "?" | "&" | "=" ]
password = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath = *xchar ; depends on protocol see section 3.1
; HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
; Miscellaneous definitions
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation = "<" | ">" | "#" | "%" | <">
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex
unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
xchar = unreserved | reserved | escape
digits = 1*digit
Using negative lookahead is an easy way if your engine supports it:
(?!.*\.\.)((https?|ftp)\:\/\/|www.)(\S+[^.*])
Otherwise, you have to be more specific:
^((https?|ftp)\:\/\/|www.)((\.[^.]|[^.\s])+[^.*])($|\s+)