Related
I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 | | \u00a0 | f | t
5760 | 1680 | | \u1680 | t | t
6158 | 180e | | \u180e | f | t
8192 | 2000 | | \u2000 | t | t
8193 | 2001 | | \u2001 | t | t
8194 | 2002 | | \u2002 | t | t
8195 | 2003 | | \u2003 | t | t
8196 | 2004 | | \u2004 | t | t
8197 | 2005 | | \u2005 | t | t
8198 | 2006 | | \u2006 | t | t
8199 | 2007 | | \u2007 | f | t
8200 | 2008 | | \u2008 | t | t
8201 | 2009 | | \u2009 | t | t
8202 | 200a | | \u200a | t | t
8203 | 200b | | \u200b | f | t
8204 | 200c | | \u200c | f | t
8205 | 200d | | \u200d | f | t
8206 | 200e | | \u200e | f | t
8207 | 200f | | \u200f | f | t
8239 | 202f | | \u202f | f | t
8287 | 205f | | \u205f | t | t
8288 | 2060 | | \u2060 | f | t
12288 | 3000 | | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
Assuming I have a BNF grammar like this
<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= a | b | c | d | e
| f | g | h | i
<digit> ::= 0 | 1 | 2 | 3 |
4
If you look at the <letter> rule, its continuation starts with the | but that of the <digit> rule starts with the production with | appearing at the end of the previous line. I also don't want to use a particular symbol to represent the end of a rule.
How do check if a rule as ended using the Boost Spirit Qi for implementation.
I have just gone through the tutorial on the boost page and wondering how I am going to handle this.
Wikipedia
BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon character “;” marks the end of a rule.
So the simple answer is: the input isn't BNF.
Iff you want to support it anyways (at your own peril :)) you'll have to make it so. So, let's write a simplistic BFN grammar, literally mapping from Wikipedia BNF
<syntax> ::= <rule> | <rule> <syntax>
<rule> ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end>
<opt-whitespace> ::= " " <opt-whitespace> | ""
<expression> ::= <list> | <list> <opt-whitespace> "|" <opt-whitespace> <expression>
<line-end> ::= <opt-whitespace> <EOL> | <line-end> <line-end>
<list> ::= <term> | <term> <opt-whitespace> <list>
<term> ::= <literal> | "<" <rule-name> ">"
<literal> ::= '"' <text1> '"' | "'" <text2> "'"
<text1> ::= "" | <character1> <text1>
<text2> ::= '' | <character2> <text2>
<character> ::= <letter> | <digit> | <symbol>
<letter> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<symbol> ::= "|" | " " | "!" | "#" | "$" | "%" | "&" | "(" | ")" | "*" | "+" | "," | "-" | "." | "/" | ":" | ";" | ">" | "=" | "<" | "?" | "#" | "[" | "\" | "]" | "^" | "_" | "`" | "{" | "}" | "~"
<character1> ::= <character> | "'"
<character2> ::= <character> | '"'
<rule-name> ::= <letter> | <rule-name> <rule-char>
<rule-char> ::= <letter> | <digit> | "-"
It could look like this:
template <typename Iterator>
struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
BNF(): BNF::base_type(start) {
using namespace qi;
start = skip(blank) [ _rule % +eol ];
_rule = _rule_name >> "::=" >> _expression;
_expression = _list % '|';
_list = +_term;
_term = _literal | _rule_name;
_literal = '"' >> *(_character - '"') >> '"'
| "'" >> *(_character - "'") >> "'";
_character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-");
_rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';
BOOST_SPIRIT_DEBUG_NODES(
(_rule)(_expression)(_list)(_term)
(_literal)(_character)
(_rule_name))
}
private:
qi::rule<Iterator, Ast::Syntax()> start;
qi::rule<Iterator, Ast::Rule(), qi::blank_type> _rule;
qi::rule<Iterator, Ast::Expression(), qi::blank_type> _expression;
qi::rule<Iterator, Ast::List(), qi::blank_type> _list;
// lexemes
qi::rule<Iterator, Ast::Term()> _term;
qi::rule<Iterator, Ast::Name()> _rule_name;
qi::rule<Iterator, std::string()> _literal;
qi::rule<Iterator, char()> _character;
};
Now it will parse your sample (corrected to be BNF):
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" | "4"
)";
Live On Compiler Explorer
Prints:
code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}
digit ::= {0} | {1} | {2} | {3} | {4}
Remaining: "
"
Support Line-Wrapped Rules
The best way is to not accept them - since the grammar wasn't designed for it unlike e.g. EBNF.
You can force the issue by doing a negative look-ahead in the skipper:
_skipper = blank | (eol >> !_rule);
start = skip(_skipper) [ _rule % +eol ];
For technical reasons (Boost spirit skipper issues) that doesn't compile, so we need to feed it a placeholder skipper inside the look-ahead:
_blank = blank;
_skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
start = skip(_skipper.alias()) [ _rule % +eol ];
Now it parses the same but with various line-breaks:
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
| "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" |
"4"
)";
Printing:
code ::= {<letter>, <digit>} | {<letter>, <digit>, <code>}
letter ::= {a} | {b} | {c} | {d} | {e} | {f} | {g} | {h} | {i}
digit ::= {0} | {1} | {2} | {3} | {4}
FULL LISTING
Compiler Explorer
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
#include <iomanip>
namespace qi = boost::spirit::qi;
namespace Ast {
struct Name : std::string {
using std::string::string;
using std::string::operator=;
friend std::ostream& operator<<(std::ostream& os, Name const& n) {
return os << '<' << n.c_str() << '>';
}
};
using Term = boost::variant<Name, std::string>;
using List = std::list<Term>;
using Expression = std::list<List>;
struct Rule {
Name name; // lhs
Expression rhs;
};
using Syntax = std::list<Rule>;
}
BOOST_FUSION_ADAPT_STRUCT(Ast::Rule, name, rhs)
namespace Parser {
template <typename Iterator>
struct BNF: qi::grammar<Iterator, Ast::Syntax()> {
BNF(): BNF::base_type(start) {
using namespace qi;
_blank = blank;
_skipper = blank | (eol >> !skip(_blank.alias()) [ _rule ]);
start = skip(_skipper.alias()) [ _rule % +eol ];
_rule = _rule_name >> "::=" >> _expression;
_expression = _list % '|';
_list = +_term;
_term = _literal | _rule_name;
_literal = '"' >> *(_character - '"') >> '"'
| "'" >> *(_character - "'") >> "'";
_character = alnum | char_("\"'| !#$%&()*+,./:;>=<?#]\\^_`{}~[-");
_rule_name = '<' >> (alpha >> *(alnum | char_('-'))) >> '>';
BOOST_SPIRIT_DEBUG_NODES(
(_rule)(_expression)(_list)(_term)
(_literal)(_character)
(_rule_name))
}
private:
using Skipper = qi::rule<Iterator>;
Skipper _skipper, _blank;
qi::rule<Iterator, Ast::Syntax()> start;
qi::rule<Iterator, Ast::Rule(), Skipper> _rule;
qi::rule<Iterator, Ast::Expression(), Skipper> _expression;
qi::rule<Iterator, Ast::List(), Skipper> _list;
// lexemes
qi::rule<Iterator, Ast::Term()> _term;
qi::rule<Iterator, Ast::Name()> _rule_name;
qi::rule<Iterator, std::string()> _literal;
qi::rule<Iterator, char()> _character;
};
}
int main() {
Parser::BNF<std::string::const_iterator> const parser;
std::string const input = R"(<code> ::= <letter><digit> | <letter><digit><code>
<letter> ::= "a" | "b" | "c" | "d" | "e"
| "f" | "g" | "h" | "i"
<digit> ::= "0" | "1" | "2" | "3" |
"4"
)";
auto it = input.begin(), itEnd = input.end();
Ast::Syntax syntax;
if (parse(it, itEnd, parser, syntax)) {
for (auto& rule : syntax)
fmt::print("{} ::= {}\n", rule.name, fmt::join(rule.rhs, " | "));
} else {
std::cout << "Failed\n";
}
if (it != itEnd)
std::cout << "Remaining: " << std::quoted(std::string(it, itEnd)) << "\n";
}
Also Live On Coliru (without libfmt)
Just trying to convert the below simple code into XSLT and I tried to use Regular expressions with XSLT 2.0 but some how it is not working. Could any one please advice? Thank you.
string[256] Fname;
integer ctr;
Fname="";
ctr=0;
while ctr <= len(#FirstName) do
begin
if mid(#FirstName,ctr,1) = "A" |
mid(#FirstName,ctr,1) = "B" |
mid(#FirstName,ctr,1) = "C" |
mid(#FirstName,ctr,1) = "D" |
mid(#FirstName,ctr,1) = "E" |
mid(#FirstName,ctr,1) = "F" |
mid(#FirstName,ctr,1) = "G" |
mid(#FirstName,ctr,1) = "H" |
mid(#FirstName,ctr,1) = "I" |
mid(#FirstName,ctr,1) = "J" |
mid(#FirstName,ctr,1) = "K" |
mid(#FirstName,ctr,1) = "L" |
mid(#FirstName,ctr,1) = "M" |
mid(#FirstName,ctr,1) = "N" |
mid(#FirstName,ctr,1) = "O" |
mid(#FirstName,ctr,1) = "P" |
mid(#FirstName,ctr,1) = "Q" |
mid(#FirstName,ctr,1) = "R" |
mid(#FirstName,ctr,1) = "S" |
mid(#FirstName,ctr,1) = "T" |
mid(#FirstName,ctr,1) = "U" |
mid(#FirstName,ctr,1) = "V" |
mid(#FirstName,ctr,1) = "W" |
mid(#FirstName,ctr,1) = "X" |
mid(#FirstName,ctr,1) = "Y" |
mid(#FirstName,ctr,1) = "Z" |
mid(#FirstName,ctr,1) = "a" |
mid(#FirstName,ctr,1) = "b" |
mid(#FirstName,ctr,1) = "c" |
mid(#FirstName,ctr,1) = "d" |
mid(#FirstName,ctr,1) = "e" |
mid(#FirstName,ctr,1) = "f" |
mid(#FirstName,ctr,1) = "g" |
mid(#FirstName,ctr,1) = "h" |
mid(#FirstName,ctr,1) = "i" |
mid(#FirstName,ctr,1) = "j" |
mid(#FirstName,ctr,1) = "k" |
mid(#FirstName,ctr,1) = "l" |
mid(#FirstName,ctr,1) = "m" |
mid(#FirstName,ctr,1) = "n" |
mid(#FirstName,ctr,1) = "o" |
mid(#FirstName,ctr,1) = "p" |
mid(#FirstName,ctr,1) = "q" |
mid(#FirstName,ctr,1) = "r" |
mid(#FirstName,ctr,1) = "s" |
mid(#FirstName,ctr,1) = "t" |
mid(#FirstName,ctr,1) = "u" |
mid(#FirstName,ctr,1) = "v" |
mid(#FirstName,ctr,1) = "w" |
mid(#FirstName,ctr,1) = "x" |
mid(#FirstName,ctr,1) = "y" |
mid(#FirstName,ctr,1) = "z" |
mid(#FirstName,ctr,1) = "-"
then
Fname = Fname + mid(#FirstName,ctr,1);
ctr = ctr+1;
end
Fname = trimleft(Fname,"-");
if len(Fname) = 2 then
Fname = Fname + "-";
if len(Fname) = 1 then
Fname = Fname + "--";
if len(Fname) > 50 then
Fname = left(Fname,50);
if len(Fname) = 0 then
Fname = "UNKNOWN";
#FirstName = Fname;
Solution:
Same logic is applied on another filed called PostalCode and here is what I tried but some how regular expression not working. I am trying to fix it mean while posting here as well for experts solutions.
<xsl:if test="contains(substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1),'[^a-zA-Z1-9. ]')">
<xsl:variable name="vPostalCode"
select="concat(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),substring(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode),1,1))"/>
<xsl:element name="PostalCode">
<xsl:if test="string-length($vPostalCode) < 9 ">
<xsl:value-of
select="substring($vPostalCode,1, 5)"/>
</xsl:if>
<xsl:if test="string-length(string(/OrdersToFulfill/Order/OrderHeader/BillTo/Address/PostalCode)) > 10 ">
<xsl:value-of
select="substring($vPostalCode,1, 10)"/>
</xsl:if>
</xsl:element>
</xsl:if>
OK, I think I've worked out what your code does: it constructs a string containing all the letters, digits, and hyphens from an input string, and removes everything else. It then pads with hyphens to a minimum length of three, and truncates to a maximum length of 50. (Why couldn't you have told us that?)
Also, if you tried to write the code and it didn't work then you should show us the code so we can tell you where you went wrong.
The first part of the problem can be done using
replace($in, "[^A-Za-z0-9\-]", "")
Padding with hyphens to length 3 can be done with
if (string-length($s) lt 3)
then substring(concat($s, "---"), 1, 3)
else $s
Truncation to a maximum of 50 characters can be done with
substring($s, 1, 50)
So, I've been working again on my assembler, this time I'm hanging with the floating-point registers. Basically, there are 32 fp registers. So, I want to match them, if I write F0, F1, F2, ..., F31. I wrote following into my lexer:
REG
: ('R0'|'r0')
| ('AT'|'at')
| ('v'[0-1]|'V'[0-1])
| ('a'[0-3]|'A'[0-3])
| ('t'[0-9]|'T'[0-9])
| ('s'[0-9]|'S'[0-8])
| ('k'[0-1]|'K'[0-1])
| ('GP'|'gp')
| ('SP'|'sp')
| ('FP'|'fp')
| ('ra'|'RA')
| ('f'[0-31]|'F'[0-31])+
;
Basically, every register here worked without any problems. But F0-F31 seems not to work. I tested it out and noticed, that it only matches F0-F3 but not any higher. This was quite obvious in that moment and I couldn't find out how I would match values which are over 10. I also tried some workarounds like adding more [0-9] behind the others, but that didn't help, as it then would match later values like F36 or F39. So, any idea how I could handle this?
Thanks in Advance.
The class [0-31] matches the 0, 1, 2, 3 or 1 (again). To emphasise: regular expression classes do not match numeric values, but (text) characters.
To match F0, F1, F2, ..., F31 (and f0, f1, f2, ..., f31), do something like this:
FREG
: [fF] ( [0-9] // matches f0..f9 (and F0..F9)
| [1-2] [0-9] // matches f10..f29 (and F10..F29)
| '3' [01] // matches f30 or f31 (and F30 or F31)
)
;
Your complete REG rule could be written as follows:
REG
: [rR] '0'
| 'AT' | 'at'
| [vV] [01]
| [aA] [0-3]
| [tT] [0-9]
| [sS] [0-9]
| [kK] [01]
| 'GP' | 'gp'
| 'SP' | 'sp'
| 'FP' | 'fp'
| 'RA' | 'ra'
| [fF] ( [0-9] | [1-2] [0-9] | '3' [01] )
;
Note that [01] and [0-1] match the same: either '0' or '1'. Also be aware that 'ra' | 'RA' does not match 'Ra'. If you want 'Ra' and 'rA' to match as well, write it like this: [rR] [aA].
((https?|ftp)://|www.)(\S+[^.*])
I would like this expression to check for . in succession to each other. If it finds two or more periods back to back, the expression should fail. On the other hand, if it succeeds, I want it to match every character and/or symbol up until the first white space encountered.
In other words:
www.yahoo..com should fail
On a related note: I realize that this expression is very basic in terms of judging valid URL structure. I have another "more intelligent" regular expression in place that precedes the one above. The purpose of the posted one is meant to check the validity of the URL that is passed from the initial regular expression via preg_match_all.
You may awnt to check out FILTER_VALIDATE_URL with http://php.net/manual/en/book.filter.php instead of using Regex to validate your URLS.
Here's example usage:
$url = "http://www.example.com";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
You can do something like this:
((https?|ftp)\:\/\/|www.)((?:[\w\-]+\.)*[\w\-]+)
This will not yet check for valid URLs, even if you skip double dots. I'd advise not to use regex if the language you're using (PHP?) has other means of validating an URL.
The RFC states the following:
; URL schemeparts for ip based protocols:
ip-schemepart = "//" login [ "/" urlpath ]
login = [ user [ ":" password ] "#" ] hostport
hostport = host [ ":" port ]
host = hostname | hostnumber
hostname = *[ domainlabel "." ] toplabel
domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit = alpha | digit
hostnumber = digits "." digits "." digits "." digits
port = digits
user = *[ uchar | ";" | "?" | "&" | "=" ]
password = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath = *xchar ; depends on protocol see section 3.1
; HTTP
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
; Miscellaneous definitions
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
alpha = lowalpha | hialpha
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation = "<" | ">" | "#" | "%" | <">
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "="
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
escape = "%" hex hex
unreserved = alpha | digit | safe | extra
uchar = unreserved | escape
xchar = unreserved | reserved | escape
digits = 1*digit
Using negative lookahead is an easy way if your engine supports it:
(?!.*\.\.)((https?|ftp)\:\/\/|www.)(\S+[^.*])
Otherwise, you have to be more specific:
^((https?|ftp)\:\/\/|www.)((\.[^.]|[^.\s])+[^.*])($|\s+)