Postgresql regex to match uppercase, Unicode-aware - regex

The title sums it up pretty well. I'm looking for a regular expression matching Unicode uppercase character for the Postgres ~ operator.
The obvious way doesn't work:
=> select 'A' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ó' ~ '[[:upper:]]';
?column?
----------
t
(1 row)
=> select 'Ą' ~ '[[:upper:]]';
?column?
----------
f
(1 row)
I'm using Postgresql 9.1 and my locale is set to pl_PL.UTF-8. The ordering works fine.
=> show LC_CTYPE;
lc_ctype
-------------
pl_PL.UTF-8
(1 row)

The regexp engine of PG 9.1 and older versions does not correctly classify characters whose codepoint doesn't fit it one byte.
The codepoint of 'Ó' being 211 it gets it right, but the codepoint of 'Ą' is 260, beyond 255.
PG 9.2 is better at this, though still not 100% right for all alphabets. See this commit in PostgreSQL source code, and particularly these parts of the comment:
remove the hard-wired limitation to not consider wctype.h results for
character codes above 255
and
Still, we can push it up to U+7FF (which I chose as the limit of
2-byte UTF8 characters), which will at least make Eastern Europeans
happy pending a better solution
Unfortunately this was not backported to 9.1

I've found that perl regular expressions handles Unicode perfectly.
create extension plperl;
create function is_letter_upper(text) returns boolean
immutable strict language plperl
as $$
use feature 'unicode_strings';
return $_[0] =~ /^\p{IsUpper}$/ ? "true" : "false";
$$;
Tested on postgres 9.2 with perl 5.16.2.

Related

Postgres: regexp_replace & trim

I need to remove '.0' at the end of the string but I have some issues.
In PG 8.4 I have this expression and its was worked fine.
select regexp_replace('10.1.2.3.0', '(\\\\.0)+$', '');
and result was
'10.1.2.3' - good result.
But after PG was updated to 9.x version result is
'10.1.2.3.0' - the input string and its not ok.
Also I tried to use trim function
it this case it is ok
select trim('.0' from '10.1.2.3.0');
result is '10.1.2.3' - ok
but when I have 10 at the end of the code I have unexpected result
select trim('.0' from '10.1.2.3.10.0');
or
select trim('.0' from '10.1.2.3.10');
result is 10.1.2.3.1 - 0 is trimmed from 10
Somebody can suggest me solution and explain what is wrong with trim function and what was changed in regexp_replace in latest versions?
I would suggest doing something like this:
select (case when col like '%.0' then left(col, length(col) - 2)
else col
end)
This will work in all versions of Postgres and you don't need to worry about regular expression parsing.
As for the regular expression version, both of these work for me (on recent versions of Postgres):
select regexp_replace('10.1.2.3.0', '(\.0)+$', '');
select regexp_replace('10.1.2.3.0', '([.]0)+$', '');
I suspect the problem with the earlier version is the string parsing with the backslash escape character -- you can use square brackets instead of backslash and the pattern should work in any version.

How to search all CJK chars in vim?

I can search a CJK char (such as 小) by using a unicode code point:
/\%u5c0f
/[\u5c0f]
I cannot search all of CJK chars by using [\u4E00-\u9FFF], because vim manual says:
:help /[]
NOTE: The other backslash codes mentioned above do not work inside []!
Is these a way to do the job?
It seems that Vim ranges are somehow limited to the same high byte, because /[\u4E00-\u4eFF] works fine. If you don't mind the mess, try:
/[\u4e00-\u4eff\u4f00-\u4fff\u5000-\u50ff\u5100-\u51ff\u5200-\u52ff\u5300-\u53ff\u5400-\u54ff\u5500-\u55ff\u5600-\u56ff\u5700-\u57ff\u5800-\u58ff\u5900-\u59ff\u5a00-\u5aff\u5b00-\u5bff\u5c00-\u5cff\u5d00-\u5dff\u5e00-\u5eff\u5f00-\u5fff\u6000-\u60ff\u6100-\u61ff\u6200-\u62ff\u6300-\u63ff\u6400-\u64ff\u6500-\u65ff\u6600-\u66ff\u6700-\u67ff\u6800-\u68ff\u6900-\u69ff\u6a00-\u6aff\u6b00-\u6bff\u6c00-\u6cff\u6d00-\u6dff\u6e00-\u6eff\u6f00-\u6fff\u7000-\u70ff\u7100-\u71ff\u7200-\u72ff\u7300-\u73ff\u7400-\u74ff\u7500-\u75ff\u7600-\u76ff\u7700-\u77ff\u7800-\u78ff\u7900-\u79ff\u7a00-\u7aff\u7b00-\u7bff\u7c00-\u7cff\u7d00-\u7dff\u7e00-\u7eff\u7f00-\u7fff\u8000-\u80ff\u8100-\u81ff\u8200-\u82ff\u8300-\u83ff\u8400-\u84ff\u8500-\u85ff\u8600-\u86ff\u8700-\u87ff\u8800-\u88ff\u8900-\u89ff\u8a00-\u8aff\u8b00-\u8bff\u8c00-\u8cff\u8d00-\u8dff\u8e00-\u8eff\u8f00-\u8fff\u9000-\u90ff\u9100-\u91ff\u9200-\u92ff\u9300-\u93ff\u9400-\u94ff\u9500-\u95ff\u9600-\u96ff\u9700-\u97ff\u9800-\u98ff\u9900-\u99ff\u9a00-\u9aff\u9b00-\u9bff\u9c00-\u9cff\u9d00-\u9dff\u9e00-\u9eff\u9f00-\u9fff]
I played around with this quite a bit and in vim the following seems to find all the Kanji characters in my Kanji/Pinyin/English text:
[^!-~0-9 aāáǎăàeēéěèiīíǐĭìoōóǒŏòuūúǔùǖǘǚǜ]
Vim cannot actually do this by itself, since you aren’t given access to Unicode properties like \p{Han}.
As of Unicode v6.0, the range of codepoints for characters in the Han script is:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCB F900-FA2D FA30-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
Whereas with Unicode v6.1, the range of Han codepoints has changed to:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCC F900-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
I also seem to recall that Vim has difficulties expressing astral code points, which are needed for this to work correctly. For example, using the flexible \x{HHHHHH} notation from Java 7 or Perl, you would have:
[\x{2E80}-\x{2E99}\x{2E9B}-\x{2EF3}\x{2F00}-\x{2FD5}\x{3005}-\x{3005}\x{3007}-\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}\x{3400}-\x{4DB5}\x{4E00}-\x{9FCC}\x{F900}-\x{FA6D}\x{FA70}-\x{FAD9}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2F800}-\x{2FA1D}]
Notice that the last part of the range is \x{2F800}-\x{2FA1D}, which is beyond the BMP. But what you really need is \p{Han} (meaning, \p{Script=Han}). This again shows that regex dialects that don’t support at least Level 1 of UTS#18: Basic Unicode Support are inadequate for working with Unicode. Vim’s regexes are inadequate for basic Unicode work.
EDITED TO ADD
Here’s the program that dumps out the ranges of code points that apply to any given Unicode script.
#!/usr/bin/env perl
#
# uniscrange - given a Unicode script name, print out the ranges of code
# points that apply.
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings;
use Unicode::UCD qw(charscript);
for my $arg (#ARGV) {
print "$arg: " if #ARGV > 1;
dump_range($arg);
}
sub dump_range {
my($scriptname) = #_;
my $alist = charscript($scriptname);
unless ($alist) {
warn "Unknown script '$scriptname'\n";
return;
}
for my $aref (#$alist) {
my($start, $stop, $name) = #$aref;
die "got $name, not $scriptname\n" unless $name eq $scriptname;
printf "%04X-%04X ", $start, $stop;
}
print "\n";
}
Its answers depend on which version of Perl — and thus, which version of Unicode — you’re running it against.
$ perl5.8.8 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0241 0250-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBF 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
$ perl5.10.0 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0293 0294-0294 0295-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBE 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B 2132-2132 214E-214E 2184-2184 2C60-2C6C 2C74-2C77 FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 037B-037D 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1DBF-1DBF 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
You can use the corelist -a Unicode command to see which version of Unicode goes with which version of Perl. Here is selected output:
$ corelist -a Unicode
v5.8.8 4.1.0
v5.10.0 5.0.0
v5.12.2 5.2.0
v5.14.0 6.0.0
v5.16.0 6.1.0
I don't understand the "same high byte problem" but it seems like it does not apply (at least not for me, VIM 7.4) when you actually enter the character to build up the ranges.
I usually search from U+3400(㐀) to U+9FCC(鿌) to capture Chinese characters in Japanese texts.
U+3400(㐀) is beginning of "CJK Unified Ideographs Extension A"
U+4DC0 - U+4DFF "Yijing Hexagram Symbols" is in between but not excluded for the sake of simplicity.
U+9FCC(鿌) is the end of "CJK Unified Ideographs"
Please note that the Japanese writing uses "々" as a kanji repetition symbol which is not part of this block. You can find it in the Block "Japanese Symbols and Punctuation."
/[㐀-鿌]
A (almost?) complete set of Chinese characters with extensions
/[㐀-鿌豈-龎𠀀-𪘀]
This range includes:
CJK Unified Ideographs Extension A
Yijing Hexagram Symbols (shouldn't be part of it)
CJK Unified Ideographs (main part)
CJK Compatibility Ideographs
CJK Unified Ideographs Extension B,
CJK Unified Ideographs Extension C,
CJK Unified Ideographs Extension D,
CJK Compatibility Ideographs Supplement
Bonus for people working on content in Japanese language:
Hiragana goes from U+3041 to U+3096
/[ぁ-ゟ]
Katakana
/[゠-ヿ]
Kanji Radicals
/[⺀-⿕]
Japanese Symbols and Punctuation.
Note that this range also includes 々(repetition of last kanji) and 〆(abbreviation for shime「しめ」). You might want to add them to your range to find words.
[ -〿]
Miscellaneous Japanese Symbols and Characters
/[ㇰ-ㇿ㈠-㉃㊀-㍿]
Alphanumeric and Punctuation (Full Width)
[!-~]
sources:
http://www.fileformat.info/info/unicode/char/9fcc/index.htm
http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/comment-page-1/#comment-46891
In some simple cases, I use this to search chinese characters. It also matches Japanese, Russian characters and so on.
[^\x00-\xff]

Finding and removing Non-ASCII characters from an Oracle Varchar2

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit.
When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters.
What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
I would like to do :
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
Is there a simple way doing what I want to do?
I think this will do the trick:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.
I wouldn't recommend it for production code, but it makes sense and seems to work:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
The select may look like the following sample:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;
In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03  ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!#/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
I had a similar issue and blogged about it here.
I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.
I found the answer here:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
Try the following:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
You can try something like following to search for the column containing non-ascii character :
select * from your_table where your_col <> asciistr(your_col);
I had similar requirement (to avoid this ugly ORA-31061: XDB error: special char to escaped char conversion failed. ), but had to keep the line breaks.
I tried this from an excellent comment
'[^ -~|[:space:]]'
but got this ORA-12728: invalid range in regular expression .
but it lead me to my solution:
select t.*, regexp_replace(deta, '[^[:print:]|[:space:]]', '#') from
(select '- <- strangest thing here, and I want to keep line break after
-' deta from dual ) t
displays (in my TOAD tool) as
replace all that ^ => is not in the sets (of printing [:print:] or space |[:space:] chars)
Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.
Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Here is the simple test in Oracle 11.2.03
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
i.e for some reason this version of Oracle does not replace char(226) and above.
Using '['||chr(127)||'-'||chr(225)||']' gives the desired result.
If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).
Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
å => a
ä => a
ö => o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
As noted in this comment, and this comment, you can use a range.
Using Oracle 11, the following works very well:
SELECT REGEXP_REPLACE(dummy, '[^ -~|[:space:]]', '?') AS dummy FROM DUAL;
This will replace anything outside that printable range as a question mark.
This will run as-is so you can verify the syntax with your installation.
Replace dummy and dual with your own column/table.
Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is).
The following is a simple character whitelist approach:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est

escaping bracket in postgresql query

I am trying to escape a bracket in a pattern matching expression for PostgreSQL 8.2
The clause looks something like:
WHERE field SIMILAR TO '%UPC=\[ R%%(\mLE)%'
but I keep getting:
ERROR: invalid regular expression: brackets [] not balanced
Try this:
select '%UPC=\[ R%%(\mLE)%';
WARNING: nonstandard use of escape in a string literal
LINE 1: select '%UPC=\[ R%%(\mLE)%';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
?column?
------------------
%UPC=[ R%%(mLE)%
(1 row)
You need to set Postgres in standard conforming strings mode instead of backward compatible mode.
set standard_conforming_strings=1;
select '%UPC=\[ R%%(\mLE)%';
?column?
--------------------
%UPC=\[ R%%(\mLE)%
(1 row)
Or you need to use escape string syntax which works regardless of mode:
set standard_conforming_strings=1;
select E'%UPC=\\[ R%%(\\mLE)%';
?column?
--------------------
%UPC=\[ R%%(\mLE)%
(1 row)
set standard_conforming_strings=0;
select E'%UPC=\\[ R%%(\\mLE)%';
?column?
--------------------
%UPC=\[ R%%(\mLE)%
(1 row)
You can set this setting in postgresql.conf for all databases, using alter database for single database, using alter user for single user or group of users or using set for current connection.

A regex for version number parsing

I have a version number of the following form:
version.release.modification
where version, release and modification are either a set of digits or the '*' wildcard character. Additionally, any of these numbers (and any preceding .) may be missing.
So the following are valid and parse as:
1.23.456 = version 1, release 23, modification 456
1.23 = version 1, release 23, any modification
1.23.* = version 1, release 23, any modification
1.* = version 1, any release, any modification
1 = version 1, any release, any modification
* = any version, any release, any modification
But these are not valid:
*.12
*123.1
12*
12.*.34
Can anyone provide me a not-too-complex regex to validate and retrieve the release, version and modification numbers?
I'd express the format as:
"1-3 dot-separated components, each numeric except that the last one may be *"
As a regexp, that's:
^(\d+\.)?(\d+\.)?(\*|\d+)$
[Edit to add: this solution is a concise way to validate, but it has been pointed out that extracting the values requires extra work. It's a matter of taste whether to deal with this by complicating the regexp, or by processing the matched groups.
In my solution, the groups capture the "." characters. This can be dealt with using non-capturing groups as in ajborley's answer.
Also, the rightmost group will capture the last component, even if there are fewer than three components, and so for example a two-component input results in the first and last groups capturing and the middle one undefined. I think this can be dealt with by non-greedy groups where supported.
Perl code to deal with both issues after the regexp could be something like this:
#version = ();
#groups = ($1, $2, $3);
foreach (#groups) {
next if !defined;
s/\.//;
push #version, $_;
}
($major, $minor, $mod) = (#version, "*", "*");
Which isn't really any shorter than splitting on "."
]
Use regex and now you have two problems. I would split the thing on dots ("."), then make sure that each part is either a wildcard or set of digits (regex is perfect now). If the thing is valid, you just return correct chunk of the split.
Thanks for all the responses! This is ace :)
Based on OneByOne's answer (which looked the simplest to me), I added some non-capturing groups (the '(?:' parts - thanks to VonC for introducing me to non-capturing groups!), so the groups that do capture only contain the digits or * character.
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Many thanks everyone!
This might work:
^(\*|\d+(\.\d+){0,2}(\.\*)?)$
At the top level, "*" is a special case of a valid version number. Otherwise, it starts with a number. Then there are zero, one, or two ".nn" sequences, followed by an optional ".*". This regex would accept 1.2.3.* which may or may not be permitted in your application.
The code for retrieving the matched sequences, especially the (\.\d+){0,2} part, will depend on your particular regex library.
My 2 cents: I had this scenario: I had to parse version numbers out of a string literal.
(I know this is very different from the original question, but googling to find a regex for parsing version number showed this thread at the top, so adding this answer here)
So the string literal would be something like: "Service version 1.2.35.564 is running!"
I had to parse the 1.2.35.564 out of this literal. Taking a cue from #ajborley, my regex is as follows:
(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)
A small C# snippet to test this looks like below:
void Main()
{
Regex regEx = new Regex(#"(?:(\d+)\.)?(?:(\d+)\.)?(?:(\d+)\.\d+)", RegexOptions.Compiled);
Match version = regEx.Match("The Service SuperService 2.1.309.0) is Running!");
version.Value.Dump("Version using RegEx"); // Prints 2.1.309.0
}
I had a requirement to search/match for version numbers, that follows maven convention or even just single digit. But no qualifier in any case. It was peculiar, it took me time then I came up with this:
'^[0-9][0-9.]*$'
This makes sure the version,
Starts with a digit
Can have any number of digit
Only digits and '.' are allowed
One drawback is that version can even end with '.' But it can handle indefinite length of version (crazy versioning if you want to call it that)
Matches:
1.2.3
1.09.5
3.4.4.5.7.8.8.
23.6.209.234.3
If you are not unhappy with '.' ending, may be you can combine with endswith logic
Don't know what platform you're on but in .NET there's the System.Version class that will parse "n.n.n.n" version numbers for you.
I've seen a lot of answers, but... i have a new one. It works for me at least. I've added a new restriction. Version numbers can't start (major, minor or patch) with any zeros followed by others.
01.0.0 is not valid
1.0.0 is valid
10.0.10 is valid
1.0.0000 is not valid
^(?:(0\\.|([1-9]+\\d*)\\.))+(?:(0\\.|([1-9]+\\d*)\\.))+((0|([1-9]+\\d*)))$
It's based in a previous one. But i see this solution better... for me ;)
Enjoy!!!
I tend to agree with split suggestion.
Ive created a "tester" for your problem in perl
#!/usr/bin/perl -w
#strings = ( "1.2.3", "1.2.*", "1.*","*" );
%regexp = ( svrist => qr/(?:(\d+)\.(\d+)\.(\d+)|(\d+)\.(\d+)|(\d+))?(?:\.\*)?/,
onebyone => qr/^(\d+\.)?(\d+\.)?(\*|\d+)$/,
greg => qr/^(\*|\d+(\.\d+){0,2}(\.\*)?)$/,
vonc => qr/^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$/,
ajb => qr/^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$/,
jrudolph => qr/^(((\d+)\.)?(\d+)\.)?(\d+|\*)$/
);
foreach my $r (keys %regexp){
my $reg = $regexp{$r};
print "Using $r regexp\n";
foreach my $s (#strings){
print "$s : ";
if ($s =~m/$reg/){
my ($main, $maj, $min,$rev,$ex1,$ex2,$ex3) = ("any","any","any","any","any","any","any");
$main = $1 if ($1 && $1 ne "*") ;
$maj = $2 if ($2 && $2 ne "*") ;
$min = $3 if ($3 && $3 ne "*") ;
$rev = $4 if ($4 && $4 ne "*") ;
$ex1 = $5 if ($5 && $5 ne "*") ;
$ex2 = $6 if ($6 && $6 ne "*") ;
$ex3 = $7 if ($7 && $7 ne "*") ;
print "$main $maj $min $rev $ex1 $ex2 $ex3\n";
}else{
print " nomatch\n";
}
}
print "------------------------\n";
}
Current output:
> perl regex.pl
Using onebyone regexp
1.2.3 : 1. 2. 3 any any any any
1.2.* : 1. 2. any any any any any
1.* : 1. any any any any any any
* : any any any any any any any
------------------------
Using svrist regexp
1.2.3 : 1 2 3 any any any any
1.2.* : any any any 1 2 any any
1.* : any any any any any 1 any
* : any any any any any any any
------------------------
Using vonc regexp
1.2.3 : 1.2. 3 any any any any any
1.2.* : 1. 2 .* any any any any
1.* : any any any 1 any any any
* : any any any any any any any
------------------------
Using ajb regexp
1.2.3 : 1 2 3 any any any any
1.2.* : 1 2 any any any any any
1.* : 1 any any any any any any
* : any any any any any any any
------------------------
Using jrudolph regexp
1.2.3 : 1.2. 1. 1 2 3 any any
1.2.* : 1.2. 1. 1 2 any any any
1.* : 1. any any 1 any any any
* : any any any any any any any
------------------------
Using greg regexp
1.2.3 : 1.2.3 .3 any any any any any
1.2.* : 1.2.* .2 .* any any any any
1.* : 1.* any .* any any any any
* : any any any any any any any
------------------------
^(?:(\d+)\.)?(?:(\d+)\.)?(\*|\d+)$
Perhaps a more concise one could be :
^(?:(\d+)\.){0,2}(\*|\d+)$
This can then be enhanced to 1.2.3.4.5.* or restricted exactly to X.Y.Z using * or {2} instead of {0,2}
This should work for what you stipulated. It hinges on the wild card position and is a nested regex:
^((\*)|([0-9]+(\.((\*)|([0-9]+(\.((\*)|([0-9]+)))?)))?))$
For parsing version numbers that follow these rules:
- Are only digits and dots
- Cannot start or end with a dot
- Cannot be two dots together
This one did the trick to me.
^(\d+)((\.{1}\d+)*)(\.{0})$
Valid cases are:
1, 0.1, 1.2.1
Another try:
^(((\d+)\.)?(\d+)\.)?(\d+|\*)$
This gives the three parts in groups 4,5,6 BUT:
They are aligned to the right. So the first non-null one of 4,5 or 6 gives the version field.
1.2.3 gives 1,2,3
1.2.* gives 1,2,*
1.2 gives null,1,2
*** gives null,null,*
1.* gives null,1,*
My take on this, as a good exercise - vparse, which has a tiny source, with a simple function:
function parseVersion(v) {
var m = v.match(/\d*\.|\d+/g) || [];
v = {
major: +m[0] || 0,
minor: +m[1] || 0,
patch: +m[2] || 0,
build: +m[3] || 0
};
v.isEmpty = !v.major && !v.minor && !v.patch && !v.build;
v.parsed = [v.major, v.minor, v.patch, v.build];
v.text = v.parsed.join('.');
return v;
}
Sometimes version numbers might contain alphanumeric minor information (e.g. 1.2.0b or 1.2.0-beta). In this case I am using this regex:
([0-9]{1,4}(\.[0-9a-z]{1,6}){1,5})
(?ms)^((?:\d+(?!\.\*)\.)+)(\d+)?(\.\*)?$|^(\d+)\.\*$|^(\*|\d+)$
Does exactly match your 6 first examples, and rejects the 4 others
group 1: major or major.minor or '*'
group 2 if exists: minor or *
group 3 if exists: *
You can remove '(?ms)'
I used it to indicate to this regexp to be applied on multi-lines through QuickRex
This matches 1.2.3.* too
^(*|\d+(.\d+){0,2}(.*)?)$
I would propose the less elegant:
(*|\d+(.\d+)?(.*)?)|\d+.\d+.\d+)
Keep in mind regexp are greedy, so if you are just searching within the version number string and not within a bigger text, use ^ and $ to mark start and end of your string.
The regexp from Greg seems to work fine (just gave it a quick try in my editor), but depending on your library/language the first part can still match the "*" within the wrong version numbers. Maybe I am missing something, as I haven't used Regexp for a year or so.
This should make sure you can only find correct version numbers:
^(\*|\d+(\.\d+)*(\.\*)?)$
edit: actually greg added them already and even improved his solution, I am too slow :)
It seems pretty hard to have a regex that does exactly what you want (i.e. accept only the cases that you need and reject all others and return some groups for the three components). I've give it a try and come up with this:
^(\*|(\d+(\.(\d+(\.(\d+|\*))?|\*))?))$
IMO (I've not tested extensively) this should work fine as a validator for the input, but the problem is that this regex doesn't offer a way of retrieving the components. For that you still have to do a split on period.
This solution is not all-in-one, but most times in programming it doesn't need to. Of course this depends on other restrictions that you might have in your code.
Specifying XSD elements:
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\..*)?"/>
</xs:restriction>
</xs:simpleType>
One more solution:
^[1-9][\d]*(.[1-9][\d]*)*(.\*)?|\*$
I found this, and it works for me:
/(\^|\~?)(\d|x|\*)+\.(\d|x|\*)+\.(\d|x|\*)+
/^([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})(\-(alpha|beta|rc|HP|CP|SP|hp|cp|sp)[1-9]\d*)?(\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?(\.[0-9a-zA-Z]+)?$/
A normal version: ([1-9]{1}\d{0,3})(\.)([0-9]|[1-9]\d{1,3})(\.)([0-9]|[1-9]\d{1,3})
A Pre-release or patched version: (\-(alpha|beta|rc|EP|HP|CP|SP|ep|hp|cp|sp)[1-9]\d*)? (Extension Pack, Hotfix Pack, Coolfix Pack, Service Pack)
Customized version: (\.C[0-9a-zA-Z]+(-U[1-9]\d*)?)?
Internal version: (\.[0-9a-zA-Z]+)?