ANTLR4 - Match values over 9? - regex

So, I've been working again on my assembler, this time I'm hanging with the floating-point registers. Basically, there are 32 fp registers. So, I want to match them, if I write F0, F1, F2, ..., F31. I wrote following into my lexer:
REG
: ('R0'|'r0')
| ('AT'|'at')
| ('v'[0-1]|'V'[0-1])
| ('a'[0-3]|'A'[0-3])
| ('t'[0-9]|'T'[0-9])
| ('s'[0-9]|'S'[0-8])
| ('k'[0-1]|'K'[0-1])
| ('GP'|'gp')
| ('SP'|'sp')
| ('FP'|'fp')
| ('ra'|'RA')
| ('f'[0-31]|'F'[0-31])+
;
Basically, every register here worked without any problems. But F0-F31 seems not to work. I tested it out and noticed, that it only matches F0-F3 but not any higher. This was quite obvious in that moment and I couldn't find out how I would match values which are over 10. I also tried some workarounds like adding more [0-9] behind the others, but that didn't help, as it then would match later values like F36 or F39. So, any idea how I could handle this?
Thanks in Advance.

The class [0-31] matches the 0, 1, 2, 3 or 1 (again). To emphasise: regular expression classes do not match numeric values, but (text) characters.
To match F0, F1, F2, ..., F31 (and f0, f1, f2, ..., f31), do something like this:
FREG
: [fF] ( [0-9] // matches f0..f9 (and F0..F9)
| [1-2] [0-9] // matches f10..f29 (and F10..F29)
| '3' [01] // matches f30 or f31 (and F30 or F31)
)
;
Your complete REG rule could be written as follows:
REG
: [rR] '0'
| 'AT' | 'at'
| [vV] [01]
| [aA] [0-3]
| [tT] [0-9]
| [sS] [0-9]
| [kK] [01]
| 'GP' | 'gp'
| 'SP' | 'sp'
| 'FP' | 'fp'
| 'RA' | 'ra'
| [fF] ( [0-9] | [1-2] [0-9] | '3' [01] )
;
Note that [01] and [0-1] match the same: either '0' or '1'. Also be aware that 'ra' | 'RA' does not match 'Ra'. If you want 'Ra' and 'rA' to match as well, write it like this: [rR] [aA].

Related

Remove all Unicode space separators in PostgreSQL?

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 |   | \u00a0 | f | t
5760 | 1680 |   | \u1680 | t | t
6158 | 180e | ᠎ | \u180e | f | t
8192 | 2000 |   | \u2000 | t | t
8193 | 2001 |   | \u2001 | t | t
8194 | 2002 |   | \u2002 | t | t
8195 | 2003 |   | \u2003 | t | t
8196 | 2004 |   | \u2004 | t | t
8197 | 2005 |   | \u2005 | t | t
8198 | 2006 |   | \u2006 | t | t
8199 | 2007 |   | \u2007 | f | t
8200 | 2008 |   | \u2008 | t | t
8201 | 2009 |   | \u2009 | t | t
8202 | 200a |   | \u200a | t | t
8203 | 200b | ​ | \u200b | f | t
8204 | 200c | ‌ | \u200c | f | t
8205 | 200d | ‍ | \u200d | f | t
8206 | 200e | ‎ | \u200e | f | t
8207 | 200f | ‏ | \u200f | f | t
8239 | 202f |   | \u202f | f | t
8287 | 205f |   | \u205f | t | t
8288 | 2060 | ⁠ | \u2060 | f | t
12288 | 3000 |   | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';

regex numbers in arithmetic expression

I want to capture all numbers in a string
for example:
+================+============+
| string | match |
+================+============+
| 5*-33 = 75.3 | 5|-33|75.3 |
+----------------+------------+
| s44+2=7 | 2|7 |
+----------------+------------+
| ii2*-5 = 46 | -5|46 |
+----------------+------------+
| -2*-2.1 = 0.1 | -2|-2.1|0.1|
+================+============+
i tried with following expression, but its not working with signed numbers.
\b([0-9]+(\.\d+)?)\b
Regexr
Don't forget the optional -. - is not a number, so you have to capture it separately.
\b(-?\d+(\.\d+)?)\b
Of course, this will have issues with valid expressions such as:
4-3
But that seems to be a different problem.

Remove certain letters in foma

I am trying to write a rule to remove the non-start [a | e | h | i | o | u | w | y] letters in a string. The rule should keep the first letter, but remove given letters in other locations.
For example,
vave -> vv
aeiou -> a
My code is as below:
?* [ a | e | h | i | o | u | w | y ]+:0 ?* [ a | e | h | i | o | u | w | y ]+:0;
However, when applying the rule on vaavaa, it returns
vaav
vava
vava
vav
vava
vava
vav
vvaa
vva
vva
vv
while vv is what I want.
Please share some advice. Thanks!
You may use this regex for search:
(?!^)[aehiouwy]+
and replace it by emptry string ""
RegEx Demo
RegEx Details:
(?!^): Lookahead to make sure it is not at start
[aehiouwy]+: Match one or more of these letters inside [...]
You can use a captured group and alternation
^(.)|[aehiouwy]+
replace by \1
Regex demo

Match single character not enclosed by braces

I am making a property list syntax definition file (tmLanguage) for practice. It's in Sublime Text's YAML format, but I'll be using it in VS Code.
I need to match all characters (including unterminated { and }) that are not enclosed by {}.
I have tried using negative lookahead and lookbehind assertions, but it just matches not the first or last character in brackets.
(?<!{).(?!})
Adding a greedy quantifier to consume all characters just matches the full line.
(?<!{).+(?!})
Adding a lazy quantifier just matches everything except the first character after the {. It also matches {} exactly.
(?<!{).+?(?!})
| Test | Expected Matches |
| ----------------- | ----------------------- |
| `{Ctrl}{Shift}D` | `D` |
| `D{Ctrl}{Shift}` | `D` |
| `{Ctrl}D{Shift}` | `D` |
| `{Ctrl}{Shift}D{` | `D` `{` |
| `{Ctrl}{Shift}D}` | `D` `}` |
| `D}{Ctrl}{Shift}` | `D` `}` |
| `D{{Ctrl}{Shift}` | `D` `{` |
| `{Shift` | `{` `S` `h` `i` `f` `t` |
| `Shift}` | `S` `h` `i` `f` `t` `}` |
| `{}` | `{` `}` |
Sample text file: https://raw.githubusercontent.com/spikespaz/windows-tiledwm/master/hotkeys.conf
Full syntax highligher:
# [PackageDev] target_format: plist, ext: tmLanguage
---
name: Hotkeys Config
scopeName: source.hks
fileTypes: ["hks", "conf"]
uuid: c4bcacab-0067-43db-a1d7-7a74fffe2989
patterns:
- name: keyword.operator.assignment
match: \=
- name: constant.language
match: "null"
- name: constant.numeric.integer
match: \{(?:Alt|Ctrl|Shift|Super)\}
- name: constant.numeric.float
match: \{(?:Menu|RMenu|LMenu|Tab|Enter|PgUp|PgDown|Ins|Del|Home|End|PrntScr|Esc|Back|Space|F[0-12]|Up|Down|Left|Right)\}
- name: comment.line
match: \#.*
...
You can use the following RegEx to match:
(?:{(Ctrl|Shift|Alt)})*
Then simply replace the matches with an empty string and what you get is according to your wishes.
The RegEx is selfexplaining, but here's a short description:
It creates a non capturing Group consisting of one of your modifier keys in curly brackets. The plus sign '+' at the right means it repeats that one or more times.

REGEX: why '^([a-z] | a)$' does not match 'a'?

I never used regular expressions before and I was testing some examples.
What I don't understand is why the regular expression ^([a-z] | a)$ doesn't match the string 'a'.
As I understood [a-z] is equivalent to (a | b | c | ... | y | z), so
[a-z] | a must be equivalent to (a | b | c | ... | y | z) | a, that is the same
to say (a | b | c | ... | y | z) or [a-z].
For that reason a string str matches ^([a-z] | a)$ iff matches ^[a-z]$.
That's why I don't understand why that regular expression doesn't match string 'a' or 'e' for example.
PS: I was testing this in this page.
Spaces matter in regular expressions. Remove the spaces around the pipe (|) and it should work.