From below table
+-----------------------------------------------------+
| Student Info |
|+----------------+--------------+-------------+ |
|| Name | Highschooled | County | |
|+----------------+--------------+-------------+ |
|| Rob | Y | LA | |
|+----------------+--------------+-------------+ |
| |
+-----------------------------------------------------+
I want to parse the values from columns.
I tried regular expression below in Golang but something is amiss
This regular expression matches only first column
`\|\|([[:word:][:space:]]+\|)+?`
And this one greedy matches first two columns as one
`^\|((\|[[:word:][:space:]]+)+?)\| +\|`
Here's my workspace: https://regex101.com/r/sXQdq1/1
You did not list a language, so I will demo in Python.
You can find the table elements that start with || and parse between the | for the data fields. This can be zipped together for a dict of the data.
Given:
tbl='''\
+-----------------------------------------------------+
| Student Info |
|+----------------+--------------+-------------+ |
|| Name | Highschooled | County | |
|+----------------+--------------+-------------+ |
|| Rob | Y | LA | |
|+----------------+--------------+-------------+ |
| |
+-----------------------------------------------------+'''
You can do:
import re
pat=r'^\|\|([^|]+)\|([^|]+)\|([^|]+)\|'
>>> dict(zip(*re.findall(pat, tbl, flags=re.M)))
{' Name ': ' Rob ', ' Highschooled ': ' Y ', ' County ': ' LA '}
If you don't want the surrounding white space:
>>> {k.strip():v.strip() for k,v in zip(*re.findall(pat, tbl, flags=re.M))}
{'Name': 'Rob', 'Highschooled': 'Y', 'County': 'LA'}
If you want a more specific regex, you could do THIS which will only match a table with Student Info at the top.
I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 | | \u00a0 | f | t
5760 | 1680 | | \u1680 | t | t
6158 | 180e | | \u180e | f | t
8192 | 2000 | | \u2000 | t | t
8193 | 2001 | | \u2001 | t | t
8194 | 2002 | | \u2002 | t | t
8195 | 2003 | | \u2003 | t | t
8196 | 2004 | | \u2004 | t | t
8197 | 2005 | | \u2005 | t | t
8198 | 2006 | | \u2006 | t | t
8199 | 2007 | | \u2007 | f | t
8200 | 2008 | | \u2008 | t | t
8201 | 2009 | | \u2009 | t | t
8202 | 200a | | \u200a | t | t
8203 | 200b | | \u200b | f | t
8204 | 200c | | \u200c | f | t
8205 | 200d | | \u200d | f | t
8206 | 200e | | \u200e | f | t
8207 | 200f | | \u200f | f | t
8239 | 202f | | \u202f | f | t
8287 | 205f | | \u205f | t | t
8288 | 2060 | | \u2060 | f | t
12288 | 3000 | | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
I want to capture all numbers in a string
for example:
+================+============+
| string | match |
+================+============+
| 5*-33 = 75.3 | 5|-33|75.3 |
+----------------+------------+
| s44+2=7 | 2|7 |
+----------------+------------+
| ii2*-5 = 46 | -5|46 |
+----------------+------------+
| -2*-2.1 = 0.1 | -2|-2.1|0.1|
+================+============+
i tried with following expression, but its not working with signed numbers.
\b([0-9]+(\.\d+)?)\b
Regexr
Don't forget the optional -. - is not a number, so you have to capture it separately.
\b(-?\d+(\.\d+)?)\b
Of course, this will have issues with valid expressions such as:
4-3
But that seems to be a different problem.
I have a Hive table column which has string separated by '-' and i need to extract the string between first and last occurrence of '-'
+-----------------+
| col1 |
+-----------------+
| abc-123-na-00-sf|
| 123-abc-01-sd |
| 123-abcd-sd |
+-----------------+
Required output:
+-----------+
| col1 |
+-----------+
| 123-na-00 |
| abc-01 |
| abcd |
+-----------+
Please suggest some regex to extract the desired output.
Thanks
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'-(.*)-',1)
from t
;
123-na-00
abc-01
abcd
or
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'(?<=-).*(?=-)',0)
from t
;
123-na-00
abc-01
abcd
For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore.
For example :
String : AM_HTCEVOBLKHS_BX [false]
String : AP_HTCEVOBLKHSPBX [false]
String : BM_HTCEVOBLKHS_BX [true]
String : A_HTCEVODSAP_DSSD [true]
String : A_HTCEVOB_A_CDSED [true]
String : MP_HTCEVOBLKHS_BX [true]
Can you make this pattern ?
My current solution doesn't work:
BEGIN
IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN
dbms_output.put_line('TRUE');
ELSE
dbms_output.put_line('FALSE');
END IF;
END;
/
why you need regexp why you not use simple substr?
with t1 as
(select 'AM_HTCEVOBLKHS_BX' as f1
from dual
union all
select 'AP_HTCEVOBLKHSPBX'
from dual
union all
select 'BM_HTCEVOBLKHS_BX'
from dual
union all
select 'A_HTCEVODSAP_DSSD'
from dual
union all
select 'A_HTCEVOB_A_CDSED'
from dual
union all
select 'MP_HTCEVOBLKHS_BX' from dual
union all
select null from dual
union all
select '1' from dual)
select f1,
case
when substr(f1, 1, 2) in ('AM', 'AP') then
'false'
else
'true'
end as check_result
from t1
If you have a table of patterns then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE strings ( string ) AS
SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL
UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL
UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL
UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL;
CREATE TABLE patterns ( pattern ) AS
SELECT '^AM' FROM DUAL
UNION ALL SELECT '^AP' FROM DUAL;
Query 1:
-- Negative Matches:
SELECT string
FROM strings s
LEFT OUTER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
WHERE p.pattern IS NULL
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 2:
-- Positive Matches:
SELECT DISTINCT
string
FROM strings s
INNER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 3:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string,
( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL )
FROM patterns )
)
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings s
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
If you want to pass the pattern as a single string then:
Query 4:
-- Negative Matches:
SELECT string
FROM strings
WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 5:
-- Positive Matches:
SELECT string
FROM strings
WHERE REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 6:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' )
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
Try this:
^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$
The idea is to describe all possible start of the string.
( # a group
[B-Z][A-Z]* # The first character is not a "A"
| # OR
A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M"
| # OR
A[A-Z]{2,} # a "A" followed by more than 1 letter
) # close the group
^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this:
not regexp_like( field, '^(AM_)|^(AP_)' )
As it is a LIKE function you don't need any more on the regex expression.