Hive: Extract string between first and last occurrence of a character - regex

I have a Hive table column which has string separated by '-' and i need to extract the string between first and last occurrence of '-'
+-----------------+
| col1 |
+-----------------+
| abc-123-na-00-sf|
| 123-abc-01-sd |
| 123-abcd-sd |
+-----------------+
Required output:
+-----------+
| col1 |
+-----------+
| 123-na-00 |
| abc-01 |
| abcd |
+-----------+
Please suggest some regex to extract the desired output.
Thanks

with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'-(.*)-',1)
from t
;
123-na-00
abc-01
abcd
or
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'(?<=-).*(?=-)',0)
from t
;
123-na-00
abc-01
abcd

Related

What is the maximum length of a regex expression?

In PostgreSQL, I want to exclude rows if the desc field contains any forbidden words.
items:
| id | desc |
|----|------------------|
| 1 | apple foo cat bar|
| 2 | foo bar |
| 3 | foocatbar |
| 4 | foo dog bar |
The forbidden words list is stored in another table, currently it has 400 words to check.
forbidden_word_table:
| word |
|---------|
| apple |
| boy |
| cat |
| dog |
| .... |
SQL query:
select id, desc
from items
where
desc !~* (select '\y(' || string_agg(word, '|') || ')\y' from forbidden_word_table)
I am checking if desc does not match the regex expression:
desc !~* '\y(apple|boy|cat|dog|.............)\y'
Results:
| id | desc |
|----|------------------|
| 2 | foo bar |
| 3 | foocatbar |
** 3rd is not excluded since cat is not a single word
My forbidden_word_table will likely grow with many rows, the above regex will become a very lengthy expression.
Do regex expressions have a maximum length limit (in bytes or characters)? I'm afraid of my regex matching approach will not work if forbidden_word_table keeps growing.
Seems, that Wiktor Stribiżew is right about "catastrophic backtracking".
I'd suggest to use ILIKE and ANY:
SELECT *
FROM items i
WHERE NOT i."desc" ILIKE ANY
(
SELECT '%' || word || '%'
FROM forbidden_word_table
);
db-fiddle

Regexp to match tabular data

From below table
+-----------------------------------------------------+
| Student Info |
|+----------------+--------------+-------------+ |
|| Name | Highschooled | County | |
|+----------------+--------------+-------------+ |
|| Rob | Y | LA | |
|+----------------+--------------+-------------+ |
| |
+-----------------------------------------------------+
I want to parse the values from columns.
I tried regular expression below in Golang but something is amiss
This regular expression matches only first column
`\|\|([[:word:][:space:]]+\|)+?`
And this one greedy matches first two columns as one
`^\|((\|[[:word:][:space:]]+)+?)\| +\|`
Here's my workspace: https://regex101.com/r/sXQdq1/1
You did not list a language, so I will demo in Python.
You can find the table elements that start with || and parse between the | for the data fields. This can be zipped together for a dict of the data.
Given:
tbl='''\
+-----------------------------------------------------+
| Student Info |
|+----------------+--------------+-------------+ |
|| Name | Highschooled | County | |
|+----------------+--------------+-------------+ |
|| Rob | Y | LA | |
|+----------------+--------------+-------------+ |
| |
+-----------------------------------------------------+'''
You can do:
import re
pat=r'^\|\|([^|]+)\|([^|]+)\|([^|]+)\|'
>>> dict(zip(*re.findall(pat, tbl, flags=re.M)))
{' Name ': ' Rob ', ' Highschooled ': ' Y ', ' County ': ' LA '}
If you don't want the surrounding white space:
>>> {k.strip():v.strip() for k,v in zip(*re.findall(pat, tbl, flags=re.M))}
{'Name': 'Rob', 'Highschooled': 'Y', 'County': 'LA'}
If you want a more specific regex, you could do THIS which will only match a table with Student Info at the top.

Remove all Unicode space separators in PostgreSQL?

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 |   | \u00a0 | f | t
5760 | 1680 |   | \u1680 | t | t
6158 | 180e | ᠎ | \u180e | f | t
8192 | 2000 |   | \u2000 | t | t
8193 | 2001 |   | \u2001 | t | t
8194 | 2002 |   | \u2002 | t | t
8195 | 2003 |   | \u2003 | t | t
8196 | 2004 |   | \u2004 | t | t
8197 | 2005 |   | \u2005 | t | t
8198 | 2006 |   | \u2006 | t | t
8199 | 2007 |   | \u2007 | f | t
8200 | 2008 |   | \u2008 | t | t
8201 | 2009 |   | \u2009 | t | t
8202 | 200a |   | \u200a | t | t
8203 | 200b | ​ | \u200b | f | t
8204 | 200c | ‌ | \u200c | f | t
8205 | 200d | ‍ | \u200d | f | t
8206 | 200e | ‎ | \u200e | f | t
8207 | 200f | ‏ | \u200f | f | t
8239 | 202f |   | \u202f | f | t
8287 | 205f |   | \u205f | t | t
8288 | 2060 | ⁠ | \u2060 | f | t
12288 | 3000 |   | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';

how to get proxy from string

I have string
76.125.85.66:16805 | 0.238 | Little Rock | AR | Unknown | United
States69.207.212.76:49233 | 0.274 | Sayre | PA | 18840 | United
States96.42.127.190:25480 | 0.292 | Sartell | MN | 56377 | United States
and heres how I get proxy from it
my code
Dim ip As String = "76.125.85.66:16805 | 0.238 | Little Rock | AR | Unknown | United States69.207.212.76:49233 | 0.274 | Sayre | PA | 18840 | United States96.42.127.190:25480 | 0.292 | Sartell | MN | 56377 | United States"
ip = Regex.Match(ip, "\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\:\d{2,5}\b", RegexOptions.Singleline).ToString
RichTextBox1.Text = ip
it only show first proxy 76.125.85.66:16805 but i want it show all
76.125.85.66:16805
69.207.212.76:49233
96.42.127.190:25480
Use the Regex.Matches() method instead and remove the beginning word boundary.
You could write it as follows:
For Each m As Match In Regex.Matches(ip, "(?:\d{1,3}\.){3}\d{1,3}:\d+")
Console.WriteLine(m.Value)
Next
Ideone Demo
use this pattern to return multi result for specific expression
public ArrayList HRefs(string incomingHtml)
{
ArrayList arrayList = new ArrayList();
string pattern = "href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))";
for (Match match = Regex.Match(incomingHtml, pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled); match.Success; match = match.NextMatch())
{
string str = match.Groups[1].Value;
arrayList.Add(str);
}
return arrayList;
}

PL/SQL: regexp_like for string not start with letters

For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore.
For example :
String : AM_HTCEVOBLKHS_BX [false]
String : AP_HTCEVOBLKHSPBX [false]
String : BM_HTCEVOBLKHS_BX [true]
String : A_HTCEVODSAP_DSSD [true]
String : A_HTCEVOB_A_CDSED [true]
String : MP_HTCEVOBLKHS_BX [true]
Can you make this pattern ?
My current solution doesn't work:
BEGIN
IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN
dbms_output.put_line('TRUE');
ELSE
dbms_output.put_line('FALSE');
END IF;
END;
/
why you need regexp why you not use simple substr?
with t1 as
(select 'AM_HTCEVOBLKHS_BX' as f1
from dual
union all
select 'AP_HTCEVOBLKHSPBX'
from dual
union all
select 'BM_HTCEVOBLKHS_BX'
from dual
union all
select 'A_HTCEVODSAP_DSSD'
from dual
union all
select 'A_HTCEVOB_A_CDSED'
from dual
union all
select 'MP_HTCEVOBLKHS_BX' from dual
union all
select null from dual
union all
select '1' from dual)
select f1,
case
when substr(f1, 1, 2) in ('AM', 'AP') then
'false'
else
'true'
end as check_result
from t1
If you have a table of patterns then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE strings ( string ) AS
SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL
UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL
UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL
UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL;
CREATE TABLE patterns ( pattern ) AS
SELECT '^AM' FROM DUAL
UNION ALL SELECT '^AP' FROM DUAL;
Query 1:
-- Negative Matches:
SELECT string
FROM strings s
LEFT OUTER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
WHERE p.pattern IS NULL
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 2:
-- Positive Matches:
SELECT DISTINCT
string
FROM strings s
INNER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 3:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string,
( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL )
FROM patterns )
)
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings s
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
If you want to pass the pattern as a single string then:
Query 4:
-- Negative Matches:
SELECT string
FROM strings
WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 5:
-- Positive Matches:
SELECT string
FROM strings
WHERE REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 6:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' )
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
Try this:
^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$
The idea is to describe all possible start of the string.
( # a group
[B-Z][A-Z]* # The first character is not a "A"
| # OR
A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M"
| # OR
A[A-Z]{2,} # a "A" followed by more than 1 letter
) # close the group
^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this:
not regexp_like( field, '^(AM_)|^(AP_)' )
As it is a LIKE function you don't need any more on the regex expression.