postgresql: regexp_split_to_table - how to split text by delimiters

postgresql: regexp_split_to_table - how to split text by delimiters - regex

I need to split my text to table by delimiters '<=' and '=>', for example
select regexp_split_to_table('kik plz <= p1 => and <= p2 => too. A say <=p1 =>','regexp');
The result must be:
table:
--------------
1 | 'kik plz '
2 | '<= p1 =>'
3 | ' and '
4 | <= p2 =>
5 | ' too. A say '
6 | '<=p1 =>'
I think the answer is in positional patterns, but my skills are not enough.
select regexp_split_to_table('kik plz <= p1 => and <= p2 => too. A say <=p1 =>', '((\s)(?=<=))|((\s)(?!=>))')
This returns the wrong result.

select regexp_split_to_table(
replace(
replace('kik plz<= p1 =>and<= p2 =>too. A say <=p1 =>', '<=', E'\001<=')
, '=>', E'=>\001')
, E'\001');

Related

extract all numbers in a string

How can I extract all numbers in a string?
Sample inputs:
7nr-6p
12c-18L
12nr-24L
11nr-12p
Expected Outputs:
{7,6}
{12,18}
{12,24}
etc...
The following is tested with the first one, 7nr-6p:
select regexp_split_to_array('7nr-6p', '[^0-9]') AS new_volume from mytable;
Gives: {7,"","",6,""} // Why is a numeric-only match returning spaces?
select regexp_matches('7nr-6p', '[0-9]*'::text) from mytable;
Gives: {7} // Why isn't this continuing?
select regexp_matches('7nr-6p', '\d'::text) from mytable;
Gives: {7}
select NULLIF(regexp_replace('7nr-6p', '\D',',','g'), '')::text from mytable;
Gives: 7,,,6,

The following query:
select regexp_split_to_array(regexp_replace('7nr-6p', '^[^0-9]*|[^0-9]*$', 'g'), '[^0-9]+')
AS new_volume from mytable;
"Trims" the prefix and suffix non-numbers and splits by the remaining non-numbers.
select regexp_matches('7nr-6p', '[0-9]*'::text) from mytable;
Gives: {7} // Why isn't this continuing?
Because without the 'g' flag, the regex stops at the first match.
Add the 'g' flag:
select regexp_matches('7nr-6p', '[0-9]*'::text, 'g') from mytable;

You can replace all text and then split:
SELECT regexp_split_to_array(
regexp_replace('7nr-6p', '[a-zA-Z]', '','g'),
'[^0-9]'
)
This returns {7,6}

SELECT id, (regexp_matches(string, '\d+', 'g'))[1]::int AS nr
FROM (
VALUES
(1, '7nr-6p')
, (2, '12c-18L')
, (3, '12nr-24L')
, (4, '11nr-12p')
) tbl(id, string);
Result:
id | nr
----+----
1 | 7
1 | 6
2 | 12
2 | 18
3 | 12
3 | 24
4 | 11
4 | 12
I wanted them in a single cell so I could extract them as needed
SELECT id, trim(regexp_replace(string, '\D+', ',', 'g'), ',') AS nrs
FROM (
VALUES
(1, '7nr-6p')
, (2, '12c-18L')
, (3, '12nr-24L')
, (4, '11nr-12p')
) tbl(id, string);
Result:
id | nrs
----+-------
1 | 7,6
2 | 12,18
3 | 12,24
4 | 11,12
dbfiddle here

Here is a more robust solution
CREATE OR REPLACE FUNCTION get_ints_from_text(TEXT) RETURNS int[] AS $$
select array_remove(regexp_split_to_array($1,'[^0-9]+','i'),'')::int[];
$$ LANGUAGE SQL IMMUTABLE;
Example
select get_ints_from_text('7nr-6p'); -- 7,6
-- also resilient in situations like
select get_ints_from_text('-7nr--6p'); -- 7,6
Here is a link to try
http://sqlfiddle.com/#!17/c6ac7/2
I feel that wrapping this functionality into an immutable function is prudent. This is a pure function, one that will not mutate data and one that returns the same result given the same input. Immutable functions marked as "immutable" have performance benefits.
By using a function we also benefit from abstraction. There is one source to update should this functionality need to improve in the future.
For more information about immutable functions see
https://www.postgresql.org/docs/10/static/sql-createfunction.html

Store regexp_matches result in variables

In a psql function I use regexp_matches. The result should go into 2 var db_datum_von_string and db_datum_von_string. However the second is always null even if the pattern matches 2 strings.
The first value is correct. I guess the issue comes from the res[2] but I can't figure out what it is.
Notice that SELECT regexp_matches('1234ffdsafdsa 4554 ', '[0-9]+', 'g') works perfectly and return 2 rows.
CREATE OR REPLACE FUNCTION ajl_TEST_datum_check(
arg_datum_id integer,
req_datum integer,
flag integer
) RETURNS text AS $$
DECLARE
db_datum text;
db_datum_von_string text;
db_datum_bis_string text;
temp integer;
BEGIN
SELECT datum.datum INTO db_datum
FROM datum
WHERE datum_id = arg_datum_id;
--- the issue starts here
SELECT res[1], res[2] INTO db_datum_von_string, db_datum_bis_string
FROM (SELECT regexp_matches(db_datum, '[0-9]+', 'g') res) y;
--- end of trouble
IF db_datum_bis_string IS NULL THEN
RETURN db_datum_von_string;
ELSE
RETURN TRUE;
END IF;
END;
$$ LANGUAGE plpgsql;

Your query returns two rows of one-element array values:
SELECT res[1], res[2]
FROM (
SELECT regexp_matches('1234ffdsafdsa 4554 ', '[0-9]+', 'g') res
) y;
res | res
------+-----
1234 |
4554 |
(2 rows)
You should convert elements of the rows to an array:
SELECT res[1], res[2]
FROM (
SELECT array_agg(res) res
FROM (
SELECT unnest(regexp_matches('1234ffdsafdsa 4554 ', '[0-9]+', 'g')) res
) y
) x;
res | res
------+------
1234 | 4554
(1 row)

using Oracle REGEXP_INSTR to find exact word

I want to return the following position from the strings using REGEXP_INSTR.
I am looking for the word car with exact match in the following strings.
car,care,oscar - 1
care,car,oscar - 6
oscar,care,car - 12
something like
SELECT REGEXP_INSTR('car,care,oscar', 'car', 1, 1) "REGEXP_INSTR" FROM DUAL;
I am not sure what kind of escape operators to use.

A simpler solution is to surround the source string and search string with commas and find the position using INSTR.
SELECT INSTR(',' || 'car,care,oscar' || ',', ',car,') "INSTR" FROM DUAL;
Example:
SQL Fiddle
with x(y) as (
SELECT 'car,care,oscar' from dual union all
SELECT 'care,car,oscar' from dual union all
SELECT 'oscar,care,car' from dual union all
SELECT 'car' from dual union all
SELECT 'cart,care,oscar' from dual
)
select y, ',' || y || ',' , instr(',' || y || ',',',car,')
from x
| Y | ','||Y||',' | INSTR(','||Y||',',',CAR,') |
|-----------------|-------------------|----------------------------|
| car,care,oscar | ,car,care,oscar, | 1 |
| care,car,oscar | ,care,car,oscar, | 6 |
| oscar,care,car | ,oscar,care,car, | 12 |
| car | ,car, | 1 |
| cart,care,oscar | ,cart,care,oscar, | 0 |

The following query handles all scenarios. It returns the starting position if the string begins with car, or the whole string is just car. It returns the starting position + 1 if ,car, is found or if the string ends with ,car to account for the comma.
SELECT
CASE
WHEN REGEXP_LIKE('car,care,oscar', '^car,|^car$') THEN REGEXP_INSTR('car,care,oscar', '^car,|^car$', 1, 1)
WHEN REGEXP_LIKE('car,care,oscar', ',car,|,car$') THEN REGEXP_INSTR('car,care,oscar', ',car,|,car$', 1, 1)+1
ELSE 0
END "REGEXP_INSTR"
FROM DUAL;
SQL Fiddle demo with the various possibilities

I like Noel his answer as it gives a very good performance! Another way around is by creating separate rows from a character separated string:
pm.nodes = 'a;b;c;d;e;f;g'
(select regexp_substr(pm.nodes,'[^;]+', 1, level)
from dual
connect by regexp_substr(pm.nodes, '[^;]+', 1, level) is not null)

tab-delimited file output inconsistent

I am attempting to write elements from a nested list to individual lines in a file, with each element separated by tab characters. Each of the nested lists is of the following form:
('A', 'B', 'C', 'D')
The final output should be of the form:
A B C D
E F G H
. . . .
. . . .
However, my output seems to have reproducible inconsistencies such that the output is of the general form:
A B C D
E F G H
I J K L
M N O P
. . . .
. . . .
I've inspected the lists before writing and they seem identical in form. The code I'm using to write is:
with open("letters.txt", 'w') as outfile:
outfile.writelines('\t'.join(line) + '\n' for line in letter_list)
Importantly, if I replace '\t' with, for example, '|', the file is created without such inconsistencies. I know whitespace parsing can become an issue for certain file I/O operations, but I don't know how to troubleshoot it here.
Thanks for the time.
EDIT: Here is some actual input data (in nested-list form) and output:
IN
('5', '+', '5752624-5752673', 'alt_region_8161'), ('1', '+', '621461-622139', 'alt_region_67'), ('1', '+', '453907-454063', 'alt_region_60'), ('1', '+', '539611-539815', 'alt_region_61'), ('4', '+', '14610049-14610103', 'alt_region_6893'), ('4', '+', '14610049-14610144', 'alt_region_6895'), ('4', '+', '14610049-14610144', 'alt_region_6897'), ('4', '+', '14610049-14610144', 'alt_region_6896')]
OUT
4 + 12816011-12816087 alt_region_6808
1 + 21214720-21214747 alt_region_2377
4 + 9489968-9490833 alt_region_7382
1 + 12121545-12126263 alt_region_650
4 + 9489968-9490811 alt_region_7381
4 + 12816011-12816087 alt_region_6807
1 + 2032338-2032740 alt_region_157
5 + 4695084-4695628 alt_region_9316
1 + 22294677-22295134 alt_region_2424
1 + 22294677-22295139 alt_region_2425
1 + 22294677-22295139 alt_region_2426
1 + 22294677-22295139 alt_region_2427
1 + 22294677-22295134 alt_region_2422
1 + 22294677-22295134 alt_region_2423
1 + 22294384-22295198 alt_region_2428
1 + 22294384-22295198 alt_region_2429
5 + 20845105-20845211 alt_region_9784
5 + 20845105-20845206 alt_region_9783
3 + 2651447-2651889 alt_region_5562
EDIT: Thanks to everyone who commented. Sorry if the question was poorly phrased. I appreciate the help in clarifying the issue (or, apparently, non-issue).

There are no spaces (' ')in your output, only tabs ('\t').
>>> print(repr('1 + 21214720-21214747 alt_region_2377'))
'1\t+\t21214720-21214747\talt_region_2377'
^^ ^^ ^^
Tabs are not equivalent to a fixed number of spaces (in most editors). Rather, they move the character following the tab to the next available multiple of x characters from the left margin, where x varies - x is most commonly 8, though it is 4 here on SO.
>>> for i in range(7):
print('x'*i+'\tx')
x
x x
xx x
xxx x
xxxx x
xxxxx x
xxxxxx x
If you want your output to appear aligned to the naked eye, you should use string formatting:
>>> for line in data:
print('{:4} {:4} {:20} {:20}'.format(*line))
5 + 5752624-5752673 alt_region_8161
1 + 621461-622139 alt_region_67
1 + 453907-454063 alt_region_60
1 + 539611-539815 alt_region_61
4 + 14610049-14610103 alt_region_6893
4 + 14610049-14610144 alt_region_6895
4 + 14610049-14610144 alt_region_6897
4 + 14610049-14610144 alt_region_6896
Note, however, that this will not necessarily be readable by code that expects a tab-separated value file.

In some text editors, tabs are displayed like that. The contents of the file are correct, it's just a matter of how the file is displayed on screen. It happens with tabs but not with | which is why you don't see it happening when you use |.

posix regexp to split a table

I'm currently working on data migration in PostgreSQL. Since I'm new to posix regular expressions, I'm having some trouble with a simple pattern and would appreciate your help.
I want to have a regular expression split my table on each alphanumeric char in a column, eg. when a column contains a string 'abc' I'd like to split it into 3 rows: ['a', 'b', 'c']. I need a regexp for that
The second case is a little more complicated, I'd like to split an expression '105AB' into ['105A', '105B'], I'd like to copy the numbers at the beginning of the string and split the table on uppercase letters, in the end joining the number with exactly 1 uppercase letter.
the function I'll be using is probably regexp_split_to_table(string, regexp)
I'm intentionally providing very little data not to confuse anyone, since what I posted is the essence of the problem. If you need more information please comment.

The first was already solved by you:
select regexp_split_to_table(s, ''), i
from (values
('abc', 1),
('def', 2)
) s(s, i);
regexp_split_to_table | i
-----------------------+---
a | 1
b | 1
c | 1
d | 2
e | 2
f | 2
In the second case you don't say if the numerics are always the first tree characters:
select
left(s, 3) || regexp_split_to_table(substring(s from 4), ''), i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i);
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2
For a variable number of numerics:
select n || a, i
from (
select
substring(s, '^\d{1,3}') n,
regexp_split_to_table(substring(s, '[A-Z]+'), '') a,
i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i)
) s;
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

postgresql: regexp_split_to_table - how to split text by delimiters - regex

select regexp_split_to_table( replace( replace('kik plz<= p1 =>and<= p2 =>too. A say <=p1 =>', '<=', E'\001<=') , '=>', E'=>\001') , E'\001');

Related

extract all numbers in a string

Store regexp_matches result in variables

using Oracle REGEXP_INSTR to find exact word

tab-delimited file output inconsistent

posix regexp to split a table

Categories

Resources