Can I make my Alteryx RegEx parse conditional? - regex

I receive messages with the fields below. I want to group and extract the user inputs. Majority of submissions contain all fields and the regex works great. Problem comes in when someone removes additional lines if let's say they only need to fill in down to Amount 1
Name:
Number:
Amount:
Old Code:
Code 1:
Amount 1:
Code 2:
Amount 2:
Code 3:
Amount 3:
Code 4:
Amount 4:
I'm using Alteryx to parse the message contents and have success with my current regex but want to be ready for unavoidable user submission inconsistency
Name:(.+)\sNumber:(.+)\sAmount:(.+)\sOld Code:(.+)\sCode 1:(.+)\sAmount 1:(.+)\sCode 2:(.*?)\sAmount 2:(.*?)\sCode 3:(.*?)\sAmount 3:(.*?)\sCode 4:(.*?)\sAmount 4:(.*?[^-]*)
Is it possible to have Alteryx return parsed results from a message even if a listed field is deleted?
Alteryx issue with new cascading regex

Anyway, you can always do a cascading nested optional grouping around the
lines to just match what's valid up to a point.
This expects the form lines to be in order. If it's not, a different type
of regex is needed - an out-of-order regex ( see the bottom regex ) .
Both these regex are for Perl 5.10
(?-ms)Name:(.*)(?:\s+Number:(.*)(?:\s+Amount:(.*)(?:\s+Old[ ]+Code:(.*)(?:\s+Code[ ]+1:(.*)(?:\s+Amount[ ]+1:(.*)(?:\s+Code[ ]+2:(.*)(?:\s+Amount[ ]+2:(.*)(?:\s+Code[ ]+3:(.*)(?:\s+Amount[ ]+3:(.*)(?:\s+Code[ ]+4:(.*)(?:\s+Amount[ ]+4:(.*?[^-]*))?)?)?)?)?)?)?)?)?)?)?
https://regex101.com/r/9oKXEE/1
For out-of-order matching, use this
(?m-s)\A(?:[\S\s]*?(?:(?(1)(?!))^\h*Name\h*:\h*(.*)|(?(2)(?!))^\h*Number\h*:\h*(.*)|(?(3)(?!))^\h*Amount\h*:\h*(.*)|(?(4)(?!))^\h*Old\h*Code\h*:\h*(.*)|(?(5)(?!))^\h*Code\h*1\h*:\h*(.*)|(?(6)(?!))^\h*Amount\h*1\h*:\h*(.*)|(?(7)(?!))^\h*Code\h*2\h*:\h*(.*)|(?(8)(?!))^\h*Amount\h*2\h*:\h*(.*)|(?(9)(?!))^\h*Code\h*3\h*:\h*(.*)|(?(10)(?!))^\h*Amount\h*3\h*:\h*(.*)|(?(11)(?!))^\h*Code\h*4\h*:\h*(.*)|(?(12)(?!))^\h*Amount\h*4\h*:\h*(.*?))){1,12}
https://regex101.com/r/f2rG1v/1

In this situation, you don't need to use Regex straight off the bat and given the inconsistent data it could take a while to perfect one regex term...
You can do it this way instead:
- RecordID first,
- Then you can use a Text 2 Columns with a new-line (\n) delimiter. Configure this to "Split to Rows".
- You can then use a Text to Columns to split on the delimter ":".
That will handle additional rows entered etc. At that stage, you can figure out how to clean up the results (filter to remove null lines, multi-row to tag records, cross-tab to create a table etc...). If you want to flag any unknown rows, you can have a Text Input with the required rows and use Find/Replace or Join to separate the data.

Related

EOLs in a string from a form

Mostly a logical question, not about a code.
I decided to make a 'bulleted list trimmer': some kind of a script which deletes things like 2.1.1 from the beginning of list elements. In a text like the following:
Federal law № 296-FR «Carbon emission restrictions» from 02.07.2021.
President's executive order № 176 from 19.04.2017 «Approval of ecological safety strategy 2020-2025».
You paste it in a form, click 'submit' and then get a list stripped of numbers. Thought it was easy.
I wrote a code:
if (isset($_POST['textinput']))
{$textinput = $_POST['textinput'];}
$textoutput = preg_replace('#^\s*\d+\.?\d*#','__PHP_EOL__',$textoutput);
Then mated it to a form with $textinput textarea. It had to find spaces/tabs, then 1+ digits, then ., then 0+ digits again in the beginning of a line. So here it comes, the problem.
There are no EOLs in input textarea and corresponding $_POST element, so the '^' symbol helps only once.
If I remove '^' in regexp, it will cut out all the parts with a \s*\d+.?\d* pattern, including federal regulations numbers and dates.
I suggest I need to get EOLs in a $textinput string somehow but I still don't know whether it is possible or not. So I ask everyone for a correct ideas how to make my 'bullet list trimmer'

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Is there an efficient way to scrape substrings from column values in Postgres?

I have a column called user_response, on which I want to do variety of operations like take out words contained in quotes, and take out the part of the string after colon (:)
One such operation is this:
Let's say for a record
user_response = "My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts"
Now, I want to scrape the part of the string after last colon(:). Hence, my output should be RealMadridTShirts
I could achieve this somehow with the following hack:
SELECT reverse(split_part(reverse(user_response), ' :', 1))
However, this is grossly inefficient, specially when I am having to do this over 500,000 rows. It's not an operation that I will doing throughout the day. This operation is for a once-a-day load but even then the load is becoming very expensive.
Coming from Oracle, I know I could have used INSTR and SUBSTR functions to achieve it in a more elegant fashion (without having to reverse the string and all.
Also, what if I had to scrape the text after the second last colon?
Find the string after the last colon, right?
My company: 'XYZ Co.' has allowed to use:: the following \n \n kind of product: RealMadridTShirts
It's trivial with a regular expression:
regress=> SELECT (regexp_matches(
'My company: ''XYZ Co.'' has allowed to use:: the following \n \n kind of product: RealMadridTShirts',
'.*:(.*?)$')
)[1];
regexp_matches
--------------------
RealMadridTShirts
(1 row)
The apparent lack of a function to request the position of a string counting from a particular starting point makes it harder to do without using a regexp, but as a regexp is sure to be the fastest way to solve this I doubt that's an issue.
Your bigger problem is likely to be that you're scanning so much data. That's never going to be fast.

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Regex to remove footer using wildcards

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.
Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..
To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.