Regex to remove footer using wildcards - regex

Ok - this is well beyond my limited knowledge of regular expressions. We receive a report from a banking entity in a fixed with text file format. Unfortunately their system exports page headers with the data file that must be removed before processing on our end. The page headers start and end with the same text but the content changes (dates and page numbers). A typical one looks like:
00007xxxxx LAST1,FIRST1 111111 20120930
ABCD EXPORT RPT 10/04/12 at 10/04/12 16:20 Seq 1501 Page 16
MRK014 Report Date: 10/04/12
Acct# Name SH. Balance QTR (YYYYMMDD)
----------------------------------------------------------------------------------------------------
00007xxxxx LAST2,FIRST2 222222 20120930
So each header starts with "ABCD" (actually the name of the bank, just removed here for privacy) and ends with the row of -------------------.
What I need to get it down to is the customer data on two rows (00007xxxxx - those account numbers change per person).
So I need to select from the " ABCD" to the end of the "---" to remove that block of text.

Try this regex.. This is a Java code.. You can use the given pattern in your language..
str = str.replaceAll("ABCD((.*?)[\n\r])+(\\-*)", "");
Where str contains your above data.. Lines are separated by \n I assume..

To ensure you are removing correct part of report I would go with more complicated regex pattern.
Use regex pattern
(?<=[\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with empty string.
However if your environment does not support regex lookbehind, then you have to use pattern:
([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+
and replace each match with first group.
For example in JavaScript it would be:
str.replace(/([\n\r])ABCD\s+EXPORT\s+RPT\s[^-]+[\n\r]\-+[\n\r]+/g, "$1")
Test this code here.

Related

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

Can I make my Alteryx RegEx parse conditional?

I receive messages with the fields below. I want to group and extract the user inputs. Majority of submissions contain all fields and the regex works great. Problem comes in when someone removes additional lines if let's say they only need to fill in down to Amount 1
Name:
Number:
Amount:
Old Code:
Code 1:
Amount 1:
Code 2:
Amount 2:
Code 3:
Amount 3:
Code 4:
Amount 4:
I'm using Alteryx to parse the message contents and have success with my current regex but want to be ready for unavoidable user submission inconsistency
Name:(.+)\sNumber:(.+)\sAmount:(.+)\sOld Code:(.+)\sCode 1:(.+)\sAmount 1:(.+)\sCode 2:(.*?)\sAmount 2:(.*?)\sCode 3:(.*?)\sAmount 3:(.*?)\sCode 4:(.*?)\sAmount 4:(.*?[^-]*)
Is it possible to have Alteryx return parsed results from a message even if a listed field is deleted?
Alteryx issue with new cascading regex
Anyway, you can always do a cascading nested optional grouping around the
lines to just match what's valid up to a point.
This expects the form lines to be in order. If it's not, a different type
of regex is needed - an out-of-order regex ( see the bottom regex ) .
Both these regex are for Perl 5.10
(?-ms)Name:(.*)(?:\s+Number:(.*)(?:\s+Amount:(.*)(?:\s+Old[ ]+Code:(.*)(?:\s+Code[ ]+1:(.*)(?:\s+Amount[ ]+1:(.*)(?:\s+Code[ ]+2:(.*)(?:\s+Amount[ ]+2:(.*)(?:\s+Code[ ]+3:(.*)(?:\s+Amount[ ]+3:(.*)(?:\s+Code[ ]+4:(.*)(?:\s+Amount[ ]+4:(.*?[^-]*))?)?)?)?)?)?)?)?)?)?)?
https://regex101.com/r/9oKXEE/1
For out-of-order matching, use this
(?m-s)\A(?:[\S\s]*?(?:(?(1)(?!))^\h*Name\h*:\h*(.*)|(?(2)(?!))^\h*Number\h*:\h*(.*)|(?(3)(?!))^\h*Amount\h*:\h*(.*)|(?(4)(?!))^\h*Old\h*Code\h*:\h*(.*)|(?(5)(?!))^\h*Code\h*1\h*:\h*(.*)|(?(6)(?!))^\h*Amount\h*1\h*:\h*(.*)|(?(7)(?!))^\h*Code\h*2\h*:\h*(.*)|(?(8)(?!))^\h*Amount\h*2\h*:\h*(.*)|(?(9)(?!))^\h*Code\h*3\h*:\h*(.*)|(?(10)(?!))^\h*Amount\h*3\h*:\h*(.*)|(?(11)(?!))^\h*Code\h*4\h*:\h*(.*)|(?(12)(?!))^\h*Amount\h*4\h*:\h*(.*?))){1,12}
https://regex101.com/r/f2rG1v/1
In this situation, you don't need to use Regex straight off the bat and given the inconsistent data it could take a while to perfect one regex term...
You can do it this way instead:
- RecordID first,
- Then you can use a Text 2 Columns with a new-line (\n) delimiter. Configure this to "Split to Rows".
- You can then use a Text to Columns to split on the delimter ":".
That will handle additional rows entered etc. At that stage, you can figure out how to clean up the results (filter to remove null lines, multi-row to tag records, cross-tab to create a table etc...). If you want to flag any unknown rows, you can have a Text Input with the required rows and use Find/Replace or Join to separate the data.

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of
companies mentioned in that weeks' issue, plus the page numbers they are appear on.
I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in
Adobe InDesign CS3 and Adobe InCopy CS3).
I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular
expressions, I am able to find most page numbers where a company is mentioned. However, where a
company name contains two or more words, the search I am running will not pick up instances where the
name appears over more than one line.
For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the
text appeared like this:
DTZ beat BNP PRE, CB [line break here]
Richard Ellis and Cushman & [line break here]
Wakefield to secure the contract. [line end here]
Could someone advise me on how to write a regular expression that will ignore white space between
words and ignore line endings OR one that will look for the words including all types of white space (ie uneven
spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is
imbedded somehow in PDF files).
Here is a sample of the set of terms I have asked PowerGREP to search for:
\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b
[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]
By the way, I am a production journalist and not usually involved in finding IT-type solutions and am
finding it difficult to get to grips with the technical language on the PowerGREP site.
Thanks for assistance
Alison
You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.
E.g.:
CB\s+Richard\s+Ellis
What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.
The regex for matching spaces is \s, so it would be
\bCB\s+Richard\s+Ellis\b
(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:
\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b