Handle initials in Postgresql - regex

I'd like to keep together initials (max two letters) when there are punctuation or spaces in between.
I have the following snippet to tackle almost everything, but I am having issues in keeping together initials that are separated by punctuation and space. For instance, this is working on regular regex, but not in postgresql:
SELECT regexp_replace('R Z ELEMENTARY SCHOOL',
'(\b[A-Za-z]{1,2}\b)\s+\W*(?=[a-zA-Z]{1,2}\b)',
'\1')
The outcome should be "RZ ELEMENTARY SCHOOL". Other examples will include:
A & D ALTERNATIVE EDUCATION
J. & H. KNOWLEDGE DEVELOPMENT
A. - Z. EVOLUTION IN EDUCATION
The transformation should be as follows:
AD ALTERNATIVE EDUCATION
JH KNOWLEDGE DEVELOPMENT
AZ EVOLUTION IN EDUCATION
How to achieve this in Postgresql?
Thanks

Building on your current regex, I can recommend
SELECT REGEXP_REPLACE(
REGEXP_REPLACE('J. & H. KNOWLEDGE DEVELOPMENT', '\m([[:alpha:]]{1,2})\M\s*\W*(?=[[:alpha:]]{1,2}\M)', '\1'),
'^([[:alpha:]]+)\W+',
'\1 '
)
See the online demo, yielding
regexp_replace
1 JH KNOWLEDGE DEVELOPMENT
It is a two step solution. The first regex matches
\m([[:alpha:]]{1,2})\M - a whole one or two letter words captured into Group 1 (\m is a leading word boundar, and \M is a trailing word boundary)
\s* - zero or more whitespaces
\W* - zero or more non-word chars
(?=[[:alpha:]]{1,2}\M) - a positive lookahead that requires a whole one or two letter word immediately to the right of the current position.
The match is replaced with the contents of Group 1 (\1).
The second regex matches a letter word at the start of the string and replaces all non-word chars after it with a space.

Related

Capture a word `+` same word again but with a prefix

To all the Regex gurus
Any idea how to handle this beast
string = 'Position_Name [+|-|/|*] PrevYear Position_Name'
Looking for the Regex to match the occurrences of Position_Name (basically twice similar to a duplicate) but not really a dupe since it is followed by a special character and then by itself BUT with some prefix - here: 'PrevYear'. Means Position_Name is dynamic and could be any word (eg Profit, Sales, etc) but PrevYear will stay constant.
So how could I identify these lines where there's a position being mentioned twice with some math symbol in the middle (for now) and then capture those three elements since the plus could also be a / (divided by), a minus sign - or a multiply * as intended to be represented by [+|-|/|*] in my example.
PS: I do not mind programming this in two steps ... so first matching and then capturing - but still would need the regex to find these little gems (in hundreds of lines).
Elegantly finding dupes is not the problem eg via \b(\w+) \1\b but I have come to realize my capabilities are not sufficient for that combo.
Thanks on hints and support.
You can use
\b(\w+)\b\s*[-+/*]\s*PrevYear\s*\1\b
See the regex demo. Details
\b - a word boundary
(\w+) - Group 1: one or more word chars
\b - a word boundary
\s*[-+/*]\s* - a -, +, / or * enclosed with zero or more whitespaces
PrevYear - a fixed word
\s* - zero or more whitespaces
\1 - same value as captured in Group 1
\b - a word boundary.

Regex to identify for values other than alphanumeric values which can have hyphen or dot in between them but not at beginning or at end

I am new to the regular expressions. I have seen other quite close posts with a similar question but as you are aware in RegEx even dot matters a lot so here I am posting this question to seek help in this particular scenario.
My SQL column value can have a-z, A-Z, and 0-9
It can have a dot(.) and hyphen(-) in between. These 2 things cannot be at the beginning or at the end.
It cannot have space or tabs or any blanks anywhere in the column value.
It cannot start or end with any special characters; not even dots or hyphens.
I wrote this query which covers the 1st, 2nd, and 3rd points but fails in the 4th case.
select * from test_db.xtmp_testtable_invalidchars042321_rg where (sl_id REGEXP '[^[:alnum:]].+$')
**Table column input values**
RaghavGupta
.RaghavGupta
#Raghav.Gupta
"Raghav Gupta"
Raghav Gupta
Raghav#Gupta
Raghav$Gupta
Raghav%Gupta
Raghav*Gupta
Raghav.Gupta
RaghavGupta
RaghavGupta$
RaghavGupta.
RaghavGupta[]
**Query Result**
RaghavGupta
.RaghavGupta
#Raghav.Gupta
"Raghav Gupta"
Raghav Gupta
Raghav#Gupta
Raghav$Gupta
Raghav%Gupta
Raghav*Gupta
Raghav.Gupta
"RaghavGupta "
RaghavGupta[]
You can use NOT with the matching regex:
select * from test_db.xtmp_testtable_invalidchars042321_rg where (sl_id NOT REGEXP '^[[:alnum:]]+([.-][[:alnum:]]+)*$')
The pattern matches
^ - start of string
[[:alnum:]]+ - one or more alphanumeric chars ([:alnum:] is a POSIX character class that matches letters and/or digits)
([.-][[:alnum:]]+)* - (a capturing group that matches) zero or more repetitions of
[.-] - a . or -
[[:alnum:]]+ - one or more alphanumeric chars
$ - end of string.

Regex to detect preferred stock symbols

To start off, regex is probably the least talented aspect within my programming belt, this is what I have so far:
\D{1,5}(PR)\D+$
\D{1,5} because common stock symbols are always a maximum of 5 letters
(PR) because that is part of the pattern that needs to be searched (more below in the background info)
\D+$ because I'm trying to match any single letter at the end of the string
A small tidbit of background
Preferred stock symbols are not standardized and so every platform, exchange, etc has their own way to display them. Having said that, most display a special character in their name, which makes those guys easy to detect. The characters are
[] {'.', '/', '-', ' ', '+'};
The trickier ones all have a similar pattern:
{symbol}PR{0}
{symbol}p{0}
{symbol}P{0}
Where 0 is just any single letter A-Z
Here is a sample data set for the trickier ones:
PSAPRZ
PSApA
PSApZ
PSAPA
PSAPZ
My regex seems to be working for the first one, since I'm specifically looking for (PR) and matching any single letter character at the end, but I can't for the life of me figure out how to also detect the patterns that end in p{0} or P{0} in the same regex. I completely gave up trying to incorporate finding the special symbols because I can easily just do a string.Contains on the target string for any of those chars. The more important part is figuring out these trickier ones.
How do I get my regex statement to also detect the p{0} and P{0} matches within the same regex statement?
Edit 1
If you're curious at the madness of different possibilities, including the "easy to detect" versions, grab a popcorn, here you go :)
PSA.PA
PSA.PR.A
PSA/PA
PSAPRA
PSA-A
PSA PRA
PSA.PRA
PSA.PA
PSA+A
PSA/PRA
PSApA
PSAPA
PSA-PA
This should do it:
^[A-Z]{1,5}([Pp]|PR)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
([Pp]|PR) - capture group used for: uppercase P or lowercase p or uppercase PR
[A-Z] - one uppercase letters
$ - anchor at end
UPDATE after EDIT 1 in question. To support the odd formats with ., /, -, + use this:
^[A-Z]{1,5}[.\/\s\+\-]?([Pp]|PR\.?)[A-Z]$
Explanation:
^ - anchor at start
[A-Z]{1,5} - one to five uppercase letters
[.\/\s\+\-]? - optional single character ., /, , +, -
([Pp]|PR\.?) - capture group used for: uppercase P, or lowercase p, or uppercase PR followed by optional .
[A-Z] - one uppercase letters
$ - anchor at end
Note on anchors: Use ^...$ anchors if you only have the stock symbol in the string. If you have text with a stock symbol anywhere within, use word boundaries \b...\b instead.
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Regular Expression for checking subword between capture groups

Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)
Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)

Excel 2007 VBA RegEx Help Needed

I'm working on an Excel 2007 VBA project that my client wants done yesterday and I need to use RegEx to locate strings within some pretty challenging data. This is my first exposure to RegEx so I'm stuck doing something I think is simple (maybe not) and I'm clueless.
I've added the reference to the VBScript RegEx engine (5.5) and RegEx is working O.K. in Excel - I just don't know how to construct the pattern statement. I need to locate occurrences of the word "trust" in a range of cells on a worksheet. In some of my data this word has been abbreviated "Tr". I have constructed the following RegEx statement to locate the word "trust" and all words that start with a space and contain "tr".
"trust| tr"
Unfortuantely, this matches any word that contains "tr", like "trail", "tree", and so on. What I want to match is " tr" - meaning it has a leading space, the "tr", and nothing else in the word. Can somebody tell me what I need to do to make this happen?
I'm also going to need RegEx patterns for street addresses, city, state, and zip plus last name and first name. If there's a resource someone can point me to for these expressions I'd appreciate the help. I'm sorry to ask the group this question without spending the proper amount of time educating myself, by this is a time-sensitive project for which I need your expertise.
Thanks In Advance -
PS - Here a sample of data that I'm working with. I have this type of data present in 5 columns over 4,000 rows.
Jones Family **Trust**
3420 E Ave of the Ftns
3420 E Avenue of the Fountain
320 E ARROWHEAD **TRAILHEAD**
501 S 29TH ST
PO BOX 13422
71343 W Paradise Dr
152035 S 29TH ST
124 Owl Grove Pl
Johnson **Tr**
1900 E Arrowhead **Trl**
1900 E ARROWHEAD **TRL**
This is a sample from a column that predominantly contains street addresses. Other columns contain client names without addresses. So not every cell contains data that starts with a number.
I would rewrite your expression that finds trust and tr where they not preceded or followed by a other letters by using the \b is a word boundary assertion. \b matches at a position that is aptly called a "word boundary".
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
For more information on word boundaries then see also regular-expressions.info. I'm not affiliated with that site.
\b(?:trust|tr)\b
After viewing the above, if you're still set on requiring the tr preceded by a space, then use this \b(?:trust|\str)\b
Examples
Live Demo
https://regex101.com/r/xM4fR9/1
Note: I am assuming you're using the case insensitive flag for this
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
trust 'trust'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tr 'tr'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
Or
The \b(?:trust|tr)\b expression isn't the most efficient, but it is readable.
A functionally identical, but more efficient regular expression would be:
\btr(?:ust)?\b
Here we're still using the \b word boundary, but we've just made the ust part of the word trust optional with the (?: ... )? construct.