Remove a String Located between Character - regex

I have an email column in my table. How to update email column with removing string between underscore _ and #.
Example:
Input: andri_pasardigital#zcode.id
Output: andri#zcode.id

You can use this : _(.*)#
See Demo

Not a regex, but API: https://www.postgresql.org/docs/current/static/functions-string.html
Something along the lines of
create table T (id serial primary key, email text);
insert into T (email) values ('bal_hussa#palo.enf' )
with
select substring( email from 1 for position('_' in email)-1) ||
substring( email from position('#' in email))
from T
produces
bal#palo.enf
could also work to answer Remove a String Located between Character

Related

how to create elasticsearch tokens for a unique use case for text value?

I have a use case where i want to tokenise emailId with
Category 1
words separated by punctuations & prefix tokens
Category 2
words separated by punctuations & prefix tokens & along with punctuation
for example. email - ona.ki#gl.co
I want to know whether the the following tokens for the above emailid is possible to achieve.
on, ona, ona., ona.k, ona.ki, ona.ki#, ona.ki#g, ona.ki#gl, ona.ki#gl., ona.ki#gl.c, ona.ki#gl.co
.k, .ki, .ki#, .ki#g, .ki#gl, .ki#gl. , .ki#gl.c, .ki#gl.co
ki, ki#, ki#g, ki#gl, ki#gl. , ki#gl.c, ki#gl.co
#g, #gl, #gl. , #gl.c, #gl.co
gl, gl., gl.c, gl.co
.c, .co
co
Use case example for ona.ki#gl.co
ona.k - should match
na.k - should not match
.ki# - Should match
ki# - Should match
i# - Should not match
The reason why i want to tokenise this way is because consider there are 2 doc with text values
ona.ki#gl.com
mona.gh#gl.com
When the user types on, ona, ... i want to fetch and show only ona.ki#gl.com not the other one.
Thanks in advance.

How to extract real name, first name, lastname, and create nickname from name that have acronyms, aristocratic title, academic titles/degrees

I am trying to extract name, firstname, lastname, create nickname/firstname, and nickname/lastname from name list that have acronyms, aristocratic title, academic titles/degrees, nickname inside parentheses or after slash-backslash in libreoffice calc using regex function. This is the expected result
The rules are:
Expected Name: remove nickname inside parentheses or after slash-backslash if exist, job titles, academic titles and degrees except for (R) or (R.), (Ra) or (Ra.), and one alphabet with(out) dot in the name ex: (A.) (A), (M.), and name acronym with(out) dot ex: (Muh.), (Moh), etc
Firstname: (from expected name) extract first name including dot in acronym ex: (R.), (Ra.), (Muh.) and one alphabet with(out) dot in the name ex: (A.) (A), (M.)
Lastname: (from expected name) extract last name including dot
Nickname-fname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If not exist, use next word in name even it's last word. If the name consist only one word, extract it even only one character
Nickname-lname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract last name that is not: one char or acronym with(out) dot. If not exist, extract prev word in name even it's first word. If the name consist only one word, extract it even only one character
I tried the following:
=REGEX($A2,"\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$|,.*","")
to extract real name. It remove nickname inside parentheses or after slash-backslash with(out) space like
( Nita ) in Yunita ( Nita )
( Nita ) in Yunita( Nita )
(Nita) in Yunita (Nita)
(Nita) in Yunita(Nita)
( Nita) in Yunita ( Nita)
( Nita) in Yunita( Nita)
(Nita ) in Yunita (Nita )
(Nita ) in Yunita(Nita ))
\Nita in Yunita\Nita
\ Nita in Yunita\ Nita
\Nita in Yunita \Nita
\ Nita in Yunita \ Nita
...
and so on
it also remove academic degrees like Ph.D. but it failed on Ra. Ayu S. Ph.D. (because there is no comma after S.). It failed on Prof. and Dr.. I want to keep Ra..
=REGEX($A23,"(?:^|(?:[.!?]\s))([\w.]+)")
to extract first name but failed on Moh.Ali (yes, without space after dot).
=REGEX($A23,"\b(?<last>[\w\[.\]\\]+)$")
to extract last name but also failed on Moh.Ali.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<first>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If the first name not meet condition, use next word in name even it's last word. If still not meet condition, use name that consist only one word, extract it even only one character. The regex failed in most case.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<last>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from last name that is not: one char, acronym with(out) dot. If the last name not meet condition, use prev word in name even it's first word. If still not meet condition, use name that consist only one word, extract it even only one character. It failed in most case too.
This is the result of the regex:
But I'm new to regex and stuck. Please help me
The parsing rules are rather elaborate, especially for nicknames, so
I suggest using a multi-step process involving 2 helper columns and
the non-advanced
(POSIX ERE
plus \b) features of the available
ICU
regular expressions combined with a few spreadsheet formulas.
If you copy the formulas listed and commented below to columns B
through H in your spreadsheet you should find that
conditional formatting highlights a total of 5 cells, all nicknames:
1 for Jasmine, 2 for Imah, and 2 for Nur.. For the first 3
the expected result contradicts the nickname rules; removing
the dot in Nur. is left as an exercise.
I used the input as it is but in cases like this pre-processing is
the key to avoid complex or special-case regexes, e.g. insert a blank
in names like M.Ali, or prefix a comma to comma-less Ph.D..
Aside: When you post source data as an image rather than plaintext
what's the likelihood of someone OCR'ing the image and editing the
text?
Step 1
Add auxiliary column AuxSplit separating the input string into its
3 main groups -- name, titles, nickname (last 2 optional) -- and
joining them with #. NB: # is used for this reply but it
must be a non-meta character not appearing in any input string.
REGEX(A22;"(.*?)( *(Ph\.D\.)?|(,[^\\/(]*?))?( *[\\/(].*)? *$";"$1#$2#$5")
where
$ anchors regex to end of string
$5 is the captured optional nickname (incl. any delimiters and blanks)
$2 the optional suffixed titles (incl. any comma and blanks);
note that the comma-less Ph.D is special-cased
$1 the name (incl. any leading blanks)
Sample content: Jasmine Fianna#, B.A., M.A.#/Jasmine
Step 2
Add auxiliary column AuxNick extracting an explicit nickname (if any)
from 3rd main group (after last #) in AuxSplit.
REGEX(G22;".*#[\\/( ]*([^ )]*)[ )]*$";"$1")
Capturing string after \ or / or between (), trimming off blanks.
Sample content Agus.
Step 3
Extract ExpectedName from 1st group (up to first #) in AuxSplit
stripping off any prefixed titles.
REGEX(G22;"^((Prof|Dr)\.[ ]*)?([^#]*).*";"$3")
Note that these titles are special-cased as they have the same form
as an abbreviated name.
Sample content: Q. Ranita El
Step 4
From ExpectedName extract Firstname,
REGEX(B22;"^(Ra?\.?|[^.]+\.|[^ ]+)")
stripping off R/R./Ra/Ra. and blanks from start of string,
and Lastname
REGEX(B22;"[^ .]*[^ ]$")
extracting the last character sequence after blank/dot;
any trailing blanks were removed in step 1.
Note: with M.Ali as M. Ali simplify Lastname formula to
REGEX(B22;"[^ ]+$")
Step 5
Extract Lastnick from AuxNick, LastName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(D22;"\b([^ ]+\.|[^ .])$"));D22;REGEX(B22;".*?([^ ]+)[ ]+([^ ]*\.|[^.])$";"$1")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(D22;"…";D22 : else use Lastname unless abbrev. or 1-char.
;REGEX(B22;"…";"$1"))) : else use previous word in name, exploiting the
fact that an unmatched replacement returns the entire string
and Firstnick from AuxNick, FirstName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(B22;"^([^ ]+\.|[^ ]\b|I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b)"));C22;REGEX(B22;"^(I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b|.*?)[ .]\b([^ ]+).*";"$4")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(B22;"…";C22 : else use FirstName unless abbrev. / 1-char. / special case
;REGEX(B22;"…";"$4"))) : else use next word in name, exploiting the
fact that an unmatched replacement returns the entire string
TODO: remove trailing dot from Nur.
Recall the (LibreOffice 6.2+) REGEX syntax:
*Syntax*: REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] )
*Expression*: A text representing the regular expression, using
[ICU](https://unicode-org.github.io/icu/userguide/strings/regexp.html)
regular expressions. If there is no match and Replacement is not
given, #N/A is returned.
Replacement: Optional. The replacement text and references to capture
groups. If there is no match, Text is returned unmodified.

How to stop Regex Search look ahead if keyword group is found (CLOSED)

I have following strings on which I need to run RE Search to extract only account ids and to avoid extracting transaction related ids -
Transaction ID 989898989
Trx no. 989898989
Account ID 1234567890
Account Number 1234567890
Acnt No. 1234567890
Account # 1234567890
ID 1234567890
I have created a regex to extract only account id that appear in the text like this to extract 3rd group in the regex.
import re
txt = <all strings from 1 to 7 one by one>
re1="(No.|#|Number|ID)(/s)(\d{10,12})"
rg = re.compile(re1,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
print m.group(3)
If I run this code then all INT will be extracted. But I want to stop RE search if "transaction" or "trx" word is identified in the string. I tried using negative lookahead but unable to find solution.
Solution I am expecting is all strings should print INT in code above apart from strings that have "transaction" or "trx" word in it.
I want to create a regex that if "transaction" is found then stop searching further for group existence
Something like this -
(?!transaction)(/s)(No.|#|Number|ID)(/s)(\d{10,12})
Please Help!
Solution - Using Conditional statement in regex
(transaction|trx)(?(1)|\d{3,12})
Explanation -
(transaction|trx) => 1st Group
(?(1)|\d{3,12}) => 2nd Group - where ?(1) checks whether first group was found, if not found match whatever is there after '|' pipe - else run whatever is before '|'
After that just run => m.group()
and it will return either number or word.
In business logic, typecast the value and check if it can be type casted to INT then great we figured out correctly if not then whatever we extracted is not INT

regex to match some string

I am working a project that need to match certain string in the output..
here the sample:
user code timestamp Action Name S#TPLC Field Name User code group profile
SNGLASK 2012-05-30-20.33.53.003000 Insert User I TEST5 DISPLAY
SNGLASK 2012-05-23-22.06.44.422000 Change Password RSO part U LERAPR SNGCHIS FULL_AUTH
SNGLASK 2012-05-30-20.34.39.066000 Insert User Group Profil I *NONE
basically i have a application that need to understand that each row after the space is belong to next column.
Then, after action name everything can be treated as other.
hence, i have come out a regex format like below:
REGEX = ^([^\s]+)\s+([^\s]+)\s+([^\s]+)s(.*)$
FORMAT = userCode::"$1" TimeStamp::"$2" ActionName::"$3" Others::"$4"
The strategy is recognize the string then ignore the space after that. However, this thing work until action name as they might be space between the action name.
Hence, my problem is, how to use regex to let it recognize the string within the action name like i need "insert user" as an input & "change password RSO part" as another input.
Do multipart words like this:
((\S+\s)+)
which says one or more word, separated with one space.
so the regex whould be:
^((\S+\s)+)\s+(\S+)\s+((\S+\s)+)\s+(.*)$

REGEXP_EXTRACT () every word except ‘,’ in a field

I’d like to select country except ‘,’ from a data field which looks like this
Japan,Singapore,Italy,France
and my Code looks like this REGEXP_EXTRACT(country,'([^,]*)'), unfortunately, it works but only the country at the first was selected. How can I code it to select it all?
I slightly changed the RegEx to ([^,]+) to make the country name at least one digit. Using * creates empty matches so that every other match contains the country name. (Example)
Take a look at the fixed example here.
Important is the /g tag in the end to make the RegEx match globally.
If you are looking to extract all the characters except , then it could be achieved using either of the the REGEXP_REPLACE Calculated Fields below:
1) Replace , with (space)
REGEXP_REPLACE(country, ",", " ")
2) Remove ,
REGEXP_REPLACE(country, ",", "")
Google Data Studio Report and a GIF to elaborate: