Regular expression to consider special characters in a string - regex

The issue is I have to tokenize data into tokens based on spaces at the same time I can't tokenize the data based on special characters. Right now the regex I have is
(\w*[-*#+=;:\/,~_ ]*\w+)
With this when I process the string
1-CHECK ON BLOCKS BELOW IF MARKET CORRECTION ARE LOADED: PCORP:BLOCK=ANCTRLG&V5PTCLG; AF55722 BRTBMWA-3289 (AF55722) in block ANCTRLG (Product ID: CAAZ 107 4493 R1A10 ) AF55736 BRTBMWA-3290 (AF55726)in block V5PTCLG (Product ID: CAAZ 107 4260 R2A08 ) IF MARKET CORRECTIONS ARE LOADED THEN V5 INTERFACE PROPERTY MUST BE DEFINED AS FOLLOW : MUXFIM : ACC-OFF (Accelerate Alligment is not active) WLL : ACC-ON (Accelerate Alligment is active ) : EXAPC:V5ID=v5id,PROP=ACC-OFF;
What it does is tokenizes the string based on spaces at the same time it also tokenizes the data based on special character like
: EXAPC:V5ID=v5id is tokenized to : EXAPC, :V5ID and =v5id rather want it to split as : and EXAPC:V5ID=v5id
I want to avoid this any idea on this any help will be appreciated.

Your regex matches "an optional word, then an optional list of special characters, then another word". In case you have two words, there is no option of having a special character before the first word.
What you're probably looking for is ([-*#+=;:\/,~_ \w]+).

Related

Special characters in EBS Search Strings?

I am working on the EBS configuration side of the SAP ERP system where I am trying to define Search Strings for the MT940 format (as per SAP SPRO activity "Define Search String for Electronic Bank Statement", for instance see this blog post).
I am trying to create a search pattern that is able to identify special characters in the MT940 format, for example ?/!/>, etc.
My search pattern: \C*######\C*
The text that I use to identify the mapping:
:86:306?00CCY RECD?20/BI/**?651234?**/BO/DE652004ED
In this case, I defined:
\C* as to look for special characters - this will be skipped based on the mapping.
# to look for a sequence of 6 numbers.
My results from the test:
1 651234
2 652004
3 651234
4 652004
The result I look to achieve based on the search pattern defined: 651234
I do understand that the reason for having the repetition is because of the * symbol. However, if I skip adding that symbol, the search pattern will end up in error.
My problem is that I cannot seem to understand how can I translate special characters to be identified by the SAP Search Strings? Furthermore, how can I identify if it is a letter?
Below is the Search String definition from the SAP documentation of SPRO activity "Define Search String for Electronic Bank Statement":
String for searches in text. A search string consists of normal characters (that is, letters and digits) and other characters:
| Or
( ) Grouping
+ Repeats the previous character once or several times
* "Zero" or repeats the previous character several times
? Any individual character you want
# Any of the digits 0 to 9
^ Start of a line
$ End of a line
\ Escape symbol
Examples:
The search string "ab" fits each position in a character string in which the letter "b" follows the letter "a".
The search string "(A+|B)+C" "AC", "BC", "AAAAAC" or "ABAAC".
"(A+|B+)C fits "AC", "BC" and "AAAAAC", but not "ABAAC".
"\*C" fits "*C"; the effect of the escape symbol is that "*" is not interpreted as a special character.
This is the first time I raise a question, therefore, I want to apologize if the format is not correct or the text was too long.
Many thanks for your time and help!

How to extract real name, first name, lastname, and create nickname from name that have acronyms, aristocratic title, academic titles/degrees

I am trying to extract name, firstname, lastname, create nickname/firstname, and nickname/lastname from name list that have acronyms, aristocratic title, academic titles/degrees, nickname inside parentheses or after slash-backslash in libreoffice calc using regex function. This is the expected result
The rules are:
Expected Name: remove nickname inside parentheses or after slash-backslash if exist, job titles, academic titles and degrees except for (R) or (R.), (Ra) or (Ra.), and one alphabet with(out) dot in the name ex: (A.) (A), (M.), and name acronym with(out) dot ex: (Muh.), (Moh), etc
Firstname: (from expected name) extract first name including dot in acronym ex: (R.), (Ra.), (Muh.) and one alphabet with(out) dot in the name ex: (A.) (A), (M.)
Lastname: (from expected name) extract last name including dot
Nickname-fname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If not exist, use next word in name even it's last word. If the name consist only one word, extract it even only one character
Nickname-lname: from Input Name, extract nickname inside parentheses or after slash-backslash if exist. If not exist, extract last name that is not: one char or acronym with(out) dot. If not exist, extract prev word in name even it's first word. If the name consist only one word, extract it even only one character
I tried the following:
=REGEX($A2,"\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$|,.*","")
to extract real name. It remove nickname inside parentheses or after slash-backslash with(out) space like
( Nita ) in Yunita ( Nita )
( Nita ) in Yunita( Nita )
(Nita) in Yunita (Nita)
(Nita) in Yunita(Nita)
( Nita) in Yunita ( Nita)
( Nita) in Yunita( Nita)
(Nita ) in Yunita (Nita )
(Nita ) in Yunita(Nita ))
\Nita in Yunita\Nita
\ Nita in Yunita\ Nita
\Nita in Yunita \Nita
\ Nita in Yunita \ Nita
...
and so on
it also remove academic degrees like Ph.D. but it failed on Ra. Ayu S. Ph.D. (because there is no comma after S.). It failed on Prof. and Dr.. I want to keep Ra..
=REGEX($A23,"(?:^|(?:[.!?]\s))([\w.]+)")
to extract first name but failed on Moh.Ali (yes, without space after dot).
=REGEX($A23,"\b(?<last>[\w\[.\]\\]+)$")
to extract last name but also failed on Moh.Ali.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<first>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from first name that is not: one char, acronym with(out) dot, "I Gede", "I Gusti", "I Made", "Ni Luh Putu", or "Ni Putu". If the first name not meet condition, use next word in name even it's last word. If still not meet condition, use name that consist only one word, extract it even only one character. The regex failed in most case.
=REGEX($A2,"(\(.*?\)|\([^)]*\)|\\[^\\]*$|\/[^\/]*$)|(\b(?<last>\w+)$)")
to create nickname extracted from given nickname inside parentheses or after slash or backslash. If nickname not exist create one from last name that is not: one char, acronym with(out) dot. If the last name not meet condition, use prev word in name even it's first word. If still not meet condition, use name that consist only one word, extract it even only one character. It failed in most case too.
This is the result of the regex:
But I'm new to regex and stuck. Please help me
The parsing rules are rather elaborate, especially for nicknames, so
I suggest using a multi-step process involving 2 helper columns and
the non-advanced
(POSIX ERE
plus \b) features of the available
ICU
regular expressions combined with a few spreadsheet formulas.
If you copy the formulas listed and commented below to columns B
through H in your spreadsheet you should find that
conditional formatting highlights a total of 5 cells, all nicknames:
1 for Jasmine, 2 for Imah, and 2 for Nur.. For the first 3
the expected result contradicts the nickname rules; removing
the dot in Nur. is left as an exercise.
I used the input as it is but in cases like this pre-processing is
the key to avoid complex or special-case regexes, e.g. insert a blank
in names like M.Ali, or prefix a comma to comma-less Ph.D..
Aside: When you post source data as an image rather than plaintext
what's the likelihood of someone OCR'ing the image and editing the
text?
Step 1
Add auxiliary column AuxSplit separating the input string into its
3 main groups -- name, titles, nickname (last 2 optional) -- and
joining them with #. NB: # is used for this reply but it
must be a non-meta character not appearing in any input string.
REGEX(A22;"(.*?)( *(Ph\.D\.)?|(,[^\\/(]*?))?( *[\\/(].*)? *$";"$1#$2#$5")
where
$ anchors regex to end of string
$5 is the captured optional nickname (incl. any delimiters and blanks)
$2 the optional suffixed titles (incl. any comma and blanks);
note that the comma-less Ph.D is special-cased
$1 the name (incl. any leading blanks)
Sample content: Jasmine Fianna#, B.A., M.A.#/Jasmine
Step 2
Add auxiliary column AuxNick extracting an explicit nickname (if any)
from 3rd main group (after last #) in AuxSplit.
REGEX(G22;".*#[\\/( ]*([^ )]*)[ )]*$";"$1")
Capturing string after \ or / or between (), trimming off blanks.
Sample content Agus.
Step 3
Extract ExpectedName from 1st group (up to first #) in AuxSplit
stripping off any prefixed titles.
REGEX(G22;"^((Prof|Dr)\.[ ]*)?([^#]*).*";"$3")
Note that these titles are special-cased as they have the same form
as an abbreviated name.
Sample content: Q. Ranita El
Step 4
From ExpectedName extract Firstname,
REGEX(B22;"^(Ra?\.?|[^.]+\.|[^ ]+)")
stripping off R/R./Ra/Ra. and blanks from start of string,
and Lastname
REGEX(B22;"[^ .]*[^ ]$")
extracting the last character sequence after blank/dot;
any trailing blanks were removed in step 1.
Note: with M.Ali as M. Ali simplify Lastname formula to
REGEX(B22;"[^ ]+$")
Step 5
Extract Lastnick from AuxNick, LastName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(D22;"\b([^ ]+\.|[^ .])$"));D22;REGEX(B22;".*?([^ ]+)[ ]+([^ ]*\.|[^.])$";"$1")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(D22;"…";D22 : else use Lastname unless abbrev. or 1-char.
;REGEX(B22;"…";"$1"))) : else use previous word in name, exploiting the
fact that an unmatched replacement returns the entire string
and Firstnick from AuxNick, FirstName, or ExpectedName
IF(LEN(H22);H22;IF(ISNA(REGEX(B22;"^([^ ]+\.|[^ ]\b|I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b)"));C22;REGEX(B22;"^(I (Gede|Gusti|Made)\b|Ni (Luh )?Putu\b|.*?)[ .]\b([^ ]+).*";"$4")))
IF(LEN(H22);H22 : use explicit nickname if present
;IF(ISNA(REGEX(B22;"…";C22 : else use FirstName unless abbrev. / 1-char. / special case
;REGEX(B22;"…";"$4"))) : else use next word in name, exploiting the
fact that an unmatched replacement returns the entire string
TODO: remove trailing dot from Nur.
Recall the (LibreOffice 6.2+) REGEX syntax:
*Syntax*: REGEX( Text ; Expression [ ; [ Replacement ] [ ; Flags|Occurrence ] ] )
*Expression*: A text representing the regular expression, using
[ICU](https://unicode-org.github.io/icu/userguide/strings/regexp.html)
regular expressions. If there is no match and Replacement is not
given, #N/A is returned.
Replacement: Optional. The replacement text and references to capture
groups. If there is no match, Text is returned unmodified.

simple regex to matching multiple word with spaces/multiple space or no spaces and special characters

I have a string that is delimited by a comma.
The first 3 fields are static.
Fields 4-20 are dynamic and can contain any string even if it has special characters but cannot be empty.
Field 21 is static
Field 22 is dynamic and can contain any string even if it has special characters.
Fields 23,24 are static.
I need to make sure the string matches the above criteria and is a match, but am wondering on how to make fields 4-20 have the option of containing the special characters and not be blank. (Total of 17 between 4-20)
If I remove the requirement of the special characters this seems to work:
Field1\,Field2\,Field3\,+([\w\s\,]+)F21/C\,[\w\s\,]+(F/23\,)(Field24)
with this string
Field1,Field2,Field3,F4,f5,6f 1,f72,f8,F9,F10,F1,f12,f13,f14,f15,f16,f17,f18,f19,f20,F21/C,F22,F/23,Field24
Is there a way to accomplish this with fields 4-20 having special characters and not being empty like "" or " " or am I pushing it too far?
I know I can parse it through c# but I'm experimenting with Regex and it seems pretty powerful.
Thanks
I did not fully understand the problem
But I think that's what you want bottom line:
s1,s2,s3,([^ ,]+,){17}s21,[^ ,]+,s23,s24
replace the sX to relevant static fields.
example:
https://regex101.com/r/EaAPKH/1

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Select until next dot followed by \s?

I could use some help writing a regex. I have the following text:
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.prime COLUMN-LABEL "" FORMAT "X(35)"
ZUNACT.sec COLUMN-LABEL " " FORMAT "X(30)"
INFDON.sep COLUMN-LABEL "" FORMAT "99/99/9999"
IF INFDON.top THEN "S" ELSE (IF INFDON.REPORT THEN "R" ELSE (IF INFDON.prime <> "" THEN INFDON.prime ELSE "")) COLUMN-LABEL "R" FORMAT "X(1)"
/* _UIB-CODE-BLOCK-END */
&ANALYZE-RESUME
WITH SEPARATORS SIZE 83.57 BY 5.08
BGCOLOR 15 FGCOLOR 1 FONT 6 FIT-LAST-COLUMN.
I have to find this whole block in a text file, so far I have this regex:
(?:DEFINE|DEF)\s([\w\s]*)BROWSE\s+([\w-]+)\s+([^.]*)\.
My problem is that it selects only this :
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.
When I want to select until the final point. Basically, the rule I want to apply is "until next dot followed by \s".
But I can't figure out how to write this regex.
Allow "non-dot" [^.] OR "dots not followed by space" \.(?!\s):
DEF(INE)?\s([\w\s]*)BROWSE\s+([\w-]+)\s+(([^.]|\.(?!\s))*)\.
Note also the simplification of the leading term.
Probably the most readable way to do that is
(?:DEFINE|DEF)\s([\w\s]*)BROWSE[\S\s]+?\.\s
You turn the + operator lazy with ?, meaning by default it matches everything until it hits the first period followed by a space.
If you have the option to use an ungreedy regex library, the simplest yet closest to what you specified would be
DEFINE\s+BROWSE.*?\.\s
Note, however, that the trailing whitespace may not be there at the end of your input text, leaving the last statement unmatched.
You may find it useful to have a lexer (scanner) like flex or ANTLR tokenize your string. This approach has the advantage that the lexer takes care of the white space and lets you specify the form of the block of interest in more detail.