Special characters in EBS Search Strings? - regex

I am working on the EBS configuration side of the SAP ERP system where I am trying to define Search Strings for the MT940 format (as per SAP SPRO activity "Define Search String for Electronic Bank Statement", for instance see this blog post).
I am trying to create a search pattern that is able to identify special characters in the MT940 format, for example ?/!/>, etc.
My search pattern: \C*######\C*
The text that I use to identify the mapping:
:86:306?00CCY RECD?20/BI/**?651234?**/BO/DE652004ED
In this case, I defined:
\C* as to look for special characters - this will be skipped based on the mapping.
# to look for a sequence of 6 numbers.
My results from the test:
1 651234
2 652004
3 651234
4 652004
The result I look to achieve based on the search pattern defined: 651234
I do understand that the reason for having the repetition is because of the * symbol. However, if I skip adding that symbol, the search pattern will end up in error.
My problem is that I cannot seem to understand how can I translate special characters to be identified by the SAP Search Strings? Furthermore, how can I identify if it is a letter?
Below is the Search String definition from the SAP documentation of SPRO activity "Define Search String for Electronic Bank Statement":
String for searches in text. A search string consists of normal characters (that is, letters and digits) and other characters:
| Or
( ) Grouping
+ Repeats the previous character once or several times
* "Zero" or repeats the previous character several times
? Any individual character you want
# Any of the digits 0 to 9
^ Start of a line
$ End of a line
\ Escape symbol
Examples:
The search string "ab" fits each position in a character string in which the letter "b" follows the letter "a".
The search string "(A+|B)+C" "AC", "BC", "AAAAAC" or "ABAAC".
"(A+|B+)C fits "AC", "BC" and "AAAAAC", but not "ABAAC".
"\*C" fits "*C"; the effect of the escape symbol is that "*" is not interpreted as a special character.
This is the first time I raise a question, therefore, I want to apologize if the format is not correct or the text was too long.
Many thanks for your time and help!

Related

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

Regular expression to consider special characters in a string

The issue is I have to tokenize data into tokens based on spaces at the same time I can't tokenize the data based on special characters. Right now the regex I have is
(\w*[-*#+=;:\/,~_ ]*\w+)
With this when I process the string
1-CHECK ON BLOCKS BELOW IF MARKET CORRECTION ARE LOADED: PCORP:BLOCK=ANCTRLG&V5PTCLG; AF55722 BRTBMWA-3289 (AF55722) in block ANCTRLG (Product ID: CAAZ 107 4493 R1A10 ) AF55736 BRTBMWA-3290 (AF55726)in block V5PTCLG (Product ID: CAAZ 107 4260 R2A08 ) IF MARKET CORRECTIONS ARE LOADED THEN V5 INTERFACE PROPERTY MUST BE DEFINED AS FOLLOW : MUXFIM : ACC-OFF (Accelerate Alligment is not active) WLL : ACC-ON (Accelerate Alligment is active ) : EXAPC:V5ID=v5id,PROP=ACC-OFF;
What it does is tokenizes the string based on spaces at the same time it also tokenizes the data based on special character like
: EXAPC:V5ID=v5id is tokenized to : EXAPC, :V5ID and =v5id rather want it to split as : and EXAPC:V5ID=v5id
I want to avoid this any idea on this any help will be appreciated.
Your regex matches "an optional word, then an optional list of special characters, then another word". In case you have two words, there is no option of having a special character before the first word.
What you're probably looking for is ([-*#+=;:\/,~_ \w]+).

Swift 3: extract regex matches with non matching parts

I want to analyze a string by many different patterns for numbers, dates and other strings. So I have an array of patterns I want to check in that order.
let patterns = [... "\\d{6}", "\\d{4}", "\\d" ] // to be extended :-)
let s = "IMG_123456_2006.10.03-13.52.59 Testfile_2009_5"
Starting with the first item in pattern I need a search in string s. If found, the string should be split in found parts e.g. "2006" and "2009" and the non matching parts. The remaining parts will be searched with the next pattern and so on. Assuming I already had the pattern defined for time/date in the middle which should be placed at the first item, the splitted string should look like:
"IMG_", "123456", "_", "2006.10.03-13.52.59", " Testfile_", "2009", "_", "5"
Can I use a build in functionality of regex.matches, or do I have to write everything by my own?
I already been able to find a match. But then I have to use the ranges to split the string and do it again and again for the remaining parts until no further matches are indicated. This will need a lot more calculations than I would expect using the results in match.numberOfRanges. Any small solutions available?

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Select until next dot followed by \s?

I could use some help writing a regex. I have the following text:
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.prime COLUMN-LABEL "" FORMAT "X(35)"
ZUNACT.sec COLUMN-LABEL " " FORMAT "X(30)"
INFDON.sep COLUMN-LABEL "" FORMAT "99/99/9999"
IF INFDON.top THEN "S" ELSE (IF INFDON.REPORT THEN "R" ELSE (IF INFDON.prime <> "" THEN INFDON.prime ELSE "")) COLUMN-LABEL "R" FORMAT "X(1)"
/* _UIB-CODE-BLOCK-END */
&ANALYZE-RESUME
WITH SEPARATORS SIZE 83.57 BY 5.08
BGCOLOR 15 FGCOLOR 1 FONT 6 FIT-LAST-COLUMN.
I have to find this whole block in a text file, so far I have this regex:
(?:DEFINE|DEF)\s([\w\s]*)BROWSE\s+([\w-]+)\s+([^.]*)\.
My problem is that it selects only this :
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.
When I want to select until the final point. Basically, the rule I want to apply is "until next dot followed by \s".
But I can't figure out how to write this regex.
Allow "non-dot" [^.] OR "dots not followed by space" \.(?!\s):
DEF(INE)?\s([\w\s]*)BROWSE\s+([\w-]+)\s+(([^.]|\.(?!\s))*)\.
Note also the simplification of the leading term.
Probably the most readable way to do that is
(?:DEFINE|DEF)\s([\w\s]*)BROWSE[\S\s]+?\.\s
You turn the + operator lazy with ?, meaning by default it matches everything until it hits the first period followed by a space.
If you have the option to use an ungreedy regex library, the simplest yet closest to what you specified would be
DEFINE\s+BROWSE.*?\.\s
Note, however, that the trailing whitespace may not be there at the end of your input text, leaving the last statement unmatched.
You may find it useful to have a lexer (scanner) like flex or ANTLR tokenize your string. This approach has the advantage that the lexer takes care of the white space and lets you specify the form of the block of interest in more detail.