Select until next dot followed by \s? - regex

I could use some help writing a regex. I have the following text:
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.prime COLUMN-LABEL "" FORMAT "X(35)"
ZUNACT.sec COLUMN-LABEL " " FORMAT "X(30)"
INFDON.sep COLUMN-LABEL "" FORMAT "99/99/9999"
IF INFDON.top THEN "S" ELSE (IF INFDON.REPORT THEN "R" ELSE (IF INFDON.prime <> "" THEN INFDON.prime ELSE "")) COLUMN-LABEL "R" FORMAT "X(1)"
/* _UIB-CODE-BLOCK-END */
&ANALYZE-RESUME
WITH SEPARATORS SIZE 83.57 BY 5.08
BGCOLOR 15 FGCOLOR 1 FONT 6 FIT-LAST-COLUMN.
I have to find this whole block in a text file, so far I have this regex:
(?:DEFINE|DEF)\s([\w\s]*)BROWSE\s+([\w-]+)\s+([^.]*)\.
My problem is that it selects only this :
DEFINE BROWSE BW_SC20SDAN
&ANALYZE-SUSPEND _UIB-CODE-BLOCK _DISPLAY-FIELDS BW_SC20SDAN C-Win _FREEFORM
QUERY BW_SC20SDAN NO-LOCK DISPLAY
ZTYACC.
When I want to select until the final point. Basically, the rule I want to apply is "until next dot followed by \s".
But I can't figure out how to write this regex.

Allow "non-dot" [^.] OR "dots not followed by space" \.(?!\s):
DEF(INE)?\s([\w\s]*)BROWSE\s+([\w-]+)\s+(([^.]|\.(?!\s))*)\.
Note also the simplification of the leading term.

Probably the most readable way to do that is
(?:DEFINE|DEF)\s([\w\s]*)BROWSE[\S\s]+?\.\s
You turn the + operator lazy with ?, meaning by default it matches everything until it hits the first period followed by a space.

If you have the option to use an ungreedy regex library, the simplest yet closest to what you specified would be
DEFINE\s+BROWSE.*?\.\s
Note, however, that the trailing whitespace may not be there at the end of your input text, leaving the last statement unmatched.
You may find it useful to have a lexer (scanner) like flex or ANTLR tokenize your string. This approach has the advantage that the lexer takes care of the white space and lets you specify the form of the block of interest in more detail.

Related

Special characters in EBS Search Strings?

I am working on the EBS configuration side of the SAP ERP system where I am trying to define Search Strings for the MT940 format (as per SAP SPRO activity "Define Search String for Electronic Bank Statement", for instance see this blog post).
I am trying to create a search pattern that is able to identify special characters in the MT940 format, for example ?/!/>, etc.
My search pattern: \C*######\C*
The text that I use to identify the mapping:
:86:306?00CCY RECD?20/BI/**?651234?**/BO/DE652004ED
In this case, I defined:
\C* as to look for special characters - this will be skipped based on the mapping.
# to look for a sequence of 6 numbers.
My results from the test:
1 651234
2 652004
3 651234
4 652004
The result I look to achieve based on the search pattern defined: 651234
I do understand that the reason for having the repetition is because of the * symbol. However, if I skip adding that symbol, the search pattern will end up in error.
My problem is that I cannot seem to understand how can I translate special characters to be identified by the SAP Search Strings? Furthermore, how can I identify if it is a letter?
Below is the Search String definition from the SAP documentation of SPRO activity "Define Search String for Electronic Bank Statement":
String for searches in text. A search string consists of normal characters (that is, letters and digits) and other characters:
| Or
( ) Grouping
+ Repeats the previous character once or several times
* "Zero" or repeats the previous character several times
? Any individual character you want
# Any of the digits 0 to 9
^ Start of a line
$ End of a line
\ Escape symbol
Examples:
The search string "ab" fits each position in a character string in which the letter "b" follows the letter "a".
The search string "(A+|B)+C" "AC", "BC", "AAAAAC" or "ABAAC".
"(A+|B+)C fits "AC", "BC" and "AAAAAC", but not "ABAAC".
"\*C" fits "*C"; the effect of the escape symbol is that "*" is not interpreted as a special character.
This is the first time I raise a question, therefore, I want to apologize if the format is not correct or the text was too long.
Many thanks for your time and help!

Tcl - How to Add Text after last character through regex?

I need a tip, tip or suggestion followed by some example of how I can add an extension in .txt format after the last character of a variable's output line.
For example:
set txt " ONLINE ENGLISH COURSE - LESSON 5 "
set result [concat "$txt" .txt]
Print:
Note that there is space in the start, means and fin of the variable phrase (txt). What must be maintained are the spaces of the start and means. But replace the last space after the end of the sentence, with the format of the extension [.txt].
With the built-in concat method of Tcl, it does not achieve the desired effect.
The expected result was something like this:
ONLINE ENGLISH COURSE - LESSON 5.txt
I know I could remove spaces with string map but I don't know how to remove just the last occurrence on the line.
And otherwise I don’t know how to remove the last space to add the text [.txt]
If anyone can point me to one or more solutions, thank you in advance.
set result "[string trimright $txt].txt"
or
set result [regsub {\s*$} $txt ".txt"]

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Regex replace filename in javascript

I'm having trouble with a regular expression, I have several images with file name that need changing. I've done them by hand. It was quick easy and painless. However, I wanted to know what I needed to do as a simple replacement reg ex using JavaScript. And that's when it doesn't quite work out. The image is called "muti blossom 02.png" and it's going to be re-sized and saved out as JPEGs with the name "iOS_multi_BLOSSOM_2048.jpg". The others are of the same form but have different nouns; winter, leaf, circus etc.
The file-name is structured as follows:
"mutli" at the start (lower case),
white space,
the noun (lower case),
white-space,
a number (that may have a preceding 0 and may be one or two digits),
file extension which may be .png or .psd (lowercase).
It then needs to be changed to:
iOS_multi (camel case as written),
noun (UPPERCASE),
2048 (new fixed size),
new file extension .jpg(lowercase).
I know that ([a-z]+\s) matches "multi" and that (\s\d+.[a-z]+$) will match the numbers and file extension, but have no idea how to successfully match the bit in the middle as well. And do the uppercase on the noun. But I'm sure there is someone else that does. Thank you.
In JavaScript regex you cannot do this with a replace as it is not possible to uppercase the replacement text. However the match method will return an array which you can then manipulate.
var oldImageName = "multi blossom 02.png";
var matches = oldImageName.match(/multi (\w+) \d{1,2}\.(?:png|psd)/);
var newImageName = "iOS_multi_" + matches[1].toUpperCase() + "_2048.jpg";
Note: this assumes that the "noun" is a single word with no spaces
I was searching for "javascript Regex to replace characters that Windows doesn't accept in a filename" but found nothing,
so here is regex to strip chars from filename that windows filesistem do not allow (/\:?<>|"):
var originalFileName='some filename:with"forbidden/>\? chars.in';
var strippedFileName=originalFileName.replace(/[/\\:?<>|\"]+/g, "")
console.log(strippedFileName);