Replace special characters in file with regex - regex

I'm trying to replace part of file that contains special characters and text, but regex not working correctly.
Regexp should work like (\e&l8D\n)+\f$ in other regex engines:
[System.IO.File]::ReadAllText(
"C:\tmp\text.prn",
[System.Text.Encoding]::GetEncoding('cp866')
)
-replace '('+[char]0x001b+'&l8D'+"`n"+')+'+"`f",''
Part of file:
&l8D│ N18-10│ │30/07 │31/07 16:00│20:30│ │
&l8DL-------+-----------+-----------+-----------+-----+----------------------------
&l8D
&l8D&l16D(3R(s1p14v0s3b4101T(s3B
(s0B(s0S(3R(s0p10.00h12.0v0s0b3T: (s3B(3R(s1p16v0s3b4101T 1.030(s0B(3R(s0p10.00h12.0v0s0b3T - (s3B(3R(s1p16v0s3b4101T 1.063(s0B(3R(s0p10.00h12.0v0s0b3T .(s3B(s0B(s3B
&l8D(s0B
&l8D
&l8D
&l8D
&l8D
&l8D
Where is ESC symbol (ASCII 1B), and last symbol is FF (ASCII 0C)
Expected result:
&l8D│ N18-10│ │30/07 │31/07 16:00│20:30│ │
&l8DL-------+-----------+-----------+-----------+-----+----------------------------
&l8D
&l8D&l16D(3R(s1p14v0s3b4101T(s3B
(s0B(s0S(3R(s0p10.00h12.0v0s0b3T: (s3B(3R(s1p16v0s3b4101T 1.030(s0B(3R(s0p10.00h12.0v0s0b3T - (s3B(3R(s1p16v0s3b4101T 1.063(s0B(3R(s0p10.00h12.0v0s0b3T .(s3B(s0B(s3B
&l8D(s0B
Interactive example https://regex101.com/r/15uuEj/1

Related

Remove all values in a string before a backslash

I have this string
AttendanceList
XXXXXX
US\abraham
EU\sarah
US\gerber
when i try to use -replace it replaces all characters inserted in square bracket (including the first line AttendanceList)
$attendance_new = $attendance -replace "[EU\\]", "" -replace"[US\\], ""
echo $attendance_new
AttndancLit
XXXXXX
abraham
arah
grbr
i was hoping to get this sample output (and possibly concatenate a string "_IN" after all values)
AttendanceList
XXXXXX
abraham_IN
sarah_IN
gerber_IN
I'm new to regex and still trying to figure out the regex code for special characters
You can use
$attendance_new = $attendance -replace '(?m)^(?:US|EU)\\(.*)', '$1_IN'
See this demo (.NET regex demo here). Details:
(?m) - multiline option enabling ^ to match start of any line position
^ - line start
(?:US|EU) - EU or US
\\ - a \ char
(.*) - Group 1: any zero or more chars other than a line feed char (note you might need to replace it with ([^\r\n]*) if you start getting weird results)

Replace pattern in list in boost build/b2/bjam

How can I replace a pattern in a list of strings in boost build ?
In GNU make that could be done using substitution for changing file extension, or patsubst in general.
Here is an example using the rule "replace-list" from builtin module regex:
SWIG_SOURCES = [ glob *.i ] ;
import regex ;
SWIG_GENERATED_CPP_FILES = [ regex.replace-list $(SWIG_SOURCES) : \\.i : _wrap.cpp ] ;
Let's say the file example_file.i is located in the directory, its name will be added to the list SWIG_SOURCES by glob and will become example_file_wrap.cpp in the list SWIG_GENERATED_CPP_FILES;
The \\ are used to mean that . is a litteral dot, without them . would match any character.
The $ matches the end of the string.
More information in the documentation of regex builtin

Powershell Regex expression to get part of a string

I would like to take part of a string to use it elsewhere. For example, I have the following strings:
Project XYZ is the project name - 20-12-11
I would like to get the value "XYZ is the project name" from the string. The word "Project" and character "-" before the number will always be there.
I think a lookaround regular expression would work here since "Project" and "-" are always there:
(?<=Project ).+?(?= -)
A lookaround can be useful for cases that deal with getting a sub string.
Explanation:
(?<= = negative lookbehind
Project = starting string (including space)
) = closing negative lookbehind
.+? = matches anything in between
(?= = positive lookahead
- = ending string
) = closing positive lookahead
Example in PowerShell:
Function GetProjectName($InputString) {
$regExResult = $InputString | Select-String -Pattern '(?<=Project ).+?(?= -)'
$regExResult.Matches[0].Value
}
$projectName = GetProjectName -InputString "Project XYZ is the project name - 20-12-11"
Write-Host "Result = '$($projectName)'"
here is yet another regex version. [grin] it may be easier to understand since it uses somewhat basic regex patterns.
what it does ...
defines the input string
defines the prefix to match on
this will keep only what comes after it.
defines the suffix to match on
this part will keep only what is before it.
trigger the replace
the part in the () is what will be placed into the 1st capture group.
show what was kept
the code ...
$InString = 'Project XYZ is the project name - 20-12-11'
# "^" = start of string
$Prefix = '^project '
# ".+' = one or more of any character
# "$" = end of string
$Suffix = ' - .+$'
# "$1" holds the content of the 1st [and only] capture group
$OutString = $InString -replace "$Prefix(.+)$Suffix", '$1'
$OutString
# define the input string
$str = 'Project XYZ is the project name - 20-12-11'
# use regex (-match) including the .*? regex pattern
# this patterns means (.)any char, (*) any times, (?) maximum greed
# to capture (into brackets) the desired pattern substring
$str -match "(Project.*?is the project name)"
# show result (the first capturing group)
$matches[1]

Optimize the regex for multiline matching, both in steps and time

Regex - should match newlines as well as should end at the first occurrence of a particular format
In reference to Regex - should match newlines as well as should end at the first occurence of a particular format
I am trying to read body of the mail from logs (some of them are more than 500 lines).
Sample data looks like: BodyOftheMail_Script = [ BEGIN 500 lines END ]
I've tried following regular expressions:
+-----------------------------------------------------------------------+----------+--------+
| Regexp | Steps | Time |
+-----------------------------------------------------------------------+----------+--------+
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s)[\s\S]*?)(?=\s{1,}END\s]) | 1015862 | ~474ms |
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s)[\w\W]*?)(?=\s{1,}END\s]) | 1015862 | ~480ms |
| BodyOftheMail_Script\s=\s[\sBEGIN\s{0,}((?s).*?)(?=\s{1,}END\s]) | 1015862 | ~577ms |
| BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((.|\n)*?)(?=\s{1,}END\s\]) | 1681711 | ~829ms |
+-----------------------------------------------------------------------+----------+--------+
Is there a faster way (more optimal regexp) to match this?
Enhancing the pattern
The most efficient from 5 expressions turned out to be
BodyOftheMail_Script\s=\s\[\sBEGIN\s*(\S*(?:\s++(?!END\s])\S*)*)\s+END\s]
See the regex demo
The part I modified is \S*(?:\s++(?!END\s])\S*)*:
\S* - 0 or more non-whitespace characters
(?:\s++(?!END\s])\S*)* - 0 or more occurrences of
\s++(?!END\s]) - 1+ whitespace characters (matched possessively so that the lookahead check could only be performed once after all the 1+ whitespaces are matched) not followed with END, 1 whitespace and ] char
\S* - 0 or more non-whitespace characters
Why not a mere BodyOftheMail_Script\s=\s\[\sBEGIN\s*(.*?)\s+END\s] with re.DOTALL? The \s*(.*?)\s+END\s] will work as follows: 0+ whitespaces will be matched at once, then (.*?) will be skipped the first time, then \s+END\s] pattern will be tried. If \s+END\s] is not matched, .*? will grab one char and again let the subsequent patterns try to match the string. And so on. It might take a lot of backtracking steps to reach the end of a match (if it is there, else, it might end in a timeout sooner than later).
Performance comparison
Since the number of steps at regex101.com is not a direct proof a certain pattern is more efficient than another, I decided to run performance tests using Python PyPi regex library. See the code below.
The results obtained on a PC with 16GB RAM, Intel Core i5-9400F CPU, consistent results are obtained using PyPi regex versions 2.5.77 and 2.5.82:
┌──────────┬─────────────────────────────────────────────────────────────────┐
│ Regex │ Time taken │
├──────────┼─────────────────────────────────────────────────────────────────┤
│ OP 1 │ 0.5606743000000001 │
│ OP 2 │ 0.5524994999999999 │
│ OP 3 │ 0.5026944 │
│ OP 4 │ 0.7502984000000001 │
│ WS_1 │ 0.25729479999999993 │
│ WS_2 │ 0.3680949 │
└──────────┴─────────────────────────────────────────────────────────────────┘
Conclusions:
The worst OP regex is the one that contains a notorious (.|\n)*? pattern, it is one of the most inefficient patterns I have seen in my regex life, it always causes issues across all languages. Please never use it in your patterns
The first three OP patterns are comparable, but it is clear than the common workarounds for a . to match any char, [\w\W] and [\s\S], should be avoided if there is a way to make . match any char with a modifier, such as (?s) or regex.DOTALL. The (?s). native solution is a tiny bit more efficient.
My suggestion appears to be twice as fast comapring to the best OP pattern due to the fact it matches strings from left-hand delimiter to the right-hand delimiter in chunks, only stopping to check for the right-hand delimiter after grabbing whitespace chunks of text and the whitespaces that follow them.
The .*? construct is expanding each time a char is not the start of the right-hand delimiter, with longer strings, its efficiency will be decreasing.
The Python testing code:
import regex, timeit
text = 'BodyOftheMail_Script = [ BEGIN some text\nhere and\nhere, too \nEND ]'
regex_pattern_1=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s)[\s\S]*?)(?=\s{1,}END\s])')
regex_pattern_2=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s)[\w\W]*?)(?=\s{1,}END\s])')
regex_pattern_3=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((?s).*?)(?=\s{1,}END\s])')
regex_pattern_4=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s{0,}((.|\n)*?)(?=\s{1,}END\s\])')
regex_pattern_WS_1=regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s*(\S*(?:\s++(?!END\s])\S*)*)\s+END\s]')
regexp_patternWS_2 = regex.compile(r'BodyOftheMail_Script\s=\s\[\sBEGIN\s*(.*?)\s+END\s]', regex.DOTALL)
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_1 as p', number=100000))
# => 0.5606743000000001
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_2 as p', number=100000))
# => 0.5524994999999999
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_3 as p', number=100000))
# => 0.5026944
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_4 as p', number=100000))
# => 0.7502984000000001
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regex_pattern_WS_1 as p', number=100000))
# => 0.25729479999999993
print(timeit.timeit("p.findall(text)", 'from __main__ import text, regexp_patternWS_2 as p', number=100000))
# => 0.3680949
Unless you missed some important details in your question, I don't see any reason to overcomplicate the things. Why not use simple BodyOftheMail_Script = \[ BEGIN.*?END \]? So you have your start indicator BodyOftheMail_Script = [ BEGIN, you have end indicator END ], and you want to match everything in between in non-greedy way .*?. Of course it requires flags like re.MULTILINE and re.DOTALL (if we're talking about Python):
import re
regexp = re.compile(r'BodyOftheMail_Script = \[ BEGIN.*?END \]', re.DOTALL | re.MULTILINE)
The first rule of regexps - do not overcomplicate ;) Someone will read it after you.
Using the same comparison script as in #Wictor's answer, I got following results:
OP 1 0.24152620000000002
OP 2 0.28501820000000005
OP 3 0.20582650000000002
OP 4 0.3379188999999999
WS 0.16937669999999994
Subj 0.10387990000000014
Replacing to \s is possible and it does not really change the speed (but if you have only space in the actual file, then just use space, do not overcomplicate)
Also if you want, you can add the group to directly get the content, it adds ~0.02s for me, most probably it will be faster to trim each result afterwards instead of using regexp group.

Get package references from a package body file

I am extracting some information about package body files, and now I need to get the package references (packages invoked) in the same file. How to do this in Notepad++ with regex?
I understand that its possible with regex by marking a search with
pac_\w*
And unmark lines, but I need only the package names, not the lines.
For example if I have this code portion:
pac_test1.function1(...);
if pac_finally.f_result then
pac_execute.p_result;
v_load := pac_gui.f_show_result(pnum1, pnum2);
.
.
I expect to get this:
pac_test1
pac_finally
pac_execute
pac_gui
Or desired:
pac_test1, pac_finally, pac_execute, pac_gui
Notepad++ may not be the right tool for this job, because the typical approach you would use would be to search for something like pac_[^.]+. But the problem is that NPP operates starting with the entire line, and ending up some replacement of that line. Lines which have no matches would need to be removed, and that is tricky.
So I recommend using an app language like PHP. Here is a PHP script which can find all matches:
$script = "pac_test1.function1(...);
if pac_finally.f_result then
pac_execute.p_result;
v_load := pac_gui.f_show_result(pnum1, pnum2);";
preg_match_all("/pac_[^.]+/", $script, $matches);
print_r($matches[0]);
echo implode(",", $matches[0]);
Array
(
[0] => pac_test1
[1] => pac_finally
[2] => pac_execute
[3] => pac_gui
)
pac_test1,pac_finally,pac_execute,pac_gui
Ctrl+H
Find what: (?:^|\G).*?(pac_\w+)(?:(?!pac_).)*(\R|\z)?
Replace with: $1,
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?:^|\G) # beginning of line OR restart from last match position
.*? # 0 or more any character but newline, not greedy
(pac_\w+) # group 1, pac_ followed by 1 or more word characters, the package
(?:(?!pac_).)* # Tempered greedy token, make sure we haven't pac_
(\R|\z)? # optional group 2, any kind of linebreak or end of file
Replacement:
$1, # content of group 1, package, a comma and a space
Given:
pac_test1.function1(...); pac_test2
if pac_finally.f_result then
pac_execute.p_result;
v_load := pac_gui.f_show_result(pnum1, pnum2);
Result for given example:
pac_test1, pac_test2, pac_finally, pac_execute, pac_gui,