Appending Text on the Line Below a Regular Expression Match - regex

I am trying to add text that reads "[value = xxx]" below every line that contains the word "Letters" and also append a comma to the line containing the word "Letters" and I thought using a Regular Expression in Notepad++ would work but I can't quite figure it out. Also, matches are not spaced regularly (ie. It's not as simple as adding "[value = xxx]" to every 3rd row).
What I have currently looks like:
Properties = "_2nastlsgb",
Letters = "#,S"
textline2
textline3
Properties = "_1,N",
Letters = "A"
I would like for the end result to be something like:
Properties = "_2nastlsgb",
Letters = "#,S",
[value = xxx]
textline2
textline3
Properties = "_1,N",
Letters = "A",
[value = xxx]
I'm really close with the following but it ends up just a bit off:
Find What: letter(.*)
Replace with: \1,\n\t\t\t\t[Value = ###]
Result:
Properties = "_2nastlsgb",
s = "#,S",
[Value = ###]
textline2
textline3
Properties = "_1,N",
s = "A",
[Value = ###]
Any help would be appreciated.

Try using:
^(.*?)(Letters.*)
And replace with:
$1$2,\n$1[Value = ###]
This regex will take the indent of the Letters and apply it to the Value as well.
The issue with your regex was that it was replacing letter and not placing it back, thus the lone s.

Related

Optional parts of regex pattern in vba

I am trying to build regex pattern for the text like that
numb contra: 1.29151306 number mafo: 66662308
numb contra 1.30789668number mafo 60.046483
numb contra/ 1.29154056 number mafo: 666692638
numb contra 137459625
mafo: 666692638
mafo: 666692638 numb contra/ 1.29154056
Here's the pattern I could build
contra?.\s+?(\d+\.?\d+)(.+mafo.?\s+(\d+\.?\d+))?
It works fine for all the lines except the last one. How can I implement all the possibilities to include the last line too?
Please have a look at this link
https://regex101.com/r/pSThAU/1
All is OK as for contra but not as for mafo
I think the key here is to make your regexp do less and your vba do more. What I think I see here is either the word 'mafo' or 'contra' and a number following. Don't know what order or whether each is present or how many times. So you can scan each of your strings for ALL occurrences with a regexp like this:
(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)
Then process it with some VBA code like this that I created in Excel:
Sub BreakItUp()
Dim rg As RegExp, scanned As MatchCollection, eachMatch As Match, i As Long, col As Long
Set rg = New RegExp
rg.Pattern = "(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)"
rg.IgnoreCase = True
rg.Global = True
i = 1
Do While (Not IsEmpty(ActiveSheet.Cells(i, 1).Value))
Set scanned = rg.Execute(ActiveSheet.Cells(i, 1).Value)
col = 2
For Each eachMatch In scanned
ActiveSheet.Cells(i, col).Value = eachMatch.SubMatches(0) & eachMatch.SubMatches(1)
ActiveSheet.Cells(i, col + 1).Value2 = "'" & eachMatch.SubMatches(2)
col = col + 2
Next eachMatch
i = i + 1
Loop
End Sub
That MatchCollection object will get one item for each Match that occurs and the subMatches array contains each capturing group. You should be able write your own logic within this processing loop to interpret what was extracted. When I ran it on your data it created all the fields in blue:
Notice I added a line to your data that had two contra entries and one mafo and it found all the occurrences. You should be able to modify this to interpret the meanings.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Python and Regex with special characters

I can't get my regex to work as desired in my Python 3 code.
I am trying to parse a file find a specific pattern (the exact pattern is Total Optimized)
I am doing this because the file can contain lines which say """Total Optimization (Active)""" and other permutations. I have tried the following lines. None work
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t[^(Active)]')
My basic code (which is simplified here) to just print the matching line out. If I got that working I would then choose the array item I wanted such as
PkOp = PkOpArray[4]
App = re.compile(r'Appliance\s(Active)')
PkOp = re.compile(r"Total Optimized\t\d")
with open("SteelheadMetric2.txt","r") as f:
with open("mydumbfile.txt","w") as o:
for line in f:
line = line.lstrip()
matches = PkOp.findall(line)
for firestick in matches:
PkOpArray = line.split()
PkOp = PkOpArray
print(PkOp)
Mostly I get this error
matches = PkOp.findall(line)
AttributeError: 'list' object has no attribute 'findall'
If I remove the slash characters I can get it to show lines with 'Total Optimization' or 'Appliance' whatever. I just can't be more specific in what I want.
What am I missing? It works fine if I just compile a text string but to use special regex like whitespace, tab digit it fails. The regex checks out in notepad++
When you write PkOp = PkOpArray you have just changed your regex into a list.
If you delete that line, and change your print(PkOp) to print(PkOpArray), it should fix your problem, assuming the rest of your code is correct.

Changing substring in a String

I've got a variable "Variable" in VBScript that will receive different values, based on names that come from xml files i don't trust. I can't let "Variable" have forbidden caracters on it (<, >, :, ", /, \, |, ?, * ) or characters with accents (I think they are called accent in english) like (Á, á, É, é, Â, â, Ê, ê, ñ, ã).
So, my question is: How can I create a script that studies and replace these possible multiple possible characters in the variable I have? I'm using a Replace function found in MSDN Library, but it won't let me alter many characters in the way I'm using it.
Example:
(Assuming a Node.Text value of "Example A/S")
For Each Node In xmlDoc.SelectNodes("//NameUsedToRenameFile")
Variable = Node.Text
Next
Result = Replace(Variable, "<", "-")
Result = Replace(Variable, "/", "-")
WScript.Echo Result
This Echo above returns me "Example A-S", but if I change my Replaces order, like:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
I get a "Example A/S". How should I program it to be prepared to any possible characters? Thanks!
As discussed, it might be easier to do things the other way around; create a list of allowed characrters as VBScript is not so good at handling unicode like characters; whilst the characters you have listed may be fine, you may run into issues with certain character sets. here's an example routine that could help your cause:
Consider this command:
wscript.echo ValidateStr("This393~~_+'852Is0909A========Test|!:~#$%####")
Using the sample routine below, it should produce the following results:
This393852Is0909ATest
The sample routine:
Function ValidateStr (vsVar)
Dim vsAllowed, vscan, vsaScan, vsaCount
vsAllowed = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
ValidateStr = ""
If vartype(vsvar) = vbString then
If len(vsvar) > 0 then
For vscan = 1 To Len(vsvar)
vsValid = False
vsaCount = 1
Do While vsaValid = false and vsaCount <= len(vsAllowed)
If UCase(Mid(vsVar, vscan, 1)) = Mid(vsAllowed, vsaCount, 1) Then vsValid = True
vsaCount = vsaCount + 1
Loop
If vsValid Then ValidateStr = ValidateStr & Mid(vsVar, vscan,1)
Next
End If
End If
End Function
I hope this helps you with your quest. Enjoy!
EDIT: If you wish to continue with your original path, you will need to fix your replace command - it is not working because you are resetting it after each line. You'll need to pump in variable the first time, then use result every subsequent time..
You had:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
You need to change this to:
Result = Replace(Variable, "/", "-")
Result = Replace(Result, "<", "-")
Result = Replace(Result, ...etc..)
Result = Replace(Result, ...etc..)
Edit: You could try Ansgar's Regex, as the code is by far more simple, but I am not sure it will work if as an example you had simplified Chinese characters in your string.
I agree with Damien that replacing everything but known-good characters is the better approach. I would, however, use a regular expression for this, because it greatly simplifies the code. I would also recommend to not remove "bad" characters, but to replace them with a known-good placeholder (an underscore for instance), because removing characters might yield undesired results.
Function SanitizeString(str)
Set re = New RegExp
re.Pattern = "[^a-zA-Z0-9]"
re.Global = True
SanitizeString = re.Replace(str, "_")
End Function

Regex regular expression to remove lines which start with certain text

I know it may be quite easily for you.
I have a text which contains 40 lines, I want to remove lines which starts with a constant text.
Please check below data.
When I used (?mn)[\+CMGL:].*($) it removes the whole text , when I use (?mn)[\+CMGL:].*(\r) , it only leaves the first line.
+CMGL: 0,1,,159
07910201956905F0440B910201532762F20008709021225282808
+CMGL: 1,1,,159
07910201956905F0240B910201915589F7000860013222244480
+CMGL: 2,1,,151
07910201956905F0240B910201851177F6000850218122415
+CMGL: 3,1,,159
07910201956905F0440B910201532762F200087090311
+CMGL: 4,1,,159
07910221020020F0440B910221741514F40008802041120481808C050
I want to remove all lines that starts with +CMGL , and leave only other line.
Thanks...
Why do you need Regex for this? String.StartsWith was created for this purpose.
Dim result = lines.Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
Edit: If you don't have "lines" but a text which contains NewLine-characters:
Dim result = text.Split({ControlChars.CrLf, ControlChars.Lf}, StringSplitOptions.None).
Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
If you want it to be converted back to a string:
Dim text = String.Join(Environment.NewLine, result)