Changing substring in a String - regex

I've got a variable "Variable" in VBScript that will receive different values, based on names that come from xml files i don't trust. I can't let "Variable" have forbidden caracters on it (<, >, :, ", /, \, |, ?, * ) or characters with accents (I think they are called accent in english) like (Á, á, É, é, Â, â, Ê, ê, ñ, ã).
So, my question is: How can I create a script that studies and replace these possible multiple possible characters in the variable I have? I'm using a Replace function found in MSDN Library, but it won't let me alter many characters in the way I'm using it.
Example:
(Assuming a Node.Text value of "Example A/S")
For Each Node In xmlDoc.SelectNodes("//NameUsedToRenameFile")
Variable = Node.Text
Next
Result = Replace(Variable, "<", "-")
Result = Replace(Variable, "/", "-")
WScript.Echo Result
This Echo above returns me "Example A-S", but if I change my Replaces order, like:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
I get a "Example A/S". How should I program it to be prepared to any possible characters? Thanks!

As discussed, it might be easier to do things the other way around; create a list of allowed characrters as VBScript is not so good at handling unicode like characters; whilst the characters you have listed may be fine, you may run into issues with certain character sets. here's an example routine that could help your cause:
Consider this command:
wscript.echo ValidateStr("This393~~_+'852Is0909A========Test|!:~#$%####")
Using the sample routine below, it should produce the following results:
This393852Is0909ATest
The sample routine:
Function ValidateStr (vsVar)
Dim vsAllowed, vscan, vsaScan, vsaCount
vsAllowed = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
ValidateStr = ""
If vartype(vsvar) = vbString then
If len(vsvar) > 0 then
For vscan = 1 To Len(vsvar)
vsValid = False
vsaCount = 1
Do While vsaValid = false and vsaCount <= len(vsAllowed)
If UCase(Mid(vsVar, vscan, 1)) = Mid(vsAllowed, vsaCount, 1) Then vsValid = True
vsaCount = vsaCount + 1
Loop
If vsValid Then ValidateStr = ValidateStr & Mid(vsVar, vscan,1)
Next
End If
End If
End Function
I hope this helps you with your quest. Enjoy!
EDIT: If you wish to continue with your original path, you will need to fix your replace command - it is not working because you are resetting it after each line. You'll need to pump in variable the first time, then use result every subsequent time..
You had:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
You need to change this to:
Result = Replace(Variable, "/", "-")
Result = Replace(Result, "<", "-")
Result = Replace(Result, ...etc..)
Result = Replace(Result, ...etc..)
Edit: You could try Ansgar's Regex, as the code is by far more simple, but I am not sure it will work if as an example you had simplified Chinese characters in your string.

I agree with Damien that replacing everything but known-good characters is the better approach. I would, however, use a regular expression for this, because it greatly simplifies the code. I would also recommend to not remove "bad" characters, but to replace them with a known-good placeholder (an underscore for instance), because removing characters might yield undesired results.
Function SanitizeString(str)
Set re = New RegExp
re.Pattern = "[^a-zA-Z0-9]"
re.Global = True
SanitizeString = re.Replace(str, "_")
End Function

Related

Changing formulas on the fly with VBA RegEx

i'm trying to change formulas in excel, i need to change the row number of the formulas.
I'm trying do use replace regex to do this. I use an loop to iterate through the rows of the excel and need to change the formula for the row that is iterating at the time. Here is an exemple of the code:
For i = 2 To rows_aux
DoEvents
Formula_string= "=IFS(N19='Z001';'xxxxxx';N19='Z007';'xxxxxx';0=0;'xxxxxxx')"
Formula_string_new = regEx.Replace(Formula_string, "$1" & i)
wb.Cells(i, 33) = ""
wb.Cells(i, 33).Formula = Formula_string_new
.
.
.
Next i
I need to replace rows references but not the ones in quotes or double quotes. Example:
If i = 2 i want the new string to be this:
"=IFS(N2='Z001';'xxxxxx';N2='Z007';'xxxxxx';0=0;'xxxxxxx')"
I'm trying to use this regex:
([a-zA-Z]+)(\d+)
But its changing everything in quotes too. Like this:
If i = 2:
"=IFS(N2='Z2';'xxxxxx';N2='Z2';'xxxxxx';0=0;'xxxxxxx')"
If anyone can help me i will be very grateful!
Thanks in advance.
As others have written, there are probably better ways to write this code. But for a regex that will capture just the Column letter in capturing group #1, try:
\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?
Note that is will NOT include the $ absolute addressing token, but could be altered if that were necessary.
Note that you can avoid the loop completely with:
Formula_string = "=IFS(N19=""Z001"",""xxxxxx"",N$19=""Z007"",""xxxxxx"",0=0,""xxxxxxx"")"
Formula_string_new = regEx.Replace(Formula_string, "$1" & firstRow)
With Range(wb.Cells(firstRow, 33), wb.Cells(lastRow, 33))
.Clear
.Formula = Formula_string_new
End With
When we write a formula to a range like this, the references will automatically adjust the way you were doing in your loop.
Depending on unstated factors, you may want to use the FormulaLocal property vice the Formula property.
Edit:
To make this a little more robust, in case there happens to be, within the quote marks, a string that exactly mimics a valid address, you can try checking to be certain that a quote (single or double) neither precedes nor follows the target.
Pattern: ([^"'])\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?\b(?!['"])
Replace: "$1$2" & i
However, this is not "bulletproof" as various combinations of included data might match. If it is a problem, let me know and I'll come up with something more robust.
If you can identify some unique features like in the example preceding bracket ( or colon ; and trailing equal = then this might work
Sub test()
Dim s As String, sNew As String, i As Long
Dim Regex As Object
Set Regex = CreateObject("vbscript.regexp")
With Regex
.Global = True
.MultiLine = False
.IgnoreCase = True
.Pattern = "([(;][a-zA-Z]{1,3})(\d+)="
End With
i = 1
s = "=IFS(NANA19='Z001';'xxxxxx';NA19='Z007';'xxxxxx';0=0;'xxxxxxx')"
sNew = Regex.Replace(s, "$1" & i & "=")
Debug.Print s & vbCr & sNew
End Sub

Replace Invalid Email Format with a Valid One

In Power Query, I have a list of emails that includes invalid emails. I am looking to use M codes to identify and "fix" them. For example, my email list would include something like "1234.my_email_gmail_com#error.invalid.com"
I am looking for Power Query to find similar email addresses, then produce an output of a valid email. For the example above, it should be "my_email#gmail.com"
Essentially, I want to do the following:
Remove the digits at the front (number of digits varies)
Remove the "#error.invalid.com"
Replace the first underscore "_" from the right to "."
Replace the second underscore "_" from the right to "#"
I'm still new to Power Query, especially with M codes. I appreciate any help and guidance I can get.
Try the function cleanEmailAddress below:
let
cleanEmailAddress = (invalidEmailAddress as text) as text =>
let
removeLeadingNumbers = Text.AfterDelimiter(invalidEmailAddress, "."), // Assumes invalid numbers are followed by "." which itself also needs removing.
removeInvalidDomain = Text.BeforeDelimiter(removeLeadingNumbers, "#"),
replaceLastOccurrence = (someText as text, oldText as text, newText as text) as text =>
let
lastPosition = Text.PositionOf(someText, oldText, Occurrence.Last),
replaced = if lastPosition >= 0 then Text.ReplaceRange(someText, lastPosition, Text.Length(oldText), newText) else someText
in replaced,
overwriteTopLevelDomainSeparator = replaceLastOccurrence(removeInvalidDomain, "_", "."),
overwriteAtSymbol = replaceLastOccurrence(overwriteTopLevelDomainSeparator, "_", "#")
in overwriteAtSymbol,
cleaned = cleanEmailAddress("1234.my_email_gmail_com#error.invalid.com")
in
cleaned
Regarding:
"Remove the digits at the front (number of digits varies)"
Your question doesn't mention what to do with the leading . (which remains if you remove the leading digits), but your expected output ("my_email#gmail.com") suggests it should be removed. Email addresses which do not have . immediately after the leading digits, will return an error (and the logic for removeLeadingNumbers expression will need to be improved).
This seems to work too:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Valid", each Text.ReplaceRange(Text.ReplaceRange(Text.BetweenDelimiters([Column1],".","#"),Text.PositionOf(Text.BetweenDelimiters([Column1],".","#"),"_",Occurrence.Last),1,"."),Text.PositionOf(Text.ReplaceRange(Text.BetweenDelimiters([Column1],".","#"),Text.PositionOf(Text.BetweenDelimiters([Column1],".","#"),"_",Occurrence.Last),1,"."),"_",Occurrence.Last),1,"#"))
in
#"Added Custom"

Why does Find/Replace zRngResult.Find work fine, but RegEx myRegExp.Execute(zRngResult) mess up the range.Start?

I wish to select and add comments after certain words, e.g. “not”, “never”, “don’t” in sentences in a Word document with VBA. The Find/Replace with wildcards works fine, but “Use wildcards” cannot be selected with “Match case”. The RegEx can “IgnoreCase=True”, but the selection of the word is not reliable when there are more than one comments in a sentence. The Range.start seems to be getting modified in a way that I cannot understand.
A similar question was asked in June 2010. https://social.msdn.microsoft.com/Forums/office/en-US/f73ca32d-0af9-47cf-81fe-ce93b13ebc4d/regex-selecting-a-match-within-the-document?forum=worddev
Is there a new/different way of solving this problem?
Any suggestion will be appreciated.
The code using RegEx follows:
Function zRegExCommentor(zPhrase As String, tComment As String) As Long
Dim sTheseSentences As Sentences
Dim rThisSentenceToSearch As Word.Range, rThisSentenceResult As Word.Range
Dim myRegExp As RegExp
Dim myMatches As MatchCollection
Options.CommentsColor = wdByAuthor
Set myRegExp = New RegExp
With myRegExp
.IgnoreCase = True
.Global = False
.Pattern = zPhrase
End With
Set sTheseSentences = ActiveDocument.Sentences
For Each rThisSentenceToSearch In sTheseSentences
Set rThisSentenceResult = rThisSentenceToSearch.Duplicate
rThisSentenceResult.Select
Do
DoEvents
Set myMatches = myRegExp.Execute(rThisSentenceResult)
If myMatches.Count > 0 Then
rThisSentenceResult.Start = rThisSentenceResult.Start + myMatches(0).FirstIndex
rThisSentenceResult.End = rThisSentenceResult.Start + myMatches(0).Length
rThisSentenceResult.Select
Selection.Comments.Add Range:=Selection.Range
Selection.TypeText Text:=tComment & "{" & zPhrase & "}"
rThisSentenceResult.Start = rThisSentenceResult.Start + 1 'so as not to find the same phrase again and again
rThisSentenceResult.End = rThisSentenceToSearch.End
rThisSentenceResult.Select
End If 'If myMatches.Count > 0 Then
Loop While myMatches.Count > 0
Next 'For Each rThisSentenceToSearch In sTheseSentences
End Function
Relying on Range.Start or Range.End for position in a Word document is not reliable due to how Word stores non-printing information in the text flow. For some kinds of things you can work around it using Range.TextRetrievalMode, but the non-printing characters inserted by Comments aren't affected by these settings.
I must admit I don't understand why Word's built-in Find with wildcards won't work for you - no case matching shouldn't be a problem. For instance, based on the example: "Never has there been, never, NEVER, a total drought.":
FindText:="[n,N][e,E][v,V][e,E][r,R]"
Will find all instances of n-e-v-e-r regardless of the capitalization. The brackets let you define a range of values, in this case the combination of lower and upper case for each letter in the search term.
The workarounds described in my MSDN post you link to are pretty much all you can if you insist on RegEx:
Using the Office Open XML (or possibly Word 2003 XML) file format will let you use RegEx and standard XML processing tools to find the information, add comment "tags" into the Word XML, close it all up... And when the user sees the document it will all be there.
If you need to be doing this in the Word UI a slightly different approach should work (assuming you're targeting Word 2003 or later): Work through the document on a range-by-range basis (by paragraph, perhaps). Read the XML representation of the text into memory using the Range.WordOpenXML property, perform the RegEx search, add comments as WordOpenXML, then write the WordOpenXML back into the document using the InserXml method, replacing the original range (paragraph). Since you'd be working with the Paragraph object Range.Start won't be a factor.

Replacing a word with another in a string if a condition is met

I am trying to get some help with a function on replacing two words in a string with another word if a condition is true.
The condition is: if the word 'poor' follows 'not', then replace the whole string 'not ... poor' with 'rich'. The problem is that I don't know how to make the function - more specific how to make a function that seeks for if the word poor follows not and then what I have to write to make the replacement. I am pretty new to python, so maybe it is a stupid questions but i hope someone will help me.
I want the function to do something like this:
string = 'I am not that poor'
new_string = 'I am rich'
Doubtless the regular expression pattern could be improved, but a quick and dirty way to do this is with Python's re module:
import re
patt = 'not\s+(.+\s)?poor'
s = 'I am not that poor'
sub_s = re.sub(patt, 'rich', s)
print s, '->', sub_s
s2 = 'I am not poor'
sub_s2 = re.sub(patt, 'rich', s2)
print s2, '->', sub_s2
s3 = 'I am poor not'
sub_s3 = re.sub(patt, 'rich', s3)
print s3, '->', sub_s3
Output:
I am not that poor -> I am rich
I am not poor -> I am rich
I am poor not -> I am poor not
The regular expression pattern patt matches the text not followed by a space and (optionally) other characters followed by a space and then the word poor.
Step One: Determine where the 'not' and 'poor' are inside your string (check out https://docs.python.org/2.7/library/stdtypes.html#string-methods)
Step Two: Compare the locations of 'not' and 'poor' that you just found. Does 'poor' come after 'not'? How could you tell? Are there any extra edge cases you should account for?
Step Three: If your conditions are not met, do nothing. If they are, everything between and including 'not' and 'poor' must be replaced by 'rich'. I'll leave you to decide how to do that, given the above documentation link.
Good luck, and happy coding!
This is something I came up with. Works for your example, but will need tweaks (what if there is more than 1 word between not and poor).
my_string = 'I am not that poor'
print my_string
my_list = my_string.split(' ')
poor_pos = my_list.index('poor')
if my_list[poor_pos - 1] or my_list[poor_pos - 2] == 'not':
not_pos = my_list.index('not')
del my_list[not_pos:poor_pos+1]
my_list.append('rich')
print " ".join(word for word in my_list)
Output:
I am not that poor
I am rich

vbscript function clean string only allow certain characters

Until now I've been manually adding characters to replace that break my code. I'd like to be a bit more proactive, so I found this function that is supposed to replace everything EXCEPT valid characters. My first hurdle is that it doesn't seem to work. The code below is my full test file and the MsgBox comes up blank.
My second question is about performance. This function handles very, very large strings. Will this method be considerably slower? Anyone recommend anything else?
Function CleanUp (input)
Dim objRegExp, outputStr
Set objRegExp = New Regexp
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "((?![a-zA-Z0-9]).)+"
outputStr = objRegExp.Replace(input, "-")
objRegExp.Pattern = "\-+"
outputStr = objRegExp.Replace(outputStr, "-")
CleanUp = outputStr
End Function
MsgBox (CleanUp("Test"))
Edit: I'm stupid and just saw the variable mixup I did which was causing it to return nothing. It is working now. Will still accept input for performance question or better suggestions.
You can simplify it even further.
objRegExp.Pattern = "[^\w+]"
It don't know what is the expected result for your example, but maybe you can try that for the pattern instead:
objRegExp.Pattern = "[^a-zA-Z0-9]"
Hope it works