Regex extract between trible double quotes and newlines - regex

For example i want to parse python file with text between triple double quotes and make html table from this text.
Text block for example like that
"""
Replaces greater than operator ('>') with 'NOT BETWEEN 0 AND #'
Replaces equals operator ('=') with 'BETWEEN # AND #'
Tested against:
* Microsoft SQL Server 2005
* MySQL 4, 5.0 and 5.5
* Oracle 10g
* PostgreSQL 8.3, 8.4, 9.0
Requirement:
* Microsoft Access
Notes:
* Useful to bypass weak and bespoke web application firewalls that
filter the greater than character
* The BETWEEN clause is SQL standard. Hence, this tamper script
should work against all (?) databases
>>> tamper('1 AND A > B--')
'1 AND A NOT BETWEEN 0 AND B--'
>>> tamper('1 AND A = B--')
'1 AND A BETWEEN B AND B--'
"""
Html table must be simple table contains 5 columns
Column everything between """ and \n if new line is empty
Column everything between Tested against: and \n if new line is empty or Requirement: and \n if new line is empty
Column everything between Notes: and \n if new line is empty
Column everything between >>> and \n
Column everything between 4 column end and \n
So result must be:
Replaces greater than operator ('>') with 'NOT BETWEEN 0 AND #'
Replaces equals operator ('=') with 'BETWEEN # AND #'
Microsoft SQL Server 2005
MySQL 4, 5.0 and 5.5
Oracle 10g
PostgreSQL 8.3, 8.4, 9.0
or
Microsoft Access
Useful to bypass weak and bespoke web application firewalls that
filter the greater than character
The BETWEEN clause is SQL standard. Hence, this tamper script
should work against all (?) databases
tamper('1 AND A > B--')
tamper('1 AND A = B--')
'1 AND A NOT BETWEEN 0 AND B--'
'1 AND A BETWEEN B AND B--'
What kind of syntax can i use to extract that?
I will use VBScript.RegExp .
Set fso = CreateObject("Scripting.FileSystemObject")
txt = fso.OpenTextFile("C:\path\to\your.py").ReadAll
Set re = New RegExp
re.Pattern = """([^""]*)"""
re.Global = True
For Each m In re.Execute(txt)
WScript.Echo m.SubMatches(0)
Next

Your question is quite broad, so I'll just outline a way to deal with this. Otherwise I'd have to write the whole script for you, which isn't going to happen.
Extract everything between the docquotes. Use a regular expression like this to extract the text between the docquotes:
Set re1 = New RegExp
re1.Pattern = """""""([\s\S]*?)"""""""
For Each m In re1.Execute(txt)
docstr = m.SubMatches(0)
Next
Note that you need to set the re.Global to True if you have more than 1 docstring in your file and want all of them processed. Otherwise you'll get just the first match.
Remove leading and trailing whitespace with a second regular expression:
Set re2 = New RegExp
re2.Pattern = "^\s*|\s*$"
re2.Global = True 'find all matches
docstr = re2.Replace(docstr, "")
You can't use Trim for this, because the function handles only spaces, not other whitespace.
Either split the string at 2+ consecutive line breaks to get the doc sections, or use another regular expression to extract them:
Set re3 = New RegExp
re3.Pattern = "([\s\S]*?)\r\n\r\n" +
"Tested against:\r\n([\s\S]*?)\r\n\r\n" +
...
For Each m In re3.Execute(txt)
descr = m.SubMatches(0)
tested = m.SubMatches(1)
...
Next
Continue breaking down the sections until you have the elements you want to display. Then build the HTML from these elements.

Related

How to find and format multiple matching words with Regex in Word document using VB script

I have a word document in which I have to do the formatting of the words using VB script. The text can be as follows :
hello <bu ABC bu>, We are pleased to confirm our offer of employment to you. The terms and conditions that will apply to your employment with are set forth in this letter and Exhibit A attached hereto and incorporated herein by reference together, the “Agreement”
You have been offered and accepted the position of , presently reporting to . Your start date is expected to be
The words which are written inside tag needs to be bold and underlined. Currently I have written a VBscript which will find the text given as argument and make it bold and underline as required.
But to make the solution/script more dynamic, I want the script to match Regular Expression pattern which I have written : (?<=(<bu))[a-zA-Z0-9 -:/\[]()]+(?=(bu>))
The script I have written :
Option Explicit
Function Macro1()
Dim strFilePath
strFilePath = "C:\Users\<UserID>\Documents\OfferLetterTemplate.docx"
Dim strTextToReplace
strTextToReplace = "<bu XYZ bu>"
Dim Word, objDoc, objSelection
Set Word = CreateObject("Word.Application")
Word.Visible = True
Dim wordfile
Set wordfile = Word.Documents.Open(strFilePath)
Set objDoc = Word.ActiveDocument
Set objSelection = Word.Selection
objSelection.Find.Forward = True
objSelection.Find.MatchWholeWord = False
objSelection.Find.ClearFormatting
objSelection.Find.Replacement.ClearFormatting
objSelection.Find.Replacement.Font.Bold = True
objSelection.Find.Replacement.Font.Underline = True
objSelection.Find.Text = strTextToReplace
objSelection.Find.Replacement.Text = ""
objSelection.Find.Execute , , , , , , , 0, , , 2
wordfile.save
Word.Quit
End Function
call Macro1
Can someone help me how I can search for the RegEx which I have given above and format all the matching occurrences at once?

Find and Replace REGEX results with new string

I want to replace a 10 digits pictureID number to a single text string in my WP-database (wp_post field: post_content)
pictureid=0001234567 (where the last 7 digits are different for every photo)
to a single value:
sourceids=2518
When I query for the pictureID numbers wit REGEX it seems te return al the records I want to change.
SELECT * FROM `wp_posts` WHERE `post_content` REGEXP 'pictureid=000[0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
Next: what to do to change pictureID in those records found to the sourceids=2518
I did try
update wp_posts set post_content = replace(post_content, 'REGEXP 'pictureid=000[0-9][0-9][0-9][0-9][0-9][0-9][0-9]'','sourceids=2518');
but this won't work
Use REGEXP_REPLACE(pictureid,'000[0-9][0-9][0-9][0-9][0-9][0-9][0-9]',sourceid)
Sorry the reply is poorly formatted so will do it this way
Its not working,I did the following: testing REGEXP:
SELECT * FROM wp_posts WHERE post_content REGEXP 'pictureid=0001119708' = WORKING
SELECT * FROM wp_posts WHERE post_content REGEXP 'pictureid=000[0-9][0-9][0-9][0-9][0-9][0-9][0-9]' = WORKING
Trying to replace 'pictureid=000#######' (where # is any numeric value, example: 00012345670 by this single value 'sourceids-2518'
SELECT * FROM wp_posts WHERE post_content REGEXP_REPLACE ('pictureid=000[0-9][0-9][0-9][0-9][0-9][0-9][0-9]','sourceids=2518') = NOT WORKING

Why does Find/Replace zRngResult.Find work fine, but RegEx myRegExp.Execute(zRngResult) mess up the range.Start?

I wish to select and add comments after certain words, e.g. “not”, “never”, “don’t” in sentences in a Word document with VBA. The Find/Replace with wildcards works fine, but “Use wildcards” cannot be selected with “Match case”. The RegEx can “IgnoreCase=True”, but the selection of the word is not reliable when there are more than one comments in a sentence. The Range.start seems to be getting modified in a way that I cannot understand.
A similar question was asked in June 2010. https://social.msdn.microsoft.com/Forums/office/en-US/f73ca32d-0af9-47cf-81fe-ce93b13ebc4d/regex-selecting-a-match-within-the-document?forum=worddev
Is there a new/different way of solving this problem?
Any suggestion will be appreciated.
The code using RegEx follows:
Function zRegExCommentor(zPhrase As String, tComment As String) As Long
Dim sTheseSentences As Sentences
Dim rThisSentenceToSearch As Word.Range, rThisSentenceResult As Word.Range
Dim myRegExp As RegExp
Dim myMatches As MatchCollection
Options.CommentsColor = wdByAuthor
Set myRegExp = New RegExp
With myRegExp
.IgnoreCase = True
.Global = False
.Pattern = zPhrase
End With
Set sTheseSentences = ActiveDocument.Sentences
For Each rThisSentenceToSearch In sTheseSentences
Set rThisSentenceResult = rThisSentenceToSearch.Duplicate
rThisSentenceResult.Select
Do
DoEvents
Set myMatches = myRegExp.Execute(rThisSentenceResult)
If myMatches.Count > 0 Then
rThisSentenceResult.Start = rThisSentenceResult.Start + myMatches(0).FirstIndex
rThisSentenceResult.End = rThisSentenceResult.Start + myMatches(0).Length
rThisSentenceResult.Select
Selection.Comments.Add Range:=Selection.Range
Selection.TypeText Text:=tComment & "{" & zPhrase & "}"
rThisSentenceResult.Start = rThisSentenceResult.Start + 1 'so as not to find the same phrase again and again
rThisSentenceResult.End = rThisSentenceToSearch.End
rThisSentenceResult.Select
End If 'If myMatches.Count > 0 Then
Loop While myMatches.Count > 0
Next 'For Each rThisSentenceToSearch In sTheseSentences
End Function
Relying on Range.Start or Range.End for position in a Word document is not reliable due to how Word stores non-printing information in the text flow. For some kinds of things you can work around it using Range.TextRetrievalMode, but the non-printing characters inserted by Comments aren't affected by these settings.
I must admit I don't understand why Word's built-in Find with wildcards won't work for you - no case matching shouldn't be a problem. For instance, based on the example: "Never has there been, never, NEVER, a total drought.":
FindText:="[n,N][e,E][v,V][e,E][r,R]"
Will find all instances of n-e-v-e-r regardless of the capitalization. The brackets let you define a range of values, in this case the combination of lower and upper case for each letter in the search term.
The workarounds described in my MSDN post you link to are pretty much all you can if you insist on RegEx:
Using the Office Open XML (or possibly Word 2003 XML) file format will let you use RegEx and standard XML processing tools to find the information, add comment "tags" into the Word XML, close it all up... And when the user sees the document it will all be there.
If you need to be doing this in the Word UI a slightly different approach should work (assuming you're targeting Word 2003 or later): Work through the document on a range-by-range basis (by paragraph, perhaps). Read the XML representation of the text into memory using the Range.WordOpenXML property, perform the RegEx search, add comments as WordOpenXML, then write the WordOpenXML back into the document using the InserXml method, replacing the original range (paragraph). Since you'd be working with the Paragraph object Range.Start won't be a factor.

Parsing Transact SQL with RegEx

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase - so I'm only looking for upperCase
words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the
previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in
quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far with I've gotten only this far: (not too far...)
(?-i)[[:upper:]]
Any help would be most appreciated.
TIA,
MN
This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER'
-- AND NO TABLES HERE AS WELL
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '
or
update TABLE set x='not '' A '''' table' where y=666
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.
A little demo in Python:
#!/usr/bin/env python
import re
input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""
regex = r"""(?xs) # x = enable inline comments, s = enable DOT-ALL
(--[^\r\n]*) # [1] comments
| # OR
('(?:''|[^\r\n'])*') # [2] string literal
| # OR
(\b(?:AND|UPDATE|SET)\b) # [3] keywords
| # OR
([A-Z][A-Z_]*) # [4] capitalized word
| # OR
. # [5] fall through: matches any char
"""
output = ''
for m in re.finditer(regex, input):
# append a `#` if group(4) matched
if m.group(4): output += '#'
# append the matched text (any of the groups!)
output += m.group()
# print the adjusted SQL
print output
which produces:
UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'
This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.
Good luck.

What's Regular Expression for update Assembly build number in AssemblyInfo.cs file?

Now, I'm writing VS 2008 Macro for replace Assembly version in AssemblyInfo.cs file. From MSDN, Assembly version must be wrote by using the following pattern.
major.minor[.build[.revision]]
Example
1.0
1.0.1234
1.0.1234.0
I need to dynamically generate build number for 'AssemblyInfo.cs' file and use Regular Expression for replace old build number with new generated build number.
Do you have any Regular Expression for solving this question? Moreover, build number must not be contained in commented statement like below code. Finally, don't forget to check your regex for inline comment.
Don't replace any commented build number
//[assembly: AssemblyVersion("0.1.0.0")]
/*[assembly: AssemblyVersion("0.1.0.0")]*/
/*
[assembly: AssemblyTrademark("")]
[assembly: AssemblyCulture("")]
[assembly: ComVisible(false)]
[assembly: AssemblyVersion("0.1.0.0")]
*/
Replace build number that are not commented
[assembly: AssemblyVersion("0.1.0.0")] // inline comment
/* inline comment */ [assembly: AssemblyVersion("0.1.0.0")]
[assembly: /*inline comment*/AssemblyVersion("0.1.0.0")]
Hint.
Please try your regex at Online Regular Expression Testing Tool
This is somewhat crude, but you could do the following.
Search for:
^{\[assembly\: :w\(\"0\.1\.}\*
Replace with:
\1####
Where #### is your replacement string.
This regex work as follows:
It starts by searching for lines beginning with \[assembly\: ,(^ indicates the beginning fo a line, backslashes escape special characters) followed by...
...some alphabetic identifier :w, followed by...
...an opening brace \(, followed by...
...The beginning of the version string, in quotes \"0\.1\., finally followed by...
...an asterisk \*.
Steps 1-4 are captured as the first tagged expression using the curly braces { } surrounding them.
The replacement string drops the tagged expression verbatim, so that it's not harmed with: \1, followed by your replacement string, some ####.
Commented lines are ignored as they do not start with [assembly: .Subsequent in-line comments are left untouched as they are not captured by the regex.
If this isn't exactly what you need, it's fairly straightforward to experiment with the regex to capture and/or replace different parts of the line.
I doubt using regular expressions will do you much good here. While it could be possible to formulate an expression that matches "uncommented" assembly version attributes it will be hard to maintain and understand.
You are making it very very hard on yourself with the syntax that you present. What about enforcing a coding standard on your AssemblyInfo.cs file that says that lines should always be commented out with a beginning // and forbid inline comments? Then it should be easy enough to parse it using a StreamReader.
If you can't do that then there's only one parser who's guaranteed to handle all of your edge cases and that's the C# compiler. How about just compiling your assembly and then reflecting it to detect the version number?
var asm = Assembly.LoadFile("foo.dll");
var version = Assembly.GetExecutingAssembly().GetName().Version;
If you're simply interested in incrementing your build number you should have a look at this question: Can I automatically increment the file build version when using Visual Studio?
You can achieve same effect much more easily, by downloading and installing MS Build Extension Pack and adding following line at the top of your .csproj file:
<Import Project="$(MSBuildExtensionsPath)\ExtensionPack\MSBuild.ExtensionPack.VersionNumber.targets"/>
This will automatically use current date (MMdd) as the build number, and increment the revision number for you. Now, to override minor and major versions, which are set to 1.0 by default, just add following anywhere in the .csproj file:
<PropertyGroup>
<AssemblyMajorVersion>2</AssemblyMajorVersion>
<AssemblyFileMajorVersion>1</AssemblyFileMajorVersion>
</PropertyGroup>
You can further customize how build number and revision are generated, and even set company, copyright etc. by setting other properties, see this page for the list of properties.
I just find answer for my question. But answer is very very complicate & very long regex. By the way, I use this syntax only 1 time per solution. So, It doesn't affect overall performance. Please look at my complete source code.
Module EnvironmentEvents.vb
Public Module EnvironmentEvents
Private Sub BuildEvents_OnBuildBegin(ByVal Scope As EnvDTE.vsBuildScope, ByVal Action As EnvDTE.vsBuildAction) Handles BuildEvents.OnBuildBegin
If DTE.Solution.FullName.EndsWith(Path.DirectorySeparatorChar & "[Solution File Name]") Then
If Scope = vsBuildScope.vsBuildScopeSolution And Action = vsBuildAction.vsBuildActionRebuildAll Then
AutoGenerateBuildNumber()
End If
End If
End Sub
End Module
Module AssemblyInfoHelp.vb
Public Module AssemblyInfoHelper
ReadOnly AssemblyInfoPath As String = Path.Combine("Common", "GlobalAssemblyInfo.cs")
Sub AutoGenerateBuildNumber()
'Declear required variables
Dim solutionPath As String = Path.GetDirectoryName(DTE.Solution.Properties.Item("Path").Value)
Dim globalAssemblyPath As String = Path.Combine(solutionPath, AssemblyInfoPath)
Dim globalAssemblyContent As String = ReadFileContent(globalAssemblyPath)
Dim rVersionAttribute As Regex = New Regex("\[[\s]*(\/\*[\s\S]*?\*\/)?[\s]*assembly[\s]*(\/\*[\s\S]*?\*\/)?[\s]*:[\s]*(\/\*[\s\S]*?\*\/)?[\s]*AssemblyVersion[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\([\s]*(\/\*[\s\S]*?\*\/)?[\s]*\""([0-9]+)\.([0-9]+)(.([0-9]+))?(.([0-9]+))?\""[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\)[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\]")
Dim rVersionInfoAttribute As Regex = New Regex("\[[\s]*(\/\*[\s\S]*?\*\/)?[\s]*assembly[\s]*(\/\*[\s\S]*?\*\/)?[\s]*:[\s]*(\/\*[\s\S]*?\*\/)?[\s]*AssemblyInformationalVersion[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\([\s]*(\/\*[\s\S]*?\*\/)?[\s]*\""([0-9]+)\.([0-9]+)(.([0-9]+))?[\s]*([^\s]*)[\s]*(\([\s]*Build[\s]*([0-9]+)[\s]*\))?\""[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\)[\s]*(\/\*[\s\S]*?\*\/)?[\s]*\]")
'Find Version Attribute for Updating Build Number
Dim mVersionAttributes As MatchCollection = rVersionAttribute.Matches(globalAssemblyContent)
Dim mVersionAttribute As Match = GetFirstUnCommentedMatch(mVersionAttributes, globalAssemblyContent)
Dim gBuildNumber As Group = mVersionAttribute.Groups(9)
Dim newBuildNumber As String
'Replace Version Attribute for Updating Build Number
If (gBuildNumber.Success) Then
newBuildNumber = GenerateBuildNumber(gBuildNumber.Value)
globalAssemblyContent = globalAssemblyContent.Substring(0, gBuildNumber.Index) + newBuildNumber + globalAssemblyContent.Substring(gBuildNumber.Index + gBuildNumber.Length)
End If
'Find Version Info Attribute for Updating Build Number
Dim mVersionInfoAttributes As MatchCollection = rVersionInfoAttribute.Matches(globalAssemblyContent)
Dim mVersionInfoAttribute As Match = GetFirstUnCommentedMatch(mVersionInfoAttributes, globalAssemblyContent)
Dim gBuildNumber2 As Group = mVersionInfoAttribute.Groups(12)
'Replace Version Info Attribute for Updating Build Number
If (gBuildNumber2.Success) Then
If String.IsNullOrEmpty(newBuildNumber) Then
newBuildNumber = GenerateBuildNumber(gBuildNumber2.Value)
End If
globalAssemblyContent = globalAssemblyContent.Substring(0, gBuildNumber2.Index) + newBuildNumber + globalAssemblyContent.Substring(gBuildNumber2.Index + gBuildNumber2.Length)
End If
WriteFileContent(globalAssemblyPath, globalAssemblyContent)
End Sub
Function GenerateBuildNumber(Optional ByVal oldBuildNumber As String = "0") As String
oldBuildNumber = Int16.Parse(oldBuildNumber) + 1
Return oldBuildNumber
End Function
Private Function GetFirstUnCommentedMatch(ByRef mc As MatchCollection, ByVal content As String) As Match
Dim rSingleLineComment As Regex = New Regex("\/\/.*$")
Dim rMultiLineComment As Regex = New Regex("\/\*[\s\S]*?\*\/")
Dim mSingleLineComments As MatchCollection = rSingleLineComment.Matches(content)
Dim mMultiLineComments As MatchCollection = rMultiLineComment.Matches(content)
For Each m As Match In mc
If m.Success Then
For Each singleLine As Match In mSingleLineComments
If singleLine.Success Then
If m.Index >= singleLine.Index And m.Index + m.Length <= singleLine.Index + singleLine.Length Then
GoTo NextAttribute
End If
End If
Next
For Each multiLine As Match In mMultiLineComments
If multiLine.Success Then
If m.Index >= multiLine.Index And m.Index + m.Length <= multiLine.Index + multiLine.Length Then
GoTo NextAttribute
End If
End If
Next
Return m
End If
NextAttribute:
Next
Return Nothing
End Function
End Module
Thanks you every body
PS. Special Thank to [RegExr: Online Regular Expression Testing Tool][1]. The best online regex tool which I have ever been played. [1]: http://gskinner.com/RegExr/