I am trying to build regex pattern for the text like that
numb contra: 1.29151306 number mafo: 66662308
numb contra 1.30789668number mafo 60.046483
numb contra/ 1.29154056 number mafo: 666692638
numb contra 137459625
mafo: 666692638
mafo: 666692638 numb contra/ 1.29154056
Here's the pattern I could build
contra?.\s+?(\d+\.?\d+)(.+mafo.?\s+(\d+\.?\d+))?
It works fine for all the lines except the last one. How can I implement all the possibilities to include the last line too?
Please have a look at this link
https://regex101.com/r/pSThAU/1
All is OK as for contra but not as for mafo
I think the key here is to make your regexp do less and your vba do more. What I think I see here is either the word 'mafo' or 'contra' and a number following. Don't know what order or whether each is present or how many times. So you can scan each of your strings for ALL occurrences with a regexp like this:
(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)
Then process it with some VBA code like this that I created in Excel:
Sub BreakItUp()
Dim rg As RegExp, scanned As MatchCollection, eachMatch As Match, i As Long, col As Long
Set rg = New RegExp
rg.Pattern = "(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)"
rg.IgnoreCase = True
rg.Global = True
i = 1
Do While (Not IsEmpty(ActiveSheet.Cells(i, 1).Value))
Set scanned = rg.Execute(ActiveSheet.Cells(i, 1).Value)
col = 2
For Each eachMatch In scanned
ActiveSheet.Cells(i, col).Value = eachMatch.SubMatches(0) & eachMatch.SubMatches(1)
ActiveSheet.Cells(i, col + 1).Value2 = "'" & eachMatch.SubMatches(2)
col = col + 2
Next eachMatch
i = i + 1
Loop
End Sub
That MatchCollection object will get one item for each Match that occurs and the subMatches array contains each capturing group. You should be able write your own logic within this processing loop to interpret what was extracted. When I ran it on your data it created all the fields in blue:
Notice I added a line to your data that had two contra entries and one mafo and it found all the occurrences. You should be able to modify this to interpret the meanings.
Related
I have thousands of mapping pattern that I need to convert. Attached is the image which shows the source value which I need to convert into Target values.
Few of the rules that I am able to decipher are below: -
If the dash ('-') is just before any value then it needs to convert to pipe ('|')
If there are multiple dashes in the middle e.g. 4 dashes, then they would converted to 4 pipes and 3 dashes as shown in second example in order to show 3 empty fields (|-|-|-|)
If there are multiple dashes at the end e.g. 3 dashes, then they would be converted to 3 pipes and 3 dashes as shown in first example in order to show 3 empty fields without pipe at the end (|-|-|-)
There would never be a dash at the beginning
There are in total 8 values. Each |-| is considered a empty value. Each field is separated by pipe.
I am looking at ways to convert the source values into intended target values using any software possible.
This quick user defined function appears to meet your requirements without regular expressions.
Function mapSource(str As String)
Dim tmp As Variant, i As Long
'strip leading hyphens
Do While Left(str, 1) = Chr(45) And Len(str) > 0: str = Right(str, Len(str) - 1): Loop
'split str to a maximum of 8 array elements
tmp = Split(str, Chr(45), 8)
'preserve an array of 8 elements
ReDim Preserve tmp(7)
'replace empty array elements with hyphens
For i = LBound(tmp) To UBound(tmp)
If tmp(i) = vbNullString Then tmp(i) = Chr(45)
Next i
'rejoin array into str
str = Join(tmp, Chr(124))
'output mapped str
mapSource = str
End Function
I have a bunch of data which contains any number of 5-digit strings in completely inconsistent formats, and i want to extract these 5-digit strings (in bold) out. I am not bothered about strings containing less than or more than 5-digits. as an example, this is the kind of data i have in my file
Cell A1: "1. 76589 - wholesale activities. 2. 33476 - general"
Cell A2: "WHOLESALE ACTIVITIES (76589). SHIPPING (12235). REAL
ESTATE ACTIVITIES (67333)"
Cell A3: "1. 33476 General. 658709 annual road. Unknown 563"
I've tried the usual SEARCH/FIND, MIN, LEFT/RIGHT/MID functions, but am not sure how to get them to produce the result i need, and even text-to-columns wasn't giving me a clean result
thanks in advance
Here is a macro that will split your line into the columns as you requested.
The range being processed is whatever you have selected.
The results are written into the adjacent columns on the same row.
Depending on your worksheet setup, you may want to "clear out" the rows where the results are going before executing the extraction code.
You can also write code to select the data to be processed automatically. Plenty of examples on this forum.
Option Explicit
Sub Extract5Digits()
Dim R As Range, C As Range
Dim RE As Object, MC As Object, M As Object
Dim I As Long
Set R = Selection
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b\d{5}\b"
For Each C In R
If .test(C.Text) = True Then
I = 0
Set MC = .Execute(C.Text)
For Each M In MC
I = I + 1
C.Offset(0, I) = M
Next M
End If
Next C
End With
End Sub
Simply with Excel functions this is impossibile.
The best way for you is to use the Regex 55 library in VBA.
Let's consider this example:
+---+--------------------------------------------------------------+
| | A |
+---+--------------------------------------------------------------+
| 1 | Cell A3: "1. 33476 General. 658709 annual road. Unknown 563" |
| 2 | 33476 |
+---+--------------------------------------------------------------+
From the Excel file hit Alt + F11, then go to Tools => Reference and select "Microsoft VBScript Regular Expression 5.5".
Then you can use the following function definition:
Public Function Get5DigitsNumer(search_str As String)
Dim regEx As New VBScript_RegExp_55.RegExp
Dim matches
GetStringInParens = ""
regEx.Pattern = "[0-9]{5}"
regEx.Global = True
If regEx.test(search_str) Then
Set matches = regEx.Execute(search_str)
GetStringInParens = matches(0).SubMatches(0)
End If
End Function
At this time you can use the following code:
Sub PatternExtractor()
Range("A2").Value = Get5DigitsNumer(Range("A1"))
End Sub
which take the value of cell A1 and extract the 5 digits numer, thn the result is saved into cell A2.
At the time I don't have any idea how this code could work where the same cell contains more than one time; like "Cell A1: "1. 76589 - wholesale activities. 2. 33476 - general" in your example.
I suggest you to have a look at this answer. The pattern is different but the question is really similar to yours.
The only way that you can do it is by writing a regex in VBA. I would recommend you to look at this question.
I've got a little problem with regards to iterating the filename of the txt files. I've got a filename format that goes like this: <date>-<year>_filename-<number>.txt. The problem is that when <number> reaches 9, the filename stops iterating.
The filenames goes like this:
31-2014_filename-1
31-2014_filename-2
31-2014_filename-3
31-2014_filename-4
31-2014_filename-5
31-2014_filename-6
31-2014_filename-7
31-2014_filename-8
31-2014_filename-9
31-2014_filename-10
The function only detects up to 9. Anything beyond that number is ignored.
Below is the code
Dim lastreport As Integer = 1
Public Sub GetLastNo(ByVal filePath As String)
Dim lastFile As String = 1
Dim files() As String = Directory.GetFiles(filePath, "*.txt")
For Each File As String In files
File = Path.GetFileNameWithoutExtension(File)
Dim numbers As MatchCollection = Regex.Matches(File, "(?<num>[\d]+)")
For Each number In numbers
number = CInt(number.ToString())
If number > 0 And number < 1000 And number > lastFile Then
lastFile = number
End If
lastreport = number
Next
Next
End Sub
Here it is:
(?<num>\d+(?=$))
This would make sure that the digits are followed by a > and $(End of line). This would make sure that it is the last set of digits.
It would really help to see some real filenames, including some that fail to match (your description is not completely unambiguous: for example what is <date> if it does not include the year?).
But assuming files like:
30May-2014_Stuff-1.txt
30May-2014_Stuff-3.txt
30May-2014_Stuff-5.txt
30May-2014_Stuff-7.txt
30May-2014_Stuff-9.txt
30May-2014_Stuff-11.txt
then using the .NET regex engine (from PowerShell (PSH) here as quicker to test with):
(?<num>\d+)$
should match the final digits ($ matches the end of the string) of the filename without extension: BaseName in PSH):
dir | foreach { if ($_.BaseName -match '(?<num>\d+)$') { $matches['num'] } }
gives:
1
11
3
5
7
9
So all filenames are matched, and the final number of their basenames is matched by group "num" of the regex.
I think there is something else going on in your approach: I would suggest changing to only get a single match per filename (and use Regex.Match rather than Matches to be consistent).
Set Regex = New RegExp
Regex.Pattern = """[^""]*""|[^,]*"
Regex.Global = True
//I have a for loop here to loop through records
text = Cells.Item(r, 7).Value
For Each Match In Regex.Execute(text)
count = count + 1
Next Match
This is my Regex Code, and here is the table where I am pulling the data from,
When I run the code in debug mode the PCBaa count comes up as two, c3 and c4 come up as 14 and C6-c36 come up as 36, Is my regex code wrong for extracting the codes between the commas ??
Ok, I have tried that myself and it seems that first off, it seems you don't reset the count value to 0 after each line. That could be intentional, but just so you know.
The second thing is that the regular expression seems to work nearly fine but always gives you the double amount because it matches a zero length string at the end of each match.
So for the last line (C6-C26) it machtes:
1) "C6" 2) "" 3) "C7" 4) "" ... and so on.
To be hounest, I'm a little bit surprised myself and don't exactly know why that's the case for now.
But the solution is pretty easy: Since you want there to be no zero length strings in the result (so they don't get counted) you simply have to exchange the * for a + and that will tell the regular expression to match only if there's at least one character.
So your regular expression string should look like:
Regex.Pattern = """[^""]+""|[^,]+"
Why you've got a count of 14 on the c3, c4 surprises me... I got a 4 which makes sence because of the double counting due to the zero length matches.
Is that possible? Probably not? How can I then find all exact occurrences of a match and the according page numbers?
EDIT:
I have the regex working properly. What I need is for each match to get all the pages it appears on.
Example:
regex = \b\d{3}\b
123 appears on page 1,4,20
243 appear on page 3,5,7
523 appears on page 9
How can I get that information (all the pages a match occurs on?)
This is for creating some kind of index automatically.
EDIT 2:
I got a basic working version, snippet:
Set Matches = regExp.Execute(ActiveDocument.range.Text)
For Each Match In Matches
Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))
page = range.Information(wdActiveEndAdjustedPageNumber)
The problem is that Match.FirstIndex does not always point to the first character of the match in ActiveDocument.range. Word tables mess this up as ActiveDocument.range.Text contains characters that are not on the text put represent something in the table.
I think this probably fits better in SuperUser.
The answer to the question is "yes."
Selection.Information(wdActiveEndAdjustedPageNumber)
The above property in VBA will get you the page number of a selection.
Also, VBA can do some regular expression work.
This turned out to be rather complex and I can't say if my solution works for any document. The main issue is as indicated in the Question, that RegexMatch.FirstIndex can not be used to determine were the actually Match is within the MS Word Document. This is due to the fact that regex matching is done on range.Text property (String) and that string just contains different amount of characters than the range object does and hence Indexes don't match.
So my solution is for each match, I do a Find in the whole document for that match. the find methods gives a Range object from which the correct page can be determined.
In my special case a match could be the same thing also different value. Example: 343in my case would be the same as Prefix-343. A second issue was that the matches must be sorted eg 123before 324regardless which one occurs first in the document.
If you require the Sort Functionality you will also need the following to "modules":
SortDictionary Function:
http://www.cpearson.com/excel/CollectionsAndDictionaries.htm
Module "modQSortInPlace":
http://www.cpearson.com/Zips/modQSortInPlace.zip
If no sort is needed you don't need them but you need to remove the according function call SortDictionary Dict, Truefrom my code.
Now to my code. Soem parts you can remove, especially the formatting one. This is specific to my case. Also if your match is "unique", eg. not prefix or so you can simplify the code too. You will need to reference the "Microsoft Scripting Library".
Option Explicit
Sub ExtractRNumbers()
Dim Dict As Scripting.Dictionary
Set Dict = CreateObject("Scripting.dictionary")
Dim regExp, Match, Matches
Dim rNumber As String
Dim range As range
Set regExp = CreateObject("VBScript.RegExp")
regExp.Pattern = "\b(R-)?\d{2}-\d{4,5}(-\d)?\b"
regExp.IgnoreCase = False
regExp.Global = True
' determine main section, only extract R-Numbers from main section
' and not the Table of contents as example
' main section = section with most characters
Dim section As section
Dim maxSectionSize As Long
Dim sectionSize As Long
Dim sectionIndex As Integer
Dim currentIndex As Integer
maxSectionSize = 0
currentIndex = 1
For Each section In ActiveDocument.Sections
sectionSize = Len(section.range.text)
If sectionSize > maxSectionSize Then
maxSectionSize = sectionSize
sectionIndex = currentIndex
End If
currentIndex = currentIndex + 1
Next
Set Matches = regExp.Execute(ActiveDocument.Sections(sectionIndex).range.text)
For Each Match In Matches
' If the Document contains Tables, ActiveDocument.range.Text will contain
' BEL charachters (chr(7)) that probably define the table structure. The issue
' is that then Match.FirstIndex does not point to the actual first charachter
' of a Match in the Document.
' Also there are other things (unknwon) that lead to the same issue, eg.
' Match.FirstIndex can not be used to find the actual "matching word" within the
' document. Because of that below commented apporach does not work on a generic document
' Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))
' page = range.Information(wdActiveEndAdjustedPageNumber)
' Maybe there is a simpler solution but this works more or less
' the exception beign tables again. see http://support.microsoft.com/kb/274003
' After a match is found the whole document is searched using the find method.
' For each find result the page number is put into an array (if it is not in the array yet)
' Then the match is formatted properly.
' After formatting, it is checked if the match was previously already found
'
' If not, we add a new entry to the dictionary (key = formatted match, value = array of page numbers)
'
' If match was already found before (but potentially in a different format! eg R-87-1000 vs 87-1000 as example),
' all additional pages are added to the already found pages.
Set range = ActiveDocument.Sections(sectionIndex).range
With range.Find
.text = Match.Value
.MatchWholeWord = True
.MatchCase = True
.Wrap = wdFindStop
End With
Dim page As Variant
Dim pages() As Integer
Dim index As Integer
index = 0
ReDim pages(0)
Do While range.Find.Execute() = True
page = range.Information(wdActiveEndAdjustedPageNumber)
If Not IsInArray(page, pages) Then
ReDim Preserve pages(index)
pages(index) = page
index = index + 1
End If
Loop
' FORMAT TO PROPER R-NUMBER: This is specific to my case
rNumber = Match.Value
If Not rNumber Like "R-*" Then
rNumber = "R-" & rNumber
End If
' remove possible batch number as r-number
If Len(rNumber) > 11 Then
rNumber = Left(rNumber, Len(rNumber) - 2)
End If
' END FORMAT
If Not Dict.Exists(rNumber) Then
Dict.Add rNumber, pages
Else
Dim existingPages() As Integer
existingPages = Dict(rNumber)
For Each page In pages
If Not IsInArray(page, existingPages) Then
' add additonal pages. this means that the previous match
' was formatted different, eg R-87-1000 vs 87-1000 as example
ReDim Preserve existingPages(UBound(existingPages) + 1)
existingPages(UBound(existingPages)) = page
Dict(rNumber) = existingPages
End If
Next
End If
Next
'sort dictionary by key (R-Number)
SortDictionary Dict, True
Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim stream
' Create a TextStream.
Set stream = fso.CreateTextFile(ActiveDocument.Path & "\" & ActiveDocument.Name & "-rNumbers.txt", True)
Dim key As Variant
Dim output As String
Dim i As Integer
For Each key In Dict.Keys()
output = key & vbTab
pages = Dict(key)
For i = LBound(pages) To UBound(pages)
output = output & pages(i) & ", "
Next
output = Left(output, Len(output) - 2)
stream.WriteLine output
Next
Set Dict = Nothing
stream.Close
End Sub
Private Function IsInArray(page As Variant, pages As Variant) As Boolean
Dim i As Integer
IsInArray = False
For i = LBound(pages) To UBound(pages)
If pages(i) = page Then
IsInArray = True
Exit For
End If
Next
End Function