Replacing everything but the matching regex string - regex

I've searched for this answer but haven't found an answer that exactly works.
I have the following pattern where the hashes are any digit: 102###-###:#####-### or 102###-###:#####-####
It must start with 102 and the last set in the pattern can either be 3 or 4 digits.
The problem is that I can have a string with between 1-5 of these patterns in it with any sort of characters in between (spaces, letters etc). The Regex I posted below matches the patterns well but I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output. (Pattern, Pattern, Pattern) How do I accomplish this with regex?Perhaps there is a better way than trying to take this line? Thanks. This is using VBA.
Regex For Pattern:(\D102\d{3}-\d{3}:\d{5}-\d{3,4}\D)
String Example: type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff

No need to grab everything you don't need to remove it: That's more difficult. Just grab everything you need and do whatever you want with it.
See regex in use here
(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)
See code in use here
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
Dim re As Regex = New Regex("(?<!\d)102\d{3}-\d{3}:\d{5}-\d{3,4}(?!\d)")
Dim mc as MatchCollection = re.Matches(sourcestring)
For each m as Match in mc
Console.WriteLine(m.Groups(0).Value)
Next
End Sub
End Module
Result:
102456-345:56746-234
102456-345:56746-2343
102456-345:56746-234
102456-345:56746-2345

I am trying to select everything that is NOT this pattern so I can remove it. The end goal is to extract all the patterns and just have all the patterns comma delimited as the output
If you want to extract the patterns, then just do that, without removing everything around them. Example in Python: (Posted before the question's language was specified, but I'm sure the same can be done in VBA.)
>>> import re
>>> p = r"102\d{3}-\d{3}:\d{5}-\d{3,4}"
>>> text = "type:102456-345:56746-234 102456-345:56746-2343 FollowingCell#:102456-345:56746-234 exampletext##$% 102456-345:56746-2345 stuff"
>>> ",".join(re.findall(p, text))
'102456-345:56746-234,102456-345:56746-2343,102456-345:56746-234,102456-345:56746-2345'

Related

RegEx vb.net MatchCollection is full of Empty Strings

I am very new to using Regular Expressions. (I have read Ben Forta's book and have learned from asking a previous question on here. I am trying to put my learning into practice.)
So, given this string: "[Class 4C] Physics 101 [~2] [#14 Worthington 5] FW"
I'd like these results:
"[Class 4C]"
"Physics 101"
"[~2]"
"[#14 Worthington 5]"
"FW"
I'm using this vb.net code:
Imports System.Text.RegularExpressions
---
Dim txt As String = "[Class 4C] Physics 101 [~2][#14 Worthington 5] FW"
Dim mc As MatchCollection = Regex.Matches(txt, "((?<=\])|(?=\[))")
Dim m As Match
For Each m In mc
Debug.Print(m.Value)
Next m
The result is a System.Text.RegularExpressions.MatchCollection containing 6 empty strings.
Using RegEx Storm, I see something I can work with in the 'Split List' but what I'm getting in the MatchCollection is the data in the "Table" view.
Screenshots: Table View | Split List View
How do I access the array shown in the Split List please? (Or do I need to use a different pattern?)
Thank you to the fourth bird who (in the comments) helped me reach the solution. In case it helps anyone else, here is the answer and updated code:
A RegEx.Match returns a collection that contains all of the matches produced from applying the specified Regular Expression pattern. This will normally return only part(s) of a string. As it is often what's needed, example code will often show this way of using Regular Expressions.
However, my requirement was to return ALL of the string, split into sections as defined by the RegEx. This is done using the RegEx.Split function.
The result is what is referred to in RegEx Storm as the 'Split List'.
Here is the modified vb.net code. It splits the string, filters out empty strings and trims-off any leading and trailing whitespace:
Imports System.Text.RegularExpressions
---
Dim text As String = "[Class 4C] Physics 101 [~2][#14 Worthington 5] FW"
Dim pattern As String = "((?<=\])|(?=\[))"
Dim matches() As String = Regex.Split(text, pattern)
For Each match As String In matches
Dim trimmedMatch As String = Trim(match)
If trimmedMatch.Length > 0 Then
' Do things here
Debug.Print(trimmedMatch)
End If
Next
Thanks again the fourth bird. Really appreciated.

Match number between 2 date

I got a text and I need to extract a number that is between 2 dates. I can't show the full text so I will only use the part I need, but keep in mint it's part of a bigger text.
12/14/2020 355345 12/14/2020
From that, I need to get '355345', I currently don't have anything to show of what I was doing because I was working on getting the text before a sentence, until I realized it the only place where the number is between 2 dates.
Thanks!
Here's a snippet that might help:
Suppose the input is this:
Imports System.Text.RegularExpressions
'...
Dim input As New StringBuilder
input.AppendLine("12/14/2020 355345 12/14/2020")
input.AppendLine("12/13/2020 425345 12/13/2020")
input.AppendLine("12/20/2020 93488557 12/20/2020")
input.AppendLine("12/21/2020 4 12/21/2020")
input.AppendLine("12/20/2020 3443 12/20/2020")
'...
Use RegEx to extract the numbers between the two dates as follows:
Dim patt = "(\d+\/\d+\/\d+)\s?(\d+)\s?(\d+\/\d+\/\d+)"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(2).Value)
Next
This will capture three groups. Example for the first match:
m.Groups(1).Value : 12/14/202 the first date.
m.Groups(2).Value : 355345 the number in between.
m.Groups(3).Value : 12/14/2020 the second date.
If you have no use for the captured dates, then no need to get theme grouped and use the following pattern instead:
Dim patt = "\d+\/\d+\/\d+\s?(\d+)\s?\d+\/\d+\/\d+"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
And you will get the number between the two dates in Group 1.
The output of both is:
355345
425345
93488557
4
3443
regex101
Also, using the quantifiers in RegEx patterns is a good idea as Mr. #AndrewMorton mentioned in his appreciated comments, and that to skip any possible things like 1234/239994/2293 in the input:
Dim patt = "\d{1,2}/\d{1,2}/\d{4}\s(\d{1,})\s\d{1,2}/\d{1,2}/\d{4}"
For Each m In Regex.
Matches(input.ToString, patt, RegexOptions.Multiline).
Cast(Of Match)
Console.WriteLine(m.Groups(1).Value)
Next
The quantifiers-way test is here.
If you can safely check for numbers and slashes, then a pattern like this should work:
\d\d/\d\d/\d\d\d\d +(\d+) +\d\d/\d\d/\d\d\d\d
...where capture group 1 would hold the number being sought. If you need to validate that the values are actually dates, well... you can do it with regex to a degree, but the pattern becomes very difficult to read.

How to get all sub-strings of a specific format from a string

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.
Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.
What is the best way to do this?
I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.
This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.
You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.
Here is a Python solution,
import re
s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))
Prints,
['someword', 'some other word']
I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.
String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}
Prints,
someword
some other word
Let me know if this is what you were looking for.
Scala solution:
val text = "[[someword1]] test [[someword2]] test 1231"
val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
.findAllIn(text)
.matchData
.map(_.group(1)) //get 1st group
.toList
println(values)

vba regex: how to extract exact recurring match at start, between, at the end of a string

I'm trying with no luck to extract a recurring word inside a string using RegEx in Excel VBA.
Following an example:
I'm trying with no luck to extract a recurring word inside a string using RegEx in Excel VBA.
Following an example:
Sub RegExTest()
 Dim re As Object
 Dim el As Object
 Const strText As String = "Fld,Fld,Fld,Fld,Fld,aFld1,bFld,cFld,Fld"
 Debug.Print strText
 With CreateObject("VBScript.RegExp")
  .Global = True
  .MultiLine = False
  .IgnoreCase = False
  .pattern = "(^Fld\,|\,Fld\,|\,Fld$)"
  If .Test(strText) Then
   Set re = .Execute(strText)
  End If
 End With
 For Each el In re
  Debug.Print el
 Next
End Sub
Result:
Fld,Fld,Fld,Fld,Fld,aFld1,bFld,cFld,Fld
Fld,
,Fld,
,Fld,
,Fld
The result that I get (4 elements) is not what I expect (6 elements).
I'm sure it is about a wrong pattern definition.
Can someone please help with the correct pattern?
Thanks in advance
The problem here is that your matches are overlapping. By that I mean the comma in Fld\, is already matched, so your second Fld won't match \,Fld\,
If you double up your commas you can see that you have the appropriate number of matches
The solution here is to use lookaheads to capture your matches. If you absolutely need the trailing commas for some reason, just append them to the relevant matches.

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).