Remove tweet regular expressions from string of text - regex

I have an excel sheet filled with tweets. There are several entries which contain #blah type of strings among other. I need to keep the rest of the text and remove the #blah part. For example: "#villos hey dude" needs to be transformed into : "hey dude". This is what i ve done so far.
Sub Macro1()
'
' Macro1 Macro
'
Dim counter As Integer
Dim strIN As String
Dim newstring As String
For counter = 1 To 46
Cells(counter, "E").Select
ActiveCell.FormulaR1C1 = strIN
StripChars (strIN)
newstring = StripChars(strIN)
ActiveCell.FormulaR1C1 = StripChars(strIN)
Next counter
End Sub
Function StripChars(strIN As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "^#?(\w){1,15}$"
.ignorecase = True
StripChars = .Replace(strIN, vbNullString)
End With
End Function
Moreover there are also entries like this one: Ÿ³é‡ï¼Ÿã€€åˆã‚ã¦çŸ¥ã‚Šã¾ã—ãŸã€‚ shiftã—ãªãŒã‚‰ã‚¨ã‚¯ã‚¹ãƒ
I need them gone too! Ideas?

For every line in the spreadsheet run the following regex on it: ^(#.+?)\s+?(.*)$
If the line matches the regex, the information you will be interested in will be in the second capturing group. (Usually zero indexed but position 0 will contain the entire match). The first capturing group will contain the twitter handle if you need that too.
Regex demo here.
However, this will not match tweets that are not replies (starting with #). In this situation the only way to distinguish between regular tweets and the junk you are not interested in is to restrict the tweet to alphanumerics - but this may mean some tweets are missed if they contain any non-alphanumerical characters. The following regex will work if that is not an issue for you:
^(?:(#.+?)\s+?)?([\w\t ]+)$
Demo 2.

Related

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

RegEx - VBA Finding splitting cell with two Uppercase [duplicate]

I'm new to VBA and would like to seek some help with regards to using RegEx and I hope somehow can enlighten me on what I'm doing wrong. I'm currently trying to split a date into its individual date, month and year, and possible delimiters include "," , "-" and "/".
Function formattedDate(inputDate As String) As String
Dim dateString As String
Dim dateStringArray() As String
Dim day As Integer
Dim month As String
Dim year As Integer
Dim assembledDate As String
Dim monthNum As Integer
Dim tempArray() As String
Dim pattern As String()
Dim RegEx As Object
dateString = inputDate
Set RegEx = CreateObject("VBScript.RegExp")
pattern = "(/)|(,)|(-)"
dateStringArray() = RegEx.Split(dateString, pattern)
' .... code continues
This is what I am currently doing. However, there seems to be something wrong during the RegEx.Split function, as it seems to cause my codes to hang and not process further.
To just confirm, I did something simple:
MsgBox("Hi")
pattern = "(/)|(,)|(-)"
dateStringArray() = RegEx.Split(dateString, pattern)
MsgBox("Bye")
"Hi" msgbox pops out, but the "Bye" msgbox never gets popped out, and the codes further down don't seem to get excuted at all, which led to my suspicion that the RegEx.Split is causing it to be stuck.
Can I check if I'm actually using RegEx.Split the right way? According to MSDN here, Split(String, String) returns an array of strings as well.
Thank you!
Edit: I'm trying not to explore the CDate() function as I am trying not to depend on the locale settings of the user's computer.
To split a string with a regular expression in VBA:
Public Function SplitRe(Text As String, Pattern As String, Optional IgnoreCase As Boolean) As String()
Static re As Object
If re Is Nothing Then
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.MultiLine = True
End If
re.IgnoreCase = IgnoreCase
re.Pattern = Pattern
SplitRe = Strings.Split(re.Replace(text, ChrW(-1)), ChrW(-1))
End Function
Usage example:
Dim v
v = SplitRe("a,b/c;d", "[,;/]")
Splitting by a regex is definitely nontrivial to implement compared to other regex operations, so I don't blame you for being stumped!
If you wanted to implement it yourself, it helps to know that RegExp objects from Microsoft VBScript Regular Expressions 5.5 have a FirstIndex property and a Length property, such that you can loop through the matches and pick out all the substrings between the end of one match (or the start of the string) and the start of the next match (or the end of the string).
If you don't want to implement it yourself, I've also implemented a RegexSplit UDF using those same RegExp objects on my GitHub.
Quoting an example from the documentation of VbScript Regexp:
https://msdn.microsoft.com/en-us/library/y27d2s18%28v=vs.84%29.aspx
Function SubMatchTest(inpStr)
Dim retStr
Dim oRe, oMatch, oMatches
Set oRe = New RegExp
' Look for an e-mail address (not a perfect RegExp)
oRe.Pattern = "(\w+)#(\w+)\.(\w+)"
' Get the Matches collection
Set oMatches = oRe.Execute(inpStr)
' Get the first item in the Matches collection
Set oMatch = oMatches(0)
' Create the results string.
' The Match object is the entire match - dragon#xyzzy.com
retStr = "Email address is: " & oMatch & vbNewLine
' Get the sub-matched parts of the address.
retStr = retStr & "Email alias is: " & oMatch.SubMatches(0) ' dragon
retStr = retStr & vbNewLine
retStr = retStr & "Organization is: " & oMatch.SubMatches(1) ' xyzzy
SubMatchTest = retStr
End Function
To test, call:
MsgBox(SubMatchTest("Please send mail to dragon#xyzzy.com. Thanks!"))
In short, you need your Pattern to match the various parts you want to extract, with the spearators in between, maybe something like:
"(\d+)[/-,](\d+)[/-,](\d+)"
The whole thing will be in oMatch, while the numbers (\d) will end up in oMatch.SubMatches(0) to oMatch.SubMatches(2).

Extract text using word VBA regex then save to a variable as string

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).
The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:
Program: Program Name (abr)
and the regex I worked out for this is "Program:[^\n]"
The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:
Sub RegExProgram()
Dim regEx
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"
(missing code here)
End Sub
Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!
Try this:
You can find documentation for the RegExp class here.
Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.IgnoreCase = True
.Global = False ' Only look for 1 match; False is actually the default.
.Pattern = "Program: ([^\r]+)" ' Word separates lines with CR (\r)
End With
' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)
' Extract the first submatch's (capture group's) value -
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)
I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.
Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.
If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.
Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).
If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:
Public Sub ReadDocument()
Dim aLine As Paragraph
Dim aLineText As String
Dim start As Long
For Each aLine In ActiveDocument.Paragraphs
aLineText = aLine.Range.Text
start = InStr(aLineText, "Program:")
If start > 0 Then
my_str = Mid(aLineText, start)
End If
Next aLine
End Sub
This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.
Lazier version:
a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then
extractedString = Trim(Split(a(1), vbCr)(0))
End If
If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )

Find specific instance of a match in string using RegEx

I am very new to RegEx and I can't seem to find what I looking for. I have a string such as:
[cmdSubmitToDatacenter_Click] in module [Form_frm_bk_UnsubmittedWires]
and I want to get everything within the first set of brackets as well as the second set of brackets. If there is a way that I can do this with one pattern so that I can just loop through the matches, that would be great. If not, thats fine. I just need to be able to get the different sections of text separately. So far, the following is all I have come up with, but it just returns the whole string minus the first opening bracket and the last closing bracket:
[\[-\]]
(Note: I'm using the replace function, so this might be the reverse of what you are expecting.)
In my research, I have discovered that there are different RegEx engines. I'm not sure the name of the one that I'm using, but I'm using it in MS Access.
If you're using Access, you can use the VBScript Regular Expressions Library to do this. For example:
Const SOME_TEXT = "[cmdSubmitToDatacenter_Click] in module [Form_frm_bk_UnsubmittedWires]"
Dim re
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.Pattern = "\[([^\]]+)\]"
Dim m As Object
For Each m In re.Execute(SOME_TEXT)
Debug.Print m.Submatches(0)
Next
Output:
cmdSubmitToDatacenter_Click
Form_frm_bk_UnsubmittedWires
Here is what I ended up using as it made it easier to get the individual values returned. I set a reference to the Microsoft VBScript Regular Expression 5.5 so that I could get Intellisense help.
Public Sub GetText(strInput As String)
Dim regex As RegExp
Dim colMatches As MatchCollection
Dim strModule As String
Dim strProcedure As String
Set regex = New RegExp
With regex
.Global = True
.Pattern = "\[([^\]]+)\]"
End With
Set colMatches = regex.Execute(strInput)
With colMatches
strProcedure = .Item(0).submatches.Item(0)
strModule = .Item(1).submatches.Item(0)
End With
Debug.Print "Module: " & strModule
Debug.Print "Procedure: " & strProcedure
Set regex = Nothing
End Sub

How to change case of matching letter with a VBA regex Replace?

I have a column of lists of codes like the following.
2.A.B, 1.C.D, A.21.C.D, 1.C.D.11.C.D
6.A.A.5.F.A, 2.B.C.H.1
8.ABC.B, A.B.C.D
12.E.A, 3.NO.T
A.3.B.C.x, 1.N.N.9.J.K
I want to find all instances of two single upper-case letters separated by a period, but only those that follow a number less than 6. I want to remove the period between the letters and convert the second letter to lower case. Desired output:
2.Ab, 1.Cd, A.21.C.D, 1.Cd.11.C.D
6.A.A.5.Fa, 2.Bc.H.1
8.ABC.B, A.B.C.D
12.E.A, 3.NO.T
A.3.Bc.x, 1.Nn.9.J.K
I have the following code in VBA.
Sub fixBlah()
Dim re As VBScript_RegExp_55.RegExp
Set re = New VBScript_RegExp_55.RegExp
re.Global = True
re.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
For Each c In Selection.Cells
c.Value = re.Replace("$1$2")
Next c
End Sub
This removes the period, but doesn't handle the lower-case requirement. I know in other flavors of regular expressions, I can use something like
re.Replace("$1\L$2\E")
but this does not have the desired effect in VBA. I tried googling for this functionality, but I wasn't able to find anything. Is there a way to do this with a simple re.Replace() statement in VBA?
If not, how would I go about achieving this otherwise? The pattern matching is complex enough that I don't even want to think about doing this without regular expressions.
[I have a solution I worked up, posted below, but I'm hoping someone can come up with something simpler.]
Here is a workaround that uses the properties of each individual regex match to make the VBA Replace() function replace only the text from the match and nothing else.
Sub fixBlah2()
Dim re As VBScript_RegExp_55.RegExp, Matches As VBScript_RegExp_55.MatchCollection
Dim M As VBScript_RegExp_55.Match
Dim tmpChr As String, pre As String, i As Integer
Set re = New VBScript_RegExp_55.RegExp
re.Global = True
re.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
For Each c In Selection.Cells
'Count of number of replacements made. This is used to adjust M.FirstIndex
' so that it still matches correct substring even after substitutions.
i = 0
Set Matches = re.Execute(c.Value)
For Each M In Matches
tmpChr = LCase(M.SubMatches.Item(1))
If M.FirstIndex > 0 Then
pre = Left(c.Value, M.FirstIndex - i)
Else
pre = ""
End If
c.Value = pre & Replace(c.Value, M.Value, M.SubMatches.Item(0) & tmpChr, _
M.FirstIndex + 1 - i, 1)
i = i + 1
Next M
Next c
End Sub
For reasons I don't quite understand, if you specify a start index in Replace(), the output starts at that index as well, so the pre variable is used to capture the first part of the string that gets clipped off by the Replace function.
So this question is old, but I do have another workaround. I use a double regex so to speak, where the first engine looks for the match as an execute, then I loop through each of those items and replace with a lowercase version. For example:
Sub fixBlah()
Dim re As VBScript_RegExp_55.RegExp
dim ToReplace as Object
Set re = New VBScript_RegExp_55.RegExp
for each c in Selection.Cells
with re `enter code here`
.Global = True
.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
Set ToReplace = .execute(C.Value)
end with
'This generates a list of items that match. Now to lowercase them and replace
Dim LcaseVersion as string
Dim ItemCt as integer
for itemct = 0 to ToReplace.count - 1
LcaseVersion = lcase(ToReplace.item(itemct))
with re `enter code here`
.Global = True
.Pattern = ToReplace.item(itemct) 'This looks for that specific item and replaces it with the lowercase version
c.value = .replace(C.Value, LCaseVersion)
end with
End Sub
I hope this helps!