Searching multiple .txt files for all occurrences of a string? - regex

I am trying to create a tool that will search 300+ .txt files for a string that that may be used several times in each of the 300+ .txt files
I want to be able to go through each file and get the string between each of the occurrences.
It sounds a bit twisted I know, I have been scratching my head for hours, while testing code.
What I have tried
I read through each file and check for if it contains my search text at least once, if it does, then I add the full path of the (files that do contain it) to a list
Dim FileNamesList As New List(Of String)
Dim occurList As New List(Of String)
Dim textSearch As String = TextBox1.Text.ToLower
'check each file to see if it even contains textbox1.text
'if it does, then add matching files to list
For Each f As FileInfo In dir.GetFiles("*.txt")
Dim tmpRead = File.ReadAllText(f.FullName).ToLower
Dim tIndex As Integer = tmpRead.IndexOf(textSearch)
If tIndex > -1 Then
FileNamesList.Add(f.FullName)
End If
Next
Then I thought, oh, now all I need to do is go through each string in that 'approved' files list and add the entire contents of each to a new list.
Then I go through each in 'that' list and get string between two delimiters.
And... I just get lost from there...
Here is the get string between delimiters I have tried using.
Private Function GetStringBetweenTags(ByVal startIdentifer As String, ByVal endIndentifier As String, ByVal textsource As String) As String
Dim idLength As Int16 = startIdentifer.Length
Dim s As String = textsource
Try
s = s.Substring(s.IndexOf(startIdentifer) + idLength)
s = s.Substring(0, s.IndexOf(endIndentifier))
'MsgBox(s)
Catch
End Try
Return s
End Function
In simple terms...
I have 300 .txt files
Some may contain a string that I am after
I want the substring of each string
Normally I am fine, and never need to ask questions, but there is too many forceptions going on.
Logical Example
== Table.txt ==
print("I am tony")
print("pineapple")
print("brown cows")
log("cable ties")
log("bad ocd")
log("bingo")
== Cherry.txt ==
print("grapes")
print("pie")
print("apples")
log("laugh")
log("tuna")
log("gonuts")
== Tower.txt ==
print("tall")
print("clouds")
print("nomountain")
log("goggles?")
log("kuwait")
log("india")
I want to end with list of the text between only the print function from all 3 files
Haven't found any other thread about this, probably because it stupid.
So I should end with
== ResultList ==
I am tony
pineapple
brown cows
grapes
pie
apples
tall
clouds
nomountain

RegEx is probably your best choice for something like this. For instance:
Dim results As New List(Of String)()
Dim r As New RegEx("print\(""(.*)""\)")
For path As String In filePaths
Dim contents As String = File.ReadAllText(path)
For Each m As Match in r.Matches(contents)
If m.Sucess Then
results.Add(m.Groups(1).Value)
End If
Next
Next
As you can see, the code loops through a list of file paths. For each one, it loads the entire contents of the file into a string. It then searches the file contents string for all matches to the following regular expression pattern: print\("(.*)"\). It then loops through all of those pattern matches and grabs the value of the first capture group from each one. Those are added to the results list, which contains your desired strings. Here's the meaning of the parts of the RegEx:
print - Looks for any string starting with the word "print"
\( - The next character after the word "print" must be an open parentheses (the backslash is an escape character)
" - The next character after the open parentheses must be a double quote character (it is repeated twice so as to escape it so that VB doesn't think it's the end of the string).
(.*) - The parentheses define this as a capturing group (so that we can pull out just this value from the matches). The .* means any characters of any length.
"\) - Matching strings must end with a double quote followed by a closing parentheses.

Use Regex:
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim input1 As String = _
"print(""I am tony"") " + _
"print(""pineapple"") " + _
"print(""brown cows"") " + _
"log(""cable ties"") " + _
"log(""bad ocd"") " + _
"log(""bingo"")"
Dim input2 As String = _
"print(""grapes"") " + _
"print(""pie"") " + _
"print(""apples"") " + _
"log(""laugh"") " + _
"log(""tuna"") " + _
"log(""gonuts"")"
Dim input3 As String = _
"print(""tall"") " + _
"print(""clouds"") " + _
"print(""nomountain"") " + _
"log(""goggles?"") " + _
"log(""kuwait"") " + _
"log(""india"")"
Dim pattern As String = "print\(""([^""]*)""\)"
Dim expr As Regex = New Regex(pattern, RegexOptions.Singleline)
Dim matches As MatchCollection = Nothing
Dim data As List(Of String) = New List(Of String)()
matches = expr.Matches(input1)
For Each mat As Match In matches
data.Add(mat.Groups(1).Value)
Next mat
matches = expr.Matches(input2)
For Each mat As Match In matches
data.Add(mat.Groups(1).Value)
Next mat
matches = expr.Matches(input3)
For Each mat As Match In matches
data.Add(mat.Groups(1).Value)
Next mat
End Sub
End Module

Related

UDF (Regular expression) to match a string variants with some exclusions

I need to use (Regular expression) on the string Mod* followed by a specific one character e.g. "A" , like:
Mod A , Mod_A , Module xx A , Modules (A & B) and so on.
But, with the following conditions:
(1)- if the cell contains any of (Modif* or Moder* or Modr*) and Mod* Plus my specific character then the result is True
(2)- if the cell contains any of (Modif* or Moder* or Modr*) and not Mod* Plus my specific character then the result is False
Please this example and the expected result:
Item Description
Expected Result of RegexMatch
new modified of module A 1
TRUE
new modification of mod A
TRUE
new moderate of mod_A
TRUE
to modules (A & B)
TRUE
new modified and moderate A 1
FALSE
new modification of  A
FALSE
new moderate of modify
FALSE
to modules (D & E)
FALSE
Public Function RegexMatch(str) As Boolean
Dim tbx2 As String: tbx2 = "A" 'ActiveSheet.TextBox2.Value
Static re As New RegExp
re.Pattern = "\b[M]od(?!erate).*\b[" & tbx2 & "]\b"
re.IgnoreCase = True
RegexMatch = re.Test(str)
End Function
In advance, great thanks for your kindly help.
Not sure if I understand your requirements correctly: You want rows that contain a word that starts with "mod", but words starting with "Modif" or "Moder" or "Modr" doesn't count. Additionally, a module character (eg "A") needs to be present.
I usually get dizzy when I see longer regex terms, so I try to program some lines of code instead. The following function replaces special characters like "(" or "_" with blanks, splits the string into words and check the content word by word. Easy to understand, easy to adapt:
Function CheckModul(s As String, modulChar As String) As Boolean
Dim words() As String
words = Split(replaceSpecialChars(s), " ")
Dim i As Long, hasModul As Boolean, hasModulChar As Boolean
For i = 0 To UBound(words)
Dim word As String
word = UCase(words(i))
If word Like "MOD*" _
And Not word Like "MODIF*" _
And Not word Like "MODER*" _
And Not word Like "MODR*" Then
hasModul = True
End If
If word = modulChar Then
hasModulChar = True
End If
Next
CheckModul = hasModul And hasModulChar
End Function
Function replaceSpecialChars(ByVal s As String) As String
Dim i As Long
replaceSpecialChars = s
For i = 1 To Len(replaceSpecialChars)
If Mid(replaceSpecialChars, i, 1) Like "[!0-9A-Za-z]" Then Mid(replaceSpecialChars, i) = " "
Next
End Function
Tested as UDF with your data:

Vba: Regular expression to count the number of words in a string delimited by special characters

Need some help writing a regular expression to count the number of words in a string (Please note the data is a html string, which needs to be placed into a spreadsheet) when separated either by any special characters like . , - , +, /, Tab etc. Count should exclude special characters.
**Original String** **End Result**
Ex : One -> 1
One. -> 1
One Two -> 2
One.Two -> 2
One Two. -> 2
One.Two. -> 2
One.Tw.o -> 3
Updated
I think you asked a valuable question and this downvoting is not fair!
Function WCount(ByVal strWrd As String) As Long
'Variable declaration
Dim Delimiters() As Variant
Dim Delimiter As Variant
'Initialization
Delimiters = Array("+", "-", ".", "/", Chr(13), Chr(9)) 'Define your delimiter characters here.
'Core
For Each Delimiter In Delimiters
strWrd = Replace(strWrd, Delimiter, " ")
Next Delimiter
strWrd = Trim(strWrd)
Do While InStr(1, strWrd, " ") > 0
strWrd = Replace(strWrd, " ", " ")
Loop
WCount = UBound(Split(strWrd, " ")) + 1
End Function
________________
You can use this function as a UDF in excel formulas or can use in another VBA codes.
Using in formula
=WCOUNT("One.Two.Three.") or =WCOUNT($A$1") assuming your string is in A1 cell.
Using in VBA
(With assume passing your string with Str argument.)
Sub test()
Debug.Print WCount(Str)
End Sub
Regards.
Update
I have test your text as shown below.
copy your text in a Cell of Excel as shown.
The code updated for Line break and Tab characters and count your string words correctly now.
Try this code, all necessary comments are in code:
Sub SpecialSplit()
Dim i As Long
Dim str As String
Dim arr() As String
Dim delimeters() As String
'here you define all special delimeters you want to use
delimetres = Array(".", "+", "-", "/")
For i = 1 To 9
str = Cells(i, 1).Value
'this will protect us from situation where last character is delimeter and we have additional empty string
str = Left(str, Len(str) - 1)
'here we replace all special delimeters with space to simplify
For Each delimeter In delimetres
str = Replace(str, delimeter, " ")
Next
arr = Split(str)
Cells(i, 2).Value = UBound(arr) - LBound(arr) + 1
Next
End Sub
With your posted data following RegExp is working correctly. Put this in General Module in Visual Basic Editor.
Public Function CountWords(strInput As String) As Long
Dim objMatches
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.Pattern = "\w+"
Set objMatches = .Execute(strInput)
CountWords = objMatches.Count
End With
End Function
You have to use it like a normal formula. e.g. assuming data is in cell A1 function would be:
=CountWords(A1)
For your information, it can be also achieved through formula if number of characters are specific like so:
=LEN(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TRIM(A1),"."," "),","," "),"-"," "),"+"," "),"/"," "),"\"," ")))-LEN(SUBSTITUTE(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TRIM(A1),"."," "),","," "),"-"," "),"+"," "),"/"," "),"\"," "))," ",""))+1

Extracting Parenthetical Data Using Regex

I have a small sub that extracts parenthetical data (including parentheses) from a string and stores it in cells adjacent to the string:
Sub parens()
Dim s As String, i As Long
Dim c As Collection
Set c = New Collection
s = ActiveCell.Value
ary = Split(s, ")")
For i = LBound(ary) To UBound(ary) - 1
bry = Split(ary(i), "(")
c.Add "(" & bry(1) & ")"
Next i
For i = 1 To c.Count
ActiveCell.Offset(0, i).NumberFormat = "#"
ActiveCell.Offset(0, i).Value = c.Item(i)
Next i
End Sub
For example:
I am now trying to replace this with some Regex code. I am NOT a regex expert. I want to create a pattern that looks for an open parenthesis followed by zero or more characters of any type followed by a close parenthesis.
I came up with:
\((.+?)\)
My current new code is:
Sub qwerty2()
Dim inpt As String, outpt As String
Dim MColl As MatchCollection, temp2 As String
Dim regex As RegExp, L As Long
inpt = ActiveCell.Value
MsgBox inpt
Set regex = New RegExp
regex.Pattern = "\((.+?)\)"
Set MColl = regex.Execute(inpt)
MsgBox MColl.Count
temp2 = MColl(0).Value
MsgBox temp2
End Sub
The code has at least two problems:
It will only get the first match in the string.(Mcoll.Count is always 1)
It will not recognize zero characters between the parentheses. (I think the .+? requires at least one character)
Does anyone have any suggestions ??
By default, RegExp Global property is False. You need to set it to True.
As for the regex, to match zero or more chars as few as possible, you need *?, not +?. Note that both are lazy (match as few as necessary to find a valid match), but + requires at least one char, while * allows matching zero chars (an empty string).
Thus, use
Set regex = New RegExp
regex.Global = True
regex.Pattern = "\((.*?)\)"
As for the regex, you can also use
regex.Pattern = "\(([^()]*)\)"
where [^()] is a negated character class matching any char but ( and ), zero or more times (due to * quantifier), matching as many such chars as possible (* is a greedy quantifier).

VB.NET Modify Matching with Access Database

i'm working in a dictionary program
as example if i want to get this line from database:
i suppose to write this in english textbox:
but if i wrote it like this:
no match is found while it is the same no comma
or any other modification except the new line
i failed to modify the code to make the program match a sentence
of words regardless of new line in between
the code of the translation function:
Function getexactsubtitles(ByVal content As String) As String
Try
Dim count As Integer = 0
Dim connectionall As New OleDb.OleDbConnection(connectionString)
Using adp As New OleDbDataAdapter
Using tbl As New DataTable
Using cmd As OleDbCommand = New OleDbCommand("SELECT * FROM [Subtitles]", connectionall)
cmd.Parameters.AddWithValue("#english", String.Concat("%", content, "%"))
adp.SelectCommand = cmd
If adp.Fill(tbl) > 0 Then
For Each row As DataRow In tbl.Rows
Dim en As String = row.Item("English").Replace("?", "\?")
Dim ar As String = row.Item("Arabic")
count += Regex.Matches(content, en, RegexOptions.IgnoreCase).Count
content = Regex.Replace(content, en, ar, RegexOptions.IgnoreCase)
Next
End If
End Using
End Using
End Using
MsgBox(IIf(count = 0, "لم يتم العثور على أى تطابق تام داخل قاعدة البيانات", "عدد الحالات المطابقة تماماً التى عثر عليها داخل قاعدة البيانات (" & " " & count & " " & ") من الجمل والكلمات"))
Return content
Catch ex As Exception
MsgBox(ex.Message)
End Try
getexactsubtitles = ""
End Function
please notice this part:
Using cmd As OleDbCommand = New OleDbCommand("SELECT * FROM [Subtitles]", connectionall)
cmd.Parameters.AddWithValue("#english", String.Concat("%", content, "%"))
this is my try to do the job but for some reason it is not working
code for translation button:
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
TextBox2.Text = getexactsubtitles(TextBox1.Text)
If I understand correctly, you are using Regex.Replace() and the pattern is stored in the first column of a table.
And all you need is for " " to also match newlines.
I think you should consider replacing " " with \s+ in your pattern, since it matches any of these characters: [\r\n\t\f ]. More examples
The "+" in \s+ is to repeat one or more times. That way, it can match
How are you [space] + [cr] + [lf]
today tommy?
Solution
In the example you provided,
[English] | [Arabic]
how\s+are\s+you\s+today\s+tommy\? | asking about tommy's health
Aternative: if you can't store that pattern in the database, replace it in your code (this is not quite efficient)
Dim en As String = row.Item("English").Replace("?", "\?").Replace(" ","\s+")
Also, I noticed you are escaping the question marks. You should also be aware there might be some other metacharacters. So please consider using Regex.Escape() instead.

Get/split text inside brackets/parentheses

Just have a list of words, such as:
gram (g)
kilogram (kg)
pound (lb)
just wondering how I would get the words within the brackets for example get the "g" in "gram (g)" and dim it as a new string.
Possibly using regex?
Thanks.
Use split function ..
strArr = str.Split("(") ' splitting 'gram (g)' returns an array ["gram " , "g)"] index 0 and 1
strArr2 = strArr[1].Split(")") ' splitting 'g)' returns an array ["g " ..]
the string is in
strArr2[0]
Edit
you want getAbbrev and getAbbrev2 to be arrays
try
Dim getAbbrev As String() = Str.Split("(")
Dim getAbbrev2 as String() = getAbbrev[1].Split(")")
To do it without declaring arrays you can do
"gram (g)".Split("(")[1].Split(")")[0]
but that's unreadable
Edit
You have some very trivial errors. I would suggest you strengthen your understanding on objects and declarations first. Then you can look into invoking methods. I rather have you understand it than give it to you. Re-read the book you have or look for a basic tutorial.
Dim unit As String = 'make sure this is the actual string you are getting, not sure where you are supposed to get the string value from => ie grams (g)
Dim getAbbrev As String() = unit.Split("(") 'use unit not Str - Str does not exist
Dim getAbbrev2 As String() = getAbbrev[1].Split(")") 'As no as - case sensitive
for the last line reference getAbbrev2 instead of the unknown abbrev2
Fun with Regular Expressions (I'm really not an expert here, but tested and works)
Imports System.Text.RegularExpressions
.....
Dim charsToTrim() As Char = { "("c, ")"c }
Dim test as String = "gram (g)" + Environment.NewLine +
"kilogram (kg)" + Environment.NewLine +
"pound (lb)"
Dim pattern as String = "\([a-zA-Z0-9]*\)"
Dim r As Regex = new Regex(pattern, RegexOptions.IgnoreCase)
Dim m As Match = r.Match(test)
While(m.Success)
System.Diagnostics.Debug.WriteLine("Match" + "=" + m.Value.ToString())
Dim tempText as String = m.Value.ToString().Trim(charsToTrim)
System.Diagnostics.Debug.WriteLine("String Trimmed" + "=" + tempText)
m = m.NextMatch()
End While
You can split at the space and remove the parens from the second token (by replacing them with an empty string).
A regex is also an option, and is very simple, its pattern is
\w+\s+\((\w+)\)
Which means, a word, then at least one space, then opening parens, then in real regex parens you search for a word, and, eventually a closing paren. The inner parentheses are capturing parentheses, which make it possible to refer to the unit g, kg, lb.