Splitting the string except for symbols - regex

I tried using these 2 codes:
Dim splitQuery() As String = Regex.Split(TextBoxQuery.Text, "\s+")
and
Dim splitQuery() As String = TextBoxQuery.Text.Split(New Char() {" "c})
My example query is a dog . Notice there's a single space between dog and .. When I check the length of splitQuery, it gives me 3 and the split words are a, dog, and ..
How can I stop it from counting . and other symbols as word? I want words/terms (alphanumeric) only to be stored in my splitQuery array. Thanks.

I suggest doing that in 2 steps:
Use txt = Regex.Replace(TextBoxQuery.Text, "\W+$", "", RegexOptions.RightToLeft) to remove the non-word characters from the end of the string
Then, split with \s+: splits = Regex.Split(txt, "\s+")
If you prefer to split with any non-word chars, you may use
splits = Regex.Split(Regex.Replace(TextBoxQuery.Text, "^\W+|\W+$", ""), "\W+")
Here, Regex.Replace(TextBoxQuery.Text, "^\W+|\W+$", "") removes non-word chars both at the start and end of string.

you should also be able to create a string of unwanted characters and trim them with a stringsplitoption to RemoveEmptyEntries.
dim unwanted as string = "./?!#"
Dim splitQuery() as string = yourString.Trim(unwanted.tochararray).Split(New Char() {" "c}), StringSplitOptions.RemoveEmptyEntries)

I would tackle this problem in two parts.
I would split up the text by spaces like you're doing
I would then run through that list of words and remove any query terms that are non-alphanumeric.
The following is an example of that:
Imports System.Collections
' ... Your Other Code ...
' A function to determine if a string is AlphaNumeric
Private Function IsAlphaNum(ByVal strInputText As String) As Boolean
Dim IsAlpha As Boolean = False
If System.Text.RegularExpressions.Regex.IsMatch(strInputText, "^[a-zA-Z0-9]+$") Then
IsAlpha = True
Else
IsAlpha = False
End If
Return IsAlpha
End Function
' A function to get the words from the textbox
Private Function GetWords() As String()
' Get a raw list of all words separated by spaces
Dim splitQuery() As String = Regex.Split(TextBoxQuery.Text, "\s+")
' ArrayList to place all words into:
Dim alWords As New ArrayList()
' Loop all words and check them:
For Each word As String In splitQuery
If(IsAlphaNum(word)) Then
' Word is alphanumeric
' Add it to the list of alphanumeric words
alWords.add(word)
End If
Next
' Convert the ArrayList of words to a primitive array of strings
Dim words As String() = CType(alWords.ToArray(GetType(String)), String())
' Return the list of filtered words
return words
End Function
This code does the following:
splits up the textbox's text
declares an ArrayList for the filtered query terms/words
loops through all the words in the split up array of terms/words
it then checks if the term is alphanumeric
If the term is alphanumeric, it is added to the ArrayList. If it's not alphanumeric, the term is disregarded.
Finally, it casts the terms/words in the ArrayList back to a normal String array and returns.
Because this solution uses an ArrayList, it requires System.Collections as an import.

Related

Vba: Regular expression to count the number of words in a string delimited by special characters

Need some help writing a regular expression to count the number of words in a string (Please note the data is a html string, which needs to be placed into a spreadsheet) when separated either by any special characters like . , - , +, /, Tab etc. Count should exclude special characters.
**Original String** **End Result**
Ex : One -> 1
One. -> 1
One Two -> 2
One.Two -> 2
One Two. -> 2
One.Two. -> 2
One.Tw.o -> 3
Updated
I think you asked a valuable question and this downvoting is not fair!
Function WCount(ByVal strWrd As String) As Long
'Variable declaration
Dim Delimiters() As Variant
Dim Delimiter As Variant
'Initialization
Delimiters = Array("+", "-", ".", "/", Chr(13), Chr(9)) 'Define your delimiter characters here.
'Core
For Each Delimiter In Delimiters
strWrd = Replace(strWrd, Delimiter, " ")
Next Delimiter
strWrd = Trim(strWrd)
Do While InStr(1, strWrd, " ") > 0
strWrd = Replace(strWrd, " ", " ")
Loop
WCount = UBound(Split(strWrd, " ")) + 1
End Function
________________
You can use this function as a UDF in excel formulas or can use in another VBA codes.
Using in formula
=WCOUNT("One.Two.Three.") or =WCOUNT($A$1") assuming your string is in A1 cell.
Using in VBA
(With assume passing your string with Str argument.)
Sub test()
Debug.Print WCount(Str)
End Sub
Regards.
Update
I have test your text as shown below.
copy your text in a Cell of Excel as shown.
The code updated for Line break and Tab characters and count your string words correctly now.
Try this code, all necessary comments are in code:
Sub SpecialSplit()
Dim i As Long
Dim str As String
Dim arr() As String
Dim delimeters() As String
'here you define all special delimeters you want to use
delimetres = Array(".", "+", "-", "/")
For i = 1 To 9
str = Cells(i, 1).Value
'this will protect us from situation where last character is delimeter and we have additional empty string
str = Left(str, Len(str) - 1)
'here we replace all special delimeters with space to simplify
For Each delimeter In delimetres
str = Replace(str, delimeter, " ")
Next
arr = Split(str)
Cells(i, 2).Value = UBound(arr) - LBound(arr) + 1
Next
End Sub
With your posted data following RegExp is working correctly. Put this in General Module in Visual Basic Editor.
Public Function CountWords(strInput As String) As Long
Dim objMatches
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.Pattern = "\w+"
Set objMatches = .Execute(strInput)
CountWords = objMatches.Count
End With
End Function
You have to use it like a normal formula. e.g. assuming data is in cell A1 function would be:
=CountWords(A1)
For your information, it can be also achieved through formula if number of characters are specific like so:
=LEN(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TRIM(A1),"."," "),","," "),"-"," "),"+"," "),"/"," "),"\"," ")))-LEN(SUBSTITUTE(TRIM(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(TRIM(A1),"."," "),","," "),"-"," "),"+"," "),"/"," "),"\"," "))," ",""))+1

.Net Regular Expression(Regex)

VB.NET separate strings using regex split?
Im having a logical error with the pattern string variable, the error occur after i extend the string from "(-)" to "(-)(+)(/)(*)"..
Dim input As String = txtInput.Text
Dim pattern As String = "(-)(+)(/)(*)"
Dim substrings() As String = Regex.Split(input, pattern)
For Each match As String In substrings
lstOutput.Items.Add(match)
This is my output when my pattern string variable is "-" it works fine
input: dog-
output: dog
-
My desired output(This is want i want to happen) but there is something wrong with the code.. its having an error after i did this "(-)(+)(/)()" even this
"(-)" + "(+)" + "(/)" + "()"
input: dog+cat/tree
output: dog
+
cat
/
tree
when space character input from textbox to listbox
input: dog+cat/ tree
output: dog
+
cat
/
tree
You need a character class, not the sequence of subpatterns inside separate capturing gorups:
Dim pattern As String = "([+/*-])"
This pattern will match and capture into Group 1 (and thus, all the captured values will be part of the resulting array) a char that is either a +, /, * or -. Note the position of the hyphen: since it is the last char in the character class, it is treated as a literal -, not a range operator.
See the regex demo:

regex for String contains comma semicolon and Carriage return

I need to check if an input string contains a semicolon or a comma or carriage return or all of them
Dim Input As String = "1298-673-4192,A08Z-931-468A;"
Dim pattern as string ="^[a-zA-Z0-9 \r , ; ]*$"
Dim regex As New Regex(pattern)
regex.Ismatch(Input)
I get false for it even though the string contains a comma a semicolon and a carriage return are present
This will work:
Dim Input As String = "1298-673-4192,A08Z-931-468A;"
Dim pattern as string ="[\r,;]+"
Dim regex As New Regex(pattern)
regex.Ismatch(Input)
I removed ^ and $ because that checks the entire string (^ means begin of string while $ means end).
I changed * to + because * checks for 0 or more while + checks for 1 or more.
I removed the spaces because they were not necessary.
I removed the a-zA-Z0-9 because that will also match all alphanumeric characters.
Before (With string begin/end characters, spaces, alphanumerical and *).
After (Without all the stuff that causes it to break).

Scan a file for a string of words ignoring extra whitespaces using VB.NET

I am searching a file for a string of words. For example "one two three". I have been using:
Dim text As String = File.ReadAllText(filepath)
For each phrase in phrases
index = text.IndexOf(phrase, StringComparison.OrdinalIgnoreCase)
If index >= 0 Then
Exit For
End If
Next
and it worked fine but now I have discovered that some files might contain target phrases with more than one whitespace gaps between words.
for example my code finds
"one two three" but fails to find "one two three"
is there a way I can use regular expressions, or any other technique, to capture the phrase even if distance between words is more than one whitespace?
I know I could use
Dim text As String = File.ReadAllText(filepath)
For each phrase in phrases
text=text.Replace(" "," ")
index = text.IndexOf(phrase, StringComparison.OrdinalIgnoreCase)
If index >= 0 Then
Exit For
End If
Next
But I wanted to know if there is a more efficient way to accomplish that
You can make a function to remove any double spaces.
Option Strict On
Option Explicit On
Option Infer Off
Public Class Form1
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
Dim testString As String = "one two three four five six"
Dim excessSpacesGone As String = RemoveExcessSpaces(testString)
'one two three four five six
Clipboard.SetText(excessSpacesGone)
MsgBox(excessSpacesGone)
End Sub
Function RemoveExcessSpaces(source As String) As String
Dim result As String = source
Do
result = result.Replace(" ", " "c)
Loop Until result.IndexOf(" ") = -1
Return result
End Function
End Class
Comments in the code will explain the code
Dim inputStr As String = "This contains one Two three and some other words" '<--- this be the input from the file
inputStr = Regex.Replace(inputStr, "\s{2,}", " ") '<--- Replace extra white spaces if any
Dim searchStr As String = "one two three" '<--- be the string to be searched
searchStr = Regex.Replace(searchStr, "\s{2,}", " ") '<--- Replace extra white spaces if any
If UCase(inputStr).Contains(UCase(searchStr)) Then '<--- check if input contains search string
MsgBox("contains") '<-- display message if it contains
End If
You could convert your phrases into regular expressions with \s+ between each word, and then check the text for matches against that. e.g.
Dim text = "This contains one Two three"
Dim phrases = {
"one two three"
}
' Splits each phrase into words and create the regex from the words.
For each phrase in phrases.Select(Function(p) String.Join("\s+", p.Split({" "c}, StringSplitOptions.RemoveEmptyEntries)))
If Regex.IsMatch(text, phrase, RegexOptions.IgnoreCase) Then
Console.WriteLine("Found!")
Exit For
End If
Next
Note that this doesn't check for word boundaries at the beginning/end of the phrase, so "This contains someone two threesome" would also match. If you don't want that, add "\s" at both ends of the regex.

how to remove double characters and spaces from string

Please let me how to remove double spaces and characters from below string.
String = Test----$$$$19****45#### Nothing
Clean String = Test-$19*45# Nothing
I have used regex "\s+" but it just removing the double spaces and I have tried other patterns of regex but it is too complex... please help me.
I am using vb.net
What you'll want to do is create a backreference to any character, and then remove the following characters that match that backreference. It's usually possible using the pattern (.)\1+, which should be replaced with just that backreference (once). It depends on the programming language how it's exactly done.
Dim text As String = "Test###_&aa&&&"
Dim result As String = New Regex("(.)\1+").Replace(text, "$1")
result will now contain Test#_&a&. Alternatively, you can use a lookaround to not remove that backreference in the first place:
Dim text As String = "Test###_&aa&&&"
Dim result As String = New Regex("(?<=(.))\1+").Replace(text, "")
Edit: included examples
For a faster alternative try:
Dim text As String = "Test###_&aa&&&"
Dim sb As New StringBuilder(text.Length)
Dim lastChar As Char
For Each c As Char In text
If c <> lastChar Then
sb.Append(c)
lastChar = c
End If
Next
Console.WriteLine(sb.ToString())
Here is a perl way to substitute all multiple non word chars by only one:
my $String = 'Test----$$$$19****45#### Nothing';
$String =~ s/(\W)\1+/$1/g;
print $String;
output:
Test-$19*45# Nothing
Here's how it would look in Java...
String raw = "Test----$$$$19****45#### Nothing";
String cleaned = raw.replaceAll("(.)\\1+", "$1");
System.out.println(raw);
System.out.println(cleaned);
prints
Test----$$$$19****45#### Nothing
Test-$19*45# Nothing