parse out the number value in a string using vb.net - regex

I have two different strings.
www.ncbi.nlm.nih.gov/myncbi/browse/collection/40918026/?sort=date&direction=descending
and
https://www.ncbi.nlm.nih.gov/sites/myncbi/john.smith.1/bibliography/47926757/public/?sort=date&direction=descending
I need the number that is in the block after the word collection or bibliography. I know that I can split the "/" slashes but if it starts with http then it will not be the same. Plus one would be in position 5 and the other in 6. Is there a better way using regex? I know I can put together a bunch of code searching for either word and then doing something different but I'm looking for a cleaner way to pull it out
I'm using
Dim str() As String = TextBox1.Text.Split("/")
For i As Integer = 0 To str.Length - 1
If Regex.IsMatch(str(i), "^[0-9 ]+$") Then
MessageBox.Show(str(i).ToString)
End If
Next
But hoped for something cleaner

Try with this regex: (?:collection|bibliography)\/(\d+)
The desired number will be on the first capturing group
See demo

A similar, but simple alternative approach without splitting:
A per your examples: (Assuming one eight digit number surrounded by
"/")
Dim Result As String = Regex.Match(TextBox1.Text, "\/\d{8}\/").Value.Replace("/", String.Empty)
Result will contain your number if matched, else String.Empty
Reference: Regex.Match Method
Example alternatives:
Only match numbers with length of 8 to 10 digits enclosed in "/": "\/\d{8,10}\/"
Only match numbers with length of 4 or more digits enclosed in "/": "\/\d{4,}\/"
Match numbers of any length enclosed in "/": "\/\d+\/"

Related

Access multiple captures of one capture group in substition string

Suppose I have the regex (\d)+.
In .Net I can access all captures of this capture group using the match.Groups[1].Captures.
Can I also access these captures in a substition string?
So for example for the input string 523, I need to use 5, 2 and 3 in my substition string (and not just 3, which is $1).
If you intend to capture the digits each in its separate capturing group then you need to actually make a separate capturing groups for every digits like this:
(\d)(\d)(\d)
NOTE: This does not scale very well and you could not match numbers of any other length than 3 digits. In other words, no math on either 23 or 345667!
An good page with a long and detailed explanation why this cant be done as (\d)+ can be found here:
https://www.regular-expressions.info/captureall.html
So if this is indeed what you want then you need to craft your own loop that searches the string for every digit separately.
If you on the other hand need to capture the number and not the individual digits then you simply put the +sign in the wrong position. I think you should write:
(\d+)
I think the OP wants to get every single digit match separately.
Perhaps this will help you then:
<!-- language: lang-vb -->
' Create a list to put the resulting matches in
Dim ResultList As StringCollection = New StringCollection()
Dim RegexObj As New Regex("(\d)")
Dim MatchResult As Match = RegexObj.Match(strName)
While MatchResult.Success
ResultList.Add(MatchResult.Groups(1).Value)
' Console.WriteLine(MatchResult.Groups(1).Value)
MatchResult = MatchResult.NextMatch()
End While

Regexp to match the whole string if there is/isn't a specific number in it

I am trying to find a way to have a regexp match a whole delimited string in case it fulfils one of the conditions:
The string should not contain number 1 (as a single digit, not 11 or 12)
The string contains number 1 (as a single digit, not 11 or 12)
The strings can be like the following format:
1,2,wo,9,5
1
wo,1,11
I have tried the following regexp:
/^.*\b(1)\b.*$
/^((?!1).)*$
I am trying to match the whole string and I would like to substitute the whole string if one of the conditions is met.
This regex will find all strings which have an occurrence of 1 as a single digit:
/^.*\b1\b.*$/
When you find a match, you can replace the whole string with the word 'true' using String.replace:
const strings = ['1,2,wo,9,5','1','wo,1,11'];
strings.map(s => console.log(s.replace(/^.*\b1\b.*$/, 'true')));
If you just wanted to replace the 1 with something, you could use a much simpler regex /\b1\b/. To replace all occurences, use the g flag:
const strings = ['1,2,wo,1,5','1','wo,1,11'];
strings.map(s => console.log(s.replace(/\b1\b/g, 'true')));
If you want to find strings that don't include 1 as a single digit, you can use a negative lookahead i.e.
^(?!.*\b1\b.*$).*$
and again use String.replace to replace the whole string with something e.g.
const strings = ['1,2,wo,9,5','1','wo,1,11','45,x,z,23'];
strings.map(s => console.log(s.replace(/^(?!.*\b1\b.*$).*$/, 'false')));

Regex words with letters, numbers, optional special characters in any order

I've been using some help on here for a while now but cannot find anything specific to my requirement. I need to pick out whole words which contain at least 6 letters and/or numbers (combined, not each), with optional 'special' characters. All in any order, so A12345, 12345A, 1-2-345-A, 12A45B and so-on.
I've done a fiddle here. I'm almost there (but could be done better) - I can't work out why it needs to be a least 6 numbers to get a match. Is it beacuse the letters are all optional with *
This is VBA so no access to look behinds. The special characters will only ever be 'within' the match, not start or end (will never be -1234-A- for example).
I think this is what you are looking for:
[a-z0-9/-]{6,}
That will match in any order a to z or 0 to 9 or - or / of at least 6. Note the - is at the end of the character class. You can have it in the middle but then need to escape it. Also, / will need to be escaped if your delimiters are also /
update
As Wiktor noted this would also capture ------ which may not be what you want. I would suggest simply cleaning out all optional characters, and then running the above regex. I would delete my answer since I'm not providing exactly what was being asked, but it would be a workable solution so it may have value.
You could do a regex replacement to remove all non letters/numbers, and then check that the length of the resulting string is 6 or more:
Dim input As String = "A-1234-B"
Dim pattern As String = "[^A-Za-z0-9]+"
Dim replacement As String = ""
Dim rgx As New Regex(pattern)
Dim result As String = rgx.Replace(input, replacement)
Console.WriteLine(result.Length) ' 6
Demo

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regular expressions string replacement of individual match within file

I have written a small program to whir through a textfile and find and replace regex where 9 digits \d{9}. It works fine, except what I need is a little more complicated.
I am finding the right data correctly. theFile is just a string with the text file streamread into it. I do this and then create and write it to another file.
But I need to find each string match individually, and replace that match with only the last 5 digits of that individual number (currently this is just replacing with FOUND). Keeping the file otherwise identical.
I am not sure how/what is the best way of doing this? would i have to split into an array of strings rather than one mass string? (it's quite a big file)
Any questions let me know, thanks in advance.
Dim regexString As String = "(\d{9})"
Dim replacement1 As String = "FOUND"
Dim rgx As New Regex(regexString)
Try
theFile = rgx.Replace(theFile, replacement1)
Catch
End try
Instead of using just one replacement pattern \d{9} split and group with two patterns, the first is 4 numbers long, the second 5 numbers. Then in the replace use only the last 5 numbers from the last group
Dim k = "abcd 123456789 abcf"
Dim ptn = "(\d{4})(\d{5})"
Dim result = Regex.Replace(k, ptn, "$2")
This approach leaves unchanged the sequences with less than 9 consecutive numbers, but if you have sequences with more than 9 numbers and don't want to change them, then you need a pattern with
Dim ptn = "(\b\d{4})(\d{5}\b)"
to fix the two groups inside a sequence of exactly nine numbers.
The question appears to ask for matches on exactly nine digits and wants the first four to be removed. Ie to replace the nine digits with the last five.
Splitting the regular expression in the question into two parts, for the unwanted and the wanted parts gives
regexString = "\d{4}(\d{5})"
which captures the wanted five digits, so then the replacement is
replacement1 ="$1"
Or in some other regular expression implementations it would be replacement1 ="\1". Additionally the replace method in some regular expression system may have additional options (parameters) for replace first versus replace n-th versus replace all occurrences.
Suppose there are more than nine digits and only the final five are wanted. In this case the regular expression can be written as one of the following (as different regular expression languages support different features). The replacement expression is the same as above.
regexString = "\d{4,}(\d{5})"
regexString = "\d\d\d\d+(\d{5})"
regexString = "\d\d\d\d\d*(\d{5})"
Because regular expressions are normally "greedy" the \d{5} should always match the final 5 digits but it may be worth finishing the regular expression with ...(\d{5})([^\d]|$) and changing the replace to be $1$2. That way it looks for a trailing non-digit or end-of-string.