Regular expressions string replacement of individual match within file - regex

I have written a small program to whir through a textfile and find and replace regex where 9 digits \d{9}. It works fine, except what I need is a little more complicated.
I am finding the right data correctly. theFile is just a string with the text file streamread into it. I do this and then create and write it to another file.
But I need to find each string match individually, and replace that match with only the last 5 digits of that individual number (currently this is just replacing with FOUND). Keeping the file otherwise identical.
I am not sure how/what is the best way of doing this? would i have to split into an array of strings rather than one mass string? (it's quite a big file)
Any questions let me know, thanks in advance.
Dim regexString As String = "(\d{9})"
Dim replacement1 As String = "FOUND"
Dim rgx As New Regex(regexString)
Try
theFile = rgx.Replace(theFile, replacement1)
Catch
End try

Instead of using just one replacement pattern \d{9} split and group with two patterns, the first is 4 numbers long, the second 5 numbers. Then in the replace use only the last 5 numbers from the last group
Dim k = "abcd 123456789 abcf"
Dim ptn = "(\d{4})(\d{5})"
Dim result = Regex.Replace(k, ptn, "$2")
This approach leaves unchanged the sequences with less than 9 consecutive numbers, but if you have sequences with more than 9 numbers and don't want to change them, then you need a pattern with
Dim ptn = "(\b\d{4})(\d{5}\b)"
to fix the two groups inside a sequence of exactly nine numbers.

The question appears to ask for matches on exactly nine digits and wants the first four to be removed. Ie to replace the nine digits with the last five.
Splitting the regular expression in the question into two parts, for the unwanted and the wanted parts gives
regexString = "\d{4}(\d{5})"
which captures the wanted five digits, so then the replacement is
replacement1 ="$1"
Or in some other regular expression implementations it would be replacement1 ="\1". Additionally the replace method in some regular expression system may have additional options (parameters) for replace first versus replace n-th versus replace all occurrences.
Suppose there are more than nine digits and only the final five are wanted. In this case the regular expression can be written as one of the following (as different regular expression languages support different features). The replacement expression is the same as above.
regexString = "\d{4,}(\d{5})"
regexString = "\d\d\d\d+(\d{5})"
regexString = "\d\d\d\d\d*(\d{5})"
Because regular expressions are normally "greedy" the \d{5} should always match the final 5 digits but it may be worth finishing the regular expression with ...(\d{5})([^\d]|$) and changing the replace to be $1$2. That way it looks for a trailing non-digit or end-of-string.

Related

parse out the number value in a string using vb.net

I have two different strings.
www.ncbi.nlm.nih.gov/myncbi/browse/collection/40918026/?sort=date&direction=descending
and
https://www.ncbi.nlm.nih.gov/sites/myncbi/john.smith.1/bibliography/47926757/public/?sort=date&direction=descending
I need the number that is in the block after the word collection or bibliography. I know that I can split the "/" slashes but if it starts with http then it will not be the same. Plus one would be in position 5 and the other in 6. Is there a better way using regex? I know I can put together a bunch of code searching for either word and then doing something different but I'm looking for a cleaner way to pull it out
I'm using
Dim str() As String = TextBox1.Text.Split("/")
For i As Integer = 0 To str.Length - 1
If Regex.IsMatch(str(i), "^[0-9 ]+$") Then
MessageBox.Show(str(i).ToString)
End If
Next
But hoped for something cleaner
Try with this regex: (?:collection|bibliography)\/(\d+)
The desired number will be on the first capturing group
See demo
A similar, but simple alternative approach without splitting:
A per your examples: (Assuming one eight digit number surrounded by
"/")
Dim Result As String = Regex.Match(TextBox1.Text, "\/\d{8}\/").Value.Replace("/", String.Empty)
Result will contain your number if matched, else String.Empty
Reference: Regex.Match Method
Example alternatives:
Only match numbers with length of 8 to 10 digits enclosed in "/": "\/\d{8,10}\/"
Only match numbers with length of 4 or more digits enclosed in "/": "\/\d{4,}\/"
Match numbers of any length enclosed in "/": "\/\d+\/"

Regex words with letters, numbers, optional special characters in any order

I've been using some help on here for a while now but cannot find anything specific to my requirement. I need to pick out whole words which contain at least 6 letters and/or numbers (combined, not each), with optional 'special' characters. All in any order, so A12345, 12345A, 1-2-345-A, 12A45B and so-on.
I've done a fiddle here. I'm almost there (but could be done better) - I can't work out why it needs to be a least 6 numbers to get a match. Is it beacuse the letters are all optional with *
This is VBA so no access to look behinds. The special characters will only ever be 'within' the match, not start or end (will never be -1234-A- for example).
I think this is what you are looking for:
[a-z0-9/-]{6,}
That will match in any order a to z or 0 to 9 or - or / of at least 6. Note the - is at the end of the character class. You can have it in the middle but then need to escape it. Also, / will need to be escaped if your delimiters are also /
update
As Wiktor noted this would also capture ------ which may not be what you want. I would suggest simply cleaning out all optional characters, and then running the above regex. I would delete my answer since I'm not providing exactly what was being asked, but it would be a workable solution so it may have value.
You could do a regex replacement to remove all non letters/numbers, and then check that the length of the resulting string is 6 or more:
Dim input As String = "A-1234-B"
Dim pattern As String = "[^A-Za-z0-9]+"
Dim replacement As String = ""
Dim rgx As New Regex(pattern)
Dim result As String = rgx.Replace(input, replacement)
Console.WriteLine(result.Length) ' 6
Demo

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

Python: RE only captures first and last match

I'm trying to make a Regular Expression that captures the following:
- XX or XX:XX, up to 6 repetitions (XX:XX:XX:XX:XX:XX), where X is a hexadecimal number.
In other words, I'm trying to capture MAC addresses than can range from 1 to 6 bytes.
regex = re.compile("^([0-9a-fA-F]{2})(?:(?:\:([0-9a-fA-F]{2})){0,5})$")
The problem is that if I enter for example "11:22:33", it only captures the first match and the last, which results in ["11", "22"].
The question: is there any method that {0,5} character will let me catch all repetitions, and not the last one?
Thanks!
Not in Python, no. But you can first check the correct format with your regex, and then simply split the string at ::
result = s.split(':')
Also note that you should always write regular expressions as raw strings (otherwise you get problems with escaping). And your outer non-capturing group does nothing.
Technically there is a way to do it with regex only, but the regex is quite horrible:
r"^([0-9a-fA-F]{2})(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?$"
But here you would always get six captures, just that some might be empty.