I am trying to get a hex value from a string with this condition "VALUE: num,num,num,HEX,num,num"
I have the following
% set STRINGTOPARSE "VALUE: 12,12,13,2,9,5271256369606C00,0,0"
% regexp {(,[0-9A-Z]+,)+} $STRINGTOPARSE result1 result2 result3
1
% puts $result1
,12,
% puts $result2
,12,
% puts $result3
I believe the condition of {(,[0-9A-Z]+,)+} will be sufficient to take the HEX from above string,
but instead I got the first result ",12," and not the HEX that I want. What have I done wrong ?
You might want to use split instead:
set result [lindex [split $STRINGTOPARSE ","] 5]
regexp is not giving you the result you are looking for because the first part that matches is ,12, and the match stops there and won't look for more matches.
You could use regexp to do it, but it will be more messy... one possible way would be to match each comma:
regexp {^(?:[^,]*,){5}([0-9A-F]+),} $STRINGTOPARSE -> hex
Where (?:[^,]*,){5} matches the first 5 non-comma parts with their commas, and ([0-9A-F]+) then grabs the hex value you're looking for.
I think that the problem is that you seem to think [0-9A-Z] will have to match at least a letter, which is not the case. it will match any character within the character class and you get a match as long as you get 1 character to match.
If you wanted a regex to match a series of characters with both numbers and letters, then you would have to use some lookaheads (using classes alone might make it more messy):
regexp {\y(?=[^,A-Z]*[0-9])(?=[^,0-9]*[A-Z])[0-9A-Z]+\y} $STRINGTOPARSE -> hex
But... this might look even more complex than before, so I would advise sticking to splitting instead :)
Related
I have a regexp with a non-greedy modifier which does not seem to work. I have tried so many variations of the regexp and various other ways I could think of, without success, that I am losing my head
I want to remove all the empty strings embedded in the string s below. With my regexp I was expecting to remove all the things that matched something=""
s = 'a,b="cde",f="",g="hi",j=""'
puts s; puts s.gsub( /,.+?="",?/ , "," ).chomp(','); nil
Expected:
a,b="cde",g="hi"
What I get:
a,g="hi"
Why isn't the .+? non greedy in the gsub regexp below?
It works if I constrain the . to a set of characters [\w\d_-], but that forces me to do assumptions:
puts s; puts s.gsub( /,[\w\d_-]+?=""/ , "" ).chomp(','); nil
# outputs:
a,b="cde",f="",g="hi",j=""
a,g="hi"
It also works if I do some sort of negative lookup like:
puts s; puts s.gsub( /,.+?="",?/ , "," ).chomp(','); nil
# outputs:
a,b="cde",f="",g="hi",j=""
a,g="hi"
But still I do not understand why it did not work in the first case.
Regex matches from left to right. Your regex ,.+?="",? matches the first comma in the string a,b="cde",f="",g="hi",j="", the one between a and b. Then it tries to find ="" that exists after the ,g so you get the actual result.
What you want is: ,[^=]+?="",? that matches 1 or more any character that is not an equal sign before ="" and you'll get a,b="cde",g="hi" as result.
I have two different strings.
www.ncbi.nlm.nih.gov/myncbi/browse/collection/40918026/?sort=date&direction=descending
and
https://www.ncbi.nlm.nih.gov/sites/myncbi/john.smith.1/bibliography/47926757/public/?sort=date&direction=descending
I need the number that is in the block after the word collection or bibliography. I know that I can split the "/" slashes but if it starts with http then it will not be the same. Plus one would be in position 5 and the other in 6. Is there a better way using regex? I know I can put together a bunch of code searching for either word and then doing something different but I'm looking for a cleaner way to pull it out
I'm using
Dim str() As String = TextBox1.Text.Split("/")
For i As Integer = 0 To str.Length - 1
If Regex.IsMatch(str(i), "^[0-9 ]+$") Then
MessageBox.Show(str(i).ToString)
End If
Next
But hoped for something cleaner
Try with this regex: (?:collection|bibliography)\/(\d+)
The desired number will be on the first capturing group
See demo
A similar, but simple alternative approach without splitting:
A per your examples: (Assuming one eight digit number surrounded by
"/")
Dim Result As String = Regex.Match(TextBox1.Text, "\/\d{8}\/").Value.Replace("/", String.Empty)
Result will contain your number if matched, else String.Empty
Reference: Regex.Match Method
Example alternatives:
Only match numbers with length of 8 to 10 digits enclosed in "/": "\/\d{8,10}\/"
Only match numbers with length of 4 or more digits enclosed in "/": "\/\d{4,}\/"
Match numbers of any length enclosed in "/": "\/\d+\/"
I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub
My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)
Matlab documentation states that it is possible to replace the Nth occurrence of the pattern in regexprep. I am failing to see how to implement it and google is not returning anything useful.
http://www.weizmann.ac.il/matlab/techdoc/ref/regexprep.html
Basically the string I have is :,:,1 and I want to replace the second occurrence of : with an arbitrary number. Based on the documentation:
regexprep(':,:,4',':','AnyNumber','N')
I do no understand how the N option should be used. I have tried 'N',2 or just '2'.
Note that the position of the : could be anywhere.
I realize there are other ways of doing this other than regexprep but I don't like having a problem linger.
Thanks for the help!
regexprep(':,:,4',':','AnyNumber',2)
The above works.
According the MATLAB documentation, the general syntax of regexprep is:
newStr = regexprep(str,expression,replace,option1,...optionM);
It looks in the "str", finds matching "expression", and replaces the matching string with "replace". There are 9 available options. Eight of them are fixed strings, one is an integer. The integer tells which one of the matching string to be replaced.
The following code set up all the parameters as variables, find the number of the matching strings, and use that information to replace only the last occurrence.
str = ':,:,4';
expression= ':';
replace = num2str(floor(rand()*10));
% generate a single digit random number converted to string
idx = regexp(str, expression); % use regexp to find the number of matches
regexprep(str, expression, replace, length(idx)); % only replace the last one