tcl regular expression, attempting to pull out a string between two patterns - regex

Gretings!
I am trying to use tcl regular expressions to strip off unwanted characters and keep the desired string.
The 4 basic string types are
I34/pAVDD_3
I32/pDVDD_15_2
I999/pAGND
I3/pDOUT_LG0
What I want to capture is what's in-between the p and the end of the string or the last underscore & number if it exists. With the strings above I want to capture AVDD, DVDD_15, AGND, and DOUT_LG0.
I thought I had it with [p](\w*)?[_][\d*] but it doesn't work with I3/pDOUT_LG0 and after quite awhile of trying different things, I can't find a pattern that will work.
Thanks!

How about
regexp {p(?:(\w+)_\d|(\w+))$} $str -> c1 c2
set result $c1$c2
One or the other will be empty, so the result is a simple concatenation of them.
Another possible solution is to strip off the unwanted parts:
regsub -all {.+p|_\d$} $str {}
Documentation:
regexp,
regsub,
Syntax of Tcl regular expressions

Related

how to use Perl Regx to parse [key=value] if value has multiple data

I could not solve below problem so I used Perl script to parse
without regular expression, but I believe there's a regular expression for it.
Input String (there's no newline):
ObjectAddress=120.146.128.250,ObjectName=psyseds-tt1y,ObjectClass=SCM F5,ObjectDescription=,Aliases=psyseds-tt1y.site.com.,NameService=A,PTR,DynamicDNSUpdate=A,PTR,CNAME
Expected Output:
ObjectAddress=120.146.128.250
ObjectName=psyseds-tt1y
ObjectClass=SCM F5
ObjectDescription=
Aliases=psyseds-tt1y.site.com.
NameService=A,PTR
DynamicDNSUpdate=A,PTR,CNAME
I tried some regular expression to parse string, but I failed to parse
since it has multiple items with , separated value.
For example, NameService has two value A,PTR.
Please help me to build regular expression to parse above.
(.+?=.*?) does not pick up multiple values.
In general, it doesn't seem that your format is unambiguous — something like A=B,C=D could mean either that A maps to B and C maps to D, or that A maps to B,C=D — but for a good approximation, you can write:
my #output = split /,(?=\w+=)/, $input;
this will split $input on commas (,), with the added restriction that the comma must be followed by one or more "word characters" (\w — letters, digits, underscores) plus an equals sign. (This is called a lookahead assertion.)
You can match with this regex
(?<=^|,)(?<key>.*?)=(?<value>.*?)(?=,|$)
You can now access values by there group names

Regexp trouble in TCL

I have question about regexp in TCL.
How i can find and change some text in TCL string variable with regexp function.
Example of the text:
/folder/folder2/test-c+a+t -test1 -test2
I want to receive:
/folder/folder2/test-d+o+g
Or for example it can be just:
test-c+a+t
and i want to recieve:
test-d+o+g
Sorry for this addition:
In this situation:
/test-c+a+t/folder2/test-c+a+t -test1 -test2
i want to recieve:
/test-c+a+t/folder2/test-d+o+g -test1 -test2
% set old {/folder/folder2/test-c+a+t -test1 -test2}
/folder/folder2/test-c+a+t -test1 -test2
% set new [regsub {(test)-c\+a\+t.*} $old {\1-d+o+g}]
/folder/folder2/test-d+o+g
Note the literal + symbols need to be escaped because they are regular expression quantifiers.
http://tcl.tk/man/tcl8.5/TclCmd/re_syntax.htm
http://tcl.tk/man/tcl8.5/TclCmd/regsub.htm
In the specific case you mention here you would do better to use string map. Regular expressions are more flexible though so it all depends how specific your task is.
set modified [string map {test-c+a+t test-d+o+g} $original]
Otherwise, there is no substitute for learning how to use regular expression syntax. It is useful pretty much all the time so read the manual page, try various expressions and re-read the manual when you fail to match what you expected. Also try out sed, awk and grep for learning to use regexp's.
Either use string map or use regsub (possibly with the -all flag). Here are some examples of the two approaches:
set myString [string map [list "test-c+a+t" "test-d+o+g"] $myString]
set myString [regsub -all "***=test-c+a+t" $myString "test-d+o+g"]
### Or equivalently, for older Tcl versions...
regsub -all "***=test-c+a+t" $myString "test-d+o+g" myString
The string map can apply multiple changes in one sweep (the mapping a b b a would swap all a and b characters) but it only ever replaces literal strings and always replaces everything it can. The regsub command can do much more complex transformations and can much more selective about what it replaces, but it does require you to use regular expressions and it is slower in the case where a string map can do an equivalent job. However, the special leading ***= in the pattern means that the rest of the pattern is a literal string.

Annotating mismatches in regular expression

I need to "annotate" with a X character each mismatch in a regular expression, For example if I have a text file like:
Line1Name: this is a (string).
Line2Name: (a string)
Line3Name this is a line without parenthesis
Line4Name: (a string 2)
Now following regular expression will match everything before a :
^[^:]+(?=:)
so the result will be
Line1Name:
Line2Name:
Line4Name:
However I would need to annotate the mismatch at the 3rd line, having this output:
Line1Name:
Line2Name:
X
Line4Name:
Is this possible with regular expressions?
If you have a look at what a regular expression is, you will realize that it is not possible to do logical operations with a regex alone. Quoting Wikipedia:
In computing, a regular expression provides a concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
emphasis mine – simply put, a regex is a fancy way to find a string; it either does (it matches), or not.
To achieve what you are after, you need some kind of logic switch that operates on the match / not-match result of your regex search and triggers an action. You haven’t specified in what environment you are using your regex, so providing a solution is a bit pointless, but as an example, this would do what you are trying to do in pure bash:
# assuming your string is in $str
result="$([[ $str =~ ^[^:]+: ]] && echo "${str%:*}" || echo "X")"
and this does the same thing in a language supporting your regex pattern (Ruby):
# assuming your string is in str
result = str.match(/^[^:]+(?=:)/) || "X"
As a side note, your example code does not match the output: you are using a lookahead for the colon, which excludes it in the match, but your output includes it. I’ve opted for sticking with your regex over your output pattern in my examples, thus excluding the colon from the result.

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");