Non-greedy match from end of string with regsub - regex

I have a folder path like following:
/h/apps/new/app/k1999
I want to remove the /app/k1999 part with the following regular expression:
set folder "/h/apps/new/app/k1999"
regsub {\/app.+$} $folder "" new_folder
But the result is /h: too many elements are being removed.
I noticed that I should use non-greedy matching, so I change the code to:
regsub {\/app.+?$} $folder "" new_folder
but the result is still /h.
What's wrong with the above code?

Non-greedy simply means that it will try to match the least amount of characters and increase that amount if the whole regex didn't match. The opposite - greedy - means that it will try to match as much characters as it can and reduce that amount if the whole regex didn't match.
$ in regex means the end of the string. Therefore something.+$ and something.+?$ will be equivalent, it is just that one will do more retries before it matches.
In your case /app.+ is matched by /apps and this is the first occurrence of /app in your string. You can fix it by being more explicit and adding the / that follows /app:
regsub {/app/.+$} $folder "" new_folder

If you are looking to match app as a whole word, you can make use of the word boundaries that in Tcl are \m and \M:
\m
matches only at the beginning of a word
\M matches only at the end of a word
We only need the \M as / is a non-word character and we do not need \m:
set folder "/h/apps/new/app/k1999"
regsub {/app\M.+$} $folder "" newfolder
puts $newfolder
See IDEONE demo
Result: /h/apps/new (we remove everything from a whole word app up to the end.)
If you want to remove just a part of the string inside the path, you can use negated class [^/]+ to make sure you only match a subpart of a path:
regsub {/app/[^/]+} $folder "" newfolder

The regular expression engine always starts matching as soon as it can; the greediness doesn't affect this. This means that in this case, it always starts matching too soon; you want the last match, not the first one.
If you use regexp -all -indices -inline, you can find out where the last match starts. That lets you then remove the part you actually don't want (e.g., by replacing it with an empty string:
set folder "/h/apps/new/app/k1999"
set indices [regexp -all -indices -inline {/app} $folder]
# This gets this value: {2 5} {11 14}
# If we have indices — if we had a match — we can do the rest of our processing
if {[llength $indices] > 0} {
# Get the '11'; the first sub-element of the last element
set index [lindex $indices end 0]
# Replace '/app/k1999' with the empty string
set newfolder [string replace $folder $index end ""]
} else {
set newfolder $folder; # In case there's no match...
}

You can use a regular expression substitution operation to remove a directory suffix from a path name, but that doesn't mean you should.
file join {*}[lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}]
# -> /h/apps/new
A path name is a string, but more properly it's a list of directory names:
file split $folder
# -> / h apps new app k1999
What you want is the sublist of directory names up to, but not including, the directory named "app".
lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}
# -> / h apps new
(The directory name can be tested however you wish; a couple of possibilities are {$dir ni {foo app bar}} to skip at alternative names, or {![string match app-* $dir]} for any name beginning with "app-".)
And when you've gotten the list of directory names you wanted, you join the elements of it back to a path name again, as above.
So why should you do it this way instead of by using a regular expression substitution operation? This question illustrates the problem well. Unless one is an RE expert or takes great care to read the documentation, one is likely to formulate a regular expression based on a hunch. In the worst case, it works the first time. If not, one is tempted to tinker with it until it does. And any sufficiently ununderstood (yep, that is a word) RE will seem to work most of the time with occasional false positives and negatives to keep things interesting.
Split it, truncate it, join it. Can't go wrong. And if it does, it goes obviously wrong, forcing you to fix it.
Documentation: break, file, if, lmap, set

Related

Tcl Remove all characters after a string without removing the string

In tcl is there a way to trim out all character AFTER a designated string? I have seen lots of posts on removing all after and including the string but not what I am hoping to do. I have a script that searches for file names with the suffix .sv but some of them are .sv.**bunch of random stuff**. and I don't need the random stuff as it is not relevant to me.
I have experimented with different regsub and string trim commands but they always remove the .sv as well.
The results being appended to a list are similar to as follows...
test_module_1.sv.random_stuff
test_module_2.sv.random_stuff
test_module_3.sv.random_stuff
test_module_4.sv.random_stuff
test_module_5.sv.random_stuff
etc etc
You can place match matched parts of a regex pattern when you use regsub. An example:
regsub {(\.sv).*} $str {\1} new
Will remove .sv and anything after it if any and replace that by the first matched group, that is the part between parens, or in this case, .sv so that an input of example.sv.random will become example.sv.
However, you can also easily replace with .sv like so:
regsub {\.sv.*} $str {.sv} new
Or another approach not involving replacing would be to get the part of the string up until the .sv part:
string range $str 0 [expr {[string first ".sv" $str]+2}]
Here [string first ".sv" $str] gets the position of .sv in the string (if there are multiple, it will get the first), adds 2 characters (sv after . are 2 chars long) to it and string range gets all characters up to and including .sv.
Or if you want to stick with regexes:
regexp {.+?\.sv} $str match
$match will contain the result string. The expression used grabs all characters up to and including .sv.

Match numbers in square brackets

Please I try to match the following strings in a file:
test/test/abc_xyz[2][0]/abc
test/test/abc_xyz[2]/abc
test/test/abc_xyz/abc
I tried the following using regexp within a TCL script:
set match [regexp {^.*\/*\_xyz(|\[[0-9]{1,}\]|\[[0-9]{1,}\]\[[0-9]{1,}\])\/abc} $line extracted_string ]
Using this regular expression I managed to extract these lines:
"test/test/abc_xyz[2]/abc"
"test/test/abc_xyz/abc"
But Couldn't by any way extract lines similar to this:
test/test/abc_xyz[2][0]/abc
Could anybody tell me what may I be missing?
In this case, you've got this sequence — [, digit(s), ] — that you want to match zero or more times. That leads to this: (?:\[\d+\])* as a general RE. Using that in place of your more complicated piece, we can get:
set match [regexp {^.*\/*\_xyz((?:\[\d+\])*)\/abc} $line extracted_string substring ]
That's shorter, and much easier to understand. I've also added substring in there to get the bracketed sequence. Dealing with that might be best done with a judicious string map of the brackets to spaces:
set numbers [string map {"[" " " "]" " "} $substring]
Now you can use list operations (llength, lindex, foreach, etc.) on it…
Try this: update ^.*\/*\_xyz(\[\d+\])?\/abc
explain:
your problem has here (|\[[0-9]{1,}\]|\[[0-9]{1,}\]\[[0-9]{1,}\]) block.
i think you want to [05] have one time after xyz & before /abc
(\[\d+\]) for select [05]
? for one or more time have
it may helpful. demo

Tcl regexp does not escape asterisk (*)

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?
From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}
I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

Hierarchical path RegExp

I have to remove a known "level" from a hierarchical path using a regular expression.
In other terms, I want to go from 'a/b/X/c/d' to 'a/b/c/d', where X can be at any level of the path.
Using Javascript as an example, I have crafted the following:
str = str.replace(/^(?:(.+\/)|)X(?:$|\/(.+$))/, "$1$2")
which works fine when X is either the root or is in the middle of the path, but leaves a trailing slash when X comes last in the path. I could make a subsequent replace to handle those instances, but would it be possible to create a better RegEx that matches all the cases?
Thanks.
Edit: To clarify, all levels of the path might contain any number of characters and I'm only interested in removing a level only if it matches X exactly.
Search: \bX/|/X(?=$)
Replace: Empty String
In the Regex Demo, see the substitutions at the bottom.
Input
a/b/X/c/d
X/a/b/c/d
a/b/c/d/X
Output
a/b/c/d
a/b/c/d
a/b/c/d
Explanation
\b assert word boundary
X/ match X/
OR |
Match /X, if the lookahead (?=$) can assert that what follows is the end of the string

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";