Tcl Remove all characters after a string without removing the string - regex

In tcl is there a way to trim out all character AFTER a designated string? I have seen lots of posts on removing all after and including the string but not what I am hoping to do. I have a script that searches for file names with the suffix .sv but some of them are .sv.**bunch of random stuff**. and I don't need the random stuff as it is not relevant to me.
I have experimented with different regsub and string trim commands but they always remove the .sv as well.
The results being appended to a list are similar to as follows...
test_module_1.sv.random_stuff
test_module_2.sv.random_stuff
test_module_3.sv.random_stuff
test_module_4.sv.random_stuff
test_module_5.sv.random_stuff
etc etc

You can place match matched parts of a regex pattern when you use regsub. An example:
regsub {(\.sv).*} $str {\1} new
Will remove .sv and anything after it if any and replace that by the first matched group, that is the part between parens, or in this case, .sv so that an input of example.sv.random will become example.sv.
However, you can also easily replace with .sv like so:
regsub {\.sv.*} $str {.sv} new
Or another approach not involving replacing would be to get the part of the string up until the .sv part:
string range $str 0 [expr {[string first ".sv" $str]+2}]
Here [string first ".sv" $str] gets the position of .sv in the string (if there are multiple, it will get the first), adds 2 characters (sv after . are 2 chars long) to it and string range gets all characters up to and including .sv.
Or if you want to stick with regexes:
regexp {.+?\.sv} $str match
$match will contain the result string. The expression used grabs all characters up to and including .sv.

Related

TCL multi capture group for simplified csv string parsing with regexp

I'm trying to parse a simplified CSV format with TCL regexp. I chose regexp over split to perform rudimentary format compliance test.
My problem is that I want to use a count quantifier but want to exclude the ',' from the match.
My test line:
set line "2017/08/21 16:06:20.0, REALTIME, late by 0.3, EOS450D, 1/640, F/8.0, ISO 100, Partial 450D 0.0%"
So far I have:
regexp -all {(?:([^\,]*)\,){8}} $line dummy date tm off cam exp fnum iso com
My thought process is:
Get a match group for all characters that are not comma up to the next comma.
Now I want to match this 8 time so I put it into a non-capturing group followed by a counting quantifier. But that defeats the purpose as now nothing is matched. What I need is a way to make the match go through the CSV 8 times and capture the text but not the comma.
My CSV is simplified in the following.
No quoted strings in the CSV
No empty entries in CSV
I've checked google for csv matching but most hits were too blown up due to allowing special cases in the CSV content.
Thanks,
Gert
In the regexp command, the interaction between the -all switch and the match variables is that the values captured in the last iteration of matching are used to fill the variables. This means that you can't fill eight variables by having one capture group and iteratively matching it eight times.
Your regular expression doesn't match anyway, since it requires a comma after the last field.
For this particular example, you could use the invocation
% regexp -all -inline {[^,]+} $line
{2017/08/21 16:06:20.0} { REALTIME} { late by 0.3} { EOS450D} { 1/640} { F/8.0} { ISO 100} { Partial 450D 0.0%}
This means to match all groups of characters that aren't commas (note that the comma isn't special: you don't need to escape it) and return them as a list.
As you noted, this is the same as using
% split $line ,
(which is also about five times faster).
You didn't want to use split because you wanted to do some validation: it is unclear what forms of validation you wanted to do, but you can easily validate the number of fields found:
% set fields [split $line ,]
% if {[llength $fields] ne 8} {puts stderr "wrong number of fields"}
You can store the fields in variables and validate them separately, which is a lot easier to get right than trying to validate them all at the same time while extracting them:
lassign $fields date tm off cam exp fnum iso com
if {![regexp {ISO\s+\d+} $iso]} {puts stderr "in search of valid ISO"}
The best method is still to split the data string using the csv package. Even if you just want to use this simplified CSV now, sooner than you think you might want to, say, allow fields with commas in them.
package require csv
set fields [::csv::split $line]
Documentation:
csv (package),
if,
lassign,
llength,
package,
puts,
regexp,
set,
split,
Syntax of Tcl regular expressions
ETA: Getting rid of leading/trailing whitespace. This is a bit unusual, since CSV data is usually arranged to be fields of strictly significant text separated by a separator character. If there is anything to be trimmed, it is usually done when saving the data.
A good way is to put the matched groups through an lmap/string trim filter:
lmap field [regexp -all -inline {[^,]+} $line] {string trim $field}
Another way is to get rid of whitespace around commas first, and then split:
split [regsub -all {\s*,\s*} $line ,] ,
You can use the Tcllib variant of split that splits by regular expression:
package require textutil
::textutil::splitx $line {\s*,\s*}
You can also swap out the earlier regular expression for [^\s,][^,]*[^\s,] (will not match fields of less than two characters). This is a regular expression that is on the verge of becoming too complex to be useful.

Non-greedy match from end of string with regsub

I have a folder path like following:
/h/apps/new/app/k1999
I want to remove the /app/k1999 part with the following regular expression:
set folder "/h/apps/new/app/k1999"
regsub {\/app.+$} $folder "" new_folder
But the result is /h: too many elements are being removed.
I noticed that I should use non-greedy matching, so I change the code to:
regsub {\/app.+?$} $folder "" new_folder
but the result is still /h.
What's wrong with the above code?
Non-greedy simply means that it will try to match the least amount of characters and increase that amount if the whole regex didn't match. The opposite - greedy - means that it will try to match as much characters as it can and reduce that amount if the whole regex didn't match.
$ in regex means the end of the string. Therefore something.+$ and something.+?$ will be equivalent, it is just that one will do more retries before it matches.
In your case /app.+ is matched by /apps and this is the first occurrence of /app in your string. You can fix it by being more explicit and adding the / that follows /app:
regsub {/app/.+$} $folder "" new_folder
If you are looking to match app as a whole word, you can make use of the word boundaries that in Tcl are \m and \M:
\m
matches only at the beginning of a word
\M matches only at the end of a word
We only need the \M as / is a non-word character and we do not need \m:
set folder "/h/apps/new/app/k1999"
regsub {/app\M.+$} $folder "" newfolder
puts $newfolder
See IDEONE demo
Result: /h/apps/new (we remove everything from a whole word app up to the end.)
If you want to remove just a part of the string inside the path, you can use negated class [^/]+ to make sure you only match a subpart of a path:
regsub {/app/[^/]+} $folder "" newfolder
The regular expression engine always starts matching as soon as it can; the greediness doesn't affect this. This means that in this case, it always starts matching too soon; you want the last match, not the first one.
If you use regexp -all -indices -inline, you can find out where the last match starts. That lets you then remove the part you actually don't want (e.g., by replacing it with an empty string:
set folder "/h/apps/new/app/k1999"
set indices [regexp -all -indices -inline {/app} $folder]
# This gets this value: {2 5} {11 14}
# If we have indices — if we had a match — we can do the rest of our processing
if {[llength $indices] > 0} {
# Get the '11'; the first sub-element of the last element
set index [lindex $indices end 0]
# Replace '/app/k1999' with the empty string
set newfolder [string replace $folder $index end ""]
} else {
set newfolder $folder; # In case there's no match...
}
You can use a regular expression substitution operation to remove a directory suffix from a path name, but that doesn't mean you should.
file join {*}[lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}]
# -> /h/apps/new
A path name is a string, but more properly it's a list of directory names:
file split $folder
# -> / h apps new app k1999
What you want is the sublist of directory names up to, but not including, the directory named "app".
lmap dir [file split $folder] {if {$dir ne {app}} {set dir} break}
# -> / h apps new
(The directory name can be tested however you wish; a couple of possibilities are {$dir ni {foo app bar}} to skip at alternative names, or {![string match app-* $dir]} for any name beginning with "app-".)
And when you've gotten the list of directory names you wanted, you join the elements of it back to a path name again, as above.
So why should you do it this way instead of by using a regular expression substitution operation? This question illustrates the problem well. Unless one is an RE expert or takes great care to read the documentation, one is likely to formulate a regular expression based on a hunch. In the worst case, it works the first time. If not, one is tempted to tinker with it until it does. And any sufficiently ununderstood (yep, that is a word) RE will seem to work most of the time with occasional false positives and negatives to keep things interesting.
Split it, truncate it, join it. Can't go wrong. And if it does, it goes obviously wrong, forcing you to fix it.
Documentation: break, file, if, lmap, set

Tcl regexp does not escape asterisk (*)

In my script I get a string that looks like this:
Reading thisfile.txt
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
** Error: (errorcode) Cannot access file "somedir/anotherlib". <--
No such file or directory. (errno = ENOENT) <--
Reading anotherfile.txt
.....
But the two marked lines with the error code only appear from time to time.
I'm trying to use a regexpression to get the lines from after Reading thisfile.txt to the line before either Reading anotherfile.txt or, if it is there, before **.
So result should in every case look like this:
"lib" maps to directory somedir/work.
"superlib" maps to directory somedir/work.
"anotherlib" maps to directory somedir/anotherlib.
I have tried it with this regexp:
set pattern ".*Reading thisfile.txt\n(.*)\n.*Reading .*$"
Then I do
regexp -all $pattern $data -> result
But that only works if there is no error message.
So I'm trying to look for the *.
set pattern ".*Reading thisfile.txt\n(.*)\n.*\[\*|Reading\].*$"
But that also does not work. The part with ** Error is still there.
I wonder why. This one doesn't even compile:
set pattern ".*Reading thisfile.txt\n(.*)\n.*\*?.*Reading .*$"
any idea how I can find the and not match the *?
From the way you wrote your regex, you will have to use braces:
set pattern {.*Reading thisfile\.txt\n(.*)\n.*\*?.*Reading .*$}
If you used quotes, you would have had to use:
set pattern ".*Reading thisfile\\.txt\n(.*)\n.*\\*?.*Reading .*$"
i.e. basically put a second backslash to escape the first ones.
The above will be able to grab something; albeit everything between the first and the last Reading.
If you want to match from Reading thisfile.txt to the next line beginning with asterisk, then you could use:
set pattern {^Reading thisfile\.txt\n(.*?)\n(?=^Reading|^\*)}
regexp -all -lineanchor -- $pattern $data -> result
(?=^Reading|^\*) is a positive lookahead and I changed your (.*) to (.*?) so that you really get all the occurrences and not from the first to the last Reading.
The positive lookahead will match if either Reading or * is ahead and are both starting on a new line.
-lineanchor makes ^ match at every beginning of line instead of at the start of the string.
codepad demo
I forgot to mention that if you have more than one match, you will have to set the results of the regexp and use the -inline modifier instead of using the above construct (else you'll get only the last submatch)...
set results [regexp -all -inline -lineanchor -- $pattern $data]
foreach {main sub} $results {
puts $sub
}
I'm unfamiliar with tcl but the following regex will give you matches of which the 1st capture-group contains the filename and the 2nd capture-group contains all the lines you want:
^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)
Debuggex Demo
Basically the (?:[^\n]|\n(?!Reading|\*\*))* is saying "Match anything that isn't a new-line character or a new-line character not followed by either Reading or **".
What I'm getting from Jerry's answer is you'd define that in tcl like so:
set pattern {^Reading ([^\n]*)\n((?:[^\n]|\n(?!Reading|\*\*))*)}

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###