How to remove duplicate characters in a string strictly using regexp in TCL?
e.g., I have a string like this aabbcddeffghh and I need only characters that are "abcdefgh". I tried with lsort unique, i am able to get unique characters:
join [lsort -unique [split $mystring {}]]
but i need using regexp command only.
You can't remove all non-consecutive double characters from a string with just Tcl's regsub command. It doesn't support access to back-references in lookahead sequences, which means that any scheme for removal will necessarily run into problems with overlapping match regions.
The simplest fix is to wrap in a while loop (with an empty body), using the fact that regsub will return the number of substitutions performed when it's given a variable to store the result in (last argument to it below):
set str "mississippi mud pie"
while {[regsub -all {(.)(.*)\1+} $str {\1\2} str]} {}
puts $str; # Prints "misp ude"
Try this one:
regsub -linestop -lineanchor -all {([a-z])\1+} $subject {\1} result
or
regsub -linestop -nocase -lineanchor -all {([a-z])\1+} $subject {\1} result
Explanation
{
( # Match the regular expression below and capture its match into backreference number 1
[a-z] # Match a single character in the range between “a” and “z”
)
\1 # Match the same text as most recently matched by capturing group number 1
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
}
regsub -all {(.)(?=.*\1)} $subject {} result
It uses a look-ahead to check if there are any more instances of the character. If there are, it removes the character.
You will always retain the last character. It is not possible to do look-behinds in TCL without extra libraries.
More information about look-arounds: Regex tutorial - Lookahead and Lookbehind Zero-Width Assertions
Edit: Hmmm... Seems to be a bug with backreferences in Tcl 8.5. {(.).*\1} matches, but not {(.)(?=.*\1)}. It complains about Invalid backreference number. I can't see any solution to this without a backreference inside a look-ahead.
It might just be the version i tested it on (ideone.com/pFS0Q). I can't find any other version of a Tcl interpreter online to test.
Related
I have some string with format
class(amber#good)
class(Back1#notgood)
class(back#good)
and I want to use regexp to get value of these string
Expected answer:
amber
Back1
back
And here's my cmd:
set string "class(amber#good)"
regexp -all {^\\([a-zA-z_0-9].\#$} $string $match
puts $match
But the answer is not what I expected
You can use
regexp {\(([^()#]+)} $string - match
See the Tcl demo online.
The \(([^()#]+) regex matches
\( - a ( char
([^()#]+) - Capturing group 1 (match): any one or more chars other than parentheses and #.
The hyphen is used since the whole-match value is not necessary, we are only interested to get Group 1 value.
Sometimes using regular expressions is error prone and/or overkill.
Here's an alternate answer using split:
lindex [split $string "()#"] 1
I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;', that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/
For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by #, possibly with some characters in between.
[abc#def"ghi"jkl'123]
The following regex
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;
will find the first " and if I run it three times I have all three.
Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.
s/(\[[^\[\]]*?#[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g;
does the trick for one instance. But how can I find all of them in one go?
The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.
In [abc#def"ghi"jkl'123], there is only a single match (which is the [abc#def" part of the string, with $1 = '[abc#def' and $2 = ''), so only the first " is removed.
After the first match, Perl scans the remaining string (ghi"jkl'123]) for another match, but it doesn't find another [ (or #).
I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.
In code:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;
Or to replace each match by X:
s{ \[ [^\[\]\#]* \# \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;
We match a prefix of [, followed by 0 or more characters that are not [ or ] or #, followed by #.
\K is used to mark the virtual beginning of the match (i.e. everything matched so far is not included in the matched string, which simplifies the substitution).
We match and capture 0 or more characters that are not [ or ].
Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).
The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr, respectively, but since each inner match is just a single character, it's also possible to use a transliteration.
We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.
So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):
The \K causes everything before it to be dropped....so only the character after it is actually replaced.
perl -pe 'while (s/\[[^]]*#[^]]*\K[^]a-zA-Z0-9]//){}' file
Basically we just have an empty loop that executes until the search and replace replaces nothing.
Slightly improved version:
perl -pe 'while (s/\[[^]]*?#[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file
The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the *s lazy with the ? so we get the shortest match possible.
Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters
use warnings;
use strict;
use feature 'say';
my $var = q(ah [abc#def"ghi"jkl'123] oh); #'
say $var;
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
(my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
$v
}ex;
say $var;
where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1, what isn't allowed)
$var =~ s{ \[ [^\[\]]*? \#\K ([^\]]+) }{
$1 =~ s/[^0-9a-zA-Z]//gr;
}ex;
The \K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.
The code in the question doesn't work because everything matched is consumed, and (under /g) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.
The problem with matches that we want to leave in the string can often be remedied by \K (used in all current answers), which makes it so that all matches before it are not consumed.
#!/usr/bin/perl
my #allFiles=`ls *.gz`;
for my $file (#allFiles) {
if ($file =~ '0000*.gz') {
print $file;
}
}
I am trying this above code to print all filenames that have a prefix of 0000. Like 00001.gz, 00002.gz etc
A close equivalent to the shell's wildcard * in regex is .*. The * quantifier means that the pattern before it matches "zero or more" times and . means "any character," see Regular Expressions in perlre. But as it seems that you want something after the zeros then use .+ instead, to match any character one-or-more times. To match a literal period escape it, \.
Next, there is no reason to use an external command for what you do. In Perl
my #allFiles = glob "*.gz";
The documentation is linked at the end.
Finally, please always enable warnings and strict.
Altogether
use warnings;
use strict;
my #allFiles = glob "*.gz";
foreach my $file (#allFiles) {
if ($file =~ /^0000.+\.gz/) {
print "$file\n";
}
}
The regex pattern matches: 0000 at the beginning of the string (^), followed by any character (.) matched one-or-more times (+), then a literal period (\.) and literal gz. Note that .+ means that the . matches one-or-more times, it need not be the same character.
Adjust to what best suits your actual need and the directory content. For example, if you want files with only digits following zeros, you need /^0000\d+\.gz/. To catch a file 00001a.gz as well you need to allow for non-digits after a string of digits, for instance by /^0000\d+.*\.gz/.
See perlretut for a regex tutorial and glob, or better File::Glob for things with spaces in names.
There are other ways to do this. For example, you need to filter a list of file names, so
my #files = grep { /^0000.+\.gz/ } glob "*.gz";
The glob is in the list context imposed by grep so it returns the list of all files that it matches. (In the scalar context it iterates through them.) The code in grep's block runs for each and if it evaluates to true that element passes. It is the same regex, applied by default to $_ variable that is the implicit iterator (and aliased to the currently processed element). So grep
returns the desired list.
For your specific example even just this will do
my #files = glob "0000[0-9].gz";
print "$_\n" for #files;
This fetches all files with a single digit following 0000, then .gz.
See the list of accepted meta characters in the linked File::Glob docs.
Below I have copied the code I had written. I don't know what the line flag does.
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
regexp -all -line -- {(venkat.+)$} $value a b
puts "Full Match: $a"
puts "Sub Match1: $b"
The above code gives the following output
Full Match: venkat567
Sub Match1: venkat567
Can any one explain me when and where should I choose the -line flag in tcl regular expression
The man page has defined it well I believe:
-line
Enables newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning. With this flag, [^ bracket expressions and . never match newline, ^ matches an empty string after any newline in addition to its normal function, and $ matches an empty string before any newline in addition to its normal function. This flag is equivalent to specifying both -linestop and -lineanchor, or the (?n) embedded option (see the re_syntax manual page).
If you want to understand it another way, . and [^ ... ] usually match newlines, for example:
regexp -- {^....$} "ab\nc"
returns 1 (meaning the regexp matches the string, counting \n as 1 character) but using the -line switch will prevent . to match \n.
Similary:
regexp -- {^[^abc]+$} "de\nf"
will also return 1 because the negated class [^abc] is able to match a character that is not abc, which includes \n.
The second function of the -line switch makes ^ match at every beginning of line instead of matching only at the start of the whole string, and makes $ match at every end of line instead of matching only at the end of the whole string.
% set text {abc
abc}
abc
abc
% regexp -- {^abc$} $text
0
% regexp -line -- {^abc$} $text
1
As for the when and where, it will depend on what you are trying to do. Based on your sample code, it would seem to me that you need to get all the usernames beginning with venkat that can appear at the end of any line. Since you want to match many, you will need to use the -all and -inline switches to get the matched strings, and I would recommend to change the regexp a bit:
set value "hi this is venkat345
hi this is venkat435
hi this is venkat567"
# I removed the capture group and changed . to \S to match non-space characters
set results [regexp -all -inline -line -- {venkat\S+$} $value]
puts $results
# venkat345 venkat435 venkat567
-line just make sure your . will never match a newline.
According to the Tcl regexp documentation:
-line
Enables newline-sensitive matching. By default, newline is a
completely ordinary character with no special meaning. With this flag,
‘[^’ bracket expressions and ‘.’ never match newline, ‘^’ matches an
empty string after any newline in addition to its normal function, and
‘$’ matches an empty string before any newline in addition to its
normal function. This flag is equivalent to specifying both -linestop
and -lineanchor, or the (?n) embedded option (see METASYNTAX, below).
Here is the output without -line option:
Full Match: venkat345
hi this is venkat435
hi this is venkat567
Sub Match1: venkat345
hi this is venkat435
hi this is venkat567
The .+ just matches all the lines up to the value string end.
I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/