TCL regexp pattern search - regex

I am trying to find a pattern match as below
abc(xxxx):efg(xxxx):xyz(xxxx) where xxxx - [0-9] digits
I used
set string "my string is abc(xxxx):efg(xxxx):xyz(xxxx)"
regexp abc(....):efg(....):xyz(....) $string result_str
it returns 0. Can anyone help?

The problem you've got is that ( and ) have special meaning to regular expressions in Tcl (and many other RE engines besides) in that they denote a capturing sub-RE. To make the characters “normal”, they have to be escaped with a backslash, and that means that it's best to put the regular expression in braces (because backslashes are general Tcl metacharacters).
Thus:
% set string "my string is abc(xxxx):efg(xxxx):xyz(xxxx)"
% regexp {abc\(....\):efg\(....\):xyz\(....\)} $string
1
If you want to also capture the contents of those parentheses, you need a slightly more complex RE:
regexp {abc\((....)\):efg\((....)\):xyz\((....)\)} $string \
all abc_bit efg_bit xyz_bit
Note that those .... sequences always match exactly four characters, but it's better to be more specific. To match any number of digits in each case:
regexp {abc\((\d+)\):efg\((\d+)\):xyz\((\d+)\)} $string -> abc efg xyz
When using regexp to extract bits of a string, it's pretty common to use -> as a (rather strange) variable name for the whole string match; it looks mnemonically like it's saying “send the pieces extracted to these variables”.

Not worked with tcl but seems like you need to escape the ( and ). Also if you are sure that the x's would be digits, use \d{4} instead of ..... Based on this, the updated regex you could try is
abc\(\d{4}\):efg\(\d{4}\):xyz\(\d{4}\).

Related

Regex detect if a matched comma(,) does not lie in a regex

I am trying to figure out a way to determine if my matched comma(,) does not lie inside a regex. Basically, i do not want to match my character if it lies in a regex.
The regex i have come up with is ,(?<!.+\/)(?!.+\/) but its not quite working.
Any ideas?
I want to skip /some,regex/ but match any other commas.
Edit:
Live example: http://rubular.com/r/WjrwSnmzyP
Here is the regex that will work for you:
,(?!\s)(?=(?:(?:[^/]*\/){2})*[^/]*$)
Live Demo: http://rubular.com/r/37buDdg1tW
Explanation: It means match comma followed by EVEN number of forward slash /. Hence comma (,) between 2 slash (/) characters will NOT be matched and outside ones will be matched (since those are followed by even number of / characters).
A curious thing about regular expressions is that if you want to use them to ignore "something" that is within "something else", you need to match that "something else", prefer matches of it, and then either silently discard or reproduce those matches.
For example, in order to remove all commas from a string unless they are in a regular expression literal—
In Perl:
my $s = "/foo,bar/,baz";
$s =~ s{(/(?:[^/\\]|\\.)+/)|,}{\1}g;
In ECMAScript:
var s = "/foo,bar/,baz";
s = s.replace(/(\/([^\/\\]|\\.)+\/)|,/g, "$1");
or
s = s.replace(new RegExp("(/([^/\\\\]|\\\\.)+/)|,", "g"), "$1");
Note that I am capturing the match for the regular expression literal in the string value, and reproducing it (\1 or $1) if it matched. (If the other part of the alternation – the standalone comma – matched, the empty string is captured, so this simple approach suffices here.)
For further reading I recommend “Mastering Regular Expressions” by Jeffrey E. F. Friedl. Two rather enlightening example chapters, each from a different edition, are available for free online.

Regular expression for number search

I need a regular expression that will find a number(s) that is not inside parenthesis.
Example abcd 1 (35) (df)
It would only see the 1.
Is this very complex? I've tried and had no luck.
Thanks for any help
An easy solution is to first remove the unwanted values:
my $string = "abcd 12 (35) (df) 2311,22";
$string =~ s/\(\d+\)//g; # remove numbers within parens
my #numbers = $string =~ /\d+/g; # extract the numbers
This is quite hard but something like this will probably do:
^(?:\()(\d+)(?:[^)])|(?:[^(0-9]|^)(\d+)(?:[^)0-9]|^)|(?:[^(])(\d+)(?:\))$
The problem is to match (123, 123) and also to not match the string 123 as the number 2 between the non-parentheses characters 1 and 3. Also there are probably some edge cases for start of and end of string.
My suggestion is to not use a regex for this. Maybe a regex that matches numbers and then use the capture info to check if the surrounding characters are not parentheses.
The regular expression would be:
^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$
The result is the first (and only) matching group of the regex.
Maybe you want to remove the ^ and $ if the regex should not match only if it’s the content of a whole single line. You can also use [a-zA-Z] or [[:alpha:]]. This depends on the regular expression engine you use and, of course, the content you want to match.
Example perl code:
if (m/^[a-z]+ ([0-9]+) \([0-9]+\) \([a-z]+\)$/) {
print("$1\n");
}
Please note that your question contains not enough information to make a good answer possible (you did not say anything about the general format of your expression, for example if you want to match integers or floating points)
How about
/(?:^|[^\d(])(\d+)(?:[^\d)]|$)/
? This matches a string of digits (\d+) that are
preceded by the beginning of the string, or a character that is not a digit or an open parenthesis ((?:^|[^\d(]))
succeeded by the end of the string, or by a character that is not a digit or a close parenthesis ((?:[^\d)]|$))

Replace repeating characters with one with a regex

I need a regex script to remove double repetition for these particular words..If these character occurs replace it with single.
/[\s.'-,{2,0}]
These are character that if they comes I need to replace it with single same character.
Is this the regex you're looking for?
/([\s.'-,])\1+/
Okay, now that will match it. If you're using Perl, you can replace it using the following expression:
s/([\s.'-,])\1+/$1/g
Edit: If you're using :ahem: PHP, then you would use this syntax:
$out = preg_replace('/([\s.\'-,])\1+/', '$1', $in);
The () group matches the character and the \1 means that the same thing it just matched in the parentheses occurs at least once more. In the replacement, the $1 refers to the match in first set of parentheses.
Note: this is Perl-Compatible Regular Expression (PCRE) syntax.
From the perlretut man page:
Matching repetitions
The examples in the previous section display an annoying weakness. We were only matching 3-letter words, or chunks of words of 4 letters or less. We'd like to be able to match words or, more generally, strings of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.
This is exactly the problem the quantifier metacharacters ?, *, +, and {} were created for. They allow us to delimit the number of repeats for a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:
a? means: match 'a' 1 or 0 times
a* means: match 'a' 0 or more times, i.e., any number of times
a+ means: match 'a' 1 or more times, i.e., at least once
a{n,m} means: match at least "n" times, but not more than "m" times.
a{n,} means: match at least "n" or more times
a{n} means: match exactly "n" times
As others said it depends on you regex engine but a small example how you could do this:
/([ _-,.])\1*/\1/g
With sed:
$ echo "foo , bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo , bar
$ echo "foo,. bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo,. bar
Using Javascript as mentioned in a commennt, and assuming (It's not too clear from your question) the characters you want to replace are space characters, ., ', -, and ,:
var str = 'a b....,,';
str = str.replace(/(\s){2}|(\.){2}|('){2}|(-){2}|(,){2}/g, '$1$2$3$4$5');
// Now str === 'a b..,'
If I understand correctly, you want to do the following: given a set of characters, replace any multiple occurrence of each of them with a single character. Here's how I would do it in perl:
perl -pi.bak -e "s/\.{2,}/\./g; s/\-{2,}/\-/g; s/'{2,}/'/g" text.txt
If, for example, text.txt originally contains:
Here is . and here are 2 .. that should become a single one. Here's
also a double -- that should become a single one. Finally here we have
three ''' which should be substituted with one '.
it is modified as follows:
Here is . and here are 2 . that should become a single one. Here's
also a double - that should become a single one. Finally here we have
three ' which should be substituted with one '.
I simply use the same replacement regex for each character in in the set: for example
s/\.{2,}/\./g;
replaces 2 or more occurrences of a dot character with a single dot. I concatenate several of this expressions, one for each character of your original set.
There may be more compact ways of doing this, but, I think this is simple and it works :)
I hope it helps.

Regular expression to replace spaces with dashes within a sub string.

I've been struggling to find a way to replace spaces with dashes in a string but only spaces that are within a particular part of the string.
Source:
ABC *This is a sub string* DEF
My attempt at a regular expression:
/\s/g
If I use the regular expression to match spaces and replace I get the following result:
ABC-*This-is-a-sub-string*-DEF
But I only want to replace spaces within the text surrounded by the two asterisks.
Here is what I'm trying to achieve:
ABC *This-is-a-sub-string* DEF
Not sure why type of regular expressions I'm using as I'm using the find and replace in TextMate with Regular Expressions option enabled.
It's important to note that the strings that I will be running this regular expression search and replace on will have different text but it's just the spaces within the asterisks that I want to match.
Any help will be appreciated.
To identify spaces that are surrounded by asterisks, the key observation is, that, if asterisks appear only in pairs, the spaces you look for are always followed by an odd number of asterisks.
The regex
\ (?=[^*]*\*([^*]*\*[^*]*\*)*[^*]*$)
will match the once that should be replaced. Textmate would have to support look-ahead assertions for this to work.
s/(?<!\*)\s(?!\*)(?!$)/-/g
If TextMate supports Perl style regex commands (I have no experience with it all, sorry), this is a one-liner that should work.
try this one
/(?<=\*.*)\s(?=.*\*)/g
but it won't work in javascript if you want to use it in it, since it uses also lookbehind which is not supported in js
Try this: \s(\*[^*]*\*)\s. It will match *This is a sub string* in group 1. Then replace to -$1-.
Use this regexp to get spaces from within asterisks
(.)(*(.(\ ).)*)(.)
Take 4th element of the array provided by regex {4} and replace it with dashes.
I find this site very good for creating regular expressions.
It depends on your programming language but in many of them you can use lambda functions with your regular expression replacement statements and thereby perform further replacement on substrings.
Here's an example in Python:
string = "ABC *This is a sub string* DEF"
import re
new_string = re.sub("\*(.*?)\*", lambda x: '*' + x.group(1).replace(" ", "-") + '*', a)
That should give you ABC *This-is-a-sub-string* DEF.

Term with no alphanumeric characters before or after

I am trying to write a regular expression that matches all occurrences of a specified word, but must not have any alphanumeric characters prefixed or suffixed.
For example, searching for the term "cat" should not return terms like "catalyst".
Here is what I have so far:
"?<!([a-Z0-9])*?TERMPLACEHOLDER?!([a-Z0-9])*?"
This should return the word "TERMPLACEHOLDER" on its own.
Any ideas?
Thanks.
How about:
\bTERMPLACEHOLDER\b
You could use word boundaries: \bTERMPLACEHOLDER\b
A quick test in Javascript:
var a = "this cat is not a catalyst";
console.log(a.match(/\bcat\b/));
Returns just "cat".
You may be looking for word boundaries. From there, you can use wildcards like \w*? on either side of the word if you want to make it match partials
Search for any word containing "MYWORD"
\b\w*?MYWORD\w*?\b
Search for any word ending in "ING"
\b\w*?ING\b
Search for any word starting with "TH"
\bTH\w*?\b
Be carefull When you say "word" refering to a substring you want to find. On the regulare expression side "word" has a different meaning, its a character class.
Define the 'literal' string you would like to find (not word). This can be anything, sentences, punctuation, newline combinations. Example "find this \exact phrase <> !abc".
Since this is going to be part of a regular expression (not the whole regex), you can escape the special regular expression metacharacters that might be embedded.
string = 'foo.bar' // the string you want to find
string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g // Escape metachars
Now the 'literal' string is ready to be inserted into the regular expression. Note that if you want to individually allow classes or want metachars in the string, you would have to escape this yourself.
sample =~ /(?<![^\W_])$string(?![^\W_])/ig // Find the string globally
(expanded)
/
(?<![^\W_]) # assertion: No alphanumeric character behind us
$string # the 'string' we want to find
(?![^\W_]) # assertion: No alphanumeric character in front of us
/ig
Perl sample -
use strict;
use warnings;
my $string = 'foo.bar';
my $sample = 'foo.bar and !fooAbar and afoo.bar.foo.bar';
# Quote string metacharacters
$string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g;
# Globally find the string in the sample target
while ( $sample =~ /(?<![^\W_])$string(?![^\W_])/ig )
{
print substr($sample, 0, $-[0]), "-->'",
substr($sample, $-[0], $+[0] - $-[0]), "'\n";
}
Output -
-->'foo.bar'
foo.bar and !fooAbar and afoo.bar.-->'foo.bar'