Stripping out angle brackets with a regular expression - regex

Having the following string:
<str1> <str2> <str3>
I am trying to do the regex such that I get the following 3 strings in C:
str1
str2
str3
I am trying to use the following regex but it doesn't seem to be yielding what I am aiming for:
<[a-zA-Z0-9]*.>
According to http://www.myregextester.com/index.php, that regex is going to yield
<str1>
<str2>
<str3>
How can I take out the <>'s?
Also, I'd also like to ONLY match strings with format, i.e., with 3 <>'s, no more, no less. How to approach that?

This can be easily done with a perl-compatible regular expression like this:
<([^>]+)>
You just have to make sure to tell your library to search for all matches, instead of trying to match the regular expression against the whole string. You will end up with str1, str2 and str3 as the first group match then.

What if you change it for <([a-zA-Z0-9]*.)>

<([a-zA-Z0-9]*?)> - Then you access the second match group of each match. (There is a PHP example on how to access the match group on http://www.myregextester.com/index.php
You can do this with positive lookaround ( http://www.regular-expressions.info/lookaround.html ) like this (?<=<)[a-zA-Z0-9]*(?=>) but not all flavours support it ( http://www.regular-expressions.info/refflavors.html )
Disclaimer: Don't copy Regex from the web if you do not understand how and why it works. This includes answers on Stack Overflow!

This regex would do the job:
/^<([a-zA-Z0-9]*?)>\s<([a-zA-Z0-9]*?)>\s<([a-zA-Z0-9]*?)>$/
example in Perl:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
my $re = qr/^<([a-zA-Z0-9]*?)>\s<([a-zA-Z0-9]*?)>\s<([a-zA-Z0-9]*?)>$/;
my $str = q!<str1> <str2> <str3>!;
$str =~ s/$re/$1\n$2\n$3/;
say $str;
Output:
str1
str2
str3

You can simple use regex like given below to apply a check on angular brackets.
var text= "<span></span>";
var check = new RegExp(/[^>]\d+[a-z]*[^<])/ );
var valid = check text=
if(!valid) {
return false;
}
else
{
return true;
}

Related

Matching consecutive characters from a string using regex

I'm not sure how to title this question, so moving along...
I'd like to be able to match a portion of a string that is a subset of a larger string. For example:
MatchPartOfThisString -> Reference string
fThisDiff -> string I'd like to be able to say matches 5 consecutive characters in
I suppose I could loop through the first string, taking the minimum number of consecutive matches from the reference string, and see if the other string matches each of the matches I get from systematically trying to match:
if(fThisDiff =~ /Match/) {
do something...;
}
if(fThisDiff =~ /atchP/) {
do something...;
}
if(fThisDiff =~ /tchPa/) {
do something...;
}
etc.
I'd like to do this more elegantly though, if there is a way to interpret portions of the reference string repeatedly with a singular regex. I do not think this is the case, but I'd like confirmation regardless.
Here is a basic take on it, by hand with builtin tools.
Build a regex pattern with alternation of substrings of desired length from your reference string.
use warnings;
use strict;
use feature 'say';
sub get_alt_re {
my ($str, $len) = #_;
$len //= 1; #/
my #substrings;
foreach my $beg (0 .. length($str)-$len) {
push #substrings, substr($str, $beg, $len);
}
return '(' . join('|', map quotemeta, #substrings) . ')';
}
my $ref = q(MatchPartOfThisString);
my $target = q(fThisDiff);
my $re = get_alt_re($ref, 5);
my #m = $target =~ /$re/g;
say for #m;
Prints the line fThis.
The code should be made more robust and general. Then, it is fairly easily modified to match for a range of lengths (not only one, 5 above). Further, it can use libraries for subtasks (those repeated calls to substr beg for C code). But this demonstrates that a basic solution can be rather simple.
There's no simple way to do this with regex features, but a couple CPAN modules can help construct such a regex in this case.
use strict;
use warnings;
use String::Substrings 'substrings';
use Data::Munge 'list2re';
my $match_string = 'MatchPartOfThisString';
my $re = list2re substrings $match_string, 5;
my $subject = 'fThisDiff';
if ($subject =~ m/($re)/) {
print "Matched $1 from $match_string in $subject\n";
}
The best approach would be to use the longest common substring algorithm (not to be confused with the similarly-named longest common subsequence algorithm) then check its length.
use String::LCSS_XS qw( lcss );
my $longest = lcss("MatchPartOfThisString", "fThisDiff");
say length($longest);
If you have really long strings and you want to to squeeze out every millisecond, a tailored version of the algorithm that quits as soon as the target length is found and that avoids building the string would be faster.

validating initialization variables in perl constructor using ref

I'm trying to apply a split function on a string only where 1 colon (:) exists using regular expressions. The problem is that while a colon could exist multiple times consecutively, I'm only interested in instances where a colon is not preceded or followed by another colon . Any other character could precede or follow the colon.
Example string:
my $example_string = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}"
Expected result:
my #result_array = ["_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}"];
What I've tried so far is a combination of negation and group regular expressions...one example that got me close:
Cuts off 1 value before and after colon
my #result_array= split(/[^:][:][^:]/g, $example_string )
#result_array = [
'_targetfund|tes',
'rowcountmax|10',
'test|YE',
'fruit|appl',
'date|\'12/31/2016\''
];
I was playing around with https://regex101.com/, thought maybe there was a way to return $1 within the same regex or something which could be done recursively.
Any help would be appreciated
Maybe overkill, but i would use the
split /(?<!:):(?!:)/, $str;
demo
use 5.014;
use warnings;
use Test::More;
my $str = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}";
my #wanted = ("_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}");
my #arr = split /(?<!:):(?!:)/, $str;
is_deeply(\#arr, \#wanted);
done_testing(1);
#ok 1
#1..1
You can use look-around assertions, i.e. split on semicolon not preceded nor followed by a semicolon:
#!/usr/bin/perl
use warnings;
use strict;
use Test::Deep;
my $example_string = "_Fruit|Apple:~Vegetable|Carrot:~fruitfunc|Package::User::Today:~datefunct|{~date}";
my $result_array = ["_Fruit|Apple","~Vegetable|Carrot","~fruitfunc|Package::User::Today","~datefunct|{~date}"];
cmp_deeply( [ split /(?<!:):(?!:)/, $example_string ], $result_array );
This one should do the job : :(?=~)
Demo

Sliding window pattern match in perl or matlab regular expressions

I am trying to use either Perl or MATLAB to parse a few numbers out of a single line of text. My text line is:
t10_t20_t30_t40_
now in matlab, i used the following script
str = 't10_t20_t30_t40_';
a = regexp(str,'t(\d+)_t(\d+)','match')
and it returns
a =
't10_t20' 't30_t40'
What I want is for it to also return 't20_t30', since this obviously is a match. Why doesn't regexp scan it?
I thus turned to Perl, and wrote the following in Perl:
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ /(t\d+_t\d+)/g)
{
print "$1\n";
}
and the result is the same as matlab
t10_t20
t30_t40
but I really wanted "t20_t30" also be in the results.
Can anyone tell me how to accomplish that? Thanks!
[update with a solution]:
With help from colleagues, I identified a solution using the so-called "look-around assertion" afforded by Perl.
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ m/(?=(t\d+_t\d+))/g)
{print "$1\n";}
The key is to use "zero width look-ahead assertion" in Perl. When Perl (and other similar packages) uses regexp to scan a string, it does not re-scan what was already scanned in the last match. So in the above example, t20_t30 will never show up in the results. To capture that, we need to use a zero-width lookahead search to scan the string, producing matches that do not exclude any substrings from subsequent searches (see the working code above). The search will start from zero-th position and increment by one as many times as possible if "global" modifier is appended to the search (i.e. m//g), making it a "greedy" search.
This is explained in more detail in this blog post.
The expression (?=t\d+_t\d+) matches any 0-width string followed by t\d+_t\d+, and this creates the actual "sliding window". This effectively returns ALL t\d+_t\d+ patterns in $str without any exclusion since every position in $str is a 0-width string. The additional parenthesis captures the pattern while its doing sliding matching (?=(t\d+_t\d+)) and thus returns the desired sliding window outcome.
Using Perl:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $re = qr/(?=(t\d+_t\d+))/;
my #l = 't10_t20_t30_t40' =~ /$re/g;
say Dumper(\#l);
Output:
$VAR1 = [
't10_t20',
't20_t30',
't30_t40'
];
Once the regexp algorithm has found a match, the matched characters are not considered for further matches (and usually, this is what one wants, e.g. .* is not supposed to match every conceivable contiguous substring of this post). A workaround would be to start the search again one character after the first match, and collect the results:
str = 't10_t20_t30_t40_';
sub_str = str;
reg_ex = 't(\d+)_t(\d+)';
start_idx = 0;
all_start_indeces = [];
all_end_indeces = [];
off_set = 0;
%// While there are matches later in the string and the first match of the
%// remaining string is not the last character
while ~isempty(start_idx) && (start_idx < numel(str))
%// Calculate offset to original string
off_set = off_set + start_idx;
%// extract string starting at first character after first match
sub_str = sub_str((start_idx + 1):end);
%// find further matches
[start_idx, end_idx] = regexp(sub_str, reg_ex, 'once');
%// save match if any
if ~isempty(start_idx)
all_start_indeces = [all_start_indeces, start_idx + off_set];
all_end_indeces = [all_end_indeces, end_idx + off_set];
end
end
display(all_start_indeces)
display(all_end_indeces)
matched_strings = arrayfun(#(st, en) str(st:en), all_start_indeces, all_end_indeces, 'uniformoutput', 0)

regular expression url rewrite based on folder

I need to be able to take /calendar/MyCalendar.ics where MyCalendar.ics coudl be about anything with an ICS extention and rewrite it to /feeds/ics/ics_classic.asp?MyCalendar.ics
Thanks
Regular expressions are meant for searching/matching text. Usually you will use regex to define what you search for some text manipulation tool, and then use a tool specific way to tell the tool with what to replace the text.
Regex syntax use round brackets to define capture groups inside the whole search pattern. Many search and replace tools use capture groups to define which part of the match to replace.
We can take the Java Pattern and Matcher classes as example. To complete your task with the Java Matcher you can use the following code:
Pattern p = Pattern.compile("/calendar/(.*\.(?i)ics)");
Matcher m = p.matcher(url);
String rewritenUrl = "";
if(m.matches()){
rewritenUrl = "/feeds/ics/ics_classic.asp?" + url.substring( m.start(1), m.end(1));
}
This will find the requested pattern but will only take the first regex group for creating the new string.
Here is a link to regex replacement information in (imho) a very good regex information site: http://www.regular-expressions.info/refreplace.html
C:\x>perl foo.pl
Before: a=/calendar/MyCalendar.ics
After: a=/feeds/ics/ics_classic.asp?MyCalendar.ics
...or how about this way?
(regex kind of seems like overkill for this problem)
b=/calendar/MyCalendar.ics
index=9
c=MyCalendar.ics (might want to add check for ending with '.ics')
d=/feeds/ics/ics_classic.asp?MyCalendar.ics
Here's the code:
C:\x>type foo.pl
my $a = "/calendar/MyCalendar.ics";
print "Before: a=$a\n";
my $match = (
$a =~ s|^.*/([^/]+)\.ics$|/feeds/ics/ics_classic.asp?$1.ics|i
);
if( ! $match ) {
die "Expected path/filename.ics instead of \"$a\"";
}
print "After: a=$a\n";
print "\n";
print "...or how about this way?\n";
print "(regex kind of seems like overkill for this problem)\n";
my $b = "/calendar/MyCalendar.ics";
my $index = rindex( $b, "/" ); #find last path delim.
my $c = substr( $b, $index+1 );
print "b=$b\n";
print "index=$index\n";
print "c=$c (might want to add check for ending with '.ics')\n";
my $d = "/feeds/ics/ics_classic.asp?" . $c;
print "d=$d\n";
C:\x>
General thoughts:
If you do solve this with a regex, a semi-tricky bit is making sure your capture group (the parens) exclude the path separator.
Some things to consider:
Are your paths separators always forward-slashes?
Regex seems like overkill for this; the simplest thing I can think of of to get the index of your last path separator and do simple string manipulation (2nd part of sample program).
Libraries often have routines for parsing paths. In Java I'd look at the java.io.File object, for example, specifically
getName()
Returns the name of the file or directory denoted by
this abstract pathname. This is just the last name in
the pathname's name sequence

Regex line by line: How to match triple quotes but not double quotes

I need to check to see if a string of many words / letters / etc, contains only 1 set of triple double-quotes (i.e. """), but can also contain single double-quotes (") and double double-quotes (""), using a regex. Haven't had much success thus far.
A regex with negative lookahead can do it:
(?!.*"{3}.*"{3}).*"{3}.*
I tried it with these lines of java code:
String good = "hello \"\"\" hello \"\" hello ";
String bad = "hello \"\"\" hello \"\"\" hello ";
String regex = "(?!.*\"{3}.*\"{3}).*\"{3}.*";
System.out.println( good.matches( regex ) );
System.out.println( bad.matches( regex ) );
...with output:
true
false
Try using the number of occurrences operator to match exactly three double-quotes.
\"{3}
["]{3}
[\"]{3}
I've quickly checked using http://www.regextester.com/, seems to work fine.
How you correctly compile the regex in your language of choice may vary, though!
Depends on your language, but you should only need to match for three double quotes (e.g., /\"{3}/) and then count the matches to see if there is exactly one.
There are probably plenty of ways to do this, but a simple one is to merely look for multiple occurrences of triple quotes then invert the regular expression. Here's an example from Perl:
use strict;
use warnings;
my $match = 'hello """ hello "" hello';
my $no_match = 'hello """ hello """ hello';
my $regex = '[\"]{3}.*?[\"]{3}';
if ($match !~ /$regex/) {
print "Matched as it should!\n";
}
if ($no_match !~ /$regex/) {
print "You shouldn't see this!\n";
}
Which outputs:
Matched as it should!
Basically, you are telling it to find the thing you DON'T want, then inverting the truth. Hope that makes sense. Can help you convert the example to another language if you need help.
This may be a good start for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
See it in action at regex101.com.