Get error result when search match for only one time - regex

I wanted to search a string that matched exact times in another string, while I met some problem.
use strict;
use warnings;
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
print $test,"\tone time\n" if($test=~/$cut_seq{1}/);
print $test,"\tmore than one times\n" if($test=~/$cut_seq{1,}/);
I expected the result:
abc1234abc5678abcdef910 more than one times
But the result showed as:
abc1234abc5678abcdef910 one time
abc1234abc5678abcdef910 more than one times
I also tried this:
print $test,"\tone time\n" if($test=~/$cut_seq{0,1}/);
print $test,"\tone time\n" if($test=~/$cut_seq{1,1}/);
print $test,"\tmore than one times\n" if($test=~/$cut_seq{1,}/);
But nothing changed. I just wonder why it can't match exact times. If something matches two times it will also match one time, then what's the difference of {1}, {1,}, {1,1}, {1,2}. I don't get the point to create these different forms.

If something matches two times it also matches one time. That's why your "one time" match always kicks in.
The easiest approach, I think, is to simply split at your $cut_seq and check the number of splitted elements.
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
my #elts= split /$cut_seq/, $test;
print scalar(#elts)-1," times\n";
P.S. This does not count '$cut_seq` at the end of the string - sorry! You'll have to append something which will not be part of your sequence like:
my $test="abc1234abc5678abcdef910";
my $cut_seq="abc";
my #elts= split /$cut_seq/, $test . chr(0);
print scalar(#elts)-1," times\n";

Related

SUM multiple values after a substring within all cells in a column in Google Sheets

For an open source chat analyser in Google Sheets, I need to extract all numeric values after a substring (Example), then total them.
For example, if a cell contains Example1 another text 123 Example500 text, Example1 and Example500 should be extracted out, and their numeric values summed to 501.
This is complicated further by needing to obtain the total for a column of messages.
What I've tried already:
=REGEXEXTRACT(A1, "Example(\d+)"): This only extracts the first matching value, but works!
=SUM(SPLIT(A1, "Example")): This works for messages that only include my target string, but falls apart when other strings are included. The output could possibly be filtered to results that start with a number, but this is very messy and possibly a red herring.
CONCATENATEing all my cells together, then searching for numbers. This is error-prone due to additional numbers within messages.
Another idea is to substitute each Example(\d+) to $1 the captured digit and space |. or replace anything else with empty string (regex101 demo). Knowing that $1 is unset on the right side of the alternation. Then split on space and sum up digits (any other occurring digits have been removed). If Example is a placeholder, replace with e.g. [[:alpha:]]+ for one or more alphabetic characters.
=IF(ISTEXT(A1);SUM(SPLIT(REGEXREPLACE(A1;"Example(\d+)|.";"$1 ");" "));0)
I added IF(ISTEXT(A1);...) for only processing text in the source field (to avoid errors). Else if empty or no text it's set to 0. Just remove if the field always contains text and this is unneeded.
Edit from #TheMaster: As a array formula, we can use BYROW
=BYROW(A:A; LAMBDA(row; IF(ISTEXT(row); SUM(SPLIT(
REGEXREPLACE(row;"Example(\d+)|.";"$1 ");" "));)))
try:
=LAMBDA(x, REGEXEXTRACT(A1, "(\w+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\w+\d+")),
REGEXEXTRACT(x, "\w+(\d+)"), )))(SPLIT(A1, " "))
update 1:
=LAMBDA(x, REGEXEXTRACT(A1, "(\D+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), )))(SPLIT(A1, " "))
update 2:
=INDEX(LAMBDA(xx, REGEXEXTRACT(xx, "(\D+)\d+")&
BYROW(LAMBDA(x, IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), ))(SPLIT(xx, " ")), LAMBDA(x, SUMPRODUCT(x))))
(A1:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
if you start from A2 just change A1: to A2:

parse comma seperated values in argumentlist that's seperated by commas

So i have this regex:
=([0-9A-Za-z_-]+),?
and i need have a string like:
foo=bar,pine=apple,tree,bar=bie
or
foo=bar,pine=apple,tree
or
pine=apple,tree
the regex works for cases where i only have 1 value.
but since we have comma's in the list of values for the key.
the regex just craps out and my code does half of what i want it to do but doesn't get the 2nd value.
How do i fix my regex to take both values regardless of where in the string it is?
alone, between 2 others, at the end.
i tried some stuff but couldn't figure it out.
Attempt 1:
=([0-9A-Za-z,_-]+),=?
In this case, it matches the one where it's in the middle but it fails on the others because = does not exist.
Attempt 2:
=[0-9A-Za-z_-]+([,]+[0-9A-Za-z_-]*),?
Matches too bar,pine and tree,bar for example
EDIT::
This seems to work maybe....
=('[0-9A-Za-z,_-]+'),*|=([0-9A-Za-z_-]+),*
if i use quotes for multi values..
You can split on variable names - that will leave only the values:
s := regexp.MustCompile("[^,\\s]+=").Split("foo=bar,pine=apple,tree,bar=bie", -1)
fmt.Println(s)
# => [ "bar", "apple,tree", "bie"]
Go Demo
Regex Demo

How can I tell if there are three or more characters between matches in a regex?

I'm using Ruby 2.1. I have this logic that looks for consecutive pairs of strings in a bigger string
results = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
My question is, how do I iterate over the list of results and print out whether there are three or more characters between the two strings? For instance if my string were
"abc def"
The above would produce
[["abc def", "abc", "def"]]
and I'd like to know whether there are three or more characters between "abc" and "def."
Use a quantifier for the spaces inbetween: \b((\S+?)\b\s{3,}\b(\S+?))\b
Also, the inner boundries are not really needed:
\b((\S+?)\s{3,}(\S+?))\b
A straightforward way to check this is by running a separate regex:
results.select!{|x|p x[/\S+?\b(.*?)\b\S+?/,1].size}
will print the size for every of the bunch.
Another way is to take the size of the captured groups and subtract them:
results = []
line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/) do |s, group1, group2|
results << $~ if s.size - group1.size - group2.size >= 3
end

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

find many matches in nucleotide sequence with a regex

I have some gene sequence (see below), and I want to find all open reading frame (start with ATG and stop TAG).
I have tried this:
my $file = ('ACCCTGCCCAAAATCCCCCCGATCGATAGAGCTAAATGGCCCATGATGCATCGACTAGCTAGCTAAAATGTCGATCGATACAGCTAATAG');
while($file =~ /(ATG\w+?TAG)/g){
print $1;
}
but it only gives
ATGGCCCATGATGCATCGACTAGATGTCGATCGATACAGCTAATAG
how can i get every one?
The trick to find all occurences is to use a zero-width assertion, this will prevent "the eating" of our characters: (?=ATG\w+?TAG).
The problem with this is that we'll get empty matches, so the solution is to use a group:
(?=(ATG\w+?TAG)). You will find all occurences in group 1.
Group 1 output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
Online demo
Result is ok, simply separate them in output:
print "$1\n";
You are getting two matches. To see them, I suggest you print some separator between them:
print "$1\n";
Then we get the output:
ATGGCCCATGATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to find frames that also occur inside another, then you must make sure to not consume too many characters. Work around that via a looahead:
/ATG(?=([ACTG]*+TAG))/g;
Then print "ATG$1\n", Output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to have the start and stop codons in the same frame don't forget to filter the results to the only ones with a length multiple of 3:
print "ATG$1\n" if (length($1)%3) == 0 ;
If you want to check the six frames available in one sequence, don't forget to check also the complementary chain:
$comp_chain = reverse($chain) ;
$comp_chain =~ tr/ATCG/TAGC/ ;
You will then obtain the open reading frames from the six reading frames available in a single sequence.