find a string pattern in a string array - regex

I need to count the occurrence of specified patterns in the input strands and produces a report for each pattern.
The input string would contain 1 AA AATTCGAA end
the 1 signifies one pattern to search for and AA is the pattern and the next is the part you would search AA in.
My idea is to :
public static void main(String[] args){
Scanner s = new Scanner(System.in);
System.out.println("How many patterns do you want and enter patterns and DNA Sequence(type 'end' to signify end):");
String DNA = s.nextLine();
process(DNA);
}
public static void process(String DNA){
String number = DNA.replaceFirst(".*?(\\d+).*", "$1");
int N = Integer.parseInt(number);
DNA.toUpperCase();
String[] DNAarray;
DNAarray = DNA.split(" ");
for(int i=1; i<N; i++){
int count=0;
for(int j =0; j < DNAarray.length; j++) {
if(DNAarray[i+N].contains(DNAarray[i])){
count= count++;
}
}
System.out.println("Pattern:"+DNAarray[i]+ "Count:"+count);
}

This should do it:
using System;
using System.Text.RegularExpressions;
public class Program
{
public void Main()
{
Console.WriteLine(PatternCount("1 AA AADDRRSSAA"));
}
public int PatternCount(string sDNA) {
Regex reParts = new Regex("(\\d+)\\s(\\w\\w)\\s(\\w+)");
Match m = reParts.Match(sDNA);
if (m.Success)
{
return Regex.Matches(m.Groups[3].Value, m.Groups[2].Value).Count;
}
else
return 0;
}
}
First RE splits the input into count, pattern and data. (Not sure why you want to limit the number of patterns to search for. This code ignores that. Modify after your needs...)
Second RE equals the pattern wanted and "Matches" counts the number of occurrences. Work from here.
Regards
(I feel good today, doing people's work ;))

Really no need to put the number of searches. And, actually this could be done
with a single regex. I can't remember if Dot-Net supports the \G anchor,
but this is really not necessary anyway. I left it in.
Each Match:
Finds a new key.
Captures the keys sub-string matches at the end.
Advances the search position by just the key.
So, sit in a Find loop.
On each match print the 'Key' capture buffer,
then print the capture collection 'Values' count.
Thats all there is to it.
The regex will search for overlapping keys. To change it to exclusive keys,
change the = to : as shown in the comments.
You can also make it a little more specific. For example, change all the \w's to [A-Z], etc...
The regex:
(?:
^ [ \d]*
| \G
)
(?<Key> \w+ ) #_(1)
[ ]+
(?=
(?: \w+ [ ]+ )*
(?= \w )
(?:
(?= # <- Change the = to : to get non-overlapped matches
(?<Values> \1 ) #_(2)
)
| .
)*
$
)
This is a perl test case
# $str = '2 6 AA TT PP AAATTCGAA';
# $count = 0;
#
# while ( $str =~ /(?:^[ \d]*|\G)(\w+)[ ]+(?=(?:\w+[ ]+)*(?=\w)(?:(?=(\1)(?{ $count++ }))|.)*$)/g )
# {
# print "search = '$1'\n";
# print "found = '$count'\n";
# $count = 0;
#
# }
#
# Output >>
#
# search = 'AA'
# found = '3'
# search = 'TT'
# found = '1'
# search = 'PP'
# found = '0'
#
#

Related

Regex - get list comma separated allow spaces before / after the comma

I try to extract an images array/list from a commit message:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
I want to get a list that contains:
["image-a", "image-b", "image_c", "imaged", "image-e"]
NOTES:
A) should allow a single space before/after the comma (,)
B) ensure that #images = exists but exclude it from the group
C) I also searching for other parameters like #build and #setup so I need to ignore them when looking for #images
What I have until now is:
/(?i)#images\s?=\s?<HERE IS THE MISSING LOGIC>/
I use find() method:
def matcher = commitMsg =~ /(?i)#images\s?=\s?([^,]+)/
if(matcher.find()){
println(matcher[0][1])
}
You can use
(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)
See the regex demo. Details:
(?i) - case insensitive mode on
(?:\G(?!^)\s?,\s?|#images\s?=\s?) - either the end of the previous regex match and a comma enclosed with single optional whitespaces on both ends, or #images string and a = char enclosed with single optional whitespaces on both ends
(\w+(?:-\w+)*) - Group 1: one or more word chars followed with zero or more repetitions of - and one or more word chars.
See a Groovy demo:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
def re = /(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)/
def res = (commitMsg =~ re).collect { it[1] }
print(res)
Output:
[image-a, image-b, image_c, imaged, image-e]
An alternative Groovy code:
String commitMsg = "#build #images = image-a, image-b,image_c, imaged , image-e #setup=my-setup fixing issue with px"
def re = /(?i)(?:\G(?!^)\s?,\s?|#images\s?=\s?)(\w+(?:-\w+)*)/
def matcher = (commitMsg =~ re).collect()
for(m in matcher) {
println(m[1])
}

Function to check a value in a String

I'm working with an application that tries to check the location of an user in a User Directory.
I have strings similar to:
CN=John Mayor,OU=Users,OU=NA,OU=Local,DC=domain,DC=application,DC=com
or
CN=Annette Luis Morgant,OU=Users,OU=CH,OU=Local,DC=domain,DC=application,DC=com
I'm trying to filter in javascript the string in order to print out ONLY the value of the second "OU".
So for the first case it will be "NA", for the second case it will be "CH".
Trying to use substring and trim or something similar, but I'm confusing myself!
Can you help me?
Thanks!!!!
edit-----
This is what I was trying to do:
public class SplitUser {
public static void main(String[] args) {
String MyStringContent = "CN=John Mayor,OU=Users,OU=NA,OU=Local,DC=domain,DC=application,DC=com";
String[] arrSplit = MyStringContent.split(",");
for (int i=0; i < arrSplit.length; i++)
{
System.out.println(arrSplit[i]);
}
//System.out.println(arrSplit[2]);
String p = arrSplit[2].substring(3, arrSplit[2].length());
System.out.println(p);
}}
You could try with this:
(?:,|^)OU=[^,]+(?=.*?,OU=([^,]+)).*
It will work even if there are some other FOO=BAR values inserted among the "OU"'s
Explained:
(?:,|^) # start by begin of line or ','
OU=
[^,]+ #Anything but a ',' 1 or more times
#We have found 1 OU, let's find the next
(?= # Lookahead expression
.*? # Anything (ungreedy)
,OU=
([^,]+)
)
.* # Anything, just to match the whole line
# and avoid multiple matches for the same line (g flag)
Demo here

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Couchbase xdcr regex - How do I exclude keys using regex?

I am trying to exclude certain documents from being transported to ES using XDCR.
I have the following regex that filters ABCD and IJ
https://regex101.com/r/gI6sN8/11
Now, I want to use this regex in the XDCR filtering
^(?!.(ABCD|IJ)).$
How do I exclude keys using regex?
EDIT:
What if I want to select everything that doesn't contains ABCDE and ABCHIJ.
I tried
https://regex101.com/r/zT7dI4/1
edit:
Sorry, after further looking at it, this method is invalid. For instance, [^B] allows an A to get by, letting AABCD slip through (since it will match AA at first, then match BCD with the [^A]. Please disregard this post.
Demo here shows below method is invalid
(disregard this)
You could use a posix style trick to exclude words.
Below is to exclude ABCD and IJ.
You get a sense of the pattern from this.
Basically, you put all the first letters into a negative class
as the first in the alternation list, then handle each word
in a separate alternation.
^(?:[^AI]+|(?:A(?:[^B]|$)|AB(?:[^C]|$)|ABC(?:[^D]|$))|(?:I(?:[^J]|$)))+$
Demo
Expanded
^
(?:
[^AI]+
|
(?: # Handle 'ABCD`
A
(?: [^B] | $ )
| AB
(?: [^C] | $ )
| ABC
(?: [^D] | $ )
)
|
(?: # Handle 'IJ`
I
(?: [^J] | $ )
)
)+
$
Hopefully one day there will be built-in support for inverting the match expression. In the mean time, here's a Java 8 program that generates regular expressions for inverted prefix matching using basic regex features supported by the Couchbase XDCR filter.
This should work as long as your key prefixes are somehow delimited from the remainder of the key. Make sure to include the delimiter in the input when modifying this code.
Sample output for red:, reef:, green: is:
^([^rg]|r[^e]|g[^r]|re[^de]|gr[^e]|red[^:]|ree[^f]|gre[^e]|reef[^:]|gree[^n]|green[^:])
File: NegativeLookaheadCheater.java
import java.util.*;
import java.util.stream.Collectors;
public class NegativeLookaheadCheater {
public static void main(String[] args) {
List<String> input = Arrays.asList("red:", "reef:", "green:");
System.out.println("^" + invertMatch(input));
}
private static String invertMatch(Collection<String> literals) {
int maxLength = literals.stream().mapToInt(String::length).max().orElse(0);
List<String> terms = new ArrayList<>();
for (int i = 0; i < maxLength; i++) {
terms.addAll(terms(literals, i));
}
return "(" + String.join("|", terms) + ")";
}
private static List<String> terms(Collection<String> words, int index) {
List<String> result = new ArrayList<>();
Map<String, Set<Character>> prefixToNextLetter = new LinkedHashMap<>();
for (String word : words) {
if (word.length() > index) {
String prefix = word.substring(0, index);
prefixToNextLetter.computeIfAbsent(prefix, key -> new LinkedHashSet<>()).add(word.charAt(index));
}
}
prefixToNextLetter.forEach((literalPrefix, charsToNegate) -> {
result.add(literalPrefix + "[^" + join(charsToNegate) + "]");
});
return result;
}
private static String join(Collection<Character> collection) {
return collection.stream().map(c -> Character.toString(c)).collect(Collectors.joining());
}
}

Solution required to create Regex pattern

I am developing a windows application in C#. I have been searching for the solution to my problem in creating a Regex pattern. I want to create a Regex pattern matching the either of the following strings:
XD=(111111) XT=( 588.466)m3 YT=( .246)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr
XD=(111 ) XT=( 588.466)m3 YT=( .009)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr
The specific requirement is that I need all the values from the above given string which is a collection of key/value pairs. Also, would like to know the right approach (in terms of efficiency and performance) out of the two...Regex pattern matching or substring, for the above problem.
Thank you all in advance and let me know, if more details are required.
I don't know C#, so there probably is a better way to build a key/value array. I constructed a regex and handed it to RegexBuddy which generated the following code snippet:
StringCollection keyList = new StringCollection();
StringCollection valueList = new StringCollection();
StringCollection unitList = new StringCollection();
try {
Regex regexObj = new Regex(
#"(?<key>\b\w+) # Match an alphanumeric identifier
\s*=\s* # Match a = (optionally surrounded by whitespace)
\( # Match a (
\s* # Match optional whitespace
(?<value>[^()]+) # Match the value string (anything except parens)
\) # Match a )
(?<unit>[^\s=]+ # Match an optional unit (anything except = or space)
\b # which must end at a word boundary
(?!\s*=) # and not be an identifier (i. e. followed by =)
)? # and is optional, as mentioned.",
RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
keyList.Add(matchResult.Groups["key"].Value);
valueList.Add(matchResult.Groups["value"].Value);
unitList.Add(matchResult.Groups["unit"].Value);
matchResult = matchResult.NextMatch();
}
Regex re=new Regex(#"(\w+)\=\(([\d\s\.]+)\)");
MatchCollection m=re.Matches(s);
m[0].Groups will have { XD=(111111), XD, 111111 }
m[1].Groups will have { XT=( 588.466), XT, 588.466 }
String[] rows = { "XD=(111111) XT=( 588.466)m3 YT=( .246)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr",
"XD=(111 ) XT=( 588.466)m3 YT=( .009)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr" };
foreach (String s in rows) {
MatchCollection Pair = Regex.Matches(s, #"
(\S+) # Match all non-whitespace before the = and store it in group 1
= # Match the =
(\([^)]+\S+) # Match the part in brackets and following non-whitespace after the = and store it in group 2
", RegexOptions.IgnorePatternWhitespace);
foreach (Match item in Pair) {
Console.WriteLine(item.Groups[1] + " => " + item.Groups[2]);
}
Console.WriteLine();
}
Console.ReadLine();
If you want to extract the units also then use this regex
#"(\S+)=(\([^)]+(\S+))
I added a set of brackets around it, then you will find the unit in item.Groups[3]