Split a DNA sequence into a list of codons with D - d

DNA strings consist of an alphabet of four characters, A,C,G, and T
Given a string,
ATGTTTAAA
I would like to split it in to its constituent codons
ATG TTT AAA
codons = ["ATG","TTT","AAA"]
codons encode proteins and they are redundant (http://en.wikipedia.org/wiki/DNA_codon_table)
I have a DNA string in D and would like to split it into a range
of codons and later translate/map the codons to amino acids.
std.algorithm has a splitter function which requires a delimiter and also the
std.regex Splitter function requires a regex to split the string.
Is there an idiomatic approach to splitting a string without a delimiter?

Looks like you are looking for chunks:
import std.range : chunks;
import std.encoding : AsciiString;
import std.algorithm : map;
AsciiString ascii(string literal)
{
return cast(AsciiString) literal;
}
void main()
{
auto input = ascii("ATGTTTAAA");
auto codons = input.chunks(3);
auto aminoacids = codons.map!(
(codon) {
if (codon == ascii("ATG"))
return "M";
// ...
}
);
}
Please note that I am using http://dlang.org/phobos/std_encoding.html#.AsciiString here instead of plain string literals. This is to avoid costly UTF-8 decoding which is done for string and is never applicable to actual DNA sequence. I remember that making notable performance difference for similar bioinformatics code before.

If you just want groups of 3 characters, you can use std.range.chunks.
import std.conv : to;
import std.range : chunks;
import std.algorithm : map, equal;
enum seq = "ATGTTTAAA";
auto codons = seq.chunks(3).map!(x => x.to!string);
assert(codons.equal(["ATG", "TTT", "AAA"]));
The foreach type of the chunks is Take!string, so you may or may not need the map!(x => x.to!string), depending on how you want to use the results.
For example, if you just want to print them:
foreach(codon ; "ATGTTTAAA".chunks(3)) { writeln(codon); }

import std.algorithm;
import std.regex;
import std.stdio;
int main()
{
auto seq = "ATGTTTAAA";
auto rex = regex(r"[AGT]{3}");
auto codons = matchAll(seq, rex).map!"a[0]";
writeln(codons);
return 0;
}

Related

Replace 3rd+ occurrence

How can I replace 3rd+ occurrence of a string? I want to keep first and second new lines but replaces all others with empty string in Dart but the question itself is language agnostic
I want to convert
a
b
c
d
e
f
to
a
b
cdef
Use the following,
const s = 'a\nb\nc\nd\ne\nf'
const chunks = s.split('\n')
console.log( [chunks[0], chunks[1], chunks.slice(2).join("") ].join("\n"))
Splits with a newline character first, then takes the first two items and joins the rest with a newline character.
It is maybe not the best solution, but it works:
void main(List<String> arguments) {
final input = '''
a
b
c
d
e
f
''';
final output = input.replaceFirstMapped(
RegExp(r"(([^\n]+\n){2})((.|\n)+)"),
(m) => "${m.group(1)!}${m.group(3)!.replaceAll("\n", "")}",
);
print(output);
}

Filter strings with regex

I need to filter (select) strings that follow certain rules, print them and count the number filtered strings. The input is a big string and I need to apply the following rules on each line:
line must not contain any of ab, cd, pq, or xy
line must contain any of the vowels
line must contain a letter that repeats itself, like aa, ff, yy etc
I'm using the regex crate and it provides regex::RegexSet so I can combine multiple rules. The rules I added are as follows
let regexp = regex::RegexSet::new(&[
r"^((?!ab|cd|pq|xy).)*", // rule 1
r"((.)\1{9,}).*", // rule 3
r"(\b[aeiyou]+\b).*", // rule 2
])
But I don't know how to use these rules to filter the lines and iterate over them.
pub fn p1(lines: &str) -> u32 {
lines
.split_whitespace().filter(|line| { /* regex filter goes here */ })
.map(|line| println!("{}", line))
.count() as u32
}
Also the compiler says that the crate doesn't support look-around, including look-ahead and look-behind.
If you're looking to use a single regex, then doing this via the regex crate (which, by design, and as documented, does not support look-around or backreferences) is probably not possible. You could use a RegexSet, but implementing your third rule would require using a regex that lists every repetition of a Unicode letter. This would not be as bad if you were okay limiting this to ASCII, but your comments suggest this isn't acceptable.
So I think your practical options here are to either use a library that supports fancier regex features (such as fancy-regex for a pure Rust library, or pcre2 if you're okay using a C library), or writing just a bit more code:
use regex::Regex;
fn main() {
let corpus = "\
baz
ab
cwm
foobar
quux
foo pq bar
";
let blacklist = Regex::new(r"ab|cd|pq|xy").unwrap();
let vowels = Regex::new(r"[aeiouy]").unwrap();
let it = corpus
.lines()
.filter(|line| !blacklist.is_match(line))
.filter(|line| vowels.is_match(line))
.filter(|line| repeated_letter(line));
for line in it {
println!("{}", line);
}
}
fn repeated_letter(line: &str) -> bool {
let mut prev = None;
for ch in line.chars() {
if prev.map_or(false, |prev| prev == ch) {
return true;
}
prev = Some(ch);
}
false
}
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c0928793474af1f9c0180c1ac8fd2d47

Jmeter Regular expression match number

I have two values to correlate and I am able to capture them in two parameters successfully. I am taking random values using -1 in match number, but I actually wanted in a way like, let's say my first value randomly take the match number as 7 and I want my second value also should take the same match num as 7.
Please help me how I can simulate this .
Unfortunately, (as you've discovered), JMeter determines the 'random' independently. What you'll need to do is capture each potential value (with a -1) for both of var1 and var2. Then after your Regexes, add a Beanshell Postprocessor that gets a random number n, then picks the nth var1 and var2:
String random_number = Integer(random.nextInt(vars.get("var1_name_matchNr"))).toString;
vars.put("var1_name_chosen",vars.get("var1_name_" + random_number));
vars.put("var2_name_chosen",vars.get("var2_name_" + random_number));
If I understood correctly, you want to extract random regex value, and put it into 2 variables. If so, I would suggest doing something like...
After you get random regex value, add beanshell in which you will paste value you got with regex into the second variable.
So if your variable in regex is "foo1", just add beanshell sampler with:
vars.put("foo2", vars.get("foo1"));
EDIT:
This would be better as Java sampler, but I think it should work in BeanShell sampler as well.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.jmeter.samplers.SampleResult;
import org.apache.jmeter.threads.JMeterContextService;
import java.util.ArrayList;
import java.util.Random;
String previousResponse = JMeterContextService.getContext()
.getPreviousResult().getResponseDataAsString();
String locationLinkRegex = "\"locationId\": (.+?),";
String myLocationId = RegexMethod(previousResponse, locationLinkRegex,
true);
String myLocationLink = RegexMethod(
previousResponse,
"\"locationId\": ".concat(myLocationId).concat(
", \"locationLink\":(.+?))\""), false);
JMeterContextService.getContext().getVariables()
.put("locationId", myLocationId);
JMeterContextService.getContext().getVariables()
.put("locationLink", myLocationLink);
private static String RegexMethod(String response, String regex,
Boolean random) {
Random ran = new Random();
String result = "No matcher!";
ArrayList<String> allMatches = new ArrayList<String>();
allMatches = null;
if (random) {
Matcher m = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
while (m.find()) {
allMatches.add(m.group());
}
result = allMatches.get(ran.nextInt(allMatches.size()));
} else {
Matcher m = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
m.find();
result = m.group(1);
}
return result;
}
Exception handling needs to be implemented as well...
EDIT2:
And the Regex-method as recursive (returns both values as CSV, and can be use only if locationId is unique):
private static String RegexMethod(String response, String regex) {
Random ran = new Random();
String result = "No matcher!";
List<String> allMatches = new ArrayList<String>();
// Find LocationId:
Matcher m1 = Pattern.compile(regex, Pattern.UNICODE_CASE).matcher(
response);
while (m1.find()) {
allMatches.add(m1.group());
}
result = allMatches.get(ran.nextInt(allMatches.size())).concat(",");
// Find LocationLink and return the CSV string:
return result += RegexMethod(response, "\"locationId\": "
.concat(result.substring(result.length()-1)).concat(", \"locationLink\":(.+?))\""));
}

Regex for custom parsing

Regex isn't my strongest point. Let's say I need a custom parser for strings which strips the string of any letters and multiple decimal points and alphabets.
For example, input string is "--1-2.3-gf5.47", the parser would return
"-12.3547".
I could only come up with variations of this :
string.replaceAll("[^(\\-?)(\\.?)(\\d+)]", "")
which removes the alphabets but retains everything else. Any pointers?
More examples:
Input: -34.le.78-90
Output: -34.7890
Input: df56hfp.78
Output: 56.78
Some rules:
Consider only the first negative sign before the first number, everything else can be ignored.
I'm trying to do this using Java.
Assume the -ve sign, if there is one, will always occur before the
decimal point.
Just tested this on ideone and it seemed to work. The comments should explain the code well enough. You can copy/paste this into Ideone.com and test it if you'd like.
It might be possible to write a single regex pattern for it, but you're probably better off implementing something simpler/more readable like below.
The three examples you gave prints out:
--1-2.3-gf5.47 -> -12.3547
-34.le.78-90 -> -34.7890
df56hfp.78 -> 56.78
import java.util.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(strip_and_parse("--1-2.3-gf5.47"));
System.out.println(strip_and_parse("-34.le.78-90"));
System.out.println(strip_and_parse("df56hfp.78"));
}
public static String strip_and_parse(String input)
{
//remove anything not a period or digit (including hyphens) for output string
String output = input.replaceAll("[^\\.\\d]", "");
//add a hyphen to the beginning of 'out' if the original string started with one
if (input.startsWith("-"))
{
output = "-" + output;
}
//if the string contains a decimal point, remove all but the first one by splitting
//the output string into two strings and removing all the decimal points from the
//second half
if (output.indexOf(".") != -1)
{
output = output.substring(0, output.indexOf(".") + 1)
+ output.substring(output.indexOf(".") + 1, output.length()).replaceAll("[^\\d]", "");
}
return output;
}
}
In terms of regex, the secondary, tertiary, etc., decimals seem tough to remove. However, this one should remove the additional dashes and alphas: (?<=.)-|[a-zA-Z]. (Hopefully the syntax is the same in Java; this is a Python regex but my understanding is that the language is relatively uniform).
That being said, it seems like you could just run a pretty short "finite state machine"-type piece of code to scan the string and rebuild the reduced string yourself like this:
a = "--1-2.3-gf5.47"
new_a = ""
dash = False
dot = False
nums = '0123456789'
for char in a:
if char in nums:
new_a = new_a + char # record a match to nums
dash = True # since we saw a number first, turn on the dash flag, we won't use any dashes from now on
elif char == '-' and not dash:
new_a = new_a + char # if we see a dash and haven't seen anything else yet, we append it
dash = True # activate the flag
elif char == '.' and not dot:
new_a = new_a + char # take the first dot
dot = True # put up the dot flag
(Again, sorry for the syntax, I think you need some curly backets around the statements vs. Python's indentation only style)

change string to match regex

Is it possible ( or why not possible ) to convert input string to a string that match regex in least Levenshtein distance ?
i.e. if 1234 is string and ^([0-9]{6})$ is regex, i need output something like 123412 ( output string matches the regex and is 2 distance from original string, there may be other string but first result will do )
How to do this ? ( no brute force..)
edit:
in other possibilities, can I get Levenshtein distance only ? ( without matching string ... )
or what other information apart form boolean( match or not match ) can regex give ?
If you know about finite automaton you could construct one which represents your regex (there are libraries to do so). Then you run it with your string (1234) and you would end up in some state. From this state you do a breath-first search until you reach a accept state. While you searching you keep track of which transitions (characters) you run over. And the characters will give you the shortest (or one of them) string which qualify your regex.
Added link: you may have a look on http://www.brics.dk/automaton/ which is a automaton library implemented at Aarhus University (BSD license)
Update: I have build what you seek with the automaton implementation from above. First, the ExtendedOperations class which is in the same package as the other automaton classes because I needed to access some methods.
package dk.brics.automaton;
public class ExtendedOperations {
//Taken from Automaton.run and modified to just return state instead of accept (bool)
static State endState(String s, Automaton a)
{
if (!a.deterministic) a.determinize();
State p = a.initial;
for (int i = 0; i < s.length(); i++) {
p = p.step(s.charAt(i));
if (q == null) return null;
}
return p;
}
public static String getShortestCompletion(Automaton a, String partlyInput)
{
State e = endState(partlyInput, a);
if (e == null) return null;
return BasicOperations.getShortestExample(e, true);
}
}
Second, a little testsample:
package subsetautomaton;
import dk.brics.automaton.*;
public class Main {
public static void main(String[] args) {
RegExp re = new RegExp("[a-zA-Z0-9._%+-]+\\#[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}");
Automaton a = re.toAutomaton();
System.out.println(ExtendedOperations.getShortestCompletion(a, "a"));
}
}
The example is a naive email address reg. exp. Notice that ^ is implicit in the reg. exp. and the same thing with $. Second, # is escaped with \ because it means 'any string' in this implementation.
The result of the example above is: #-.AA