matlab regexprep multiple strings with multiple numbers - regex

given a cell array of strings, I want to build one regexprep rule, so that different string types are replaced by a certain number. I.e:
my_cell = {'ok', 'ok', 'bad', 'broken', 'bad', 'broken', 'ok'};
I know how to replace each string type one by one, i.e:
my_cell = regexprep(my_cell,'ok$','1');
but ideally I would like to build one rule, so that ok will be replaced with 1, bad will be replaced with 0 and broken will be replaced with -1.
any hints on how to do this?

How about:
>> my_cell = regexprep(my_cell,{'ok$','bad$','broken$'},{'1','0','-1'});

There's documentation here: http://www.mathworks.co.uk/help/techdoc/ref/regexprep.html
It gives the syntax as: s = regexprep('str', 'expr', 'repstr')
It also says: "If both expr and repstr are cell arrays of strings, then expr and repstr must contain the same number of elements, and regexprep pairs each repstr element with its matching element in expr."
Therefore you could try something like this:
my_cell = regexprep(my_cell, {'^ok$', '^bad$', '^broken$'}, {'1', '0', '-1'});
(Untested)

Related

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Regex: allow for the occurrence of a certain character up to one time

I want to search for a specific (DNA) string 'AGCTAGCT' and allow for the occurrence of one (and only one) mismatch (signified as 'N').
The following are matches (no or one N):
AGCTAGCT
NGCTAGCT
AGCNAGCT
The following are not matches (two or more Ns):
AGNTAGCN
AGNTANCN
Use negative lookahead at the start to check for the strings whether it contains two N's or not.
^(?!.*?N.*N)[AGCTN]{8}$
I assumed that you string contains only A,G,C,T,N letters.
^(?!.*?N.*N)[AGCTN]+$
Or simply like this,
^(?!.*?N.*N).+$
DEMO
In any language you could do something like this
var count = str.match(/N/g).length; // just count the number of N in the string
if(count == 1 || count == 0) { // and compare it
// str valid
}
If you only want a regex, you could use this regex
/^[^N]*N?[^N]*$/
You can test if the string matches the above regex or not.
if you are using python, you can make it without regex:
myList = []
for word in dna :
if word.count('N') < 2 :
myList.append(word)
and now, if you want to generate all the DNA, i dont know how DNA takes letters, but this can save you:
import itertools
letters = ['A', 'G', 'C', 'T', 'N']
for letter in itertools.permutations(letters):
print ''.join(letter)
then, you will have all the permutations you can have from the four letters.
I think a regular expression is not the best choice for doing this. I say that because (at least to my knowledge) there is no easy way to express an arbitrary string to match with at most one mistake, other than explicitly considering all the possible mistakes.
being said that, it'd be something like this
AGCTAGCT|NGCTAGCT|ANCTAGCT|AGNTAGCT|AGCNAGCT|AGCTNGCT|AGCTANCT|AGCTAGNT|AGCTAGCN
maybe it can be simplified a bit.
EDIT
Given that N is a mismatch, a regular expression to accept what you want should replace each N with the wrong alternatives.
AGCTAGCT|[GCT]GCTAGCT|A[ACT]CTAGCT|AG[AGT]TAGCT|AGC[AGC]AGCT
|AGCT[GCT]GCT|AGCTA[ACT]CT|AGCTAG[AGT]T|AGCTAGC[AGC]
Simplifying...
(A(G(C(T(A(G(C(T|[AGC])|[AGT]T)|[ACT]CT)|[GCT]GCT)|[AGC]AGCT)|[AGT]TAGCT)|[ACT]CTAGCT)|[GCT]GCTAGCT)
Demo replacing N with wrong choices https://regex101.com/r/bB0gX1/1.

Subsetting a string based on pre- and suffix

I have a column with these type of names:
sp_O00168_PLM_HUMAM
sp_Q8N1D5_CA158_HUMAN
sp_Q15818_NPTX1_HUMAN
tr_Q6FGH5_Q6FGH5_HUMAN
sp_Q9UJ99_CAD22_HUMAN
I want to remove everything before, and including, the second _ and everything after, and including, the third _.
I do not which to remove based on number of characters, since this is not a fixed number.
The output should be:
PLM
CA158
NPTX1
Q6FGH5
CAD22
I have played around with these, but don't quite get it right..
library(stringer)
str_sub(x,-6,-1)
That’s not really a subset in programming terminology1, it’s a substring. In order to extract partial strings, you’d usually use regular expressions (pretty much regardless of language); in R, this is accessible via sub and other related functions:
pattern = '^.*_.*_([^_]*)_.*$'
result = sub(pattern, '\\1', strings)
1 Aside: taking a subset is, as the name says, a set operation, and sets are defined by having no duplicate elements and there’s no particular order to the elements. A string by contrast is a sequence which is a very different concept.
Another possible regular expression is this:
sub("^(?:.+_){2}(.+?)_.+", "\\1", vec)
# [1] "PLM" "CA158" "NPTX1" "Q6FGH5" "CAD22"
where vec is your vector of strings.
A visual explanation:
> gsub(".*_.*_(.*)_.*", "\\1", "sp_O00168_PLM_HUMAM")
[1] "PLM"

Part of as string from a string using regular expressions

I have a string of 5 characters out of which the first two characters should be in some list and next three should be in some other list.
How could i validate them with regular expressions?
Example:
List for First two characters {VBNET, CSNET, HTML)}
List for next three characters {BEGINNER, EXPERT, MEDIUM}
My Strings are going to be: VBBEG, CSBEG, etc.
My regular expression should find that the input string first two characters could be either VB, CS, HT and the rest should also be like that.
Would the following expression work for you in a more general case (so that you don't have hardcoded values): (^..)(.*$)
- returns the first two letters in the first group, and the remaining letters in the second group.
something like this:
^(VB|CS|HT)(BEG|EXP|MED)$
This recipe works for me:
^(VB|CS|HT)(BEG|EXP|MED)$
I guess (VB|CS|HT)(BEG|EXP|MED) should do it.
If your strings are as well-defined as this, you don't even need regex - simple string slicing would work.
For example, in Python we might say:
mystring = "HTEXP"
prefix = mystring[0:2]
suffix = mystring[2:5]
if (prefix in ['HT','CS','VB']) AND (suffix in ['BEG','MED','EXP']):
pass # valid!
else:
pass # not valid. :(
Don't use regex where elementary string operations will do.

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!
I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');