Parse a string of pre-specified format in MATLAB

Parse a string of pre-specified format in MATLAB - regex

I would like to parse a string in MATLAB that has a prespecified format. For instance, I would like to parse the following string,
s = 'a(1,2)+b_c15_d.ext'
so that I can get the substring before '(', i.e., the substring 'a', the numbers that are between the parentheses, i.e, 1 and 2, and so on (the substring 'b' between the '+' and '_', and the substring 'd' before the extension).
I can do it using regexp and split, but is there a more convenient way of doing this?

I could do it using regexp and 'split', but I hope there is a more convenient way of doing so.
textscan offers a simple, perhaps convenient, alternative to regexp and split.
a = textscan(s,'%c(%d,%d)+%c_%c%d_%c.ext');
The results are in order as:
>> a =
'a' [1] [2] 'b' 'c' [15] 'd'
If you are going to have more than one character for your substring, %c to a modified %[] (character string) depending on your needs. For example:
a = textscan(s,'%[^(](%d,%d)+%[^_]_%[^_]_%[^.].ext')
The %[^(] above will collect characters from the string until the string ends or a '(' is reached on the string and place the result in the first cell of the output. The '(' is still on the string/stream however, so we remove it directly, by reading it, as shown.
sscanf is another option, but manipulating the output format is more involved.

Using regexp and split really isn't that inconvenient, you can do it nicely in 2 lines
r = regexp(s,'[()\+_(_c)(.ext)]', 'split');
r = r(~cellfun(#isempty, r))
>> r = {'a' '1,2' 'b' '15' 'd'}
I'm not sure if this is similar to what you'd already tried, since you didn't post any code.

Related

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?

Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.

As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Regex: allow for the occurrence of a certain character up to one time

I want to search for a specific (DNA) string 'AGCTAGCT' and allow for the occurrence of one (and only one) mismatch (signified as 'N').
The following are matches (no or one N):
AGCTAGCT
NGCTAGCT
AGCNAGCT
The following are not matches (two or more Ns):
AGNTAGCN
AGNTANCN

Use negative lookahead at the start to check for the strings whether it contains two N's or not.
^(?!.*?N.*N)[AGCTN]{8}$
I assumed that you string contains only A,G,C,T,N letters.
^(?!.*?N.*N)[AGCTN]+$
Or simply like this,
^(?!.*?N.*N).+$
DEMO

In any language you could do something like this
var count = str.match(/N/g).length; // just count the number of N in the string
if(count == 1 || count == 0) { // and compare it
// str valid
}
If you only want a regex, you could use this regex
/^[^N]*N?[^N]*$/
You can test if the string matches the above regex or not.

if you are using python, you can make it without regex:
myList = []
for word in dna :
if word.count('N') < 2 :
myList.append(word)
and now, if you want to generate all the DNA, i dont know how DNA takes letters, but this can save you:
import itertools
letters = ['A', 'G', 'C', 'T', 'N']
for letter in itertools.permutations(letters):
print ''.join(letter)
then, you will have all the permutations you can have from the four letters.

I think a regular expression is not the best choice for doing this. I say that because (at least to my knowledge) there is no easy way to express an arbitrary string to match with at most one mistake, other than explicitly considering all the possible mistakes.
being said that, it'd be something like this
AGCTAGCT|NGCTAGCT|ANCTAGCT|AGNTAGCT|AGCNAGCT|AGCTNGCT|AGCTANCT|AGCTAGNT|AGCTAGCN
maybe it can be simplified a bit.
EDIT
Given that N is a mismatch, a regular expression to accept what you want should replace each N with the wrong alternatives.
AGCTAGCT|[GCT]GCTAGCT|A[ACT]CTAGCT|AG[AGT]TAGCT|AGC[AGC]AGCT
|AGCT[GCT]GCT|AGCTA[ACT]CT|AGCTAG[AGT]T|AGCTAGC[AGC]
Simplifying...
(A(G(C(T(A(G(C(T|[AGC])|[AGT]T)|[ACT]CT)|[GCT]GCT)|[AGC]AGCT)|[AGT]TAGCT)|[ACT]CTAGCT)|[GCT]GCTAGCT)
Demo replacing N with wrong choices https://regex101.com/r/bB0gX1/1.

Split a concatenated field delimited by pipe

I have a field whose value is a concatenated set of fields delimited by | (pipe),
Note:- escape character is also a pipe.
Given:
AB|||1|BC||DE
Required:
["AB|","1","BC|DE"]
How can I split the given string into an array or list without iterating character by character (i.e. using regex or any other method) to get what is required?

If there's an unused character you can substitute for the doubled-pipe you could do this:
groovy:000> s = "AB|||1|BC||DE"
===> AB|||1|BC||DE
groovy:000> Arrays.asList(s.replaceAll('\\|\\|', '#').split('\\|'))*.replaceAll(
'#', '|')
===> [AB|, 1, BC|DE]
Cleaned up with a magic char sequence and using tokenize it would look like:
pipeChars = 'ZZ' // or whatever
s.replaceAll('\\|\\|', pipeChars).tokenize('\\|')*.replaceAll(pipeChars, '|')
Of course this assumes that it's valid to go left-to-right across the string grouping the pipes into pairs, so each pair becomes a single pipe in the output, and the left-over pipes become the delimiters. When you start with something like
['AB|', '|1', 'BC|DE']
which gets encoded as
AB|||||1|BC||DE
then the whole encoding scheme falls apart, it's entirely unclear how to group the pairs of pipes in order to recover the original values. 'X|||||Y' could have been generated by ['X|','|Y'] or ['X||', 'Y'] or ['X', '||Y'], there is no way to know which it was.

How about using the split('|') method - but from what you provided, it looks like you can also have the '|' character in the field value. Any chance you can change the delimiter character to something that is not in the resulting values?

Selectively deleting commas and splitting strings

I have s string that looks like
txt = '"EMB","iShares J,P. Morg",110.81,N/A'
I'm using strsplit(txt,','); to break this into separate strings based on the comma delimiter. However I want to ignore the comma between the 'J' and the 'P', since it's not a delimiter; it's just part of the name.
Is there a way I can say "if a comma is between two quotes but there are other characters between the quotes, delete the comma"?

Here's an equivalent regexp one-liner:
C = regexp(txt, '("[^"]*")|([^,"]+)', 'match')
The result is a cell array with already split strings. Unfortunately, I don't have MATLAB R2013, so I cannot benchmark this versus strsplit.

A silly (but functional) answer:
inquotes=false;
keep=true(1,length(txt));
for v=1:length(txt)
if (txt(v)=='"')
inquotes=~inquotes;
elseif (txt(v)==',' && inquotes)
keep(v)=false;
end
end
txt=txt(keep);
tt=strsplit(txt,',');
This will, if you are in quotes, remove the commas so that you can use strsplit. That is what I understood you want to do, correct?

lua gsub special replacement producing invalid capture index

I have a piece of lua code (executing in Corona):
local loginstr = "emailAddress={email} password={password}"
print(loginstr:gsub( "{email}", "tester#test.com" ))
This code generates the error:
invalid capture index
While I now know it is because of the curly braces not being specified appropriately in the gsub pattern, I don't know how to fix it.
How should I form the gsub pattern so that I can replace the placeholder string with the email address value?
I've looked around on all the lua-oriented sites I can find but most of the documentation seems to revolve around unassociated situations.

As I've suggested in the comments above, when the e-mail is encoded as a URL parameter, the %40 used to encode the '#' character will be used as a capture index. Since the search pattern doesn't have any captures (let alone 40 of them), this will cause a problem.
There are two possible solutions: you can either decode the encoded string, or encode your replacement string to escape the '%' character in it. Depending on what you are going to do with the end result, you may need to do both.
the following routine (I picked up from here - not tested) can decode an encoded string:
function url_decode(str)
str = string.gsub (str, "+", " ")
str = string.gsub (str, "%%(%x%x)",
function(h) return string.char(tonumber(h,16)) end)
str = string.gsub (str, "\r\n", "\n")
return str
end
For escaping the % character in string str, you can use:
str:gsub("%%", "%%%%")
The '%' character is escaped as '%%', and it needs to be ascaped on both the search pattern and the replace pattern (hence the amount of % characters in the replace).

Are you sure your problem isn't that you're trying to gsub on loginurl rather than loginstr?
Your code gives me this error (see http://ideone.com/wwiZk):
lua: prog.lua:2: attempt to index global 'loginurl' (a nil value)
and that sounds similar to what you're seeing. Just fixing it to use the right variable:
print(loginstr:gsub( "{email}", "tester#test.com" ))
says (see http://ideone.com/mMj0N):
emailAddress=tester#test.com password={password}
as desired.

I had this in value part so You need to escape value with: value:gsub("%%", "%%%%").
Example of replacing "some value" in json:
local resultJSON = json:gsub(, "\"SOME_VALUE\"", value:gsub("%%", "%%%%"))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parse a string of pre-specified format in MATLAB - regex

Using regexp and split really isn't that inconvenient, you can do it nicely in 2 lines r = regexp(s,'[()\+_(_c)(.ext)]', 'split'); r = r(~cellfun(#isempty, r)) >> r = {'a' '1,2' 'b' '15' 'd'} I'm not sure if this is similar to what you'd already tried, since you didn't post any code.

Related

Find group of strings starting and ending by a character using regular expression

Regex: allow for the occurrence of a certain character up to one time

Split a concatenated field delimited by pipe

Selectively deleting commas and splitting strings

lua gsub special replacement producing invalid capture index

Categories

Resources