How can I tell if there are three or more characters between matches in a regex? - regex

I'm using Ruby 2.1. I have this logic that looks for consecutive pairs of strings in a bigger string
results = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
My question is, how do I iterate over the list of results and print out whether there are three or more characters between the two strings? For instance if my string were
"abc def"
The above would produce
[["abc def", "abc", "def"]]
and I'd like to know whether there are three or more characters between "abc" and "def."

Use a quantifier for the spaces inbetween: \b((\S+?)\b\s{3,}\b(\S+?))\b
Also, the inner boundries are not really needed:
\b((\S+?)\s{3,}(\S+?))\b

A straightforward way to check this is by running a separate regex:
results.select!{|x|p x[/\S+?\b(.*?)\b\S+?/,1].size}
will print the size for every of the bunch.
Another way is to take the size of the captured groups and subtract them:
results = []
line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/) do |s, group1, group2|
results << $~ if s.size - group1.size - group2.size >= 3
end

Related

Get just X number of strings from a comma separated cel

having real trouble finding a succinct solution to this simple problem. Currently I have cells which contain many comma separated items. I just want the first 5.
ie. cell A1 =
text, another string, something else, here's another one, guess what another string here, and another, hello i'm another string, another string etc, etc, etccccc
and I'm trying to grab just the first 5 strings.
Beyond that, I wonder if I can incorporate a formula such as =LEN(A1)>20
Currently I do this with numerous; =IFERROR(INDEX( SPLIT(C31,","),1)) then =IFERROR(INDEX( SPLIT(C31,","),2)) etc. then run the LEN formula above.
Is there a simpler solution? Thanks so much.
Try,
=split(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), ", ", false)
For Excel, with data in A1, in B1 enter:
=TRIM(MID(SUBSTITUTE($A1,",",REPT(" ",999)),COLUMNS($A:A)*999-998,999))
and copy across:
To get all 5 substrings into a single cell, use:
=LEFT(A1,FIND(CHAR(1),SUBSTITUTE(A1,",",CHAR(1),5))-1)
=ARRAY_CONSTRAIN(SPLIT(A1,","),1,5)
=REGEXEXTRACT(A1,"((?:.*?,){5})")
=REGEXEXTRACT(A1,REPT("(.*?),",5))
SPLIT to split by delimiter
ARRAY_CONSTRAIN to constrain the array
REGEX1 to extract 5 comma separated values
. Any character
.*?, Any character repeated unlimited number of times (? as little as possible) followed by a ,
{5} Quantifier
REPT to repeat strings

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Regex for pattern matching last instance with a range of values

I need to split a string that has a character, that is repeated but only split the last 3 instances and keep the 0 to any number of instances before those 3 instances of the character.
ex1: "Hi####there" after regex (and splitting) "Hi#","there"
ex2: "Hi###there" after regex (and splitting) "Hi","there"
Splitting on #{3} does not give the result I want: "Hi","#there"
edit:
I simplified my quesiton a bit; serialization and the character I am using is not relevant. I'm just interested in the regex that will get me the last x number in a list of y number of a repetitive list of characters. (#{3})(\w+) results in only returning hi# and the rest of the string is lost.
This is done in Javascript, but works
"Hi####there".split(/###(?=[^#])/); // Output = > ["Hi#", "there"]
"Hi###there".split(/###(?=[^#])/); // Output = > ["Hi", "there"]
// Tested on win7 with Chrome 45+
(?=[^#]) ... Lookahead to check if there is no # after the final/splitter ###

Regex: allow for the occurrence of a certain character up to one time

I want to search for a specific (DNA) string 'AGCTAGCT' and allow for the occurrence of one (and only one) mismatch (signified as 'N').
The following are matches (no or one N):
AGCTAGCT
NGCTAGCT
AGCNAGCT
The following are not matches (two or more Ns):
AGNTAGCN
AGNTANCN
Use negative lookahead at the start to check for the strings whether it contains two N's or not.
^(?!.*?N.*N)[AGCTN]{8}$
I assumed that you string contains only A,G,C,T,N letters.
^(?!.*?N.*N)[AGCTN]+$
Or simply like this,
^(?!.*?N.*N).+$
DEMO
In any language you could do something like this
var count = str.match(/N/g).length; // just count the number of N in the string
if(count == 1 || count == 0) { // and compare it
// str valid
}
If you only want a regex, you could use this regex
/^[^N]*N?[^N]*$/
You can test if the string matches the above regex or not.
if you are using python, you can make it without regex:
myList = []
for word in dna :
if word.count('N') < 2 :
myList.append(word)
and now, if you want to generate all the DNA, i dont know how DNA takes letters, but this can save you:
import itertools
letters = ['A', 'G', 'C', 'T', 'N']
for letter in itertools.permutations(letters):
print ''.join(letter)
then, you will have all the permutations you can have from the four letters.
I think a regular expression is not the best choice for doing this. I say that because (at least to my knowledge) there is no easy way to express an arbitrary string to match with at most one mistake, other than explicitly considering all the possible mistakes.
being said that, it'd be something like this
AGCTAGCT|NGCTAGCT|ANCTAGCT|AGNTAGCT|AGCNAGCT|AGCTNGCT|AGCTANCT|AGCTAGNT|AGCTAGCN
maybe it can be simplified a bit.
EDIT
Given that N is a mismatch, a regular expression to accept what you want should replace each N with the wrong alternatives.
AGCTAGCT|[GCT]GCTAGCT|A[ACT]CTAGCT|AG[AGT]TAGCT|AGC[AGC]AGCT
|AGCT[GCT]GCT|AGCTA[ACT]CT|AGCTAG[AGT]T|AGCTAGC[AGC]
Simplifying...
(A(G(C(T(A(G(C(T|[AGC])|[AGT]T)|[ACT]CT)|[GCT]GCT)|[AGC]AGCT)|[AGT]TAGCT)|[ACT]CTAGCT)|[GCT]GCTAGCT)
Demo replacing N with wrong choices https://regex101.com/r/bB0gX1/1.

Part of as string from a string using regular expressions

I have a string of 5 characters out of which the first two characters should be in some list and next three should be in some other list.
How could i validate them with regular expressions?
Example:
List for First two characters {VBNET, CSNET, HTML)}
List for next three characters {BEGINNER, EXPERT, MEDIUM}
My Strings are going to be: VBBEG, CSBEG, etc.
My regular expression should find that the input string first two characters could be either VB, CS, HT and the rest should also be like that.
Would the following expression work for you in a more general case (so that you don't have hardcoded values): (^..)(.*$)
- returns the first two letters in the first group, and the remaining letters in the second group.
something like this:
^(VB|CS|HT)(BEG|EXP|MED)$
This recipe works for me:
^(VB|CS|HT)(BEG|EXP|MED)$
I guess (VB|CS|HT)(BEG|EXP|MED) should do it.
If your strings are as well-defined as this, you don't even need regex - simple string slicing would work.
For example, in Python we might say:
mystring = "HTEXP"
prefix = mystring[0:2]
suffix = mystring[2:5]
if (prefix in ['HT','CS','VB']) AND (suffix in ['BEG','MED','EXP']):
pass # valid!
else:
pass # not valid. :(
Don't use regex where elementary string operations will do.