Mysterious no-match in regular expression - regex

Imagine I have a cell array with two filenames:
filenames{1,1} = 'SMCSx0noSat48VTFeLeakTrace.txt';
filenames{2,1} = 'SMCSx0NoSat48VTrace.txt';
I want to get the filename which starts with 'SMCSx0' and contains the filterword 'NoSat48VTrace':
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48VTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
it works, I get:
file =
'SMCSx0NoSat48VTrace.txt'
But for whatever reason does the change of the filterword to 'noSat48VTFeLeakTrace' doesn't get me the other file?
%// case 2
expression = 'SMCSx0';
filterword = 'noSat48VTFeLeakTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
which is absolutely the same as before, but
file =
Empty cell array: 0-by-1
I'm actually use these lines in a function for months, without problems. But now I added some files to my folder which are not found, though their names are similar to before. Any hints?
It is actually supposed to work without including Trace into the filterword, which it does for the first case, that's why I put .*\ into the regex.
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48V';
... works

'^' expression '.*\'
The \ near the end makes it that \n is interpreted as a new-line character:
SMCSx0.*\noSat48VTFeLeakTrace.*\.txt$
This worked fine with the other filterword because NoSat48VTrace has an upper case N and \N is interpreted as simply N.
Get rid of the \, you don't need it.

You have an extra backslash in there:
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
^^^
|||
remove it and it should give the expected result.

Related

Regex for IFC with array attributed

IFC is a variation of STEP files used for construction projects. The IFC contains information about the building being constructed. The file is text based and it easy to read. I am trying to parse this information into a python dictionary.
The general format of each line will be similar to the following
2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
ideally this should be parsed int #2334, IFCMATERIALLAYERSETUSAGE, #2333,.AXIS2.,.POSITIVE.,-180.
I found a solution Regex includes two matches in first match
https://regex101.com/r/RHIu0r/10 for part of the problem.
However, there are some cases the data contains arrays instead of values as the example below
2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
This case need to be parsed as #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334
Where [#40,#221,#268,#281] is a stored in a single variable as an array
The array can be in the middle or the last variable.
Would you be able to assist in creating a regular expression to obtain desired results
I have created https://regex101.com/r/mqrGka/1 with cases to test
Here's a solution that continues from the point you reached with the regular expression in the test cases:
file = """\
#1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,1190720890);
#2=IFCSPACE(';;);',#1,$);some text);
#2=IFCSPACE(';;);',#1,$);
#2885=IFCRELAGGREGATES('1gtpBVmrDD_xsEb7NuFKc8',#5,$,$,#2813,(#2840,#2846,#2852,#2858,#2879));
#2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
#2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
""".splitlines()
import re
d = dict()
for line in file:
m = re.match(r"^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);", line, re.I|re.M)
attr = m.group(3) # attribute list string
values = [m.group(2)] # first value is the entity type name
while attr:
start = 1
if attr[0] == "'": start += attr.find("'", 1) # don't split at comma within string
if attr[0] == "(": start += attr.find(")", 1) # don't split item within parentheses
end = attr.find(",", start) # search for a comma / end of item
if end < 0: end = len(attr)
value = attr[1:end-1].split(",") if attr[0] == "(" else attr[:end]
if value[0] == "'": value = value[1:-1] # remove quotes
values.append(value)
attr = attr[end+1:] # remove current attribute item
d[m.group(1)] = values # store into dictionary

Regular expressions string splits

I'm trying to create a pattern that enables me split a string on comas but ignoring expressions within curly brackets.
my existing code works great if only one group of curly bracket expressions exist in the string.
Dim expression As New Regex(",(?=(?:[^\{]*\{[^\{]*\})*(?![^\}]*\}))")
Try
parts = expression.Split(sortString)
For Each Item In parts
If Not Item Is Nothing Then
result.Add(Item)
End If
Next
Return result
If I pass the string
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))},Site DESC,Pressure1 ASC
It works, the curly bracket grouping is ignored and each string after is broken out with the coma split.
Problem is. If I need to accommodate multiple groupings of curly bracket expressions in my string and it begins to fail.
This fails:
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))},Site DESC,{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo.Site = "AO",3,4)))}, Pressure1 ASC
one of the grouping is ignored as it should be, but the other grouping of curly brackets is not. Resulting in a dirty collection.
I would appreciate a second pair of eyes on this.
The pattern ,(?![^{]*\}) seems to be working for me, as long as there are no lose { or } floating around in the text.
Broken down, it evaluates as
,
zero-width negative lookahead
Any character not in "{"
* (zero or more times)
}
End Capture
You can use *,(?![^{]*\}) * to trim the spaces immediately before and after commas if required.
On your test string, it produces the follow splits:
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site =
"SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))}
Site DESC
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo.Site =
"AO",3,4)))}
Pressure1 ASC
However, it will fail to properly split strings like
Apple, Ban}ana, Carrot {1,2,3}, Frog, Cat, 1,2,3
due to the lose } in "banana"

Finding the shortest repetitive pattern in a string

I was wondering if there was a way to do pattern matching in Octave / matlab? I know Maple 10 has commands to do this but not sure what I need to do in Octave / Matlab. So if a number was 12341234123412341234 the pattern match would be 1234. I'm trying to find the shortest pattern that upon repetiton generates the whole string.
Please note: the numbers (only numbers will be used) won't be this simple. Also, I won't know the pattern ahead of time (that's what I'm trying to find). Please see the Maple 10 example below which shows that the pattern isn't known ahead of time but the command finds the pattern.
Example of Maple 10 pattern matching:
ns:=convert(12341234123412341234,string);
ns := "12341234123412341234"
StringTools:-PrimitiveRoot(ns);
"1234"
How can I do this in Octave / Matlab?
Ps: I'm using Octave 3.8.1
To find the shortest pattern that upon repetition generates the whole string, you can use regular expressions as follows:
result = regexp(str, '^(.+?)(?=\1*$)', 'match');
Some examples:
>> str = '12341234123412341234';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234'
>> str = '1234123412341234123';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234123412341234123'
>> str = 'lullabylullaby';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby'
>> str = 'lullaby1lullaby2lullaby1lullaby2';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby1lullaby2'
I'm not sure if this can be accomplished with regular expressions. Here is a script that will do what you need in the case of a repeated word called pattern.
It loops through the characters of a string called str, trying to match against another string called pattern. If matching fails, the pattern string is extended as needed.
EDIT: I made the code more compact.
str = 'lullabylullabylullaby';
pattern = str(1);
matchingState = false;
sPtr = 1;
pPtr = 1;
while sPtr <= length(str)
if str(sPtr) == pattern(pPtr) %// if match succeeds, keep looping through pattern string
matchingState = true;
pPtr = pPtr + 1;
pPtr = mod(pPtr-1,length(pattern)) + 1;
else %// if match fails, extend pattern string and start again
if matchingState
sPtr = sPtr - 1; %// don't change str index when transitioning out of matching state
end
matchingState = false;
pattern = str(1:sPtr);
pPtr = 1;
end
sPtr = sPtr + 1;
end
display(pattern);
The output is:
pattern =
lullaby
Note:
This doesn't allow arbitrary delimiters between occurrences of the pattern string. For example, if str = 'lullaby1lullaby2lullaby1lullaby2';, then
pattern =
lullaby1lullaby2
This also allows the pattern to end mid-way through a cycle without changing the result. For example, str = 'lullaby1lullaby2lullaby1'; would still result in
pattern =
lullaby1lullaby2
To fix this you could add the lines
if pPtr ~= length(pattern)
pattern = str;
end
Another approach is as follows:
determine length of string, and find all possible factors of the string length value
for each possible factor length, reshape the string and check
for a repeated substring
To find all possible factors, see this solution on SO. The next step can be performed in many ways, but I implement it in a simple loop, starting with the smallest factor length.
function repeat = repeats_in_string(str);
ns = numel(str);
nf = find(rem(ns, 1:ns) == 0);
for ii=1:numel(nf)
repeat = str(1:nf(ii));
if all(ismember(reshape(str,nf(ii),[])',repeat));
break;
end
end
This problem is a great Rorschach test for your approach to problem solving. I'll add a signal engineering solution, which should be simple since the signal is expected to be perfectly repetitive, assuming this holds: find the shortest pattern that upon repetition generates the whole string.
In the following str fed to the function is actually a column vector of floats, not a string, the original string having been converted with str2num(str2mat(str)'):
function res=findshortestrepel(str);
[~,ii] = max(fft(str-mean(str)));
res = str(1:round(numel(str)/(ii-1)));
I performed a small test, comparing this to the regexp solution and found it to be faster overall (blue squares), although somewhat inconsistently, and only if you don't consider the time required to convert the string into a vector of floats (green squares). However I did not pursue this further (not breaking records with this):
Times in sec.

Matlab: regexprep with variable

I have an array a : a list of identified words to be compared and replace by empty character in an array b. newB is the result.
The value of a might vary according to an input file.
I am trying to use regexprep but it is not working well.
e.g.:
a = {'apple';'banana';'orange'}; % a might be also ‘watermelon’, ‘papaya’ etc
b = {'1 apple = 2 kiwi';'1 fig = 1 banana';'1 orange = 3 strawberry'};
newB = {' = 2 kiwi';'1 fig = ';' = 3 strawberry'};
From your example it seems like you want to remove a special word and a number, the appropriate regular expression for this is (for word = 'apple'): '\d+ apple'. Building the regular expression from all the words in a, using sprintf:
re = sprintf('\\d+ %s|',a{:}); %// adding | operator to select between expressions
re(end)=[]; %// discard the last '|'
The resulting regular expression is
re =
'\d+ apple|\d+ banana|\d+ orange'
Now the actual replacement:
newB = regexprep(b,re,'')
Resulting with
newB =
' = 2 kiwi'
'1 fig = '
' = 3 strawberry'

I can't figure out what is wrong with my regex?

below is is the code in question:
ID = null;
Table = null;
Match CMD = Regex.Match(CommandString, #"create update command for (^[A-Za-z0-9 ]+$) Where_ID_=_(^[0-9]+$)"); //create update command for MARKSWhere_ID_=_11
if (CMD.Success)
{
Table = CMD.Groups[1].Value;
ID = CMD.Groups[2].Value;
return true;
}
this is returning false every time when
CommandString = "create update command for MARKS Where_ID_=_5"
why?
In the regular expression you used, ^ denote the beginning of the input string, and $ denotes the end of the input string.
Removing ^ and $ from the regular expression will give you what you want.
[A-Za-z0-9]{0,10} will allow only 10 alphabet form a to z A to Z and from 0 to 9
Soo its good habit to put fix number of alphabet match by regex.