Matching innermost braces with regex or strpos? - regex

I have a sort of mini parsing syntax I made up to help me streamline my view code in cakephp. Basically I have created a table helper which, when given a dataset and (optionally) a set of options for how to format the data will render out a table, as opposed to me looping though the data and editing it manually.
It allows the user to be as complex or as simple as they like, it can get pretty powerful. However, In order to achieve this I had to make a simple parsing syntax. As a quick example the user would do something like so:
$this->Table->data = $userData;
$this->Table->elements['td']['data'] = array(
'{:User.username:}',
'{:User.created:}' => array('Time::nice')
);
echo $this->Table->render();
And when rendering the table would then generate:
<table>
<tbody>
<tr><td>rich97</td><td>Sun 21st 02:30pm</td></tr>
</tbody>
</table>
The problem occurs then I try to nest the braces like so:
{:User.levels.iconClasses.{:User.access:}:}
Is there anyway I can only get the inner most brackets on the first time round and loop though until there are no matches? Or even do it in one go? Or even better use strpos?
Here is my regex as it stands:
'/\{\:([^}]+)\:\}/'

Just add the opening brace to your negated character class:
'/\{:([^{}]+):\}/'

var $validate= array(
'name'=>array(
'notEmpty' =>array(
'rule'=>'notEmpty',
'message'=>'Please Enter The Name'
),
'isUnique' =>array(
'rule'=>'isUnique',
'message'=>'Name Already Exist'
)
),
'address'=>array(
'rule'=>'notEmpty',
'message'=>'Please Enter The Address')
);

Related

Remove columns by name based on pattern

How can I remove a large number of columns by name based on a pattern?
A data set exported from Jira has a ton of extra columns that I've no interest in. 400 Log entries, 50 Comments, dozens of links or attachments. Problem is that they get random numbers assigned which means that removing them with hardcoded column names will not work. That would look like this and break as the numbers change:
= Table.RemoveColumns(#"Previous Step",{"Watchers", "Watchers_10", "Watchers_11", "Watchers_12", "Watchers_13", "Watchers_14", "Watchers_15", "Watchers_16", "Watchers_17", "Watchers_18", "Watchers_19", "Watchers_20", "Watchers_21", "Watchers_22", "Watchers_23", "Watchers_24", "Watchers_25", "Watchers_26", "Watchers_27", "Watchers_28", "Log Work", "Log Work_29", "Log Work_30", "Log Work_31", "Log Work_32", ...
How can I remove a large number of columns by using a pattern in the name? i.e. remove all "Log Work" columns.
The best way I've found is to use List.FindText on Table.ColumnNames to get a list of column names dynamically based on target string:
= Table.RemoveColumns(#"Previous Step", List.FindText(Table.ColumnNames(#"Previous Step"), "Log Work")
This works by first grabbing the full list of Column Names and keeping only the ones that match the search string. That's then sent to RemoveColumns as normal.
Limitation appears to be that FindText doesn't offer complex pattern matching.
Of course, when you want to remove a lot of different patterns, having individual steps isn't very interesting. A way to combine this is to use List.Combine to join the resulting column names together.
That becomes:
= Table.RemoveColumns(L, List.Combine({ List.FindText(Table.ColumnNames(L), "Watchers_"), List.FindText(Table.ColumnNames(L), "Log Work"), List.FindText(Table.ColumnNames(L), "Comment"), List.FindText(Table.ColumnNames(L), "issue link"), List.FindText(Table.ColumnNames(L), "Attachment")} ))
SO what's actually written there is:
Table.RemoveColumns(PreviousStep, List.Combine({ foundList1, foundlist2, ... }))
Note the { } that signifies a list! You need to use this as List.Combine only accepts a single argument which is itself already a List of lists. And the Combine call is required here.
Also note the L here instead of #"Previous Step". That's used to make the entire thing more readable. Achieved by inserting a step named "L" that just has = #"Promoted Headers".
This allows relatively maintainable removal of multiple columns by name, but it's far from perfect.

How to return a value when a cell matches something without a certain set of characters? Using regexmatch

I'm using IFS and REGEXMATCH to assign values to each cell. The desired outcome is to return 'Info' if a cell contains 'how' 'where' 'what' 'who' but at the same time does not contain 'buy' 'coupon' 'discount' and 'deals'. If the cell contains 'buy' 'coupon' 'discount' and 'deals', it should return 'Commercial'.
For example: cell containing 'where to go' should be Info,
cell containing 'how to buy' should be 'Commercial' because of the 'buy' keyword
My current formula looks like this:
=IFS(REGEXMATCH(J9,"**how|where|what|who**<>buy<>coupon<>discount<>deals"),"Info",REGEXMATCH(J9,"buy|coupon|discount|deals"),"Commercial")
The problem here is it is returning 'Info' for cells like 'how to buy' when it should return 'Commercial'. It only works for 'who', but not for 'how' 'where' and 'what'.
Any thoughts what could be wrong?
I'm still new to regex, would appreciate it if anyone could help me fix this!
You can use the following formula instead
=IF(AND(REGEXMATCH(J8,"how|where|what|who"),NOT(REGEXMATCH(J8,"buy|coupon|discount|deals"))),"Info","Commercial")
You cannot use <> in a regex. We use the NOT instead.
Functions used:
REGEXMATCH
IF
AND
NOT

Regex (re2 googlesheets) multiple values in multiline cell

Getting stuck on how to read and pretty up these values from a multiline cell via arrayformula.
Im using regex as preceding line can vary.
just formulas please, no custom code
The first column looks like a set of these:
```
[config]
name = the_name
texture = blah.dds
cost = 1000
[effect0]
value = 1000
type = ATTR_A
[effect1]
value = 8
type = ATTR_B
[feature0]
name = feature_blah
[components]
0 = comp_one,1
[resources]
res_one = 1
res_five = 1
res_four = 1
<br/>
Where to be useful elsewhere, at minimum it needs each [tag] set ([effect\d], [feature\d], ect) to be in one column each, for example the 'effects' column would look like:
ATTR_A:1000,ATTR_B:8
and so on.
Desired output can also be seen in the included spreadsheet
<br/>
<b>Here is the example spreadsheet:</b>
https://docs.google.com/spreadsheets/d/1arMaaT56S_STTvRr2OxCINTyF-VvZ95Pm3mljju8Cxw/edit?usp=sharing
**Current REGEXREPLACE**
Kinda works, finds each 'type' and 'value' great, just cant figure out how to extract just that from the rest, tried capture (and non-capturing) groups before and after but didnt work
=ARRAYFORMULA(REGEXREPLACE($A3:$A,"[\n.][effect\d][\n.](.)\n(.)","1:$1 2:$2"))
**Current SUBSTITUTE + REGEXEXTRACT + REGEXREPLACE**
A different approach entirely, also kinda works, longer form though and left with having to parse the values out of that string, where got stuck again. Idea was to use this to simplify, then regexreplace like above. Getting stuck removing content around the final matches though, and if can do that then above approach is fine too.
// First ran a substitute
=ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE($A3:$A,char(10),";"),";;",char(10)))
// Then variation of this (gave up on single line 'effect/d' so broke it up to try and get it working)
=ARRAYFORMULA(IF(A3:A<>"",IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect0]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect1]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect2]);(.)$")&";;"),""))
// Then use regexreplace like above
=ARRAYFORMULA(REGEXREPLACE($B3:$B,"value = (.);type = (.);;","1:$1 2:$2"))
**--EDIT--**
Also, as my updated 'Desired Output' sheet shows (see timestamped comment below), bonus kudos if you can also extract just the values of matching 'type's to those extra columns (see spreadsheet).
All good if you cant though, just realized would need that too for lookups.
**--END OF EDIT--**
<br/>
Ive tried dozens of things, discarding each in turn, had a quick look in version history to grab out two promising attempts and shared them in separate sheets.
One of these also used SUBSTITUTE to simplify input column, im happy for a solution using either RAW or the SUBSTITUTE results.
<br/>
**Potentially Useful links:**
https://github.com/google/re2/wiki/Syntax
<br/>
<b>Just some more words:</b>
I also have looked at dozens of stackoverflow and google support pages, so tried both REGEXEXTRACT and REGEXREPLACE, both promising but missing that final tweak. And i tried dozens of tweaks already on both.
Any help would be great, and hopefully help others in future since examples with spreadsheets are great since every new REGEX seems to be a new adventure ;)
<br/>
P.S. if we can think of better title for OP, please say in comment or your answer :)
paste in B3:
=ARRAYFORMULA(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(
IF(C3:E<>"", C2:E2&":"&C3:E, )),,999^99))), " ", ", "))
paste in C3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&C2)))
paste in D3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&D2)))
paste in E3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&E2)))
paste in F3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[feature\d+\]\nname = (.*)")))
paste in G3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[components\]\n\d+ = (.*)")))
paste in H3:
=ARRAYFORMULA(IFNA(REGEXREPLACE(INDEX(SPLIT(REGEXEXTRACT(
REGEXREPLACE(A3:A, "\n", ", "), "\[resources\], (.*)"), "["),,1), ", , $", )))
spreadsheet demo
This was a fun exercise. :-)
Caveat first: I have added some "input data". Examples:
[feature1]
name = feature_active_spoiler2
[components]
0 = spoiler,1
1 = spoilerA, 2
So the output has "extra" output.
See the tab ADW's Solution.

cts:value-match on xs:dateTime() type in Marklogic

I have a variable $yearMonth := "2015-02"
I have to search this date on an element Date as xs:dateTime.
I want to use regex expression to find all files/documents having this date "2015-02-??"
I have path-range-index enabled on ModifiedInfo/Date
I am using following code but getting Invalid cast error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime("2015-02-??T??:??:??.????"))
I have also used following code and getting same error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime(xs:date("2015-02-??"),xs:time("??:??:??.????")))
Kindly help :)
It seems you are trying to use wild card search on Path Range index which has data type xs:dateTime().
But, currently MarkLogic don't support this functionality. There are multiple ways to handle this scenario:
You may create Field index.
You may change it to string index which supports wildcard search.
You may run this workaround to support your existing system:
for $x in cts:values(cts:path-reference("ModifiedInfo/Date"))
return if(starts-with(xs:string($x), '2015-02')) then $x else ()
This query will fetch out values from lexicon and then you may filter your desired date.
You can solve this by combining a couple cts:element-range-querys inside of an and-query:
let $target := "2015-02"
let $low := xs:date($target || "-01")
let $high := $low + xs:yearMonthDuration("P1M")
return
cts:search(
fn:doc(),
cts:and-query((
cts:element-range-query("country", ">=", $low),
cts:element-range-query("country", "<", $high)
))
)
From the cts:element-range-query documentation:
If you want to constrain on a range of values, you can combine multiple cts:element-range-query constructors together with cts:and-query or any of the other composable cts:query constructors, as in the last part of the example below.
You could also consider doing a cts:values with a cts:query param that searches for values between for instance 2015-02-01 and 2015-03-01. Mind though, if multiple dates occur within one document, you will need to post filter manually after all (like in option 3 of Navin), but it could potentially speed up post-filtering a lot..
HTH!

Regular expression to get HTML table contents

I've stumbled on a bit of challenge here: how to get the contents of a table in HTML with the help of a regular expression. Let's say this is our table:
<table someprop=2 id="the_table" otherprop="val">
<tr>
<td>First row, first cell</td>
<td>Second cell</td>
</tr>
<tr>
<td>Second row</td>
<td>...</td>
</tr>
<tr>
<td>Another row, first cell</td>
<td>Last cell</td>
</tr>
</table>
I already found a method that works, but involves multiple regular expression to be executed in steps:
Get the right table and put it's rows in back-reference 1 (there may be more than one in the document):
<table[^>]*?id="the_table"[^>]*?>(.*?)</table>
Get the rows of the table and put the cells in back-reference 1:
<tr.*?>(.*?)</tr>
And lastly fetch the cell contents in back-reference 1:
<td.*?>(.*?)</td>
Now this is all good, but it would be infinitely more awesome to do this all using one fancy regular expression... Does someone know if this is possible?
There really isn’t a possible regex solution that works for an arbitrary number of table data and puts each cell into a separate back reference. That’s because with backreferences, you need to have a distinct open paren for each backref you want to create, and you don’t know how many cells you have.
There’s nothing wrong with using looping of one or another sort to pull out the data. For example, on the last one, in Perl it would be this, given that $tr already contains the row you need:
#td = ( $tr =~ m{<td.*?>(.*?)</td>}sg );
Now $td[0] will contain the first <td>, $td[1] will contain the second one, etc. If you wanted a two-dimensional array, you might wrap that in a loop like this to populate a new #cells variable:
our $table; # assume has full table in it
my #cells;
while(my($tr) =~ $table = m{<tr.*?>(.*?)</tr>}sg) {
push #cells, [ $tr =~ m{<td.*?>(.*?)</td>}sg ];
}
Now you can do two-dimensional addressing, allowing for $cells[0][0], etc. The outer explicit loop processes the table a row at a time, and the inner implicit loop pulls out all the cells.
That will work on the canned sample data you showed. If that’s good enough for you, then great. Use it and move on.
What Could Ever Be Wrong With That?
However, there are actually quite a few assumptions in your patterns about the contents of your data, ones I don’t know that you’re aware of. For one thing, notice how I’ve used /s so that it doesn’t get stuck on newlines.
But the main problem is that minimal matches aren’t always quite what you want here. At least, not in the general case. Sometimes they aren’t as minimal as you think, matching more than you want, and sometimes they just don’t match enough.
For example, a pattern like <i>(.*?)</i> will get more than you want if the string is:
<i>foo<i>bar</i>ness</i>
Because you will end up matching the string <i>foo<i>bar</i>.
The other common problem (and not counting the uncommon ones) is that a pattern like <tag.*?> may match too little, such as with
<img alt=">more" src="somewhere">
Now if you use a simplistic <img.*?> on that, you would only capture <img alt=">, which is of course wrong.
I think the last major remaining problem is that you have to altogether ignore certain things in parsing. The simplest demo of this embedded comments (also <script>, <style>, andCDATA`), since you could have something like
<i> some <!-- secret</i> --> stuff </i>
which will throw off something like <i>(.*?)</i>.
There are ways around all these, of course. Once you’ve done so, and it is really quite a bit of effort, you’ll find that you have built yourself a real parser, completely with a lot of auxiliary logic, not just one pattern.
Even then you are only processing well-formed input strings. Error recovery and failing softly is an entirely different art.
This answer was added before it was known the the OP needed a solution for c++...
Since using regex to parse html is technically wrong, I'll offer a better solution. You could use js to get the data and put it into a two dimensional array. I use jQuery in the example.
var data = [];
$('table tr').each(function(i, n){
var $tr = $(n);
data[i] = [];
$tr.find('td').text(function(j, text){
data[i].push(text);
});
});
jsfiddle of the example: http://jsfiddle.net/gislikonrad/twzM7/
EDIT
If you want a plain javascript way of doing this (not using jQuery), then this might be more for you:
var data = [];
var rows = document.getElementById('the_table').getElementsByTagName('tr');
for(var i = 0; i < rows.length; i++){
var d = rows[i].getElementsByTagName('td');
data[i] = [];
for(var j = 0; j < d.length; j++){
data[i].push(d[j].innerText);
}
}
Both these functions return the data the same way.