Google Sheets / Google Data Studio - RegEx

Google Sheets / Google Data Studio - RegEx - regex

I got cells in a Google Sheet, which consist of some combined data to track workout progress. They look something like this:
80kg-3x5, 100kg-1x3
For a given exercise, i.e. hang snatch above, it means what actual work loads I did for that exercise on a given date, with weights and the related set x reps separated by commas. So for one exercise, I might have only one work load, or several (which are then comma separated). I have them in a single cell to keep the data tidy, and reduce time when entering the data after a workout.
Now to analyze the data, I need to somehow separate the comma separated values. An example using the sample cell data above, would be total volume for that exercise, with an expression like this:
Sum( (digit before 'kg') * (digit before 'x') * (digit after 'x') + Same expression before, if comma ',' exists after first expression (multiple loads for the exercise) )
It should be a trivial task, but I haven't touched the functions in google sheet or data studio that much, and I had a surprisingly difficult time figuring out a way to either loop through the content in a cell with appropriate regex, or other ways. I could do this easily in python and then any other visualization software, but the point for going this way using drive tools is that it saves a lot of time (if it works...). I can either implement it in google sheet, or in data studio as a new calculated column from the import, whichever makes it possible.

If you are looking to write a custom function, something like this may do the trick (though it needs work for better error-handling)
function workoutProgress(string) {
if (string == '' || string == null || string == undefined) { return 'error';}
var stringArray = string.split(",");
var sum = 0;
var digitsArray, digitsProduct;
if ( stringArray.length > 0) {
for (var element in stringArray) {
digitsArray = stringArray[element].match(/\d{1,}/g);
digitsProduct = digitsArray.reduce(function(product, digit){ return product*digit;});
sum += digitsProduct;
}
}
return sum;
}

It can be achieved using the RegEx Calculated Field below where Field represents the respective field name; each row represents a single workload (for example 80kg-3x5), thus the below accounts for 5 workloads (more can be added, for example a 6th could be added by copy-pasting the 5th line and incrementing he number in curly brackets by one - that is, changing {4} to {5}):
(CAST(REGEXP_EXTRACT(Field,"^(\\d+)kg")AS NUMBER) * CAST(REGEXP_EXTRACT(Field,"^\\d+kg-(\\d+)")AS NUMBER) * CAST(REGEXP_EXTRACT(Field,"^\\d+kg-\\d+x(\\d+)")AS NUMBER)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}\\d+kg-\\d+x(\\d+)")AS NUMBER),0))
Editable Google Data Studio Report, Embedded Data Source, Editable Data Set (Google Sheets) and a GIF to elaborate, so feel free to change the name of the field (at the Data Source) to adapt the field to the Calculated Field:

Related

Google Sheets: How can I extract partial text from a string based on a column of different options?

Goal: I have a bunch of keywords I'd like to categorise automatically based on topic parameters I set. Categories that match must be in the same column so the keyword data can be filtered.
e.g. If I have "Puppies" as a first topic, it shouldn't appear as a secondary or third topic otherwise the data cannot be filtered as needed.
Example Data: https://docs.google.com/spreadsheets/d/1TWYepApOtWDlwoTP8zkaflD7AoxD_LZ4PxssSpFlrWQ/edit?usp=sharing
Video: https://drive.google.com/file/d/11T5hhyestKRY4GpuwC7RF6tx-xQudNok/view?usp=sharing
Parameters Tab: I will add words in columns D-F that change based on the keyword data set and there will often be hundreds, if not thousands, of options for larger data sets.
Categories Tab: I'd like to have a formula or script that goes down the columns D-F in Parameters and fills in a corresponding value (in Categories! columns D-F respectively) based on partial match with column B or C (makes no difference to me if there's a delimiter like a space or not. Final data sheet should only have one of these columns though).
Things I've Tried:
I've tried a bunch of things. Nested IF formula with regexmatch works but seems clunky.
e.g. this formula in Categories! column D
=IF(REGEXMATCH($B2,LOWER(Parameters!$D$3)),Parameters!$D$3,IF(REGEXMATCH($B2,LOWER(Parameters!$D$4)),Parameters!$D$4,""))
I nested more statements changing out to the next cell in Parameters!D column (as in , manually adding $D$5, $D$6 etc) but this seems inefficient for a list thousands of words long. e.g. third topic will get very long once all dog breed types are added.
Any tips?
Functionality I haven't worked out:
if a string in Categories B or C contains more than one topic in the parameters I set out, is there a way I can have the first 2 to show instead of just the first one?
e.g. Cell A14 in Categories, how can I get a formula/automation to add both "Akita" & "German Shepherd" into the third topic? Concatenation with a CHAR(10) to add to new line is ideal format here. There will be other keywords that won't have both in there in which case these values will just show up individually.
Since this data set has a bunch of mixed breeds and all breeds are added as a third topic, it would be great to differentiate interest in mixes vs pure breeds without confusion.
Any ideas will be greatly appreciated! Also, I'm open to variations in layout and functionality of the spreadsheet in case you have a more creative solution. I just care about efficiently automating a tedious task!!

Try using custom function:
To create custom function:
1.Create or open a spreadsheet in Google Sheets.
2.Select the menu item Tools > Script editor.
3.Delete any code in the script editor and copy and paste the code below into the script editor.
4.At the top, click Save save.
To use custom function:
1.Click the cell where you want to use the function.
2.Type an equals sign (=) followed by the function name and any input value — for example, =DOUBLE(A1) — and press Enter.
3.The cell will momentarily display Loading..., then return the result.
Code:
function matchTopic(p, str) {
var params = p.flat(); //Convert 2d array into 1d
var buildRegex = params.map(i => '(' + i + ')').join('|'); //convert array into series of capturing groups. Example (Dog)|(Puppies)
var regex = new RegExp(buildRegex,"gi");
var results = str.match(regex);
if(results){
// The for loops below will convert the first character of each word to Uppercase
for(var i = 0 ; i < results.length ; i++){
var words = results[i].split(" ");
for (let j = 0; j < words.length; j++) {
words[j] = words[j][0].toUpperCase() + words[j].substr(1);
}
results[i] = words.join(" ");
}
return results.join(","); //return with comma separator
}else{
return ""; //return blank if result is null
}
}
Example Usage:
Parameters:
First Topic:
Second Topic:
Third Topic:
Reference:
Custom Functions

I've added a new sheet ("Erik Help") with separate formulas (highlighted in green currently) for each of your keyword columns. They are each essentially the same except for specific column references, so I'll include only the "First Topic" formula here:
=ArrayFormula({"First Topic";IF(A2:A="",,IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) & IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))))})
This formula first creates the header (which can be changed within the formula itself as you like).
The opening IF condition leaves any row in the results column blank if the corresponding cell in Column A of that row is also blank.
JOIN is used to form a concatenated string of all keywords separated by the pipe symbol, which REGEXEXTRACT interprets as OR.
IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) will attempt to extract any of the keywords from each concatenated string in Columns B and C. If none is found, IFERROR will return null.
Then a second-round attempt is made:
& IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>"")))))
Only this time, REGEXREPLACE is used to replace the results of the first round with null, thus eliminating them from being found in round two. This will cause any second listing from the JOIN clause to be found, if one exists. Otherwise, IFERROR again returns null for round two.
CHAR(10) is the new-line character.
I've written each of the three formulas to return up to two results for each keyword column. If that is not your intention for "First Topic" and "Second Topic" (i.e., if you only wanted a maximum of one result for each of those columns), just select and delete the entire round-two portion of the formula shown above from the formula in each of those columns.

POWER BI: Creating a calculated field based on multiple filters

I am fairly new to Power Bi and have searched in various online help forums but was unsuccessful in finding the one similar to mine. Hence posting this here. Not sure if this is a fairly straightforward one or complicated (as I think!)
I have 3 columns: 'Event', 'Screen' and 'Time' (Similar to below)
I want do a single calculation as below:
(2*count of "NameSubmitted" occurrences in Event) - (AVG Time of corresponding "NameSubmitted" (from Event) * count of "NameSubmitted" occurrences in Event)
+
(2*count of "AddressAdded" occurrences in Event) - (AVG Time of corresponding "AddAddress" (from screen) * count of "AddressAdded" occurrences in Event)
+
(2*count of "OrderCreated" occurrences in Event) - (AVG Time of corresponding sum of "Orders"+"OrderDetail"+"OrderConfirmation" (from screen) * count of "OrderCreated" occurrences in Event)`
My approach:
I have tried to create a new column with the following IF() function calculation but in vain (Started like below) and been receiving the following error:
Calc =
CALCULATE(
(2*SUM(IF(Table[Event] = "NameSubmitted",1,BLANK())))
- AVERAGE(IF(Table[Event] = "NameSubmitted")
)) .....
Error: The SUM function only accepts a column reference as an argument
Any help is much appreciated.

I'm not going to give a complete answer, but the last piece should look something like this:
= 2 * COUNTROWS(FILTER(Table1, Table1[Event] = "OrderCreated"))
- CALCULATE(AVERAGE(Table1[Time]),
Table1[Screen] IN {"Orders", "OrderDetail", "OrderConfirmation"})
* COUNTROWS(FILTER(Table1, Table1[Event] = "OrderCreated"))
Note: It's more efficient if you factor out the COUNTROWS():
= COUNTROWS(FILTER(Table1, Table1[Event] = "OrderCreated")) *
(2 - CALCULATE(AVERAGE(Table1[Time]),
Table1[Screen] IN {"Orders", "OrderDetail", "OrderConfirmation"}))
You should be able to get the rest from there.
There are a variety of ways to count occurrences instead of using COUNTROWS(). The closest to your attempt would probably be like this:
= SUMX(Table1, IF(Table1[Event]="NameSubmitted", 1, BLANK()))

Excel, duplicates in string, single cell iteration

I'm trying to extract certain pieces of data from a very long string within a single cell. For the sake of this exercise, this is the data I have in cell A1.
a:2:{s:15:"info_buyRequest";a:5:{s:4:"uenc";s:252:"WN0aW9uYWwuaHRlqdyZ2dC1hdD0lN0JhZHR5cGUlN0QmdnQtcHRpPSU3QmFkd29yZHNfcHJvZHVjdHRhcmdldGlkJTdEJiU3Qmlnbm9y,";s:7:"product";s:4:"1253";s:8:"form_key";s:16:"wyfg89N";s:7:"options";a:6:{i:10144;s:5:"73068";i:10145;s:5:"63085";i:10141;s:5:"73059";i:10143;s:5:"73064";i:13340;s:5:"99988";i:10142;s:5:"73063";}s:3:"qty";s:1:"1";}s:7:"options";a:6:{i:0;a:7:{s:5:"label";s:5:"Color";s:5:"value";s:11:"White";s:11:"print_value";s:11:"White";s:9:"option_id";s:5:"10144";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73068";s:11:"custom_view";b:0;}i:1;a:7:{s:5:"label";s:4:"Trim";s:5:"value";s:11:"Black";s:11:"print_value";s:11:"Black";s:9:"option_id";s:5:"10145";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"63085";s:11:"custom_view";b:0;}i:2;a:7:{s:5:"label";s:7:"Material";s:5:"value";s:15:"Vinyl";s:11:"print_value";s:15:"Vinyl";s:9:"option_id";s:5:"10141";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73059";s:11:"custom_view";b:0;}i:3;a:7:{s:5:"label";s:6:"Orientation";s:5:"value";s:17:"Left Side";s:11:"print_value";s:17:"Left Side";s:9:"option_id";s:5:"10143";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73064";s:11:"custom_view";b:0;}i:4;a:7:{s:5:"label";s:12:"Table";s:5:"value";s:16:"YES! Add Table";s:11:"print_value";s:16:"YES! Add Table";s:9:"option_id";s:5:"13340";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"99988";s:11:"custom_view";b:0;}i:5;a:7:{s:5:"label";s:8:"Shipping";s:5:"value";s:20:"Front Door Delivery";s:11:"print_value";s:20:"Front Door Delivery";s:9:"option_id";s:5:"10142";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73063";s:11:"custom_view";b:0;}}}
The end result, would be to separate the values for Color, Trim, Material Orientation, etc.
The formula I was using is this:
=MID(LEFT(A4,FIND("print_value",A4)-9),FIND("Color",A4)+25,LEN(A4))
This basically looks in between two points and trims out the fat. It works, but only for the first iteration of "print_value". If I were to use this searching for "Trim"...
=MID(LEFT(A4,FIND("print_value",A4)-9),FIND("Trim",A4)+25,LEN(A4))
...I get an empty result. This happens because print_value is duplicate and not unique to the string. Excel doesn't understand what point to apply its function to and poops itself.
Even though there are unique factors within this string that I could essentially attach myself to (and arrive at the desired result), I CAN NOT use them as they will not be consistent and will render the formula useless when applied to other cells.
That said, here is what I need. Within this formula, I need a way to either A) tell the formula which iteration of print_value to find or B) change print_value to print_value(1,2,3,4, etc) and then run my trimming formula.

Few options based on this link:
1) VBA - Using a User Defined Function
If you're new to these then follow this tutorial.
Function FindN(sFindWhat As String, _
sInputString As String, N As Integer) As Integer
Dim J As Integer
Application.Volatile
FindN = 0
For J = 1 To N
FindN = InStr(FindN + 1, sInputString, sFindWhat)
If FindN = 0 Then Exit For
Next
End Function
2) Using a Formula
=FIND(CHAR(1),SUBSTITUTE(A1,"c",CHAR(1),3))
c is the character you want to find
A1 is the text you want to look in
3 is the nth instance

Regex with SQL Server 2008 CLR performance issues

I am trying to understand why is it taking so long to execute a simple query.
In my local machine it takes 10 seconds but in production it takes 1 min.
(I imported the database from production into my local database)
select *
from JobHistory
where dbo.LikeInList(InstanceID, 'E218553D-AAD1-47A8-931C-87B52E98A494') = 1
The table DataHistory is not indexed and it has 217,302 rows
public partial class UserDefinedFunctions
{
[SqlFunction]
public static bool LikeInList([SqlFacet(MaxSize = -1)]SqlString value, [SqlFacet(MaxSize = -1)]SqlString list)
{
foreach (string val in list.Value.Split(new char[] { ',' }, StringSplitOptions.None))
{
Regex re = new Regex("^.*" + val.Trim() + ".*$", RegexOptions.IgnoreCase);
if (re.IsMatch(value.Value))
{
return(true);
}
}
return (false);
}
};
And the issue is that if a table has 217k rows then I will be calling that function 217,000 times! not sure how I can rewrite this thing.
Thank you

There are several issues with this code:
Missing (IsDeterministic = true, IsPrecise = true) in [SqlFunction] attribute. Doing this (mainly just the IsDeterministic = true part) will allow the SQLCLR UDF to participate in parallel execution plans. Without setting IsDeterministic = true, this function will prevent parallel plans, just like T-SQL UDFs do.
Return type is bool instead of SqlBoolean
RegEx call is inefficient: using an instance method once is expensive. Switch to using the static Regex.IsMatch instead
RegEx pattern is very inefficient: wrapping the search string in "^.*" and ".*$" will require the RegEx engine to parse and retain in memory as the "match", the entire contents of the value input parameter, for every single iteration of the foreach. Yet the behavior of Regular Expressions is such that simply using val.Trim() as the entire pattern would yield the exact same result.
(optional) If neither input parameter will ever be over 4000 characters, then specify a MaxSize of 4000 instead of -1 since NVARCHAR(4000) is much faster than NVARCHAR(MAX) for passing data into, and out of, SQLCLR objects.

Best way to compare phone numbers using Regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));

Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js