Google Script: Match RegEx into 2D array - regex

I'm trying to extract information from Gmail into Google Spreadsheet. The information in the email has a table structure with the following columns List of Products, QTY Sold and the Subtotal for each product. These repeat N times.
When accesing the information using message.getPlainBody() I get the following text:
Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96
Progress
First I tried to use a regular expression to identify each row with all his elements:
Product name: Any amount of characters that don't include ':' (.*)[^:]
QTY Sold: Any number \d*
Anything that looks like a SubTotal [$]\d*.\d*
Wrapping everything up it looks like this
function ExtractDetail(message){
var mainbody = message.getPlainBody();
//RegEx
var itemListRegex = new RegExp(/(.*)[^:][\r\n]+(\d*[\r\n]+[$](\d*\.\d*)[\r\n]+/g);
var itemList = mainbody.match(itemListRegex);
Logger.log(itemList);
}
And so far it works:
itemList: Chocolate 1 $8.58 ,Apples 2 $40.40 ,Bananas 1 $95.99
,Candy 1 $4.99
However, I'm getting the following result:
[Chocolate 1 $8.58]
[Apples 2 $40.40]
[Bananas 1 $95.99]
[Candy 1 $4.99]
Instead of:
[Chocolate] [ 1 ] [$8.58]
[Apples] [ 2 ] [$40.40]
[Bananas] [ 1 ] [$95.99]
[Candy] [ 1 ] [$4.99]
Question
My question is, how can I append a new row in a way that it each row corresponds to each match found and that each column corresponds to each property?
How do I turn the result of each match into an array? Is it possible or should I change my approach?
Update:
Since the result of my current attemp is a large string I'm trying to find other options. This one poped up:
var array = Array.from(mainbody.matchAll(itemListRegex), m => m[1]);
Source: How do you access the matched groups in a JavaScript regular expression?
I'm still working on it. I still need to find how to add more columns and for some reason it starts on 'Apples' (following the examples), leaving 'Chocolates' behind.
Log:
Logger.log('array: ' + array);

If you want to use matchAll like Array.from(mainbody.matchAll(itemListRegex), m => m[1]), how about this modification?
In this case, /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g is used as the regex.
Modified script:
const itemListRegex = /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g;
var array = Array.from(mainbody.matchAll(itemListRegex), ([,b,c,d]) => [b,Number(c),d]);
Result:
[
["Chocolate",1,"$8.58"],
["Apples",2,"$40.40"],
["Bananas",1,"$95.99"],
["Candy",1,"$4.99"]
]
The result is the same with TheMaster's answer.
Test of script:
const mainbody = `
Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96
`;
const itemListRegex = /(.*[^:])[\r\n]+(\d*)[\r\n]+([$]\d*\.\d*)[\r\n]/g;
var array = Array.from(mainbody.matchAll(itemListRegex), ([,b,c,d]) => [b,Number(c),d]);
console.log(array)
Note:
About how can I append a new row in a way that it each row corresponds to each match found and that each column corresponds to each property?, this means for putting the values to Spreadsheet? If it's so, can you provide a sample result you expect?
References:
matchAll()
Array.from()

Map and split the resulting array by \new lines:
const data = `Product
Quantity
Price
Chocolate
1
$8.58
Apples
2
$40.40
Bananas
1
$95.99
Candy
1
$4.99
Subtotal:
$149.96`;
const itemListRegex = /.*[^:][\r\n]+\d*[\r\n]+\$\d*\.\d*(?=[\r\n]+)/g;
const itemList = data.match(itemListRegex);
console.info(itemList.map(e => e.split(/\n/)));//map and split

Related

Google Sheets Search and Sum in two lists

I have a Google Sheets question I was hoping someone could help with.
I have a list of about 200 keywords which looks like the ones below:
**List 1**
Italy City trip
Italy Roundtrip
Italy Holiday
Hungary City trip
Czechia City trip
Croatia Montenegro Roundtrip
....
....
And I then have another list with jumbled keywords with around 1 million rows. The keywords in this list don't exactly match with the first list. What I need to do is search for the keywords in list 1 (above) in list 2 (below) and sum all corresponding cost values. As you can see in the list below the keywords from list 1 are in the second list but with other keywords around them. For example, I need a formula that will search for "Italy City trip" from list 1, in list 2 and sum the cost when that keyword occurs. In this case, it would be 6 total. Adding the cost of "Italy City trip April" and "Italy City trip June" together.
**List 2** Cost
Italy City trip April 1
Italy City trip June 5
Next week Italy Roundtrip 4
Italy Holiday next week 1
Hungary City holiday trip 9
....
....
I hope that makes sense.
Any help would be greatly appreciated
try:
=ARRAYFORMULA(QUERY({IFNA(REGEXEXTRACT(PROPER(C1:C),
TEXTJOIN("|", 1, SORT(PROPER(A1:A), 1, 0)))), D1:D},
"select Col1,sum(Col2)
where Col1 is not null
group by Col1
label sum(Col2)''", 0))
You want to establish whether keywords in one list (List#1) can be found in another list (List#2).
List#2 is 1,000,000 rows long, so I would recommend segmenting the list so that execution times are not exceeded. That's something you will be able to establish by trial and error.
The solution is to use the javascript method indexOf.
Paraphrasing from w3schools: indexOf() returns the position of the first occurrence of a specified value in a string. If the value is not found, it returns -1. So testing if (idx !=-1){ will only return List#1 values that were found in List#2. Note: The indexOf() method is case sensitive.
function so5864274503() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var srcname = "source";
var tgtname = "target";
var sourceSheet = ss.getSheetByName(srcname);
var targetSheet = ss.getSheetByName(tgtname);
// get the source list
var sourceLR = sourceSheet.getLastRow();
var srcData = sourceSheet.getRange(1,1,sourceLR).getValues();
//get the target list
var targetLR = targetSheet.getLastRow();
var tgtlist = targetSheet.getRange(1,1,targetLR,2).getValues();
var totalcostvalues = [];
// start looping through the keywords (list 1)
for (var s = 0;s<srcData.length;s++){
var totalcost = 0;
var value = srcData[s][0]
// start looping through the strings (List 2)
for (var i=0;i<tgtlist.length;i++){
// set cost to zero
var cumcost = 0;
// use indexOf to test if keyword is in the string
var idx = tgtlist[i][0].indexOf(value);
// value of -1 = no match, value >-1 indicates posuton in the string where the key word was found
if (idx !=-1){
var cost = tgtlist[i][1]
cumcost = cumcost + cost;
totalcost = totalcost+cost
}
}//end of loop - list2
//Logger.log("DEBUG: Summary: "+value+", totalcost = "+totalcost)
totalcostvalues.push([totalcost])
}// end of loop - list1
//Logger.log(totalcostvalues); //DEBUG
sourceSheet.getRange(1,2,sourceLR).setValues(totalcostvalues);
}
I also got this one, but it's case sensitive a bit
function myFunction() {
var ss = SpreadsheetApp.getActive();
var sheet1 = ss.getSheets()[0];
var sheet2 = ss.getSheets()[1];
var valuesSheet1 = sheet1.getRange(2,1, (sheet1.getLastRow()-1), sheet1.getLastColumn()).getValues();
var valuesCol1Sheet1 = valuesSheet1.map(function(r){return r[0]});
var valuesCol2Sheet1 = valuesSheet1.map(function(r){return r[1]});
Logger.log(valuesCol2Sheet1);
var valuesSheet2 = sheet2.getRange(2,1, (sheet2.getLastRow()-1)).getValues();
var valuesCol1Sheet2 = valuesSheet2.map(function(r){return r[0]});
for (var i = 0; i<= valuesCol1Sheet2.length-1; i++){
var price = 0;
valuesCol1Sheet1.forEach(function(elt,index){
var position = elt.toLowerCase().indexOf(valuesCol1Sheet2[i].toLowerCase());
if(position >-1){
price = price + valuesCol2Sheet1[index];
}
});
sheet2.getRange((i+2),2).setValue(price);
};
}

Return top value ordered by another column

Suppose I have a table as follows:
TableA =
DATATABLE (
"Year", INTEGER,
"Group", STRING,
"Value", DOUBLE,
{
{ 2015, "A", 2 },
{ 2015, "B", 8 },
{ 2016, "A", 9 },
{ 2016, "B", 3 },
{ 2016, "C", 7 },
{ 2017, "B", 5 },
{ 2018, "B", 6 },
{ 2018, "D", 7 }
}
)
I want a measure that returns the top Group based on its Value that work inside or outside a Year filter context. That is, it can be used in a matrix visual like this (including the Total row):
It's not hard to find the maximal value using DAX:
MaxValue = MAX(TableA[Value])
or
MaxValue = MAXX(TableA, TableA[Value])
But what is the best way to look up the Group that corresponds to that value?
I've tried this:
Top Group = LOOKUPVALUE(TableA[Group],
TableA[Year], MAX(TableA[Year]),
TableA[Value], MAX(TableA[Value]))
However, this doesn't work for the Total row and I'd rather not have to use the Year in the measure if possible (there are likely other columns to worry about in a real scenario).
Note: I am providing a couple solutions in the answers below, but I'd love to see any other approaches as well.
Ideally, it would be nice if there were an extra argument in the MAXX function that would specify which column to return after finding the maximum, much like the MAXIFS Excel function has.
Another way to do this is through the use of the TOPN function.
The TOPN function returns entire row(s) instead of a single value. For example, the code
TOPN(1, TableA, TableA[Value])
returns the top 1 row of TableA ordered by TableA[Value]. The Group value associated with that top Value is in the row, but we need to be able to access it. There are a couple of possibilities.
Use MAXX:
Top Group = MAXX(TOPN(1, TableA, TableA[Value]), TableA[Group])
This finds the maximum Group from the TOPN table in the first argument. (There is only one Group value, but this allows us to covert a table into a single value.)
Use SELECTCOLUMNS:
Top Group = SELECTCOLUMNS(TOPN(1, TableA, TableA[Value]), "Group", TableA[Group])
This function usually returns a table (with the columns that are specified), but in this case, it is a table with a single row and a single column, which means the DAX interprets it as just a regular value.
One way to do this is to store the maximum value and use that as a filter condition.
For example,
Top Group =
VAR MaxValue = MAX(TableA[Value])
RETURN MAXX(FILTER(TableA, TableA[Value] = MaxValue), TableA[Group])
or similarly,
Top Group =
VAR MaxValue = MAX(TableA[Value])
RETURN CALCULATE(MAX(TableA[Group]), TableA[Value] = MaxValue)
If there are multiple groups with the same maximum value the measures above will pick the first one alphabetically. If there are multiple and you want to show all of them, you could use a concatenate iterator function:
Top Group =
VAR MaxValue = MAX(TableA[Value])
RETURN CONCATENATEX(
CALCULATETABLE(
VALUES(TableA[Group]),
TableA[Value] = MaxValue
),
TableA[Group],
", "
)
If you changed the 9 in TableA to an 8, this last measure would return A, B rather than A.

Unique list from dynamic range table with possible blanks

I have an Excel table in sheet1 in which column A:
Name of company Company 1 Company 2 Company
3 Company 1 Company 4 Company 1 Company
3
I want to extract a unique list of company names to sheet2 also in column A. I can only do this with help of a helper column if I dont have any blanks between company names but when I do have I get one more company which is a blank.
Also, I've researched but the example was for non-dynamic tables and so it doesn't work because I don't know the length of my column.
I want in Sheet2 Column A:
Name of company Company 1 Company 2 Company 3
Company 4
Looking for the solution that requires less computational power Excel or Excel-VBA. The final order which they appear in sheet2 don't really matter.
Using a slight modification to Recorder-generated code:
Sub Macro1()
Sheets("Sheet1").Range("A:A").Copy Sheets("Sheet2").Range("A1")
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=Range("A2:A" & Rows.Count) _
, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange Range("A2:A" & Rows.Count)
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
Sample Sheet1:
Sample Sheet2:
The sort removes the blanks.
EDIT#1:
If the original data in Sheet1 was derived from formulas, then using PasteSpecial will remove unwanted formula copying. There is also a final sweep for empty cells:
Sub Macro1_The_Sequel()
Dim rng As Range
Sheets("Sheet1").Range("A:A").Copy
Sheets("Sheet2").Range("A1").PasteSpecial Paste:=xlPasteValues
Sheets("Sheet2").Range("A:A").RemoveDuplicates Columns:=1, Header:=xlYes
Set rng = Sheets("Sheet2").Range("A2:A" & Rows.Count)
With Sheets("Sheet2").Sort
.SortFields.Clear
.SortFields.Add Key:=rng, SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlGuess
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
Call Kleanup
End Sub
Sub Kleanup()
Dim N As Long, i As Long
With Sheets("Sheet2")
N = .Cells(Rows.Count, "A").End(xlUp).Row
For i = N To 1 Step -1
If .Cells(i, "A").Value = "" Then
.Cells(i, "A").Delete shift:=xlUp
End If
Next i
End With
End Sub
All of these answers use VBA. The easiest way to do this is to use a pivot table.
First, select your data, including the header row, and go to Insert -> PivotTable:
Then you will get a dialog box. You don't need to select any of the options here, just click OK. This will create a new sheet with a blank pivot table. You then need to tell Excel what data you're looking for. In this case, you only want the Name of company in the Rows section. On the right-hand side of Excel you will see a new section named PivotTable Fields. In this section, simply click and drag the header to the Rows section:
This will give a result with just the unique names and an entry with (blank) at the bottom:
If you don't want to use the Pivot Table further, simply copy and paste the result rows you're interested in (in this case, the unique company names) into a new column or sheet to get just those without the pivot table attached. If you want to keep the pivot table, you can right click on Grand Total and remove that, as well as filter the list to remove the (blank) entry.
Either way, you now have your list of unique results without blanks and it didn't require any formulas or VBA, and it took relatively few resources to complete (far fewer than any VBA or formula solution).
Here's another method using Excel's built-in Remove Duplicates feature, and a programmed method to remove the blank lines:
EDIT
I have deleted the code using the above methodology as it takes too long to run. I have replaced it with a method that uses VBA's collection object to compile a unique list of companies.
The first method, on my machine, took about two seconds to run; the method below: about 0.02 seconds.
Sub RemoveDups()
Dim wsSrc As Worksheet, wsDest As Worksheet
Dim rRes As Range
Dim I As Long, S As String
Dim vSrc As Variant, vRes() As Variant, COL As Collection
Set wsSrc = Worksheets("sheet1")
Set wsDest = Worksheets("sheet2")
Set rRes = wsDest.Cells(1, 1)
'Get the source data
With wsSrc
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
'Collect unique list of companies
Set COL = New Collection
On Error Resume Next
For I = 2 To UBound(vSrc, 1) 'Assume Row 1 is the header
S = CStr(Trim(vSrc(I, 1)))
If Len(S) > 0 Then COL.Add S, S
Next I
On Error GoTo 0
'Populate results array
ReDim vRes(0 To COL.Count, 1 To 1)
'Header
vRes(0, 1) = vSrc(1, 1)
'Companies
For I = 1 To COL.Count
vRes(I, 1) = COL(I)
Next I
'set results range
Set rRes = rRes.Resize(UBound(vRes, 1) + 1)
'Write the results
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
'Uncomment the below line if you want
'.Sort key1:=.Columns(1), order1:=xlAscending, MatchCase:=False, Header:=xlYes
End With
End Sub
NOTE: You wrote you didn't care about the order, but if you want to Sort the results, that added about 0.03 seconds to the routine.
With two sheets named 1 and 2
Inside sheet named: 1
+----+-----------------+
| | A |
+----+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | |
| 5 | Company 3 |
| 6 | Company 1 |
| 7 | |
| 8 | Company 4 |
| 9 | Company 1 |
| 10 | Company 3 |
+----+-----------------+
Result in sheet named: 2
+---+-----------------+
| | A |
+---+-----------------+
| 1 | Name of company |
| 2 | Company 1 |
| 3 | Company 2 |
| 4 | Company 3 |
| 5 | Company 4 |
+---+-----------------+
Use this code in a regular module:
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
End Sub
The code return the date unsorted, just the way data appears.
if you want a sorted list, just add this code before the las line:
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
This way the result will be always sorted.
(The subrutine would be like this)
Sub extractUni()
Dim objDic
Dim Cell
Dim Area As Range
Dim i
Dim Value
Set Area = Sheets("1").Range("A2:A10") 'this is where your data is located
Set objDic = CreateObject("Scripting.Dictionary") 'use a Dictonary!
For Each Cell In Area
If Not objDic.Exists(Cell.Value) Then
objDic.Add Cell.Value, Cell.Address
End If
Next
i = 2 '2 because the heading
For Each Value In objDic.Keys
If Not Value = Empty Then
Sheets("2").Cells(i, 1).Value = Value 'Store the data in column D below the heading
i = i + 1
End If
Next
Dim sht As Worksheet
Set sht = Sheets("2")
sht.Activate
With sht.Sort
.SetRange Range("A:A")
.Header = xlYes
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
End Sub
If you have any question about the code, I will glad to explain.

Split Pandas Column by values that are in a list

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?
the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]