I have a function which gets a string from another website and if I extract it I end up with the following string
IFX TMP2134567 1433010010 WT33 PARTIAL 2014-11-26 09:43:58 IFX TEMP12345 1433010003 SW80 PARTIAL 2014-11-26 09:43:10 IFX AP RETERM 007 1418310108 MB01 CONFIRMED 2014-07-03 09:48:37
In this case it's 2 records which have 6 fields each and they are all separated by a space. how can I go and read the string and add these into an structure and array to access them.
The fields would be set up like this
IFX
TMP2134567 (this field may contain a space)
1433010010
WT33
PARTIAL
2014-11-26 09:43:58.
So if we use the " " as a separator we would get 7 since the 6th is a date time and has a space between I could also use 7 since I can put 6 and 7 back together and store date and time separately.
My question is there a way to do this with 6 or if I have to use 7 how would I do that. I tried valuelist but that does not work.
I know a couple of things in my list, 1st one is always 3 Char, 4th is always 4 char and my record ends with a date time in format YYYY-MM-DD HH:MM:SS
To make it a bit more complicated I just found that the 2nd field can have spaces like in the 3rd record which looks like this "AP RETERM 007"
Another option is to create a JSON string with your data like this, and then deserialize it.
<cfsavecontent variable="sampledata">
IFX TMP2134567 1433010010 WT33 PARTIAL 2014-11-26 09:43:58 IFX TEM P12345 1433010003 SW80 PARTIAL 2014-11-26 09:43:10 IFX AP RETERM 007 1418310108 MB01 CONFIRMED 2014-07-03 09:48:37</cfsavecontent>
<cfset asJson = ReReplaceNoCase(sampledata,"\s*(.{3}) (.*?) (\d+) (.{4}) ([^\s]*) (\d+-\d+-\d+ \d+:\d+:\d+)\s*",'["\1","\2","\3","\4","\5","\6"],',"ALL")>
<!--- Replace the last comma in the generated string with a closing bracket --->
<cfset asJson = "[" & ReReplace(asJson,",$","]","ALL")>
<cfset result_array = DeSerializeJSON(asJson)>
<cfdump var="#result_array#">
You can access the data simply with the resulting array.
So here's how I understand it
3 characters
Variable string
All digits
4 characters
I assume this value never contains a space
Date/Time
Based on assuming a "yes" to my question above, this solution works:
<cfscript>
raw = " IFX TMP2134567 1433010010 WT33 PARTIAL 2014-11-26 09:43:58 IFX TEMP12345 1433010003 SW80 PARTIAL 2014-11-26 09:43:10 IFX AP RETERM 007 1418310108 MB01 CONFIRMED 2014-07-03 09:48:37";
recordPattern = "(\S+)\s+([\w\s]+)\s+(\d+)\s+(\S+)\s+(\S+)\s+(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})";
keys = ["a","b","c","d","e","f"];
records = getRecordsFromString(raw, recordPattern, keys);
writeDump(records);
function getRecordsFromString(raw, pattern, keys){
var offset = 1;
var records = [];
while (true) {
var result = getRecord(raw, recordPattern, keys, offset);
offset = result.offset;
if (!offset) break;
arrayAppend(records, result.record);
}
return records;
}
function getRecord(raw, recordPattern, keys, offset){
var match = reFind(recordPattern, raw, offset, true);
if (arrayLen(match.pos) != arrayLen(keys)+1){
return {record="", offset=0};
}
var keyIdx=1;
for (var key in keys){
record[key] = mid(raw, match.pos[++keyIdx], match.len[keyIdx]);
}
return {record=record, offset=offset+match.len[1]};
}
</cfscript>
Obviously you will need to tweak the recordPattern and keys to suit your actual needs.
And if you don't understand the regular expression usage there, do yourself a favour and read up on it. I do a series on "regular expressions in CFML" on my blog, which would be an adequate starting point.
Related
I am trying to match the 1st character using regex in GAS. When I insert ^, the code doesn't work.If my id is "DEL", then I need words starting with "D", when id is "DEL" then all words strarting with "DEL".
function getAirportMatch(e) {
var airportlist = "`DELHI VIDP, MINDELIHM HEID, DELHI VIDD, LIHELD HDEL";
var id = "DEL";
var regExp = new RegExp("^(?:"+id+")","gm"); // "i" is for case insensitive
var airport = regExp.exec(airportlist);
Logger.log(airport);````
To match terms starting with something, use word boundary \b and matching everything up to the next comma (or end, which ever comes first):
"\b"+id+"[^,]*"
If you simply want to see which airport names start with DEL you can simply make use of the startsWith method:
var airportlist = "DELHI VIDP, MINDELIHM HEID, DELHI VIDD, LIHELD HDEL";
var airportarray = airportlist.split(', ');
var id = "DEL";
airportarray.forEach(e => {if (e.startsWith(id.toString()))console.log(e)})
The snippet above creates an array with all the airport names and then checks if any of them start with the id you have supplied.
Reference
String.prototype.startsWith().
I am wondering if there is a way to handle situations where there is not always a match for all groups. In my case i have a text which i am trying to parse but there is an element that does not have some elements so my pattern skips over some data which is not desired.
<FONT FACE="Arial,Helvetica" size=2>1260 CORONA POINTE STE 120<br/>CORONA, CA 92879<br/><br/></font></td></tr><tr valign="top"><td></td><td><FONT FACE="Arial,Helvetica" size=2>2124 MAIN ST STE 100<br/>HUNTINGTON BEACH, CA 92648<br/>00610922 Miller, David S - Branch/Division Manager<br><br/><br/></font></td></tr><tr valign="top"><td></td><td>
the pattern i am using is below and will only create one match
/<FONT FACE="Arial,Helvetica" size=2>(.*?)<br\/>(.*?)<br\/>.*?License_id=(\d*?)">.*?<\/A>(.*?)<br>/gm
if i use this pattern i will have 2 matches
/<FONT FACE="Arial,Helvetica" size=2>(.*?)<br\/>(.*?)<br\/>/gm
In my case the source of problem is that i am trying to match the License_id= as well as the name which is not avail in the first match.
so what i am looking for is a way to return an empty match or something so if a match is not present it will not offset my data
I am using JavaScript / NodeJS
This way is done in 2 or 3 steps.
It first gets the record, from a FONT tag to just before the next FONT tag.
Then it removes all the tags from the record by replacing with a newline.
That makes each content section that is left, a separate line.
It then splits the string on newline to get into an array.
The last 2 things are optional, take your pick.
var html = "<FONT FACE=\"Arial,Helvetica\" size=2>1260 CORONA POINTE STE 120<br/>CORONA, CA 92879<br/><br/></font></td></tr><tr valign=\"top\"><td></td><td><FONT FACE=\"Arial,Helvetica\" size=2>2124 MAIN ST STE 100<br/>HUNTINGTON BEACH, CA 92648<br/>00610922 Miller, David S - Branch/Division Manager<br><br/><br/></font></td></tr><tr valign=\"top\"><td></td><td>";
var rxTag = new RegExp( "(?:\\s*<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>\\s*)+", "g" );
var rxRecord = new RegExp( "<font(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+>(?:(?!<font(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+>)[\\S\\s])*", "gi");
var match;
while ( match = rxRecord.exec( html ) )
{
var rec = match[0];
var sData;
sData = rec.replace( rxTag, "\r\n" );
sData = sData.trim();
console.log( sData );
var ary = [];
ary = sData.split( /\r?\n/ );
console.log( ary );
}
I need help with with three regular expressions for date validation. The date formats to validate against should be:
- MMyy
- ddMMyy
- ddMMyyyy
Further:
I want the regular expressions to match the exact number of digits in the formats above. For instance, January should be 01, NOT 1:
060117 // ddMMyy format: Ok
06117 // ddMMyy format: NOT Ok
Hyphens and slashes are NOT allowed, like: 06-01-17, or 06/01/17.
Below are the regex:es that I use. I cannot get them quite right though.
string regex_MMyy = #"^(1[0-2]|0[1-9]|\d)(\d{2})$";
string regex_ddMMyy = #"^(0[1-9]|[12]\d|3[01])(1[0-2]|0[1-9]|\d)(\d{2})$";
string regex_ddMMyyyy = #"^(0[1-9]|[12]\d|3[01])(1[0-2]|0[1-9]|\d)(\d{4})$";
var test_MMyy_1 = Regex.IsMatch("0617", regex_MMyy); // Pass
var test_MMyy_2 = Regex.IsMatch("617", regex_MMyy); // Pass, do NOT want this to pass.
var test_ddMMyy_1 = Regex.IsMatch("060117", regex_ddMMyy); // Pass
var test_ddMMyy_2 = Regex.IsMatch("06117", regex_ddMMyy); // Pass, do NOT want this to pass.
var test_ddMMyyyy_1 = Regex.IsMatch("06012017", regex_ddMMyyyy); // Pass
var test_ddMMyyyy_2 = Regex.IsMatch("0612017", regex_ddMMyyyy); // Pass, do NOT want this to pass.
(If anyone could take allowed days for each month, and leap years into account, that would be a huge bonus :)).
Thanks,
Best Regards
I would try to replace everything inside this string :
[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs
1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567,
1410315029
Except the following numbers :
1303222074
1403281851
1307239335
1403277567
1410315029
I have built a REGEX to match them :
1[0-9]{9}
But I have not figured it out to do the opposite that is everything except all matches ...
google spreadsheet use the Re2 regex engine and doesn't support many usefull features that can help you to do that. So a basic workaround can help you:
match what you want to preserve first and capture it:
pattern: [0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)
replacement: $1 (with a space after)
demo
So probably something like this:
=TRIM(REGEXREPLACE("[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs 1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567, 1410315029"; "[0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)"; "$1 "))
You can also do this with dynamic native functions:
=REGEXEXTRACT(A1,rept("(\d{10}).*",counta(split(regexreplace(A1,"\d{10}","#"),"#"))-1))
basically it is first split by the desired string, to figure out how many occurrences there are of it, then repeats the regex to dynamically create that number of capture groups, thus leaving you in the end with only those values.
First of all thank you Casimir for your help. It gave me an idea that will not be possible with a built-in functions and strong regex lol.
I found out that I can make a homemade function for my own purposes (yes I'm not very "up to date").
It's not very well coded and it returns doublons. But rather than fixing it properly, I use the built in UNIQUE() function on top of if to get rid of them; it's ugly and I'm lazy but it does the job, that is, a list of all matches of on specific regex (which is: 1[0-9]{9}). Here it is:
function ti_extract(input) {
var tab_tis = new Array();
var tab_strings = new Array();
tab_tis.push(input.match(/1[0-9]{9}/)); // get the TI and insert in tab_tis
var string_modif = input.replace(tab_tis[0], " "); // modify source string (remove everything except the TI)
tab_strings.push(string_modif); // insert this new string in the table
var v = 0;
var patt = new RegExp(/1[0-9]{9}/);
var fin = patt.test(tab_strings[v]);
var first_string = tab_strings[v];
do {
first_string = tab_strings[v]; // string 0, or the string with the first removed TI
tab_tis.push(first_string.match(/1[0-9]{9}/)); // analyze the string and get the new TI to put it in the table
var string_modif2 = first_string.replace(tab_tis[v], " "); // modify the string again to remove the new TI from the old string
tab_strings.push(string_modif2);
v += 1;
}
while(v < 15)
return tab_tis;
}
Items I have:
A large list A of strings in column A (unsorted)
name1 pattern1 pattern4
name5 pattern2
name4 pattern4
name2 pattern3 pattern1
name4 pattern4
A large list B of different string patterns that I want to remove from string in column A (include punctuation and special characters)
pattern1
pattern2
pattern3
Once I compare each pattern in B with the string in A, it should output:
name1 pattern4
name5
name4 pattern4
name2
name4 pattern4
Now I have 2 difficulties. I have a very simple test code, assuming there is only 1 pattern in list, the program executed error free however nothing happens in my google spreadsheet, which I can't explain why
function removeS(){
var sheet = SpreadsheetApp.getActiveSheet();
var range = sheet.getRange("A1:A");
var data = range.getValues();
for(i in data){
data[i].toString().replace(pattern,"");
}
}
Also secondly are there anyways I can accomplish my task without doing nested loop? (One loop through everything in column A and another loop for list of patterns) It seems so inefficient as I am dealing with large data. In Excel macro you can do sth like:
With ActiveSheet.UsedRange
.Replace pattern1, ""
.Replace pattern2, ""
and takes care of the need of using nested loops, although it takes manual work to add the patterns.
Here is an option. Although I'm not sure a more eloquent way than nested loops, without converting the returned spreadsheet values from a 2d to a 1d array.
I set a constant for the last row of the patterns column, assuming it was the short of the two columns (see comments in code for rational).
function cleanMe(){
var sheet = SpreadsheetApp.getActiveSheet();
var range = sheet.getRange("A1:A" + sheet.getLastRow());
var data = range.getValues();
// get the array of patterns (all ranges returned as 2d array)
// because .getLastRow() or .getDataRange returns the last row in the spreadsheet with data
// not the last row of the range with data
// hardcoded the last row in column be so as not to
// have to use conditions to check if values exist in range
var patternLastRow = 3;
var patterns = sheet.getRange("B1:B" + patternLastRow).getValues();
// 2d array to replace data in row A using range.setValues(newRange)
var newRange = [];
for(var i = 0; i < data.length; i++){
// use encodeURIComponent to contend with special charactes that would need escaping
var newValue = encodeURIComponent(data[i][0].toString());
for(var p = 0; p < patterns.length; p++){
var pattern = encodeURIComponent(patterns[p][0]);
var index = newValue.indexOf(pattern);
if(index >=0){
newValue = newValue.replace(pattern,'');
}
}
newRange.push([decodeURIComponent(newValue)]);
}
range.setValues(newRange);
}