How can I strip a href attribute without the query? - regex

Using Google Sheets, I'd like to grab a URL without a possible query from a "href" attribute. For example, get https://test.com from Test1 or Test1.
I've used the regex answer offered in https://stackoverflow.com/a/40426187/4829915 to remove the query string, and then extracted the actual URL.
Is there a way to do it in one formula?
Please see below what I did. In all of these examples the final output is https://test.com
A B C
1 \?[^\"]+ href="(.+)"
2 Test1 =REGEXREPLACE(A2, B$1, "") =REGEXEXTRACT(B2, C$1)
3 Test2 =REGEXREPLACE(A3, B$1, "") =REGEXEXTRACT(B3, C$1)
4 Test3 =REGEXREPLACE(A4, B$1, "") =REGEXEXTRACT(B4, C$1)

In this answer, I would like to propose 2 patterns. In the 1st pattern, it uses REGEXEXTRACT. In the 2nd pattern, it uses a custom function using Google Apps Script (This is a sample.).
Pattern 1: Using formula
=REGEXEXTRACT(A2, C1)
where C1 is href="(.+?)[\?"]
Pattern 2: Using custom function
When you use this, please copy and paste the script to the script editor. Then please use it at a cell like =getUrl(A2).
function getUrl(value) {
var obj = XmlService.parse(value.replace(/&/g, ";"));
var url = obj.getRootElement().getAttribute("href").getValue();
return url.split("?")[0];
}
Results:
References:
REGEXEXTRACT
XmlService

Related

How to simplify this google sheets regex sequence?

I want to make the following transformation to a set of datas in my google spreadsheets :
6 views -> 6
73K views -> 73000
3650 -> 3650
163K views -> 163000
1.2K views -> 1200
52.5K -> 52500
All the datas are in a column and depending on the case I need to apply a specific transformation.
I tried to put all the regex in one formula but I failed. I always had a case over two regular expressions etc.
Anyaway I end up making these regex one case by one case in different columns. It works fine but I feel like it could slowdown the sheet since I except a lot of data coming into this sheet.
Here is the sheet : spreadsheet
Thank you for your help !
Use regexreplace(), like this:
=arrayformula(
iferror( 1 /
value(
regexreplace(
regexreplace(trim(A2:A), "\s*K", "e3"),
" views", ""
)
)
^ -1 )
)
See your sample spreadsheet.
replace 'views' using regex: /(?<=(\d*\.?\d+\K?)) views/gi
To replace 'K' with or without decimal value, first, detect K then replace K with an empty string and multiply by 1000.
use call back function as:
txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000)
code:
arr = [`6 views`,
`73K views`,
`3650`,
`163K views`,
`1.2K views`,
`52.5K`];
arr.forEach(txt => {
console.log(txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000))
})
Output:
6
73000
3650
163000
1200
52500
Say your inputs are in column A. Empty cells allowed. In any other column,
=arrayformula(if(A2:A<>"",value(substitute(substitute(A2:A," views",""),"K","e3")),))
works.
Adjust the range A2:A as needed.
Also note that non-empty cells with empty strings are ignored.
Basically, since Google Sheet's regex engine doesn't support look around, it is more efficient to take advantage of the rather strict patterns in your application and use substitute() instead.

calculate range of values entered by user Custom function Google Appscript

I want to use arrayformula for my custom function if possible because I want to input a range of values
I also get this error: TypeError: Cannot read property "0" from null.
Also, this: Service invoked too many times in a short time: exec qps. Try Utilities.sleep(1000) between calls
var regExp = new RegExp("Item: ([^:]+)(?=\n)");
var matches=new regExp(input);
return matches[0];
}
Really appreciated some help
Edit:
Based on the second picture, I also try using this regex formula to find word start with "Billing address"
But for the first picture, I used regex formula to find word start with "Item"
The error appears the same for both custom function.
If you want to use a custom function which finds all the strings that start with Item or item and extracts the contents from after the finding, you can use the code provided below. The regular expression is checked by using the match() function and returns the desired result; otherwise, it will return null.
function ITEM(input) {
var regEx = /(?:I|i)tem\s*(.*)$/;
var matches = input.match(regEx);
if (matches && matches.length > 1) {
return matches[1];
} else {
return null;
}
}
If you want to use the RegExp like you did in the code you have shared, you should use \\ instead of \.
For checking and verifying the regular expressions you can use this site.
The Service invoked too many times in a short time: exec qps. Try Utilities.sleep(1000) between calls. error message you are getting is due to the fact that you are trying to call the custom function on too many cells - for example dragging the custom function on too many cells at once. You can check more about this error message here.

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

How to use RegExReplace in Google Spreadsheet

I'm trying to remove beggining numbers from a column in a Google Docs spreadsheet using regex. I can't get RegExReplace function to work. This is the error I get when I run/debug the code:
Missing ) after argument list. (line 14)
This is a part of my code (line 14 is the RegExReplace function line, bolded):
regexFormat = "^[0-9]+$";
replVal = value.RegExReplace(value; regexFormat; ""); //error here
rplc.setValue(replVal);
This is the official syntax: RegExReplace( text ; regular_expression ; replacement )
Anyone knows how to use this function? Thanks!
I don't know why the docs list a semicolon, but if you are doing it as a spreadsheet function, you still need to use commas. Try the following:
=REGEXREPLACE("What-A Crazy str3ng", "\W", "")
Which as expected, yields
WhatACrazystr3ng
I've found another solution for replacing with regexp in Google Docs Script:
var replace = '';//because we want to remove matching text
var regexp2 = new RegExp("[0-9]*[\.]*");//an example of regexp to do the job
var valcurat = value.replace(regexp2, replace);//working
As I did not find any solution for RegExReplace, I changed the method with replace(regexp, new_text). This one works.
This is just a guess but if the function is Javaish, maybe there are 2 forms.
Form 1:
myvar = RegExReplace(value; regexFormat; "");
Form2:
myvar = value.RegExReplace(regexFormat; "");