IMPORTDATA in Google Sheets and XPath - regex

On GoogleSheet I need to crawl the Amazon's share's price (following xPath) on this page : https://www.boursorama.com/cours/AMZN/
//*[#id="main-content"]/div/section[1]/header/div/div/div[1]/div[1]/div/div[1]/span[1]
I can do it using this formula and it works :
IMPORTXML("https://www.boursorama.com/cours/AMZN/", "//*[#id=""main-content""]/div/section[1]/header/div/div/div[1]/div[1]/div/div[1]/span[1]")
BUT there is a daily limitation of IMPORTXML function.
So to avoid this Google's daily limitation I need to use this elegant method so it should be something like this :
=REGEXEXTRACT(QUERY(TRANSPOSE(IMPORTDATA(
"https://www.boursorama.com/cours/AMZN/"));
"where Col1 contains 'basp:""'");"(\d+.*)""") <-- here is the line where something is wrong
I'm not used to work with REGEX, does someone can help me on it ?

try:
=REGEXREPLACE(QUERY(FLATTEN(IMPORTDATA(
"https://www.boursorama.com/cours/AMZN")),
"where Col1 contains 'data-ist-bid-price>'", 0),
"</?\S+[^<>]*>", )*1

Related

replace expression format xx-xx-xxxx_12345678

IDENTIFIER
31-03-2022_13636075
01-04-2022_13650262
04-04-2022_13663174
05-04-2022_13672025
20220099001
11614491_R
10781198
00000000000
11283627_P
11614491_R
-1
how can i remove (only) the "XX-XX-XXXXX_" Part in certain values of a column in SSIS but WITHOUT affecting values that doesn't have this format? For example "21-05-2022_12345678" = "12345678" but the other values i don't want them affected. This are just examples of many rows from this column so i want only the ones that have this format to be affected.
SELECT REVERSE(substring(REVERSE('09-03-2022_13481330'),0,CHARINDEX('_',REVERSE('09-03-2022_13481330'),0)))
result
13481330
but this also affects others values.Also this is in ssms not ssis because i am not sure how to transform this expression in ssis code.
Update : Corrected code in SSIS goes as following:
(FINDSTRING(IDENTIFIER,"__-__-____[_]",1) == 1) ? SUBSTRING(IIDENTIFIER,12,LEN(IDENTIFIER) - 11) : IDENTIFIER
Do you have access to the SQL source? You can do this on the sql by using a LIKE and crafting a match pattern using the single char wildcard _ please see below example
DECLARE #Value VARCHAR(50) = '09-03-2022_13481330'
SELECT CASE WHEN #Value LIKE '__-__-____[_]%' THEN
SUBSTRING(#Value,12,LEN(#Value)-11) ELSE #Value END
Please see the Microsoft Documentation on LIKE and using single char wildcards
If you don't have access to the source SQL it gets a bit more tricky as you might need to use regex in a script task or maybe there is a expression you can apply

HOW TO MAKE IF AND SEARCH STATEMENT GOOGLE SPREADSHEETS

So i have dropdown (cells J23) and when the dropdown gets the number, i want to show the results number (which is percent cell).
I was use filter, search and if function.
When i was run the if function, it's work. But when i was combine it, it doesn't work.
Here's my function
=filter(F10:G,search(J23,IF(J23 < 10, J23, IF(J23 = 10, 10, IF(J23 > 10, J23))), F10:F))
If you need my excel, you can access my google sheets
Your spreadsheet goal is hard to understand from your current formula. I think that you want the following...
In K21:
=IFERROR(VLOOKUP(J21,F6:G8,2,FALSE))
In K23:
=IFERROR(VLOOKUP(J23,F10:G,2,FALSE))
try:
=ARRAYFORMULA(IFERROR(QUERY(TO_TEXT(A5:C), "where 9=9 "&
IF(I6="",," and lower(Col1) contains '"&LOWER(I6)&"'")&
IF(J6="",," and lower(Col2) contains '"&LOWER(J6)&"'")&
IF(K6="",," and lower(Col3) contains '"&LOWER(K6)&"'")), "no match"))

Using regextract and importrange together make a "duplicate" formula

I have a spreadsheet where I look if the data (website) already exist on the master sheet.
=if(countif(importrange("Spreadsheet Key","Leads!N:N"),K2)>0,"COMPANY EXISTS!","")
But the above formula is not dynamic enough. If there are companies with co.uk and on the master sheet if it's registered under .com, it won't show "COMPANY EXISTS!"
So I changed to the formula to look for works after before and after "." on a website.
=ARRAYFORMULA(REGEXEXTRACT(UNIQUE(SUBSTITUTE(importrange("Spreadsheet Key","Leads!N:N"),"www.","")), "([0-9A-Za-z-]+)\."))
But it's not working if I try to incorporate with if and countif.
=if(COUNTIF(ARRAYFORMULA(REGEXEXTRACT(SUBSTITUTE(importrange("Spreadsheet Key","Leads!N:N"),"www.",""), "([0-9A-Za-z-]+)\."),L2:L)>0,"Company Exist!",""))
It shows 'Wrong number of arguments to IF. Expected between 2 and 3 arguments, but got 1 arguments'
Can anyone help me out on where I am making the mistake?
Spreadsheet link- https://docs.google.com/spreadsheets/d/1La3oOWiM5KpzRY0MLLEUQC25LzDuQlqTjgFp-VlS8Bo/edit#gid=0
Edited: Made a mistake beforehand, didn't specify on that cell its looking against
Your formula had a typo. You are not closing the Arrayformula and the Countif correctly (the array formula closing parenthesis should go before the , of the count if). So change this:
=if(COUNTIF(ARRAYFORMULA(REGEXEXTRACT(SUBSTITUTE(importrange("Spreadsheet Key","Leads!N:N"),"www.",""), "([0-9A-Za-z-]+)\."),L2:L)>0,"Company Exist!",""))
To this:
=if(COUNTIF(ARRAYFORMULA(REGEXEXTRACT(SUBSTITUTE(IMPORTRANGE("Spreadsheet Key","Leads!N:N"),"www.",""), "([0-9A-Za-z-]+)\.")),L3:L)>0,"Company Exist!","")
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
try:
=ARRAYFORMULA(IFNA(IF(IFNA(REGEXEXTRACT(SUBSTITUTE(IMPORTRANGE(
"1bnz7Y_xVN9Jo80aCBBeMBMJBnMDHkbZQUWnmL20CRi8", "Leads!N:N"),
"www.", ), "([0-9A-Za-z-]+)\."))>0, "Company Exist!", )))

How can I strip a href attribute without the query?

Using Google Sheets, I'd like to grab a URL without a possible query from a "href" attribute. For example, get https://test.com from Test1 or Test1.
I've used the regex answer offered in https://stackoverflow.com/a/40426187/4829915 to remove the query string, and then extracted the actual URL.
Is there a way to do it in one formula?
Please see below what I did. In all of these examples the final output is https://test.com
A B C
1 \?[^\"]+ href="(.+)"
2 Test1 =REGEXREPLACE(A2, B$1, "") =REGEXEXTRACT(B2, C$1)
3 Test2 =REGEXREPLACE(A3, B$1, "") =REGEXEXTRACT(B3, C$1)
4 Test3 =REGEXREPLACE(A4, B$1, "") =REGEXEXTRACT(B4, C$1)
In this answer, I would like to propose 2 patterns. In the 1st pattern, it uses REGEXEXTRACT. In the 2nd pattern, it uses a custom function using Google Apps Script (This is a sample.).
Pattern 1: Using formula
=REGEXEXTRACT(A2, C1)
where C1 is href="(.+?)[\?"]
Pattern 2: Using custom function
When you use this, please copy and paste the script to the script editor. Then please use it at a cell like =getUrl(A2).
function getUrl(value) {
var obj = XmlService.parse(value.replace(/&/g, ";"));
var url = obj.getRootElement().getAttribute("href").getValue();
return url.split("?")[0];
}
Results:
References:
REGEXEXTRACT
XmlService

AWQL - how can i use a regular expressions or something similar?

I am querying the adwords api via the following AWQL-Query (which works fine):
SELECT AccountDescriptiveName, CampaignId, CampaignName, AdGroupId, AdGroupName, KeywordText, KeywordMatchType, MaxCpc, Impressions, Clicks, Cost, Conversions, ConversionsManyPerClick, ConversionValue
FROM KEYWORDS_PERFORMANCE_REPORT
WHERE CampaignStatus IN ['ACTIVE', 'PAUSED']
AND AdGroupStatus IN ['ENABLED', 'PAUSED']
AND Status IN ['ACTIVE', 'PAUSED']
AND AdNetworkType1 IN ['SEARCH'] AND Impressions > 0
DURING 20140501,20140531
Now i want to exclude some campaigns:
we have a convention for our new campaigns that the campaign name begins with three numbers followed by an underscore, eg. "100_brand_all"
So i want to get only these new campaigns..
I tried lots of different variations for STARTS_WITH but only exact strings are working - but i need a pattern to match!
I already read https://developers.google.com/adwords/api/docs/guides/awql?hl=en and following its content it should be possible to use a WHERE expression like this:
CampaignName STARTS_WITH ['0','1','2','3']
But that doesn't work!
Any other ideas how i can achieve this?
Well, why don't you run a campaign performance report first, then process that ( get the campaign ids you want or don't want) the use those in the "CampaignId IN [campaign ids here] . or CampaignID NOT_IN [campaign ids]