Remove specific characters from string to tidy up URLs [duplicate] - regex

This question already has answers here:
Extracting rootdomains from URL string in Google Sheets
(3 answers)
Closed 2 years ago.
Hi I have a column of messy URL links within Google Sheets I'm trying to clean up, I want all formats of website links to be the same so that I can run a duplicate check on them.
For example, I have a list of URLs with various http, http://, https:// etc. I am trying to use the REGEXREPLACE tool to remove all http combination elements from the column entries, however cannot get it to work. This is what I have:
Before:
http://www.website1.com/
https://website2.com/
https://www.website3.com/
And I want - After:
website.com
website2.com
website3.com
It is ok if this takes place over a number of formulas and thus columns to the end result.

try:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(INDEX(SPLIT(
REGEXREPLACE(A1:A, "https?://www.|https?://|www.", ), "/"),,1),
"\.(.+\..+)"), INDEX(IFERROR(SPLIT(
REGEXREPLACE(A1:A, "https?://www.|https?://|www.", ), "/")),,1)))
or shorter:
=INDEX(IFERROR(REGEXEXTRACT(A1:A, "^(?:https?:\/\/)?(?:www\.)?([^\/]+)")))

You can try the following formula
=ArrayFormula(regexreplace(LEFT(P1:P3,LEN(P1:P3)-1),"(.*//www.)|(.*//)",""))
Please do adjust ranges as needed.

Related

Regexmatch in Google Sheet to find 2 sections of my website

I'm trying to find the number of users coming to 2 different parts of my website:
/blog/
resources.company
To do this I use REGEXMATCH with | separator:
=SUMPRODUCT(('raw data'!$B$2:$B),REGEXMATCH('RAW - New Users'!$A$2:$A,"/blog/|resources.company"))
However, when I check with this formula:
=SUMPRODUCT(('raw data'!$B$2:$B),REGEXMATCH('RAW - New Users'!$A$2:$A,"/blog/") + REGEXMATCH('RAW - New Users'!$A$2:$A,"resources.company"))
I find more users.
I'm pretty sure the . messes up with the 1st formula. I've tried /blog/|resources\.company but it didn't help. How can I change my first formula so the REGEXMATCH finds everything that contains "resources.company" as well as "/blog"?

How to simplify this google sheets regex sequence?

I want to make the following transformation to a set of datas in my google spreadsheets :
6 views -> 6
73K views -> 73000
3650 -> 3650
163K views -> 163000
1.2K views -> 1200
52.5K -> 52500
All the datas are in a column and depending on the case I need to apply a specific transformation.
I tried to put all the regex in one formula but I failed. I always had a case over two regular expressions etc.
Anyaway I end up making these regex one case by one case in different columns. It works fine but I feel like it could slowdown the sheet since I except a lot of data coming into this sheet.
Here is the sheet : spreadsheet
Thank you for your help !
Use regexreplace(), like this:
=arrayformula(
iferror( 1 /
value(
regexreplace(
regexreplace(trim(A2:A), "\s*K", "e3"),
" views", ""
)
)
^ -1 )
)
See your sample spreadsheet.
replace 'views' using regex: /(?<=(\d*\.?\d+\K?)) views/gi
To replace 'K' with or without decimal value, first, detect K then replace K with an empty string and multiply by 1000.
use call back function as:
txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000)
code:
arr = [`6 views`,
`73K views`,
`3650`,
`163K views`,
`1.2K views`,
`52.5K`];
arr.forEach(txt => {
console.log(txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000))
})
Output:
6
73000
3650
163000
1200
52500
Say your inputs are in column A. Empty cells allowed. In any other column,
=arrayformula(if(A2:A<>"",value(substitute(substitute(A2:A," views",""),"K","e3")),))
works.
Adjust the range A2:A as needed.
Also note that non-empty cells with empty strings are ignored.
Basically, since Google Sheet's regex engine doesn't support look around, it is more efficient to take advantage of the rather strict patterns in your application and use substitute() instead.

How to query in Mongo for a String based on expressions [duplicate]

This question already has answers here:
Matching a Forward Slash with a regex
(9 answers)
Closed 3 years ago.
I have lot of Data in Mongo DB, I wanted to query based on a String value and that value contains a url
"url" : "http://some-host/api/name/service/list/1234/xyz"
I got records count when executed the below query
db.mycollection.find({url:"http://some-host/api/name/service/list/1234/xyz"}).count()
I want to get all the records which match with
some-host/api/name/service/list/
I tried using below saamples
db.mycollection.find({url:"some-host/api/name/service/list/"}).count()
Got zero records
db.mycollection.find({url:/.*"some-host/api/name/service/list/".*/}).count()
Got error
db.mycollection.find({"url":/.*"some-host/api/name/service/list/".*/}).count()
Got error
db.mycollection.find({"url":/.*some-host/api/name/service/list/.*/}).count()
Got Error
db.mycollection.find({"url":/.*some-host//api//name//service//list//.*/}).count()
Got ...
...
Then no response
Did you try with something like this:
db.mycollection.find({'url': {'$regex': 'sometext'}})
Please check also here

HiveQL: Parse strings and count

I am using HiveQL to work with millions of rows of domain name text data stored in HDFS. The following is a hand-selected subset to illustrate lexical diversity. There are duplicate entries.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
mgmtsubnet.mgmtvcn.oraclevcn.com.
asdf.mgmtvcn.oraclevcn.com.
dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.
localhost.
a.localhost.
img.pulsemgr.com.
36.136.154.156.in-addr.arpa.
accounts.spotify.com.
_dmarc.ixia-devops.com.
&eventtype=close&reason=4&duration=35.
&eventtype=close&reason=3&duration=10336.
I am trying to get a count of # of rows based on the last two levels of the domain, where sometimes the 2nd level is absent (i.e. localhost.). For example:
domain_root count
oraclevcn.com. 4
localhost. 1
a.localhost. 1
pulsemgr.com. 1
in-addr.arpa. 1
spotify.com. 1
ixia-devops.com 1
It would be nice to also see how to filter out domains 2nd level is absent.
I am not sure where to start. I have seen use of the SPLIT() function, but that may not be robust since there could be many levels to a domain name, for example: a.b.c.d.e.f.g.h.i etc.
Any ideas are implementations are appreciated.
Below would be the query with regexp_extract.
select domain_root, count(*) from (select regexp_extract('dnsvm.mgmtsubnet.mgmtvcn.oraclevcn.com.', '[A-Za-z0-9-]+\.[A-Za-z0-9-]+\.$', 0) as domain_root from table) A group by A.domain_root -- replace first argument with column name
regex will extract for domain root with Alphanumeric and special character '-'
hope this helps.

AWQL - how can i use a regular expressions or something similar?

I am querying the adwords api via the following AWQL-Query (which works fine):
SELECT AccountDescriptiveName, CampaignId, CampaignName, AdGroupId, AdGroupName, KeywordText, KeywordMatchType, MaxCpc, Impressions, Clicks, Cost, Conversions, ConversionsManyPerClick, ConversionValue
FROM KEYWORDS_PERFORMANCE_REPORT
WHERE CampaignStatus IN ['ACTIVE', 'PAUSED']
AND AdGroupStatus IN ['ENABLED', 'PAUSED']
AND Status IN ['ACTIVE', 'PAUSED']
AND AdNetworkType1 IN ['SEARCH'] AND Impressions > 0
DURING 20140501,20140531
Now i want to exclude some campaigns:
we have a convention for our new campaigns that the campaign name begins with three numbers followed by an underscore, eg. "100_brand_all"
So i want to get only these new campaigns..
I tried lots of different variations for STARTS_WITH but only exact strings are working - but i need a pattern to match!
I already read https://developers.google.com/adwords/api/docs/guides/awql?hl=en and following its content it should be possible to use a WHERE expression like this:
CampaignName STARTS_WITH ['0','1','2','3']
But that doesn't work!
Any other ideas how i can achieve this?
Well, why don't you run a campaign performance report first, then process that ( get the campaign ids you want or don't want) the use those in the "CampaignId IN [campaign ids here] . or CampaignID NOT_IN [campaign ids]