Regex Match Kusto

Regex Match Kusto - regex

I have below 2 tables, One with complete list of URLs and other table with regex representation of all URLs (nearly 100 values) with corresponding topic. I now want to create a third table which maps each url with the topic based on the regex pattern.
I figured that kusto offers 'matches regex' but it cannot be used at a row level. Ideally I want to create a function and pass URL which output the corresponding Topic
Table1:
| URL |
Table2:
|URL Regex| Topic|
Output:
|URL | Topic|
let me know if the below logic needs any tuning for it to work,
Query:
.create-or-alter function with findTopic(Path:string) {
toscalar(Table2
| extend TopicName=case (Path matches regex URLRegex, Topic,"Not Found")
| project Topic)
}
Table1
| extend Topic=findTopic(Path)

Regular expressions can't be originated from a dynamic source, like another table. In Kusto, regular expressions must be string scalars.
In your case this isn't a problem, since there are about 100 different topics. You can maintain a stored function that does the URI categorization:
.create-or-alter function GetUrlTopic(Url:string)
{
case(
Url matches regex #"https://bing.com.*", "Search",
Url matches regex #"https://stackoverflow.com.*", "Q&A",
"N/A")
}
Example:
let Uris=datatable(Url:string)
[
"https://bing.com/foo/bar",
"https://bing.com/1/2",
"https://microsoft.com",
"https://stackoverflow.com/q/1",
"https://stackoverflow.com/q/2"
];
Uris
| extend Topic=GetUrlTopic(Url)
Result:
Url
Topic
https://bing.com/foo/bar
Search
https://bing.com/1/2
Search
https://microsoft.com
N/A
https://stackoverflow.com/q/1
Q&A
https://stackoverflow.com/q/2
Q&A

Related

How to simplify this google sheets regex sequence?

I want to make the following transformation to a set of datas in my google spreadsheets :
6 views -> 6
73K views -> 73000
3650 -> 3650
163K views -> 163000
1.2K views -> 1200
52.5K -> 52500
All the datas are in a column and depending on the case I need to apply a specific transformation.
I tried to put all the regex in one formula but I failed. I always had a case over two regular expressions etc.
Anyaway I end up making these regex one case by one case in different columns. It works fine but I feel like it could slowdown the sheet since I except a lot of data coming into this sheet.
Here is the sheet : spreadsheet
Thank you for your help !

Use regexreplace(), like this:
=arrayformula(
iferror( 1 /
value(
regexreplace(
regexreplace(trim(A2:A), "\s*K", "e3"),
" views", ""
)
)
^ -1 )
)
See your sample spreadsheet.

replace 'views' using regex: /(?<=(\d*\.?\d+\K?)) views/gi
To replace 'K' with or without decimal value, first, detect K then replace K with an empty string and multiply by 1000.
use call back function as:
txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000)
code:
arr = [`6 views`,
`73K views`,
`3650`,
`163K views`,
`1.2K views`,
`52.5K`];
arr.forEach(txt => {
console.log(txt.replace(/(?<=(\d*\.?\d+\K?)) views/gi, '').replace(/(?<=\d)\.?\d+K/g, x => x.replace(/K/gi, '')*1000))
})
Output:
6
73000
3650
163000
1200
52500

Say your inputs are in column A. Empty cells allowed. In any other column,
=arrayformula(if(A2:A<>"",value(substitute(substitute(A2:A," views",""),"K","e3")),))
works.
Adjust the range A2:A as needed.
Also note that non-empty cells with empty strings are ignored.
Basically, since Google Sheet's regex engine doesn't support look around, it is more efficient to take advantage of the rather strict patterns in your application and use substitute() instead.

How to conditionally transform text in a column in power query?

I am building a workbook in PowerBI and I have the need for doing a conditional appending of text to column A if it meets a certain criteria. Specifically, if column A does not end with ".html" then I want to append the text ".html" to the column.
A sample of the data would look like this:
URL | Visits
site.com/page1.html | 5
site.com/page2.html | 12
site.com/page3 | 15
site.com/page4.html | 8
where the desired output would look like this:
URL | Visits
site.com/page1.html | 5
site.com/page2.html | 12
site.com/page3.html | 15
site.com/page4.html | 8
I have tried using the code:
#"CurrentLine" = Table.TransformColumns(#"PreviousLine", {{"URL", each if Text.EndsWith([URL],".html") = false then _ & ".html" else "URL", type text}})
But that returns an error "cannot apply field access to the type Text".
I can achieve the desired output in a very roundabout way if I use an AddColumn to store the criteria value, and then another AddColumn to store the new appended value, but this seems like an extremely overkill way to approach doing a single transformation to a column. (I am specifically looking to avoid this as I have about 10 or so transformations and don't want to have so many columns to add and cleanup if there is a more succinct way of coding)

You don't want [URL] inside Text.EndWith. Try this:
= Table.TransformColumns(#"PreviousLine",
{{"URL", each if Text.EndsWith(_, ".html") then _ else _ & ".html", type text}}
)

AWQL - how can i use a regular expressions or something similar?

I am querying the adwords api via the following AWQL-Query (which works fine):
SELECT AccountDescriptiveName, CampaignId, CampaignName, AdGroupId, AdGroupName, KeywordText, KeywordMatchType, MaxCpc, Impressions, Clicks, Cost, Conversions, ConversionsManyPerClick, ConversionValue
FROM KEYWORDS_PERFORMANCE_REPORT
WHERE CampaignStatus IN ['ACTIVE', 'PAUSED']
AND AdGroupStatus IN ['ENABLED', 'PAUSED']
AND Status IN ['ACTIVE', 'PAUSED']
AND AdNetworkType1 IN ['SEARCH'] AND Impressions > 0
DURING 20140501,20140531
Now i want to exclude some campaigns:
we have a convention for our new campaigns that the campaign name begins with three numbers followed by an underscore, eg. "100_brand_all"
So i want to get only these new campaigns..
I tried lots of different variations for STARTS_WITH but only exact strings are working - but i need a pattern to match!
I already read https://developers.google.com/adwords/api/docs/guides/awql?hl=en and following its content it should be possible to use a WHERE expression like this:
CampaignName STARTS_WITH ['0','1','2','3']
But that doesn't work!
Any other ideas how i can achieve this?

Well, why don't you run a campaign performance report first, then process that ( get the campaign ids you want or don't want) the use those in the "CampaignId IN [campaign ids here] . or CampaignID NOT_IN [campaign ids]

sparql exact match regex

I'am using the following sparql query to extract from dbpedia the pages which match a specific infobox:
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpedia: <http://dbpedia.org/property/>
PREFIX res:<http://dbpedia.org/resource/>
SELECT DISTINCT *
WHERE {
?page dbpedia:wikiPageUsesTemplate ?template .
?page rdfs:label ?label .
FILTER (regex(?template, 'Infobox_artist')) .
FILTER (lang(?label) = 'en')
}
LIMIT 100
In this line of the query :
FILTER (regex(?template, 'Infobox_artist')) .
I get all the infoboxes that start with artist as artist_discography and other which I don't need. My question is: how can I get by a regex only the infoboxes that matche exactly "infobox_artist" ?

As it is a regex you should be able to restrict the search as follows:
FILTER (regex(?template, '^Infobox_artist$')) .
^ is the beginning of a string
$ is the end of a string
in a regex.
NB: I've not used sparql, so this may well not work.

While the approach suggested by #beny23 works, it is really very inefficient. Using a regex for essentially matching an exact value is (potentially) putting an unnessary burden on the endpoint being queried. This is bad practice.
The value of ?template is a URI, so you really should use a value comparison (or even inline as #cygri demonstrated):
SELECT DISTINCT * {
?page dbpedia:wikiPageUsesTemplate ?template .
?page rdfs:label ?label .
FILTER (lang(?label) = 'en')
FILTER (?template = <http://dbpedia.org/resource/Template:Infobox_artist> )
}
LIMIT 100
You can still easily adapt this query string in code to work with different types of infoboxes. Also: depending on which toolkit you use to create and execute SPARQL queries, you may have some programmatic alternatives to make query reuse even easier.
For example, you can create a "prepared query" which you can reuse, and set a binding to a particular value before executing it. For example, in Sesame you could do something like this:
String q = "SELECT DISTINCT * { " +
" ?page dbpedia:wikiPageUsesTemplate ?template . " +
" ?page rdfs:label ?label . " +
" FILTER (lang(?label) = 'en') " +
" } LIMIT 100 ";
TupleQuery query = conn.prepareTupleQuery(SPARQL, q);
URI infoboxArtist = f.createURI(DBPedia.NAMESPACE, "Template:Infobox_artist");
query.setBinding("template", infoboxArtist);
TupleQueryResult result = query.evaluate();
(As an aside: showing example using Sesame because I'm on the Sesame development team, but no doubt other SPARQL/RDF toolkits have similar functionality)

If all you want to do is a direct string comparison, then you don't need a regex! This is simpler and faster:
SELECT DISTINCT * {
?page dbpedia:wikiPageUsesTemplate
<http://dbpedia.org/resource/Template:Infobox_artist> .
?page rdfs:label ?label .
FILTER (lang(?label) = 'en')
}
LIMIT 100

How to do ANDing of conditions in a regular expression?

I want to match and modify part of a string if following conditions are true:
I want to capture information regarding a project, like project duration, client, technologies used, etc..
So, I want to select string starting with word "project" or string may start with other words like "details of project" or "project details" or "project #1".
RegEx. should first look at word "project" and it should select the string only when few or all of the following words are found after word "project".
1) client
2) duration
3) environment
4) technologies
5) role
I want to select a string if it matches at least 2 of the above words. Words can appear in any order and if the string contains ANY two or three of these words, then the string should get selected.
I have sample text given below.
Details of Projects :
*Project #1: CVC â€“ Customer Value Creation (Sep 2007 â€“ till now) Time
Warner Cable is the world's leading
media and entertainment company, Time
Warner Cable (TWC) makes coaxial
quiver.
Client : Time Warner Cable,US. ETL
Tool : Informatica 7.1.4
Database : Oracle 9i.
Role : ETL Developer/Team Lead.
O/S : UNIX.
Responsibilities: Created Test Plan and Test Case Book. Peer reviewed team members > Mappings. Documented Mappings. Leading the Development Team. Sending Reports to onsite. Bug >fixing for Defects, Data and Performance related.
Details of Project #2: MYER â€“ Sales
Analysis system (Nov 2005 â€“ till now)
Coles Myer is one of Australia's largest retailers with more than 2,000 > stores throughout Australia,
Client : Coles Myer
Retail, Australia. ETL Tool :
Informatica 7.1.3 Database : Oracle
8i. Role : ETL Developer. O/S :
UNIX. Responsibilities: Extraction,
Transformation and Loading of the data
using Informatica. Understanding the
entire source system.
Created and Run Sessions and
Workflows. Created Sort files using
Syncsort Application.*
Does anyone know how to achieve this using regular expressions?
Any clues or regular expressions are welcome!
Many thanks!

(client|duration|environment|technologies|role).+(client|duration|environment|technologies|role)(?!\1)

I would break it down into a few simpler regex's to get these results. The first would select only the chunk of text between projects: (?=Project #).*(?<=Project #)
With the match that this produces, i would run a seperate regex to ask if it contains any of those words : client | duration | environment | technologies | role
If this match comes back with a count of more then 2 distinct matches, you know to select the original string!
Edit:
string originalText;
MatchCollection projectDescriptions = Regex.Matches(originalText, "(?=Project #).(?:(?!Project #).)*", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Foreach(Match projectDescription in projectDescriptions)
{
MatchCollection keyWordMatches = Regex.Matches(projectDescription.value, "client | duration | environment | technologies | role ", RegexOptions.IgnoreCase);
if(keyWordMatches.Distinct.Count > 2)
{
//At this point, do whatever you need to with the original projectDescription match, the Match object will give you the index etc of the match inside the original string.
}
}

Maybe you need to break that requirements in two steps: first, take your key/value pairs from your string, than apply your filter.
string input = #"Project #...";
Regex projects = new Regex(#"(?<key>\S+).:.(?<value>.*?\.)");
foreach (Match project in projects.Matches(input))
{
Console.WriteLine ("{0} : {1}",
project.Groups["key" ].Value,
project.Groups["value"].Value);
}

Try
^(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*$
One note: This will also match if only one of the terms appears twice.
In C#:
foundMatch = Regex.IsMatch(subjectString, #"\A(?:(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*)\Z", RegexOptions.Singleline | RegexOptions.IgnoreCase);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js