How to remove underscores in field names with logstash?

How to remove underscores in field names with logstash? - regex

I'm thinking about using the mutate filter and the rename option, but I don't know about the corresponding regex to achieve that:
filter {
mutate {
rename => {
"any_field_with_underscore" => "anyfieldwithunderscore" # i don't know how to write regex for this ...
}
}
}
Can anyone help?

There no indication in the doc that rename{} takes a regexp.
I've seen this done with a ruby{} filter.
As requested, here's some untested Ruby:
begin
keys = event.to_hash.keys
keys.each{|key|
if ( key =~ /_/ )
newkey = key.gsub(/_/, '')
event[newkey] = event.remove(key)
end
}
rescue Exception => e
event['logstash_ruby_exception'] = 'underscores: ' + e.message
end

To build on Alain's answer,
In Logstash >= 5.x, an event object accessor is enforced:
ruby {
code => "
begin
keys = event.to_hash.keys
keys.each{|key|
if ( key =~ /http_/ )
newkey = key.gsub(/http_/, '')
event.set(newkey, event.remove(key))
end
}
rescue Exception => e
event.set('logstash_ruby_exception', 'underscores: ' + e.message)
end
"
}
}
Also see this feature request that would do the same

Related

Is there a way to generate the AWS Console URLs for CloudWatch Log Group filters?

I would like to send my users directly to a specific log group and filter but I need to be able to generate the proper URL format. For example, this URL
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/
%252Fmy%252Flog%252Fgroup%252Fgoes%252Fhere/log-events/$3FfilterPattern$3D$255Bincoming_ip$252C$2Buser_name$252C$2Buser_ip$2B$252C$2Btimestamp$252C$2Brequest$2B$2521$253D$2B$2522GET$2B$252Fhealth_checks$252Fall$2B*$2522$252C$2Bstatus_code$2B$253D$2B5*$2B$257C$257C$2Bstatus_code$2B$253D$2B429$252C$2Bbytes$252C$2Burl$252C$2Buser_agent$255D$26start$3D-172800000
will take you to a log group named /my/log/group/goes/here and filter messages with this pattern for the past 2 days:
[incoming_ip, user_name, user_ip , timestamp, request != "GET /health_checks/all *", status_code = 5* || status_code = 429, bytes, url, user_agent]
I can decode part of the URL but I don't know what some of the other characters should be (see below), but this doesn't really look like any standard HTML encoding to me. Does anyone know a encoder/decoder for this URL format?
%252F == /
$252C == ,
$255B == [
$255D == ]
$253D == =
$2521 == !
$2522 == "
$252F == _
$257C == |
$2B == +
$26 == &
$3D == =
$3F == ?

First of all I'd like to thank other guys for the clues. Further goes the complete explanation how Log Insights links are constructed.
Overall it's just weirdly encoded conjunction of an object structure that works like that:
Part after ?queryDetail= is object representation and {} are represented by ~()
Object is walked down to primitive values and the latter are transformed as following:
encodeURIComponent(value) so that all special characters are transformed to %xx
replace(/%/g, "*") so that this encoding is not affected by top level ones
if value type is string - it is prefixed with unmatched single quote
To illustrate:
"Hello world" -> "Hello%20world" -> "Hello*20world" -> "'Hello*20world"
Arrays of transformed primitives are joined using ~ and as well put inside ~() construct
Then, after primitives transformation is done - object is joined using "~".
After that string is escape()d (note that not encodeURIComponent() is called as it doesn't transform ~ in JS).
After that ?queryDetail= is added.
And finally this string us encodeURIComponent()ed and as a cherry on top - % is replaced with $.
Let's see how it works in practice. Say these are our query parameters:
const expression = `fields #timestamp, #message
| filter #message not like 'example'
| sort #timestamp asc
| limit 100`;
const logGroups = ["/application/sample1", "/application/sample2"];
const queryParameters = {
end: 0,
start: -3600,
timeType: "RELATIVE",
unit: "seconds",
editorString: expression,
isLiveTrail: false,
source: logGroups,
};
Firstly primitives are transformed:
const expression = "'fields*20*40timestamp*2C*20*40message*0A*20*20*20*20*7C*20filter*20*40message*20not*20like*20'example'*0A*20*20*20*20*7C*20sort*20*40timestamp*20asc*0A*20*20*20*20*7C*20limit*20100";
const logGroups = ["'*2Fapplication*2Fsample1", "'*2Fapplication*2Fsample2"];
const queryParameters = {
end: 0,
start: -3600,
timeType: "'RELATIVE",
unit: "'seconds",
editorString: expression,
isLiveTrail: false,
source: logGroups,
};
Then, object is joined using ~ so we have object representation string:
const objectString = "~(end~0~start~-3600~timeType~'RELATIVE~unit~'seconds~editorString~'fields*20*40timestamp*2C*20*40message*0A*20*20*20*20*7C*20filter*20*40message*20not*20like*20'example'*0A*20*20*20*20*7C*20sort*20*40timestamp*20asc*0A*20*20*20*20*7C*20limit*20100~isLiveTrail~false~source~(~'*2Fapplication*2Fsample1~'*2Fapplication*2Fsample2))"
Now we escape() it:
const escapedObject = "%7E%28end%7E0%7Estart%7E-3600%7EtimeType%7E%27RELATIVE%7Eunit%7E%27seconds%7EeditorString%7E%27fields*20*40timestamp*2C*20*40message*0A*20*20*20*20*7C*20filter*20*40message*20not*20like*20%27example%27*0A*20*20*20*20*7C*20sort*20*40timestamp*20asc*0A*20*20*20*20*7C*20limit*20100%7EisLiveTrail%7Efalse%7Esource%7E%28%7E%27*2Fapplication*2Fsample1%7E%27*2Fapplication*2Fsample2%29%29"
Now we append ?queryDetail= prefix:
const withQueryDetail = "?queryDetail=%7E%28end%7E0%7Estart%7E-3600%7EtimeType%7E%27RELATIVE%7Eunit%7E%27seconds%7EeditorString%7E%27fields*20*40timestamp*2C*20*40message*0A*20*20*20*20*7C*20filter*20*40message*20not*20like*20%27example%27*0A*20*20*20*20*7C*20sort*20*40timestamp*20asc*0A*20*20*20*20*7C*20limit*20100%7EisLiveTrail%7Efalse%7Esource%7E%28%7E%27*2Fapplication*2Fsample1%7E%27*2Fapplication*2Fsample2%29%29"
Finally we URLencode it and replace % with $ and vois la:
const result = "$3FqueryDetail$3D$257E$2528end$257E0$257Estart$257E-3600$257EtimeType$257E$2527RELATIVE$257Eunit$257E$2527seconds$257EeditorString$257E$2527fields*20*40timestamp*2C*20*40message*0A*20*20*20*20*7C*20filter*20*40message*20not*20like*20$2527example$2527*0A*20*20*20*20*7C*20sort*20*40timestamp*20asc*0A*20*20*20*20*7C*20limit*20100$257EisLiveTrail$257Efalse$257Esource$257E$2528$257E$2527*2Fapplication*2Fsample1$257E$2527*2Fapplication*2Fsample2$2529$2529"
And putting it all together:
function getInsightsUrl(queryDefinitionId, start, end, expression, sourceGroup, timeType = 'ABSOLUTE', region = 'eu-west-1') {
const p = m => escape(m);
const s = m => escape(m).replace(/%/gi, '*');
const queryDetail
= p('~(')
+ p("end~'")
+ s(end.toUTC().toISO()) // converted using Luxon
+ p("~start~'")
+ s(start.toUTC().toISO()) // converted using Luxon
// Or use UTC instead of Local
+ p(`~timeType~'${timeType}~tz~'Local~editorString~'`)
+ s(expression)
+ p('~isLiveTail~false~queryId~\'')
+ s(queryDefinitionId)
+ p("~source~(~'") + s(sourceGroup) + p(')')
+ p(')');
return `https://${region}.console.aws.amazon.com/cloudwatch/home?region=${region}#logsV2:logs-insights${escape(`?queryDetail=${queryDetail}`).replace(/%/gi, '$')}`;
}
Of course reverse operation can be performed as well.
That's all folks. Have fun, take care and try to avoid doing such a weird stuff yourselves. :)

I had to do a similar thing to generate a back link to the logs for a lambda and did the following hackish thing to create the link:
const link = `https://${process.env.AWS_REGION}.console.aws.amazon.com/cloudwatch/home?region=${process.env.AWS_REGION}#logsV2:log-groups/log-group/${process.env.AWS_LAMBDA_LOG_GROUP_NAME.replace(/\//g, '$252F')}/log-events/${process.env.AWS_LAMBDA_LOG_STREAM_NAME.replace('$', '$2524').replace('[', '$255B').replace(']', '$255D').replace(/\//g, '$252F')}`

A colleague of mine figured out that the encoding is nothing special. It is the standard URI percent encoding but applied twice (2x). In javascript you can use the encodeURIComponent function to test this out:
let inp = 'https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/'
console.log(encodeURIComponent(inp))
console.log(encodeURIComponent(encodeURIComponent(inp)))
This piece of javascript produces the expected output on the second encoding stage:
https%3A%2F%2Fconsole.aws.amazon.com%2Fcloudwatch%2Fhome%3Fregion%3Dus-east-1%23logsV2%3Alog-groups%2Flog-group%2F
https%253A%252F%252Fconsole.aws.amazon.com%252Fcloudwatch%252Fhome%253Fregion%253Dus-east-1%2523logsV2%253Alog-groups%252Flog-group%252F
Caution
At least some bits use the double encoding, not the whole link though. Otherwise all special characters would occupy 4 characters after double encoding, but some still occupy only 2 characters. Hope this helps anyway ;)

My complete Javascript solution based on #isaias-b answer, which also adds a timestamp filter on the logs:
const logBaseUrl = 'https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group';
const encode = text => encodeURIComponent(text).replace(/%/g, '$');
const awsEncode = text => encodeURIComponent(encodeURIComponent(text)).replace(/%/g, '$');
const encodeTimestamp = timestamp => encode('?start=') + awsEncode(new Date(timestamp).toJSON());
const awsLambdaLogBaseUrl = `${logBaseUrl}/${awsEncode('/aws/lambda/')}`;
const logStreamUrl = (logGroup, logStream, timestamp) =>
`${awsLambdaLogBaseUrl}${logGroup}/log-events/${awsEncode(logStream)}${timestamp ? encodeTimestamp(timestamp) : ''}`;

I have created a bit of Ruby code that seems to satisfy the CloudWatch URL parser. I'm not sure why you have to double escape some things and then replace % with $ in others. I'm guessing there is some reason behind it but I couldn't figure out a nice way to do it, so I'm just brute forcing it. If you have something better, or know why they do this, please add a comment.
NOTE: The filter I tested with is kinda basic and I'm not sure what might need to change if you get really fancy with it.
# Basic URL that is the same across all requests
url = 'https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/'
# CloudWatch log group
log_group = '/aws/my/log/group'
# Either specify the instance you want to search or leave it out to search all instances
instance = '/log-events/i-xxxxxxxxxxxx'
OR
instance = '/log-events'
# The filter to apply.
filter = '[incoming_ip, user_name, user_ip , timestamp, request, status_code = 5*, bytes, url, user_agent]'
# Start time. There might be an End time as well but my queries haven't used
# that yet so I'm not sure how it's formatted. It should be pretty similar
# though.
hours = 48
start = "&start=-#{hours*60*60*1000}"
# This will get you the final URL
final = url + CGI.escape(CGI.escape(log_group)) + instance + '$3FfilterPattern$3D' + CGI.escape(CGI.escape(filter)).gsub('%','$') + CGI.escape(start).gsub('%','$')

A bit late but here is a python implementation
def get_cloud_watch_search_url(search, log_group, log_stream, region=None,):
"""Return a properly formatted url string for search cloud watch logs
search = "{$.message: "You are amazing"}
log_group = Is the group of message you want to search
log_stream = The stream of logs to search
"""
url = f'https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}'
def aws_encode(value):
"""The heart of this is that AWS likes to quote things twice with some substitution"""
value = urllib.parse.quote_plus(value)
value = re.sub(r"\+", " ", value)
return re.sub(r"%", "$", urllib.parse.quote_plus(value))
bookmark = '#logsV2:log-groups'
bookmark += '/log-group/' + aws_encode(log_group)
bookmark += "/log-events/" + log_stream
bookmark += re.sub(r"%", "$", urllib.parse.quote("?filterPattern="))
bookmark += aws_encode(search)
return url + bookmark
This then allows you to quickly verify it.
>>> real = 'https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Fapp$252Fdjango/log-events/production$3FfilterPattern$3D$257B$2524.msg$253D$2522$2525s$2525s+messages+to+$2525s+pk$253D$2525d...$2522$257D'
>>> constructed = get_cloud_watch_search_url(None, search='{$.msg="%s%s messages to %s pk=%d..."}', log_group='/app/django', log_stream='production', region='us-west-2')
>>> real == constructed
True

I encountered this problem recently when I wanted to generate cloudwatch insights URL. Typescript version below:
export function getInsightsUrl(
start: Date,
end: Date,
query: string,
sourceGroup: string,
region = "us-east-1"
) {
const p = (m: string) => escape(m);
// encodes inner values
const s = (m: string) => escape(m).replace(/\%/gi, "*");
const queryDetail =
p(`~(end~'`) +
s(end.toISOString()) +
p(`~start~'`) +
s(start.toISOString()) +
p(`~timeType~'ABSOLUTE~tz~'UTC~editorString~'`) +
s(query) +
p(`~isLiveTail~false~queryId~'`) +
s(v4()) +
p(`~source~(~'`) +
s(sourceGroup) +
p(`))`);
return (
`https://console.aws.amazon.com/cloudwatch/home?region=${region}#logsV2:logs-insights` +
escape("?queryDetail=" + queryDetail).replace(/\%/gi, "$")
);
}
Github GIST

A Python solution based on #Pål Brattberg's answer:
cloudwatch_log_template = "https://{AWS_REGION}.console.aws.amazon.com/cloudwatch/home?region={AWS_REGION}#logsV2:log-groups/log-group/{LOG_GROUP_NAME}/log-events/{LOG_STREAM_NAME}"
log_url = cloudwatch_log_template.format(
AWS_REGION=AWS_REGION, LOG_GROUP_NAME=CLOUDWATCH_LOG_GROUP, LOG_STREAM_NAME=LOG_STREAM_NAME
)
Make sure to substitute illegal characters first (see OP) if you used any.

I encountered this problem recently when I wanted to generate cloudwatch insights URL. PHP version below:
<?php
function getInsightsUrl($region = 'ap-northeast-1') {
// https://stackoverflow.com/questions/67734825/why-is-laravels-carbon-toisostring-different-from-javascripts-toisostring
$start = now()->subMinutes(2)->format('Y-m-d\TH:i:s.v\Z');
$end = now()->addMinutes(2)->format('Y-m-d\TH:i:s.v\Z');
$filter = 'INFO';
$logStream = 'xxx_backend_web';
$sourceGroup = '/ecs/xxx_backend_prod';
// $sourceGroup = '/aws/ecs/xxx_backend~\'/ecs/xxx_backend_dev'; // multiple source group
$query =
"fields #timestamp, #message \n" .
"| sort #timestamp desc\n" .
"| filter #logStream like '$logStream'\n" .
"| filter #message like '$filter'\n" .
"| limit 20";
$queryDetail = urlencode(
("~(end~'") .
($end) .
("~start~'") .
($start) .
("~timeType~'ABSOLUTE~tz~'Local~editorString~'") .
($query) .
("~isLiveTail~false~queryId~'") .
("~source~(~'") .
($sourceGroup) .
("))")
);
$queryDetail = preg_replace('/\%/', '$', urlencode("?queryDetail=" . $queryDetail));
return
"https://console.aws.amazon.com/cloudwatch/home?region=${region}#logsV2:logs-insights"
. $queryDetail;
}

A coworker came up with the following JavaScript solution.
import JSURL from 'jsurl';
const QUERY = {
end: 0,
start: -3600,
timeType: 'RELATIVE',
unit: 'seconds',
editorString: "fields #timestamp, #message, #logStream, #log\n| sort #timestamp desc\n| limit 200\n| stats count() by bin(30s)",
source: ['/aws/lambda/simpleFn'],
};
function toLogsUrl(query) {
return `#logsV2:logs-insights?queryDetail=${JSURL.stringify(query)}`;
}
toLogsUrl(QUERY);
// #logsV2:logs-insights?queryDetail=~(end~0~start~-3600~timeType~'RELATIVE~unit~'seconds~editorString~'fields*20*40timestamp*2c*20*40message*2c*20*40logStream*2c*20*40log*0a*7c*20sort*20*40timestamp*20desc*0a*7c*20limit*20200*0a*7c*20stats*20count*28*29*20by*20bin*2830s*29~source~(~'*2faws*2flambda*2fsimpleFn))

I HAVE to elevate #WayneB's answer above bc it just works. No encoding required - just follow his template. I just confirmed it works for me. Here's what he said in one of the comments above:
"Apparently there is an easier link which does the encoding/replacement for you: https://console.aws.amazon.com/cloudwatch/home?region=${process.env.AWS_REGION}#logEventViewer:group=${logGroup};stream=${logStream}"
Thanks for this answer Wayne - just wish I saw it sooner!

Since Python contributions relate to log-groups, and not to log-insights, this is my contribution. I guess that I could have done better with the inner functions though, but it is a good starting point:
from datetime import datetime, timedelta
import re
from urllib.parse import quote
def get_aws_cloudwatch_log_insights(query_parameters, aws_region):
def quote_string(input_str):
return f"""{quote(input_str, safe="~()'*").replace('%', '*')}"""
def quote_list(input_list):
quoted_list = ""
for item in input_list:
if isinstance(item, str):
item = f"'{item}"
quoted_list += f"~{item}"
return f"({quoted_list})"
params = []
for key, value in query_parameters.items():
if key == "editorString":
value = "'" + quote(value)
value = value.replace('%', '*')
elif isinstance(value, str):
value = "'" + value
if isinstance(value, bool):
value = str(value).lower()
elif isinstance(value, list):
value = quote_list(value)
params += [key, str(value)]
object_string = quote_string("~(" + "~".join(params) + ")")
scaped_object = quote(object_string, safe="*").replace("~", "%7E")
with_query_detail = "?queryDetail=" + scaped_object
result = quote(with_query_detail, safe="*").replace("%", "$")
final_url = f"https://{aws_region}.console.aws.amazon.com/cloudwatch/home?region={aws_region}#logsV2:logs-insights{result}"
return final_url
Example:
aws_region = "eu-west-1"
query = """fields #timestamp, #message
| filter #message not like 'example'
| sort #timestamp asc
| limit 100"""
log_groups = ["/application/sample1", "/application/sample2"]
query_parameters = {
"end": datetime.utcnow().isoformat(timespec='milliseconds') + "Z",
"start": (datetime.utcnow() - timedelta(days=2)).isoformat(timespec='milliseconds') + "Z",
"timeType": "ABSOLUTE",
"unit": "seconds",
"editorString": query,
"isLiveTrail": False,
"source": log_groups,
}
print(get_aws_cloudwatch_log_insights(query_parameters, aws_region))

Yet another Python solution:
from urllib.parse import quote
def aws_quote(s):
return quote(quote(s, safe="")).replace("%", "$")
def aws_cloudwatch_url(region, log_group, log_stream):
return "/".join([
f"https://{region}.console.aws.amazon.com/cloudwatch/home?region={region}#logsV2:log-groups",
"log-group",
aws_quote(log_group),
"log-events",
aws_quote(log_stream),
])
aws_cloudwatch_url("ap-southeast-2", "/var/log/syslog", "process/pid=1")
https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logsV2:log-groups/log-group/$252Fvar$252Flog$252Fsyslog/log-events/process$252Fpid$253D1

Regex help need to match an ampersand OR and end of string

I'm trying to create a regex to match part of a URL
The possible URLs might be
www.mysite.com?userid=123xy
www.mysite.com?userid=123x&username=joe
www.mysite.com?tag=xyz&userid=1ww45
www.mysite.com?tag=xyz&userid=1g3x5&username=joe
I'm trying to match the userid=123456
So far I have
Dim r As New Regex("[&?]userID.*[?&]")
Debug.WriteLine(r.Match(strUrl))
But this is only matching lines 2 and 4.
Can anyone help?

(?<=[?&]userid=)[^&#\s]*
Output:
123xy
123x
1ww45
1g3x5
A few points:
This works both if you are matching one URL at a time and if you have a whitespace-separated set.
This captures the username only. It uses the non-capturing positive look-behind assertion since you only care about the username.
The fragment part, if present, will be ignored (e.g. if the URL looked like this: www.mysite.com?tag=xyz&userid=1ww45#top)
If the case of userid doesn't matter, use RegexOptions.IgnoreCase.

I got it:
[&?]userID=[^\s&#]+

PHP solution:
"/[\\?&]userid=([^&]*)/"
Tests:
$tests = [
[
"regex" => "/[\\?&]userid=([^&]*)/",
"expected" => "123xy",
"inputs" => [
"www.mysite.com?userid=123xy",
"www.mysite.com?userid=123xy&username=joe",
"www.mysite.com?tag=xyz&userid=123xy",
"www.mysite.com?tag=xyz&userid=123xy&username=joe"
]
]
];
foreach ($tests as $test) {
$regex = $test['regex'];
$expected = $test['expected'];
foreach ($test['inputs'] as $input) {
if (!preg_match($regex, $input, $match)) {
throw new Exception("Regex '{$regex}' doesn't match for input '{$input}' or error has occured.");
}
$matched = $match[1];
if ($matched !== $expected) {
throw new Exception("Found '{$matched}' instead of '{$expected}'.");
}
echo "Matched '{$matched}' in '{$input}'." . PHP_EOL;
}
}
Results:
Matched '123xy' in 'www.mysite.com?userid=123xy'.
Matched '123xy' in 'www.mysite.com?userid=123xy&username=joe'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy&username=joe'.

You can use the regex: .*?(userid=\d+).*
.*? - is a non-greedy way to express: everything that comes before (userid=\d+)
Python example:
import re
a = 'www.mysite.com?userid=12345'
b = 'www.mysite.com?userid=12345&username=joe'
mat = re.match('.*?(userid=\d+).*', a)
print mat.group(1) # prints userid=12345
mat = re.match('.*?(userid=\d+).*', b)
print mat.group(1) # prints userid=12345
Link to Fiddler

how to store a regex-capture-group as a variable in vim script?

i'm trying to write a vimscript to refactor some legacy code.
roughly i have a lot of files in this format
$this['foo'] = array();
{
$this['foo']['id'] = 123;
$this['foo']['name'] = 'name here';
$this['foo']['name2'] = 'name here2';
$this['foo']['name3'] = 'name here3';
}
I want to reformat this into
$this['foo'] = array(
'id' => 123;
'name' 'name here';
'name2' 'name here';
'name3' 'name here';
);
where foo is variable.
I'm trying to match
$this['foo'] = array()
{
with this regex
/\zs\$this\[.*\]\ze = array()\_s{;
so i can execute this code
# move cursor down two lines, visual select the contents of the block { }
jjvi{
# use variable, parent_array to replace
s/\= parent_array . '\[\([^=]\+\)] = \(.*\);'/'\1' => \2,
but of course i need to let parent_array = /\zs$this[(.*)]\ze = array(); which isnt the right syntax apparently...
TL;DR
function Refactor()
# what is the proper syntax to do this assignment ?
let parent_array = /\zs\$this\[.*\]\ze = array()\_s{;
if (parent_array)
jjvi{
'<,'>s/\= parent_array . '\[\([^=]\+\)] = \(.*\);'/'\1' => \2,
endif
endfunction
EDIT* fixed escaping as per commenter FDinoff

Assuming there's only one such match in a line, and you want the first such line:
let pattern = '\$this\[.*\]\ze = array()\_s{;'
if search(pattern, 'cW') > 0
let parent_array = matchstr(getline('.'), pattern)
endif
This first locates the next matching line, then extracts the matching text. Note that this moves the cursor, but with the 'n' flag to search(), this can be avoided.

Regular expression to match word pairs joined with colons

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.

You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.

Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp

Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}

Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}

I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here

here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo

([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...

Scala Regex enable Multiline option

I'm learning Scala, so this is probably pretty noob-irific.
I want to have a multiline regular expression.
In Ruby it would be:
MY_REGEX = /com:Node/m
My Scala looks like:
val ScriptNode = new Regex("""<com:Node>""")
Here's my match function:
def matchNode( value : String ) : Boolean = value match
{
case ScriptNode() => System.out.println( "found" + value ); true
case _ => System.out.println("not found: " + value ) ; false
}
And I'm calling it like so:
matchNode( "<root>\n<com:Node>\n</root>" ) // doesn't work
matchNode( "<com:Node>" ) // works
I've tried:
val ScriptNode = new Regex("""<com:Node>?m""")
And I'd really like to avoid having to use java.util.regex.Pattern. Any tips greatly appreciated.

This is a very common problem when first using Scala Regex.
When you use pattern matching in Scala, it tries to match the whole string, as if you were using "^" and "$" (and did not activate multi-line parsing, which matches \n to ^ and $).
The way to do what you want would be one of the following:
def matchNode( value : String ) : Boolean =
(ScriptNode findFirstIn value) match {
case Some(v) => println( "found" + v ); true
case None => println("not found: " + value ) ; false
}
Which would find find the first instance of ScriptNode inside value, and return that instance as v (if you want the whole string, just print value). Or else:
val ScriptNode = new Regex("""(?s).*<com:Node>.*""")
def matchNode( value : String ) : Boolean =
value match {
case ScriptNode() => println( "found" + value ); true
case _ => println("not found: " + value ) ; false
}
Which would print all all value. In this example, (?s) activates dotall matching (ie, matching "." to new lines), and the .* before and after the searched-for pattern ensures it will match any string. If you wanted "v" as in the first example, you could do this:
val ScriptNode = new Regex("""(?s).*(<com:Node>).*""")
def matchNode( value : String ) : Boolean =
value match {
case ScriptNode(v) => println( "found" + v ); true
case _ => println("not found: " + value ) ; false
}

Just a quick and dirty addendum: the .r method on RichString converts all strings to scala.util.matching.Regex, so you can do something like this:
"""(?s)a.*b""".r replaceAllIn ( "a\nb\nc\n", "A\nB" )
And that will return
A
B
c
I use this all the time for quick and dirty regex-scripting in the scala console.
Or in this case:
def matchNode( value : String ) : Boolean = {
"""(?s).*(<com:Node>).*""".r.findAllIn( text ) match {
case ScriptNode(v) => System.out.println( "found" + v ); true
case _ => System.out.println("not found: " + value ) ; false
}
}
Just my attempt to reduce the use of the word new in code worldwide. ;)

Just a small addition, use tried to use the (?m) (Multiline) flag (although it might not be suitable here) but here is the right way to use it:
e.g. instead of
val ScriptNode = new Regex("""<com:Node>?m""")
use
val ScriptNode = new Regex("""(?m)<com:Node>""")
But again the (?s) flag is more suitable in this question (adding this answer only because the title is "Scala Regex enable Multiline option")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove underscores in field names with logstash? - regex

I'm thinking about using the mutate filter and the rename option, but I don't know about the corresponding regex to achieve that: filter { mutate { rename => { "any_field_with_underscore" => "anyfieldwithunderscore" # i don't know how to write regex for this ... } } } Can anyone help?

Related

Is there a way to generate the AWS Console URLs for CloudWatch Log Group filters?

Regex help need to match an ampersand OR and end of string

how to store a regex-capture-group as a variable in vim script?

Regular expression to match word pairs joined with colons

Scala Regex enable Multiline option

Categories

Resources