How do I create a Scala Regex that is compiled using Java Pattern.COMMENTS? - regex

I want to create several rather complex regular expressions used by my Scala code that take advantage of the Pattern.COMMENTS flag? I want to do something vaguely like this:
val regex = """my
(complex|hideous) # either is appropriate
pattern
(might)? # optional
look like this
""".r
(With the .r at the end of the string giving me all of Scala's Regex goodness)
Unfortunately, using .r doesn't give me any way to tell the Regex to use java.util.regex.Pattern.COMMENTS. Is there an way to create a scala.util.matching.Regex that compiles its source string with comments turned on?

According to the documentation, you should be able to use inline modifiers:
val regex = """(?x)my
(complex|hideous) # either is appropriate
pattern
(might)? # optional
look like this
""".r
See also the Java doc for Regex comments.
With an inline modifier, you enable the option from the point on, where the inline modifier is written. If you use it at the start, it is valid for the whole regular expression.
Check also regular-expressions.info for a further explanation

Related

regular expression matching filename with multiple extensions

Is there a regular expression to match the some.prefix part of both of the following filenames?
xyz can be any character of [a-z0-9-_\ ]
some.prefix part can be any character in [a-zA-Z0-9-_\.\ ].
I intentionally included a . in some.prefix.
some.prefix.xyz.xyz
some.prefix.xyz
I have tried many combinations. For example:
(?P<prefix>[a-zA-Z0-9-_\.]+)(?:\.[a-z0-9]+\.gz|\.[a-z0-9]+)
It works with abc.def.csv by catching abc.def, but fail to catch it in abc.def.csv.gz.
I primarily use Python, but I thought the regex itself should apply to many languages.
Update: It's not possible, see discussion with #nowox below.
I think your regex works pretty well. I recommend you to trying regex101 with your example:
https://regex101.com/r/dV6cE8/3
The expression
^(?i)[ \w-]+\.[ \w-]+
Should work in your case:
som e.prefix.xyz.xyz
^^^^^^^^^^^
some.prefix.xyz
^^^^^^^^^^^
abc.def.csv.gz
^^^^^^^
And in Python you can use:
import re
text = """some.prefix.xyz.xyz
some.prefix.xyz
abc.def.csv.gz"""
print re.findall('^(?i)[ \w-]+\.[ \w-]+', text, re.MULTILINE)
Which will display:
['som e.prefix', 'some.prefix', 'abc.def']
I might think you are a bit confused about your requirement. If I summarize, you have a pathname made of chars and dot such as:
foo.bar.baz.0
foobar.tar.gz
f.o.o.b.a.r
How would you separate these string into a base-name and an extension? Here we recognize some known patterns .tar.gz is definitely an extension, but is .bar.baz.0 the extension or it is only .0?
The answer is not easy and no regexes in this World would be able to guess the correct answer at 100% without some hints.
For example you can list the acceptable extensions and make some criteria:
An extension match the regex \.\w{1,4}$
Several extensions may be concatenated together (\.\w{1,4}){1,4}$
The remaining is called the basename
From this you can build this regular expression:
(?P<basename>.*?)(?P<extension>(?:\.\w{1,4}){1,4})$
Try this[a-z0-9-_\\]+\.[a-z0-9-_\\]+[a-zA-Z0-9-_\.\\]+

Regex Jersey Rest Service

I have the following regex in jersey, that works:
/artist_{artistUID: [1-9][0-9]*}
however, if i do
/{artistUID: [artist_][1-9][0-9]*}
it does not, what i do not understand how the regexes are being build and do not find any good documentation for it. What i want to do is something like this:
/{artistUID: ([uartist_]|[artist_])[1-9][0-9]*}
to recognize terms like "artist_123" and "uartist_123" and store them in the artistUID value.
You can use the alternation group ((...|...)) rather than a characrter class [...] (that matches 1 single character defined inside it).
Use
/{artistUID: (uartist|artist)_[1-9][0-9]*}
Or to make it shorter, use a ? quantifier after u to make it optional:
/{artistUID: u?artist_[1-9][0-9]*}
See the regex demo

Regular expression to extract part of a file path using the logstash grok filter

I am new to regular expressions but I think people here may give me valuable inputs. I am using the logstash grok filter in which I can supply only regular expressions.
I have a string like this
/app/webpf04/sns882A/snsdomain/logs/access.log
I want to use a regular expression to get the sns882A part from the string, which is the substring after the third "/", how can I do that?
I am restricted to regex as grok only accepts regex. Is it possible to use regex for this?
Yes you can use regular expression to get what you want via grok:
/[^/]+/[^/]+/(?<field1>[^/]+)/
for your regex:
/\w*\/\w*\/(\w*)\/
You can also test with:
http://www.regextester.com/
By googling regex tester, you can have different UI.
If you are indeed using Perl then you should use the File::Spec module like this
use strict;
use warnings;
use File::Spec;
my $path = '/app/webpf04/sns882A/snsdomain/logs/access.log';
my #path = File::Spec->splitdir($path);
print $path[3], "\n";
output
sns882A
This is how I would do it in Perl:
my ($name) = ($fullname =~ m{^(?:/.*?){2}/(.*?)/});
EDIT:
If your framework does not support Perl-ish non-grouping groups (?:xyz), this regex should work instead:
^/.*?/.*?/(.*?)/
If you are concerned about performance of .*?, this works as well:
^/[^/]+/[^/]+/([^/]+)/
One more note: All of regexes above will match string /app/webpf04/sns882A/.
But matching string is completely different from first matching group, which is sns882A in all three cases.
Same answer but a small bug fix. If you doesnt specify ^ in starting,it will go for the next match(try longer paths adding more / for input.). To fix it just add ^ in the starting like this. ^ means starting of the input line. finally group1 is your answer.
^/[^/]+/[^/]+/([^/]+)/
If you are using any URI paths use below.(it will handle path aswell as URI).
^.*?/[^/]+/[^/]+/([^/]+)/

Regular Expression to find multiple instances of %%{ANYTHING}%%

SomeRandomText=%EXAMPLE1%,MoreRandomText=%%ONE%%!!%%TWO%%,YetMoreRandomText=%%THREE%%%FOUR%!!%FIVE%\%%SIX%%
I'm in need of a regular expression which can pull out anything which is wrapped in '%%'- so this regular expression would match only the following:
%%ONE%%
%%TWO%%
%%THREE%%
%%SIX%%
I've tried lots of different methods, and am sure there is a way to achieve this- but i'm struggeling as of yet. I mainly end up getting it where it will match everything from the first %% to the last %% in the string- which is not what i want. i think i need something like forward lookups, but struggling to implement
You need a non-greedy match, using the ? modifier:
%%.*?%%
See it working online: rubular
This can also be done be restricting what is allowed between the %s.
%%[^%]*%%
This is more widely supported than non-greedy matching, however
note that this won't match %%A%B%%. Although, if necessary, this can be done with some modifications:
%%([^%]|%[^%])*%%
Or equivalently
%%(%?[^%])*%%

Notepad++ replace with reg expression?

I have a big list with links and other date in it. I want to filter out all the data and have a list with just the links.
Example of the current list:
32,2012-01-04 06:44:44,http://link.com/link
33,2012-01-04 06:44:45,http://link.com/link,{Text|textext|text},http://link.com/link|http://link.com/link|http://link.com/link
Notepad++ offers find replace functionality using RegEx. You can access this feature by using Ctrl+H.
If you're actually asking for a regular expression to do this, you can use something like this to match URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
which I found here.
Additionally you can test out changes to your regex easily at http://gskinner.com/RegExr/
Using the input you provided, here's a pattern you can use on http://www.regexr.com/
You'll need to make sure the global (/g) flag is on
Expression:
.*?(http.*?)[,|\n]
Input:
32,2012-01-04 06:44:44,http://link.com/link1
33,2012-01-04 06:44:45,http://link.com/link2,{Text|textext|text},http://link.com/link3|http://link.com/link4|http://link.com/link5
Substitution:
$1\n
Output:
http://link.com/link1
http://link.com/link2
http://link.com/link3
http://link.com/link4
http://link.com/link5