Regex capturing named groups in a language that doesn't support them using a meta regex? - regex

I am using Haskell and I don't seem to find a REGEX package that supports Named Groups so I have to implement it somehow myself.
basically a user of my api would use some regex with named groups to get back captured groups in a map
so
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj
would give
[("name","foo"),("surname","bar")]
I am doing a specification trivial implementation with relatively small strings so for now performance is not a main issue.
To solve this, I thought I'd write a meta regex that will apply on the user's regex
/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj
to extract the names of groups and replace them with nothing to get
0 -> name
1 -> surname
and the regex becomes
/([a-z]*)/hhhh/([a-z]*)/jjj
then apply it to the string and use the index to group names with matched.
Two questions:
does it seem like a good idea?
what is the meta regex that I need to capture and replace the named groups syntax
for those unfamiliar with named groups http://www.regular-expressions.info/named.html
note: all what I need from named groups is that the user give names to matches, so a subset of named groups that only gives me this is ok.

The more generally you want to apply your solution, the more complex your problem becomes. For instance, in your approach, you want to remove the named groups and use the indexes (indices?) to match. This seems like a good start, but you have consider a few things:
If you replace the (?<name>blah) with (blah) then you also have to replace the /name with /1 or /2 or whatever.
What happens if the user includes non named groups as well? for eg: ([a-z]{3})/(?P<name>[a-z]*)/hhhh/(?P<surname>[a-z]*)/jjj on /foo/hhhh/bar/jjj. In this case, your numbering will not work b/c group 1 is the user defined non named group.
See this post for some insipration, as it seems other have successfully tried the same (albeit in Java)
Regex Named Groups in Java

Perhaps you should use parser combinators. This looks sufficiently complicated that it would be cleaner and more maintainable to step out and use Parsec or Attoparsec, instead of trying to push regexes further towards parsing.

Related

Regex, Grafana Loki, Promtail: Parsing a timestamp from logs using regex

I want to parse a timestamp from logs to be used by loki as the timestamp.
Im a total noob when it comes to regex.
The log file is from "endlessh" which is essentially a tarpit/honeypit for ssh attackers.
It looks like this:
2022-04-03 14:37:25.101991388 2022-04-03T12:37:25.101Z CLOSE host=::ffff:218.92.0.192 port=21590 fd=4 time=20.015 bytes=26
2022-04-03 14:38:07.723962122 2022-04-03T12:38:07.723Z ACCEPT host=::ffff:218.92.0.192 port=64475 fd=4 n=1/4096
What I want to match, using regex, is the second timestamp present there, since its a utc timestamp and should be parseable by promtail.
I've tried different approaches, but just couldn't get it right at all.
So first of all I need a regex that matches the timestamp I want.
But secondly, I somehow need to form it into a regex that exposes the value in some sort?
The docs offer this example:
.*level=(?P<level>[a-zA-Z]+).*ts=(?P<timestamp>[T\d-:.Z]*).*component=(?P<component>[a-zA-Z]+)
Afaik, those are named groups, and that is all that it takes to expose the value for me to use it in the config?
Would be nice if someone can provide a solution for the regex, and an explanation of what it does :)
You could for example create a specific pattern to match the first part, and capture the second part:
^\d{4}-\d{2}-\d{2} \d\d:\d\d:\d\d\.\d+\s+(?P<timestamp>\d{4}-\d{2}-\d{2}T\d\d:\d\d:\d\d\.\d+Z)\b
Regex demo
Or use a very broad if the format is always the same, repeating an exact number of non whitespace characters parts and capture the part that you want to keep.
^(?:\S+\s+){2}(?<timestamp>\S+)
Regex demo

Negating a regex query

I have looked at multiple posts about this, and am still having issues.
I am attempting to write a regex query that finds the names of S3 buckets that do not follow the naming scheme we want. The scheme we want is as follows:
test-bucket-logs**-us-east-1**
The bolded part is optional. Meaning, the following two are valid bucket names:
test-bucket-logs
test-bucket-logs-us-east-1
Now, what I want to do is negate this. So I want to catch all buckets that do not follow the scheme above. I have successfully formed a query that will match for the naming scheme, but am having issues forming one that negates it. The regex is below:
^(.*-bucket-logs)(-[a-z]{2}-[a-z]{4,}-\d)?$
So some more valid bucket names:
example-bucket-logs-ap-northeast-1
something-bucket-logs-eu-central-1
Invalid bucket names (we want to match these):
Iscrewedthepooch
test-bucket-logs-us-ee
bucket-logs-us-east-1
Thank you for the help.
As mr Barmar said, probably the best approach on these circumstances is solving it programatically. You could write the usual regex for matching the right pattern, and exclude them from the collection.
But you can try this:
^(?:.(?!-bucket-logs-[a-z]{2}-[a-z]{4,}-\d|-bucket-logs$))*$
which is a typical solution using a negative lookeahead (?!) which is a non-capturing group, with zero-length. Basically it states that you want every line that starts with something but dont has the pattern after it.
EDITED
As Ibrahim pointed out(thank you!), there was a little issue with my first regex. I fixed it and I think it is ok now. I had forgot to set the last part of inner regex as optional(?).

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Find replace named groups regexp in Geany

I am trying to replace public methods to protected methods for methods that have a comment.
This because I am using phpunit to test some of those methods, but they really don't need to be public, so I'd like to switch them on the production server and switch back when testing.
Here is the method declaration:
public function extractFile($fileName){ //TODO: change to protected
This is the regexp:
(?<ws>^\s+)(?<pb>public)(?<fn>[^/\n]+)(?<cm>//TODO: change to protected)
If I replace it with:
\1protected\3\//TODO: change back to public for testing
It seems to be working, but what I cannot get to work is naming the replace with. I have to use \1 to get the first group. Why name the groups if you can't access them in the replacing texts? I tried things like <ws>, $ws, $ws, but that doesn't work.
What is the replacing text if I want to replace \1 with the <ws> named group?
The ?<ws> named group syntax is the same as that used by .NET/Perl. For those regex engines the replacement string reference for the named group is ${ws}. This means your replacement string would be:
${ws}protected${fn}\//TODO: change back to public for testing
The \k<ws> reference mentioned by m.buettner is only used for backreferences in the actual regex.
Extra Information:
It seems like Geany also allows use of Python style named groups:
?P<ws> is the capturing syntax
\g<ws> is the replacement string syntax
(?P=ws) is the regex backreference syntax
EDIT:
It looks my hope for a solution didn't pan out. From the manual,
A subpattern can be named in one of three ways: (?...) or (?'name'...) as in Perl, or (?P...) as in Python. References to capturing parentheses from other parts of the pattern, such as backreferences, recursion, and conditions, can be made by name as well as by number.
And further down:
Back references to named subpatterns use the Perl syntax \k or \k'name' or the Python syntax (?P=name).
and
A subpattern that is referenced by name may appear in the pattern before or after the reference.
So, my inference of the syntax for using named groups was correct. Unfortunately, they can only be used in the matching pattern. That answers your question "Why name groups...?".
How stupid is this? If you go to all the trouble to implement named groups and their usage in the matching pattern, why not also implement usage in the replacement string?

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/