Regex with multiple groups, some of which are optional

Regex with multiple groups, some of which are optional - regex

I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.
As input, I have cells which look like this:
SEPA Overboeking IBAN: AB1234 BIC: LALA678 Naam: John Smith Omschrijving: Hello hello Kenmerk: 03-05-2019 23:12 533238
I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.
For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238.
To obtain this, I've used:
.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)
This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched.
I've tried variations with greedy/non greedy, but couldn't get it to work.
Help would be greatly appreciated!
N.B.: I'm working in KNIME (open source data analysis tool)

I was able to split your input using the following regular expression:
^.*
\s+IBAN\:\s*(?<IBAN>.*?)
\s+BIC\:\s*(?<BIC>.*?)
\s+Naam\:\s*(?<Naam>.*?)
(?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
(?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
$
This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):
For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:
I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.

Related

How can I create a Regex that matches and transforms a period delimited path?

I am using den4b Renamer to rename a lot of files that follow a specific pattern. The program allows me to use RegEx: (https://www.den4b.com/wiki/ReNamer:Regular_Expressions)
I am stuck trying to conjure up an expression for a specific pattern.
My current RegEx:
Expression: ^(com\.)(([\w\s]*\.){0,4})([\w\s]*)$
Replace: \L$1\L$2\u$4
Note: \L and \u transform the sub-expression to upper and lower case as defined in the table below:
Here are a few example strings so you can get an idea of the input:
Android File Transfer.svg
Angular Console.svg
Au.Edu.Uq.Esys.Escript.svg
Avidemux.svg
Blackmagic Fusion8.svg
Broken Sword.svg
Browser360 Beta.svg
Btsync GUI.svg
Buttercup Desktop.svg
Calc.svg
Calibre EBook Edit.svg
Calibre Viewer.svg
Call Of Duty.svg
com.GitHub.Plugarut.Pwned Checker.svg
com.GitHub.Plugarut.Wingpanel Monitor.svg
com.GitHub.Rickybas.Date Countdown.svg
com.GitHub.Spheras.Desktopfolder.svg
com.GitHub.Themix Project.Oomox.svg
com.GitHub.Unrud.Remote Touchpad.svg
com.GitHub.Unrud.Video Downloader.svg
com.GitHub.Weclaw1.Image Roll.svg
com.GitHub.Zelikos.Rannum.svg
com.Gitlab.Miridyan.Mt.svg
com.Inventwithpython.Flippy.svg
com.Neatdecisions.Detwinner.svg
com.Rafaelmardojai.Share Preview.svg
com.Rafaelmardojai.Webfont Kit Generator.svg
Distributor Logo Antix.svg
Distributor Logo Archlabs.svg
Distributor Logo Dragonflybsd.svg
DOSBox.svg
Drawio.svg
Drweb GUI.svg
For this question I am focused on the strings that begin with com.xxx.xxx.
Since I can't only target those names in Renamer, the expression has to "play nice" with the other input file names and correctly leave them alone. That's why I've prefixed my expression with ^(com\.)
What I want:
Transform the entire string to lower case except for the last period separated part of the string.
Strip white space from the entire string.
For instance:
Original: com.GitHub.Alcadica.Develop.svg
After my Regex: com.github.alcadica.Develop.svg
What I want: com.github.alcadica.Develop.svg
This specific file is correctly renamed. What I'm having trouble with are names that have spaces in any part of the string. I can't figure out how to strip whitespace:
Original: com.Belmoussaoui.Read it Later.svg
After my Regex: com.belmoussaoui.Read it Later.svg
What I want: com.belmoussaoui.ReaditLater.svg
Here is a hypothetical example because I couldn't find a file with more than four parts. I want my pattern to be robust enough to handle this:
Original: com.Shatteredpixel.Another Level.Next.Pixel Dungeon.svg
After my Regex: com.shatteredpixel.another level.next.Pixel Dungeon.svg
What I want: com.shatteredpixel.anotherlevel.next.PixelDungeon.svg
Note that since I'm not using any kind of programming language, I don't have access to common string operations like trim, etc. I can, however, stack expressions. But this would create more overhead and since I am renaming thousands of files at a time I'd ideally like to keep it to one find/replace expression.
Any help would be greatly appreciated. Please let me know if I can provide any more information to make this more clear.
Edit:
I got it to work with the following rules:
Really inefficient, but it works. (Thanks to Jeremy in the comments for the idea)

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?

How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

EditPad: Need a regex that handles multiple possible data formats

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.
I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.
Here is some sample data:
3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre
Here is how the above data needs to end up:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
So, there are numerous rules, such as:
In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
If a question mark is present, everything from the mark onwards should be removed.
If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
If the string contains 2 underscores, averything after the first numeric string up to a dash
(if present) should be removed, and the string "sm-" should be prepended.
Additionally there is other data that must be left untouched, including but not limited to:
113535|24905|24905
as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)
This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.

Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.
That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.
Here is how.
The solution below is tested in EPP.
Regex 1
Press Ctrl + Sh + F to enter Search / Replace mode
Enter the following Search and Replace in the appropriate boxes
At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1
Search:
(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)
Replace:
\1\2\3\4\5
Regex 2
Same 1-2-3 steps as above.
Search
^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
Replace
sm-\1\2
Chaining Regex 1 and Regex 2
Top menu: Macros, Record Macro, give it a name.
Click the Favorite searches pulldown, select Regex 1
Hit Replace All.
Click the Favorite searches pulldown, select Regex 2
Hit Replace All.
Macros, Stop recording.
Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.
Testing This
I have tested my "Jonathan macro" on your input. Here is the result:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre

Try this:
Toggle the Search Panel : SHIFT+CTRL+F
SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
REPLACE: $1
Check REGEX and WORDS
Click Replace All or Hit CTRL+ALT+F3
Check the image below:

A regex with Splunk

Got some troubles with my regex.
I got some lines like this:
SomeText#"C:\\","Shadow Copy Components:\\","E:\\",""
SomeText#"D:\\"
SomeText#"E:\\","Shadow Copy Components:\\"
SomeText#"SET SNAP_ID=serv.a.x.com_1380312019","BACKUP H:\\ USING \\\\?\\GLOBALROOT\\Device\\HarddiskVolumeShadowCopy47\\ OPTIONS:ALT_PATH_PREFIX=c:\\VERITAS\\NetBackup\\temp\\_vrts_frzn_img_3200\"
SomeText#"SET SNAP_ID=serv.a.x.com_1380312019","BACKUP Y:\\Libs USING \\\\?\\GLOBALROOT\\Device\\HarddiskVolumeShadowCopy47\\ OPTIONS:ALT_PATH_PREFIX=c:\\VERITAS\\NetBackup\\temp\\_vrts_frzn_img_3200\"
What i would like is to get a group named jobFileList containing for each line:
"C:\\","Shadow Copy Components:\\","E:\\",""
"D:\\"
"E:\\","Shadow Copy Components:\\"
H:\\
Y:\\Libs
You can see i only want the file list, but some times its only the full text after the # mark and sometimes there is a lot of ** that i need to remove.
Fact is i cant use a script for this case so i need to do this with only ONE regexp, can't just do a streplace of other stuff after the regex.
What i did is :
SomeText(#.*BACKUP (?P<jobFileList>.*?) .*)?(#(?P<jobFileList>.*))?
But seems i cant set the same GroupName :( If i replace the second jobFileList with another name its works perfectly but not what i need .
Thanks for your help,
EDIT:
I can also have some lines like :
SomeText#/ahol5d72_1_2
SomeText#/p7ol4a1p_1_2
SomeText#Gvadag04SANDsk_Daily
SomeText#/bck_reco_a9ol5765_1_2_827497669
In all these cases i need to have all the text after the # mark.

A version which doesn't rely on the double quotes after the double backslash:
SomeText#(?:(.*?BACKUP) )?(?P<jobFileList>(?(1)[^ ]*|.*$))
This: (?(1)[^ ]*|.*$) is a conditional group that is supported in Python 2.7.5 (probably works for higher versions but I don't know for previous ones). If there's BACKUP, it grabs all the non-spaces and if there's no BACKUP, it grabs everything till the end of the string.
regex101 demo
EDIT: As per comment, the regex that worked after #timmalos' modifications:
\#(?P<G>.*?[^E]BACKUP\s)?(?P<G2>f:\\\\Mailbox\\\)?(?P<jobFileList>(?(G)(?(G2)[^\]|\S)‌*|.*))

This is possible to match with a single regular expression however I know nothing of splunk. Maybe this will help:
("?[A-Z]:\\\\(?:".+|\S+)?)
Live demonstration here

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.

There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js