Regex for multiline .ini file - regex

I try to create a regex expression that can parse a ini-File.
But I want that the ini-values can be multiline!
Like that:
Wert1=Hallo
dsadasd
Wert2=Hi
Wert3=Heinirch Volland
I try it with this regex, but it doesn't work:
/.*=(.*)^.*=/gsm

You could be using this PCRE regex :
/^.*=.*[^=]*$/gm
Try it here.
This relies on the absence of the single-line flag, be careful not to set it. The multiline flag is also necessary, and global can be used if appropriate.
This matches from the start of a line containing an equal sign (^.*=.*), then will match as many whole lines that do not contain an equal sign as it can ([^=]*$, where [^=] will match linefeeds).

You appear to be using Perl. Have you considered using Config::IniFiles? That module will handle parsing INI-type files for you, and has support for multiline parameters using heredoc syntax:
Parameter=<<EOT
value/line 1
value/line 2
EOT
Or, if you enable it with Config::IniFiles->new(..., -allowcontinue => 1);, continuation lines:
[Section]
Parameter=this parameter \
spreads across \
a few lines

I guess you are trying to get all ini values, and to do that you can use this regex pattern:
/^(.*)=(.*)/gm
and you'll can access your values using groups, each group will retrieve to you key and value

Related

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Regex Adding a URL path except the current one I'm at

I'm trying to add something along the lines of this regex logic.
For Input:
reading/
reading/123
reading/456
reading/789
I want the regex to match only
reading/123
reading/456
reading/789
Excluding reading/.
I've tried reading\/* but that doesn't work because it includes reading/
You must escape your backslashes in Hugo, \\/\\d+.

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

How to parse word using grok

I am very new in grok syntax. I have lines:
/app-name/version/code_suffix/sync
for example:
/my-app/v1/O03_ABCD/sync
/my-app/v1/O04/sync
and I need to parse code which always consist from 3 characters. I tried something using:
http://grokconstructor.appspot.com/do/match
but with no success
This regex will match each part of your format and put it in a named capturing group :
/(?<appName>[^/]*)/(?<version>[^/]*)/(?<code>[^\W_]{3})(?:_(?<suffix>[^/]*))?/sync
You can try it here, and it also works on grokConstructor.

Find a tag with a specific suffix using preg_match

I'm looking for tags that end in a specific way using regular expressions in PHP. However all my attempts either result in too much or too little.
For example, in the following string I'd like to match 'bar' because it is in a tag that ends with 'suffix'.
preg_match_all("/<(.*?)suffix>/", "<foo> <barsuffix> <baz>"
However the above line results in 'foo>
Use this regex instead: /<([^>]*)suffix>/.
It will not allow things like foo> <bar, because [^>] ensures that the match souldn't containt >.​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​