How to exclude lines with ... in regular expression - regex

I have the following table of contents and sections in my file:
1.2 Purpose .................... 8
1.3 System Overview ............ 8
1.4 Document Overview .......... 8
1.5 Definitions and Acronyms ......... 9
2.1.3.3.8 FOO
2.1.3.3.9 BAR
2.1.4 TEST
I'd like to extract the section names and ignore the lines that are part of the table of contents.
I've been trying this regular expression:
^((?:\d{1,2}\.)+(?:\d{1,2})+)\s.+(?!\.\.\.).*$
However, I keep capturing the table of contents lines.
How can I exclude the lines with the .... strings?
Thanks!

The problem here was that you were only excluding .s at a very specific place; your negative lookahead match didn't go beyond the position it was placed in. Consider instead:
^(\d{1,2}(?:\.\d{1,2})*)\s*[^.]*(?!.*\.{3}).*$
# ^^
...the characters with the carrot below them are critical: They make the negative lookahead apply not only at that specific point, but at anywhere after it as well.

Related

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

How to extract all IMDb ID's from string

I have a block of text where I want to search for IMDb link, if found I want to extract the IMDdID.
Here is an example string:
http://www.imdb.com/Title/tt2618986
http://www.google.com/tt2618986
https://www.imdb.com/Title/tt2618986
http://www.imdb.com/title/tt1979376/?ref_=nv_sr_1?ref_=nv_sr_1
I want to only extract 2618986 from lines 1, 3 and 4.
Here is the regex line I am currently using but am not having luck:
(?:http|https)://(?:.*\.|.*)imdb.com/(?:t|T)itle(?:\?|/)(..\d+)(.+)?
https://regex101.com/r/ERtoRz/1
If you are interested in only extracting the ID, so 2618986, none of the comments quite nail it, since they match tt2618986. Building on top of #The fourth bird answer, you will need to separate tt2618986 into two parts - tt and 2618986. So instead of a single ([a-zA-Z0-9]+), have [a-zA-Z]+([0-9]+).
^https?://www\.imdb\.com/[Tt]itle[?/][a-zA-Z]+([0-9]+)
Regex Demo
You can then extract the 2618986 part by calling group 1.
This expression might simply extract those desired digits:
^(?:https?://)(?:www\.)?imdb\.com/title/[a-z]+([0-9]+).*$
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

How do I use regex to return text following specific prefixes?

I'm using an application called Firemon which uses regex to pull text out of various fields. I'm unsure what specific version of regex it uses, I can't find a reference to this in the documentation.
My raw text will always be in the following format:
CM: 12345
APP: App Name
BZU: Dept Name
REQ: First Last
JST: Text text text text.
CM will always be an integer, JST will be sentence that may span multiple lines, and the other fields will be strings that consist of 1-2 words - and there's always a return after each section.
The application, Firemon, has me create a regex entry for each field. Something simple that looks for each prefix and then a return should work, because I return after each value. I've tried several variations, such as "BZU:\s*(.*)", but can't seem to find something that works.
EDIT: To be clear I'm trying to get the value after each prefix. Firemon has a section for each field. "APP" for example is a field. I need a regex example to find "APP:" and return the text after it. So something as simple as regex that identifies "APP:", and grabs everything after the : and before the return would probably work.
You can use (?=\w+ )(.*)
Positive lookahead will remove prefix and space character from match groups and you will in each match get text after space.
I am a little late to the game, but maybe this is still an issue.
In the more recent versions of FireMon, sample regexes are provided. For instance:
jst:\s*([^;]?)\s;
will match on:
jst:anything in here;
and result in
anything in here

regex using positive lookahead

My source data text looks something like this:
a1,a2,a3
a4,a5,a6
a7,a8,a9
test="1"
b1,b2,b3
b4,b5,b6
b7,b8,b9
test="2"
c1,c2,c3
c4,c5,c6
c7,c8,c9
test="3"
I need to parse this so the end result looks like this (appropriate “test” field included in each row):
a1,a2,a3,1
a4,a5,a6,1
a7,a8,a9,1
b1,b2,b3,2
b4,b5,b6,2
b7,b8,b9,2
c1,c2,c3,3
c4,c5,c6,3
c7,c8,c9,3
...etc
this what I started with and captures the fields correctly:
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+
I understand I need to use lookarounds to capture and include the “test” field on each line.
So something like this added (using a positive lookahead)…
(?<f1>.*?),(?<f2>.*?),(?<f3>.*?)\s+(?=test="(?<test>.*?)")
This seems close but is not yielding all rows of data, but instead only the last row of data with the included test value as if it is consuming the look ahead row.
This expression with its captured groups are input into a .NET application that inserts these captured groups as fields within a database table. Number of fields is always static (4 in the example above; field1=f1, field2=f2, field3=f3, field4=test), but the number of records will be variable.
Any guidance would be appreciated.
Parsing your data to extract the relevant values
You are almost there, but need to allow the look ahead to skip the rows between the current one and the test line:
(?ms)(?<f1>\w+),(?<f2>\w+),(?<f3>\w+)\R(?=.*?^test="(?<test>\d+)")
\R matches all sort of newlines, (?ms) is the inline way of turning on the multiline and dot match all modifiers, so that .*?^test matches every line up to the test one, see demo here.
Again, your issue was that \s+ forced the lookahead to be on the line right after the one your were matching.

Custom markdown tag

I'm using dflydev's markdown which is based on michelf's project to transform Markdown into HTML.
My site is RTL by default and I'd like to add a custom tag to allow left-aligning paragraphs,
something like:
regular text, right aligned.
<- some text that will be aligned to the left
<--
fenced text that will be aligned to the left
<--
I'm trying to build the regex pattern to catch those blocks:
For <- ... I have: /^<- ([^\n]+)/
For the fenced block I couldn't get a working pattern
I'd like to get help on the fenced block regex and on improving the one-line regex I already have.
Thanks!
This would match your second group:
^<--.*?<--$
For your first group I would use something like this instead:
^<-[^-][^\r\n]*?$
For your first case you can use
/^<-(.*?)$/
and get the first group.
For the second case use
/^<--(.*?)<--$/
and get the first group.