Regex select names in next lines after match (until...) - regex

I have a text file with different levels (scructured by tabs) and I need to select certain values out of it. Here is an example. I tried this for a very long time, but I can't find any solution.
Connection
Match
Fridolin
Marten
Connection
Inventory
Fill Up
Fill Up
Match
Peter
Marcus
Storage
Room 1
Room 2
Room 3
Match
Albert
Jonas
Hans
List
Match
Peter
Marcus
I want to select every name in the following lines after "Match" (which has the same amount of tabs in front of it) until the next level (different amount of tabs) starts. In this case I want to select the names that are listed after the word "Match". Until (for example) "Connection" pops up and the amount of tabs in front of it (level) changes. The Names that follow "Match" are always on the same level. I can't use multiline for this.
Match
Fridolin
Marten
Connection
(?<=Match[\r\n]+\t\t?\t?\t?\t?)([ a-zA-ZäöüÄÖÜßé0-9\.-/\-])+
I have already this regex, which selects at least the first name that follows "Match". I don't know how to select the next names and stop if the level changes.

Try this:
(?<=Match)\n(\s+)\w+(?:\n\1\w+)+
online demo
The regular expression matches as follows:
Node
Explanation
(?<=
look behind to see if there is:
Match
'Match'
)
end of look-behind
\n
'\n' (newline)
(
group and capture to \1:
\s+
whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))
)
end of \1
\w+
word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
(?:
group, but do not capture (1 or more times (matching the most amount possible)):
\n
'\n' (newline)
\1
what was matched by capture \1
\w+
word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
)+
end of grouping

Try:
\bMatch\n((\s*).*\n(?:\2.*(?:\n|\Z))*)
Regex demo.
This will match Match, following newline and then any number of whitespaces as capturing group 1. Then use this capturing group to match other lines.

Related

Extract the last path-segments of a URI or path using RegEx

I am trying to extract the last section of the following string :
"/subscriptions/5522233222-d762-666e-555a-e6666666666/resourcegroups/rg-sql-Belguim-01/providers/Microsoft.Compute/snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I want to capture:
"snapshots/vm-sql-image-v3.3-pre-sysprep-Oct-2021-BG"
I tried below with no luck:
(\w*?\/\w*?)$
How to pull this off using regex?
Use
[^\/]+\/[^\/]+$
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
[^\/]+ any character except: '\/' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Your issues
(\w*?/\w*?)$ is for simple or empty last 2 segments (tested), e.g.
matched hello/world/subscriptions123/snap_shots capturing subscriptions123/snap_shots
matched /1/2// capturing the last 2 empty segments
OK was:
capture-group
/ to match the last path-separator before end ($)
\w*? intended to match the path-segment of any length
What to improve:
*? is a bit too unrestricted, choose quantifier as + for at least one (instead * for any or ? for zero or one)
\w is for word-meta-character, does not match hyphens or dots (OK for snapshot, not for given last segment)
Quick-fixed
(\w+/[\w\.-]+)$ (tested)
added dot \. and hyphen - to character-set containing \w
Simple but solid
(snapshots/[^\/]+)$ (tested)
fore-last path-segment assumed as fix constant snapshots
[^\/] any character except (^) slash in last segment
Note: the slash doesn't need to be escaped \/ like Ryszard answered

Basic Regular expression for matching all lines except given set of lines

Can someone explain using basic regular expression (not lookahead like extensions please) to match all content except matching set of lines
For example if I want to match everything in content except first three lines, I can think of doing this in two steps:
(.*\n){3} matches first three lines
Match everything except lines matched in last step
I tried expression like:
[^(.*\n){3}].*\n
But this isn't working.
How to do the second step ?
This is basic to me:
^(?:.*\n){3}\K[\w\W]*
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
\K 'K'
--------------------------------------------------------------------------------
[\w\W]* any character of: word characters (a-z, A-
Z, 0-9, _), non-word characters (all but a-
z, A-Z, 0-9, _) (0 or more times (matching
the most amount possible))

Regexport() both integers and numbers with decimals

I'm working in Google Sheets and wondering if it's possible to use one regexport() function to export both whole and partial numbers.
I have a column with:
1 Ml/ 2 Ml
2 Ml/ 2.02 Ml
3 Ml/ 4.01 Ml
and want a column with:
2
2.02
4.01
The first value could be 2.00 as well.
I was wondering if this is possible specifically with regular expressions. I know how to do it without. I currently have regexport(cell#, "\/\D(\d+)\D")
Thanks!
I think all you need as a pattern is:
(\d+(?:\.\d+)?) Ml$
( - 1St apture group.
\d+ - One or more digits.
(?: - Open non-capture group.
\.\d+ - A literal dot followed by one or more digits.
)? - Close non-capture group and make it optional.
) - Close 1st capture group.
Ml$ - Match "Ml" literally upto the end string ancor ($).
Add this to an ARRAYFORMULA() like:
=ARRAYFORMULA(REGEXEXTRACT(A1:A3,"(\d+(?:\.\d+)?) Ml$"))
without Regex
We want to grab a value encapsulated by slash-space on the left and space on the right:
=TRIM(MID(A1,FIND("/ ",A1)+2,FIND(" ",A1,FIND("/ ",A1)+2)-(FIND("/ ",A1)+2)))
(Both Excel and Google Sheets should work the same way. If we have to grab multiple instances, I would use Regex.)
use:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A1:A2, " (\d+\.\d+) Ml"),
REGEXEXTRACT(A1:A2, " (\d+) Ml")))
You can also try the simpler which takes care of errors and returns results as numbers at the same time.
=ArrayFormula(IFERROR(
REGEXEXTRACT(A1:A,"/ (.*) ")*1))
Use
=REGEXEXTRACT(A1, "(\d[\d.]*)\s*Ml$")
See proof
Explanation
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
[\d.]* any character of: digits (0-9), '.' (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
Ml 'Ml'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Notepad++: Extracting all words from a very long string that contains a set of rounded brackets

I have a large .txt file written in German. It is a transcript of many people speaking. When an abbreviate form of a word is used, the correct form of the word is written around it, or inside it, in brackets. I would like to extract, as a list, all such examples that exist in this .txt. I have tried a few Regex but I can't seem to get it to highlight the entire "word".
Any ideas?
Here is a part of the .txt with the words I'd like extracted highlighted:
Ich hab(e) am Achtundzwanzigsten achten neunzehnhundertneunzig Geburtstag. Also wenn ich mich beschreiben sollte, dann muss ich sagen freundlich, unkompliziert und bescheiden. Hallo wie gehts (geht es) dir. Na was machst (machst du) den jetzt heut(e). Und, eh, hm, was noch? Stör(e) ich? Ja das is(t), eh, so, würd(e) ich das so sagen....
Thanks!
If I well understand your needs, how about:
(\w+\(\w+\))| \([\w\s]+\)
Explanation:
The regular expression:
(?-imsx:(\w+\(\w+\))| \([\w\s]+\))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
' '
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
[\w\s]+ any character of: word characters (a-z, A-
Z, 0-9, _), whitespace (\n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of grouping
This regular expression finds all the contents between ( and ) included and also everything before ( and its preceeding space character:
[^ ]*\([^)]*\)
Now to transform your text into a nice list:
open find/replace dialog (Ctrl-H)
Find what: .*?([^ ]*\([^)]*\))
Replace with: \1\n
"Regular expression" with "Matches newline" checked
Press "Replace All" with cursor at the beginning of file (Ctrl-Home)
Ignore or delete last line
Now you have a nice list of all these words each on separate line.
Notepad++ uses a regex flavor that may not be POSIX compliant, hence does not support word boundaries. (Atleast v5.9.2 does not support it)
Try this regular expression:
[^\s]*\([^)]*\)[^\s\.\,\;\?\!]*
[^\s]* : detects beginning of a word by not matching any whitespace before a word (tab, space, etc..)
\([^)]*\) : matches the brackets and its content
[^\s\.\,\;\?\!]* : detects ending of a word by not matching any whitespace or possible punctuation symbols.
You can extend this by adding more punctuation marks before or after the word (like quotes).
Successfully tested this on Notepad++ v5.9.2 on your sample text.

Perl, delete everything after first three characters

I promise you all I've searched the site for about two hours now. I've found several that should have worked, but they didn't.
I have a line that consists of a varying amount of numbers separated by spaces. I want to delete everything after the third number.
I should say that everything I've been writing has been assuming that \S\s\S\s\S would match the first three numbers. with spaces between 1 and 2, and 2 and 3.
I anticipated the following working:
s/^.*?[\S\s\S\s\S].{5}//s;
but it did the exact opposite of what I wanted.
I would like 2 3 0 4 5 6 7 1 0 1 2 to become 2 3 0
I would really prefer to keep it substitution. I've tried look-behind as one person mentioned and I had no luck. Should I be saving the first 3 numbers as a string before I'm trying these commands?
EDIT:
I should have clarified that these numbers could be in the form 1.57 or 1.00E01 as well. I had integers when I was trying to get that to just baseline work.
\S\s\S\s\S will indeed match three non-space characters separated by space characters. However, ^.*?[\S\s\S\s\S].{5} does something completely different:
^ matches the beginning of the line.
.*? matches characters until the next match can start (not as many as it can). Since you specify /s, . will match newline as well.
[\S\s\S\s\S] is a character class, and so is the same as [\S\s]—match either \S or \s, which is to say anything.
.{5} will match five characters.
Since [\S\s] and . with /s match the same things, the .*? will never match any characters as it wants to match as little as possible. Thus, this is the same as s/^.{6}//s—delete the first six characters from the string. As you can see, that's not what you wanted!
One way to keep the first three numbers is to explicitly match them: s/^(\d \d \d).*/$1/s. Here, \d matches a single digit (0–9) with literal spaces in between them. We match the first three followed by anything at all, and then replace the whole match—since it ends in .*, that's the whole string—with just the bit in between parentheses, i.e. the first three numbers. If your numbers can be more than one digit long, then s/^(\d+ \d+ \d+).*/$1/s will do what you want; if you can have arbitrary space-like characters (space, tab, newline) separating them, then s/^(\d\s\d\s\d\s).*/$1/s is what you want (or \s+ if you can have multiple spaces). If you want to catch lines which have things other than digits, you can use \S or \S+, just as you were.
Another approach, using lookbehind, would be s/(?<=^\d \d \d).*//s. In other words, delete any characters which are preceded by ^\d \d \d—the beginning of the string followed by three space-separated numbers. There's no real advantage to this approach—I'd probably do it the other way—but since you mentioned lookbehind, here's how you can do it. (Again, things like s/(?<=^\S\s\S\s\S).*//s are more general.)
So match the first three numbers explicitly, and drop everything else.
s/^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$/$1 $2 $3/;
This works as follows:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(q{^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$})->explain;'
The regular expression:
(?-imsx:^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(Updated in consideration of the changes the OP made to the original specification.)
Your code where you say s/^.*?[\S\s\S\s\S].{5}//s;
I would write as: s/^(\S\s\S\s\S).*$/$1/
You're forgetting to use a $1 to capture the part of the substitution that you want to keep, and having a .* at the beginning could lead to starting numbers being removed instead of trailing numbers.
Also, I'm not sure if you have some guarantee of single digit numbers, or of single whitespace characters, so you could write the code with s/^(\S+\s+\S+\s+\S+).*$/$1/ to capture all of the spaces and all of the digits.
Let me know if I need to clarify that a little more.
Here's a website I find super helpful for Perl regex: http://pubcrawler.org/perl-reference.html
Question is, why do u want to do such a thing with regexp? it seems easier to me with:
substr $string, 5;
or if u really want to (I didn't test):
s/^(.{5})(.*)/$1/
parentheses allows you to "remember" patterns, this is the way to say that you want to replace pretty much everything with just the first part of the pattern (the first five characters). this pattern will match any line of text and leave just the first 5 characters maybe you want to modify it to match 3 digits with spaces between them