Notepad++: Extracting all words from a very long string that contains a set of rounded brackets - regex

I have a large .txt file written in German. It is a transcript of many people speaking. When an abbreviate form of a word is used, the correct form of the word is written around it, or inside it, in brackets. I would like to extract, as a list, all such examples that exist in this .txt. I have tried a few Regex but I can't seem to get it to highlight the entire "word".
Any ideas?
Here is a part of the .txt with the words I'd like extracted highlighted:
Ich hab(e) am Achtundzwanzigsten achten neunzehnhundertneunzig Geburtstag. Also wenn ich mich beschreiben sollte, dann muss ich sagen freundlich, unkompliziert und bescheiden. Hallo wie gehts (geht es) dir. Na was machst (machst du) den jetzt heut(e). Und, eh, hm, was noch? Stör(e) ich? Ja das is(t), eh, so, würd(e) ich das so sagen....
Thanks!

If I well understand your needs, how about:
(\w+\(\w+\))| \([\w\s]+\)
Explanation:
The regular expression:
(?-imsx:(\w+\(\w+\))| \([\w\s]+\))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
' '
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
[\w\s]+ any character of: word characters (a-z, A-
Z, 0-9, _), whitespace (\n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of grouping

This regular expression finds all the contents between ( and ) included and also everything before ( and its preceeding space character:
[^ ]*\([^)]*\)
Now to transform your text into a nice list:
open find/replace dialog (Ctrl-H)
Find what: .*?([^ ]*\([^)]*\))
Replace with: \1\n
"Regular expression" with "Matches newline" checked
Press "Replace All" with cursor at the beginning of file (Ctrl-Home)
Ignore or delete last line
Now you have a nice list of all these words each on separate line.

Notepad++ uses a regex flavor that may not be POSIX compliant, hence does not support word boundaries. (Atleast v5.9.2 does not support it)
Try this regular expression:
[^\s]*\([^)]*\)[^\s\.\,\;\?\!]*
[^\s]* : detects beginning of a word by not matching any whitespace before a word (tab, space, etc..)
\([^)]*\) : matches the brackets and its content
[^\s\.\,\;\?\!]* : detects ending of a word by not matching any whitespace or possible punctuation symbols.
You can extend this by adding more punctuation marks before or after the word (like quotes).
Successfully tested this on Notepad++ v5.9.2 on your sample text.

Related

Regex select names in next lines after match (until...)

I have a text file with different levels (scructured by tabs) and I need to select certain values out of it. Here is an example. I tried this for a very long time, but I can't find any solution.
Connection
Match
Fridolin
Marten
Connection
Inventory
Fill Up
Fill Up
Match
Peter
Marcus
Storage
Room 1
Room 2
Room 3
Match
Albert
Jonas
Hans
List
Match
Peter
Marcus
I want to select every name in the following lines after "Match" (which has the same amount of tabs in front of it) until the next level (different amount of tabs) starts. In this case I want to select the names that are listed after the word "Match". Until (for example) "Connection" pops up and the amount of tabs in front of it (level) changes. The Names that follow "Match" are always on the same level. I can't use multiline for this.
Match
Fridolin
Marten
Connection
(?<=Match[\r\n]+\t\t?\t?\t?\t?)([ a-zA-ZäöüÄÖÜßé0-9\.-/\-])+
I have already this regex, which selects at least the first name that follows "Match". I don't know how to select the next names and stop if the level changes.
Try this:
(?<=Match)\n(\s+)\w+(?:\n\1\w+)+
online demo
The regular expression matches as follows:
Node
Explanation
(?<=
look behind to see if there is:
Match
'Match'
)
end of look-behind
\n
'\n' (newline)
(
group and capture to \1:
\s+
whitespace (\n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))
)
end of \1
\w+
word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
(?:
group, but do not capture (1 or more times (matching the most amount possible)):
\n
'\n' (newline)
\1
what was matched by capture \1
\w+
word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
)+
end of grouping
Try:
\bMatch\n((\s*).*\n(?:\2.*(?:\n|\Z))*)
Regex demo.
This will match Match, following newline and then any number of whitespaces as capturing group 1. Then use this capturing group to match other lines.

Basic Regular expression for matching all lines except given set of lines

Can someone explain using basic regular expression (not lookahead like extensions please) to match all content except matching set of lines
For example if I want to match everything in content except first three lines, I can think of doing this in two steps:
(.*\n){3} matches first three lines
Match everything except lines matched in last step
I tried expression like:
[^(.*\n){3}].*\n
But this isn't working.
How to do the second step ?
This is basic to me:
^(?:.*\n){3}\K[\w\W]*
See proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
\K 'K'
--------------------------------------------------------------------------------
[\w\W]* any character of: word characters (a-z, A-
Z, 0-9, _), non-word characters (all but a-
z, A-Z, 0-9, _) (0 or more times (matching
the most amount possible))

How can I only match a leading space followed by a non numeric in Regex

I need some assistance, I have been at this for hours now. I am not winning.
I need to match a space only if its followed by a non-numeric character (which I will replace with blank to remove it from the string).
I have tried this ^[^\s+]+\D and it works to some extent.
if I have the string " JLABCD-1 836397-BTD56517" it return correctly without the leading space, which is what I want "JLABCD-1 836397-BTD56517"
if I have " BefhMS JLZARL-1 836397-BTD56517" it returns this "JLZARL-1 836397-BTD56517"
But if I don't have a space before the the first word, I want it to ignore all other spaces.
If I have "_JLABCD-1 836397-BTD56517", I want to return "JLABCD-1 836397-BTD56517" or the original string as it is. Not "836397-BTD56517" which is what I am getting at the moment.
Is this possible with Regex?
Use a look ahead:
"^ +(?=\D)"
but it seems you just want to match any leading spaces. If so, just use:
"^ +"
The negated (due to its first character being ^) character class [^\s+] in your regex matches anything not whitespace or a +.
Use
^\s+(\D)
Replace with $1, it is a backreference to the capturing group (\D). Or \1 if $1 does not work.
See proof.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of \1

a regex for cleaning quotes between quotes

I'm trying write a regex that clears double quotes inside double quotes of a shortcode attribute.
I wrote this regex
\="(.*?)\"
and it matches the string between quotes http://regex101.com/r/jW0uC4
But when I have attribute value that also contains double quotes it fails http://regex101.com/r/pL9bI0
So, how can i improve the regex as it will catch the string only between =" and last "
Thanks in advance
This regex matches the sample text you provided:
/="(.*?)"(?=\s*(?:[a-z]+=|]))/
Explanation:
=" '="'
( group and capture to \1:
.*? any character except \n (0 or more times
(matching the least amount possible))
) end of \1
" '"'
(?= look ahead to see if there is:
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
(?: group, but do not capture:
[a-z]+ any character of: 'a' to 'z' (1 or
more times (matching the most amount
possible))
= '='
| OR
] ']'
) end of grouping
) end of look-ahead
But user errors are hard to fix and this regex may not work in all cases (for example if text contains an = character). You should make sure user input is escaped properly.

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/