Match multiple line text (from 1 to n lines) until certain new line regex - regex

I created regex for matching such pattern:
<some text>
yyyy.MM.dd SOME TEXT decimal decimal
yyy.MM.dd
some sentence
some sentence
some sentence (it can have from 1 to n lines of comments) but the last line that does not start with yyyy.MM.dd SOME TEXT decimal decimal)
yyyy.MM.dd SOME TEXT decimal decimal
yyy.MM.dd
some sentence
some sentence
some sentence
...
<some text>
The regex:
((\d{4}\.\d{2}\.\d{2})\s([a-zA-Z\s]{0,})\s(\-{0,1}((\d{1}\,\d{2})|(\d{1,}\ \d{3}\,\d{2})))\s(\-{0,1}((\d{1}\,\d{2})|(\d{1,}\ \d{3}\,\d{2}))\s)(\d{4}\.\d{2}\.\d{2}))
Which matches only first 2 lines. I can't match multiline sentences until next yyyy.MM.dd SOME TEXT decimal decimal (exclusively)
This is the test data for matching:
2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
...
it should match like this:
1.
2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
For me it matches like this:
1.
2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30
How can I match from 1 to many multiline lines WITHOUT 'yyyy.MM.dd SOME TEXT decimal decimal' on the next line?

For the example data, you can match the first 2 lines with a date like pattern, followed by all the lines that do not start with a datelike pattern.
Note that \d{4}\.\d{2}\.\d{2} does not validate a date itself. To get a more precise match, this page has more detailed examples.
^\d{4}\.\d{2}\.\d{2} .*\r?\n\d{4}\.\d{2}\.\d{2}\b.*(?:\r?\n(?!\d{4}\.\d{2}\.\d{2}\b).*)*
Regex demo
Or if you first want to match all lines that start with a datelike pattern incase of 1 or more, followed with lines that do not:
^\d{4}\.\d{2}\.\d{2} \S.*(?:\r?\n\d{4}\.\d{2}\.\d{2}\b.*)+(?:\r?\n(?!\d{4}\.\d{2}\.\d{2}\b).*)*
Explanation
^ Start of the string
\d{4}\.\d{2}\.\d{2} \S.* match a datelike pattern followed by a space, at least a non whitespace char (For SOME TEXT in the example) and the rest of the line
(?:\r?\n\d{4}\.\d{2}\.\d{2}\b.*)+ Repeat 1+ times matches lines that start with a datelike pattern
(?: Non capture group (to repeat as a whole)
\r?\n Match a newline
(?!\d{4}\.\d{2}\.\d{2}\b) Assert not a datelike format directly to the right
.* If the previous assertion it true, match the whole line
)* Optionally repeat all lines that do not start with a datelike pattern (If there should be at least 1 line, change the quantifier to +)
Regex demo

Related

Autohotekey: How to extract text between two words with multiple occurrences in a large text document

Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000+ lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text
De l e te this text
StartWord
Apples Oranges
Pears Grapes
StopWord
Delete this text
Delete this text
StartWord
Peas Carrots
Peas Carrots
StopWord
Delete this text
Delete this text
Desired Output Text
Apples Oranges
Pears Grapes
Peas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)+
^ Start of string
StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
(?: Non capture group to repeat as a whole
\R Match a newline
(?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
)+ Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
This is only slightly different than #Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R+\K.*?\R(?=\R*^StopWord\R)
Demo
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R+ # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.

Find the first letter and sign of a sentence with Regex

Find the first letter and sign of a sentence with Regex.
At the beginning of the sentence can sometimes be letters and sometimes numbers.
15. Lorem ipsum is placeholder text
B. Lorem ipsum is placeholder text
C.Lorem ipsum is placeholder text
D . Lorem ipsum is placeholder text
E,Lorem ipsum is placeholder text
I wrote something like this:
[\dga-zA-Z.]{1\s}
Demo with regex101
But it doesn't work right for every sentence. Moreover, it does not detect if there is a space between the first letter/digit and the sign with the sentence.
Where am I making a mistake?
Also, In terms of performance For such scenarios, it makes more sense to use regex or PHP?
Hello this matched all of your provided examples
([A-Za-z\d ]+)(\.|,)
What this does is the following:
it matches all small, big letters, digits or space. It should find at least
one of those or more (the + sign).
It should end with a dot or comma. (\.) Note: In regex, the dot should be escaped.
If that doesn't do the trick, comment below
Edit: demo here: click
The following regex will match a single letters or multiple digits that are placed at the beginning of a sentence and then followed with either a single period or comma:
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
This is the breakdown:
^ # Asserts position at start of the line
[a-zA-Z]{1}|[0-9]+ # Match a single alphabetic character or one or more digits
\s* # Matches whitespace characters between 0 and unlimited times
[.,]{1} # Matches a single period or comma character literal
.* # Matches the rest of the text
$ # Asserts position at end of the line
Group 1 - will return both the letter/numbers and the period/comma (including potential spaces). This is in case you need to get both for some reason.
Group 2 - will return only letter or numbers at the start of the sentence, which I assume you'll actually be looking for most of the times.
Group 3 - will return the rest of the text.
The regex will need to be modified depending on what you want. For example if you don't want a match when there are spaces after the letter/digits at the start of the sentence or if you want to include more delimiting characters that mark the separator character. Let me know if you have any additional constraints you'd like this regex conform to.
See the DEMO
Use: ^[\da-zA-Z]+\h*[.,]
Demo
Explanation:
^ # beginning of line
[\da-zA-Z]+ # 1 or more letter or digit
\h* # 0 or more horizontal spaces
[.,] # a dot or a comma

regex to match pattern followed some string

I have following text. I want to capture the pattern ddd-dd-ddd followed by all text until I again hit a ddd-dd-ddd.
I am trying to use this regex
\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b.*
it matches 982-99-122 followed by the sentence until it hits a line feed. then again the second number 586-33-453 is matched followed by the text on the same line. but it fails to capture the text that continues on the next line.
OR if I remove the line feed from this string, it will only capture the first number 982-99-122 and captures the whole string i.e. does not match the second number 586-33-453
How should I fix both these issues, 1. when line feeds are part of the string and 2. when the string does not have line feeds.
982-99-122 (FCC 333/22) lube oil service pump 1b discharge lube oil service pump
aaa bb dsdsd
586-33-453 Matches exactly 3 times 0-e single character in the range between
dfldfldflkdf 545-66-666 sdkjsl () jdfkjd-kfdkf sdfl
848-99-040 sdsd"" df
dfdf
It seems you want
\b([0-9]{3}-[0-9]{2}-[0-9]{3})\b([\s\S]*?)(?=\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b|$)?
See the regex demo
Details
\b - word boundary
([0-9]{3}-[0-9]{2}-[0-9]{3}) - 3 digits, -, 2 digits, - and 3 digits
\b - word boundary
([\s\S]*?) - Group 2: any 0+ chars, as few as possible
(?=\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b|$)? - a positive lookahead that requires 3 diigts, -, 2 digits, - and 3 digits as a whole word or end of string immediately to the right of the current location.

Select digits on the end of line

I need to replace only digits at the end of line with semicolon ; using RegEx in Notepad++.
Before:
ddd 66 ffff 5
d 44 dds 55
After:
ddd 66 ffff;
d 44 dds;
I'm trying to find digits at the end of lines with expression
($)(\d+)
but Notepad++ can't find anything by use of this expression. How to achieve this?
Find:
\s\d+$
Replace:
;
\d+ will match one or more digits. $ will match the end of the line--this is non-capturing (so don't worry... the end of the line will not be replaced in a find/replace operation). And so \d+$ will match one or more digits immediately followed by the end of the line.
I included \s (a single whitespace character) because it looks like you want to replace the space preceding the digits as well.
Note that you will need to do "Replace All" for this to work like you want. (because each regex match is for one instance only)
Try this find/replace:
find:
^(.*) \d+$
replace:
\1;
The find regex above matches anything up to and excluding a final space followed by at least one digit. If the end pattern for a given line is not space followed by one or more digits, the regex should not match. The replacement is the capture group, what is in parenthesis, which is everything up to but excluding the final space and number.

Perl multiline regex for first 3 individual items

I am trying to read a regex format in Perl. Sometimes instead of a single line I also see the format in 3 lines.
For the below single line format I can regex as
/^\s*(.*)\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)/
to get the first 3 individual items in line
Hi There FirstName.LastName 10 3/23/2011 2:46 PM
Below is the multi-line format I see. I am trying to use something like
/^\s*(.*)\n*\n*|\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$/m
to get individual items but don’t seem to work.
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Any suggestions? Is multi-line regex possible?
NOTE: In the same output i can see either Single line or Multi line or both so output can be like below
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM
Hello Line2
Line2FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM
You can for sure apply regex over multiple lines.
I've used the negated word \W+ between words to match space and newlines between words (actually \W is equal to [^a-zA-Z0-9_]).
The chat is viewed as a repetead \w+\W+ block.
If you provide more specific input / output case i can refine the example code:
#!/usr/bin/env perl
my $input = <<'__END__';
Hi There
FirstName-LastName 8 7/17/2015 1:15 PM
Testing - 12323232323 Hello There
__END__
my ($chat,$username,$chars,$timestamp) = $input =~ m/(?im)^\s*((?:\w+\W+)+)(\w+[-,\.]\w+)\W+(\d+)\W+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)/;
$chat =~ s/\s+$//; #remove trailing spaces
print "chat -> ${chat}\n";
print "username -> ${username}\n";
print "chars -> ${chars}\n";
print "timestamp -> ${timestamp}\n";
Legenda
m/^.../ match regex (not substitute type) starting from start of line
(?im): case insensitive search and multiline (^/$ match start/end of line also)
\s* match zero or more whitespace chars (matches spaces, tabs, line breaks or form feeds)
((?:\w+\W+)+) (match group $chat) match one or more a pattern composed by a single word \w+ (letters, numbers, '_') followed by not words \W+(everything that is not \w including newline \n). This is later filtered to remove trailing whitespaces
(\w+[-,\.]\w+): (match group $username) this is our weak point. If the username is not composed by two regex words separated by a dash '-' or a comma ',' (UPDATE) or a dot '.' the entire regex cannot work properly (i've extracted both the possibilities from your question, is not directly specified).
(\d+): (match group $chars) a number composed by one or more digits
([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s[ap]m): (match group $timestamp) this is longer than the others split it up:
[0-1]?\d\/[0-3]?\d\/[1-2]\d{3} match a date composed by month (with an optional leading zero), a day (with an optional leading zero) and a year from 1000 to 2999 (a relaxed constraint :)
[0-2]?\d:[0-5]?\d\s?[ap]m match the time: hour:minutes,optional space and 'pm,PM,am,AM,Am,Pm...' thanks to the case insensitive modifier above
You can test it online here
Your regex says:
^\s*(.*)\n*\n* # line starts with optional space followed by anything
| # or
\s+([a-zA-Z0-9._]+)\s+(\d+)\s+(.*)$ # spaces followed by any words followed by spaces, digits, spaces, anything at the end of the line
Consider this:
/^From|To$/
Alternation sticks as close to the sequences.
Above is really saying to find a line starting with 'Fro' followed by 'm' or 'T', followed by 'o', followed by the end of line
Compare to this:
/^(From|To)$/
Above will find lines that only have 'From' or 'To'