regex: extract text blocks, defined beginning, undefined end

regex: extract text blocks, defined beginning, undefined end - regex

i have text like this:
Date: 01.02.2015 //<-stable format
something
something more
some random more
Date: 02.02.2015
something random
i dont know
so i have many such blocks. Starts with Date... ends with next Date... start.
The text in the lines in the block could be anything, but not Date... format
I need an array at the end, with such blocks:
array[0] = "Date: 01.02.2015
something
something more
some random more"
array[1] = "Date: 02.02.2015
something random
i dont know"
for now i add some unique splitter before Date... than split by the splitter.
Question: is it possible to get such blocks only by regex?
(i use VBA to parse the text, RegExp object)

Instead of split just match using
\bDate:\s\d{1,2}\.\d{1,2}\.\d{4}[\s\S]*?(?=\nDate:|$)
See demo.
https://regex101.com/r/uF4oY4/77
Syntax explanation (from the linked site):
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Date: matches the characters Date: literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{4} matches a digit (equal to [0-9]) exactly 4 times
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy) , what specified in previous brackets
?= Positive Lookahead - Assert that the following Regex matches
\nDate Option 1
\n matches a line-feed (newline) character (ASCII 10)
Date matches the characters Date: literally (case sensitive)
$: Option 2 - $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Related

Validating emails in file with batch

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?

Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.

Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

BASH -- grep'ing for perl-regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Beginner here and I'm trying to understand this. Can someone please break down the part in between the single quotes and describe what it does?
grep -oP '(?<=\S\/1\.\d.\s)[345]\d+'
Many thanks in advance!

Positive Lookbehind (?<=\S/1.\d.\s) Assert that the Regex below matches
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
\/ matches the character / literally (case sensitive)
1 matches the character 1 literally (case sensitive)
\. matches the character . literally (case sensitive)
\d matches a digit (equal to [0-9])
. matches any character (except for line terminators)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
Match a single character present in the list below [345]
345 matches a single character in the list 345 (case sensitive)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Output simply copied from https://regex101.com/r/HfJSNm/1 : very handy to test/share/have automatic explications on regexes.

Regex / grep match on: "not this" and "that"

From a Linux command line, I would like to find all the instances in multiple files where I do not reference a figure reference with Fig..
So I'm looking each line for when I don't preface \ref{fig with exactly Fig. .
Fig. \ref{fig:myFigure}
A sentence with Fig. \ref{fig:myFigure} there.
\ref{fig:myFigure}
A sentence with \ref{fig:myFigure} there.
The regex should ignore cases (1) and (2), but find cases (3) and (4).

You can use Negative Lookahead like:
^((?!Fig\. {0,1}\\ref\{fig).)*$
https://regex101.com/r/wSw9iI/2
Negative Lookahead (?!Fig\.\s*\\ref\{fig)
Assert that the Regex below does not match
Fig matches the characters Fig literally (case sensitive)
\. matches the character . literally (case sensitive)
\s* matches any whitespace character (equal to [\r\n\t\f\v ])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
ref matches the characters ref literally (case sensitive)
\{ matches the character { literally (case sensitive)
fig matches the characters fig literally (case sensitive)

Use RegEx to match URL [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I'm writing a regex to pull a URL out of an auto-generated email from my monitoring system. For example:
https://mon.contoso.com/mon/call.py?fn=edit&num=1389896156
I need a regex to match:
https://mon.contoso.com/mon/call.py?fn=edit&num=XXXXXXXXX
whereby the "x"'s always change. I run into an issue with the "?". The point of this is to append the URL to a field in JIRA.

Pattern p = new Pattern("https://mon.contoso.com/mon/call.py?fn=edit&num=(\d+)")
Matcher m = p.matcher(inputEmail);
return m.matches() ? m.group(1) : "";
This returns num if it is numeric, otherwise you might want to use \w instead of \d. If you want the whole URL, remove the group() parameter.

You don't indicate what language you're working in.
In Python and JavaScript, this regex will identify a variety of URLs:
/\[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])|(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*/gi
You can refer to this regex101 test for examples of the regex in use.
Explanation:
/\[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])|(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*/gi
1st Alternative: \[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])
\[ matches the character [ literally
[^\]\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\] matches the character ] literally
\n matches a line-feed (newline) character (ASCII 10)
\] matches the character ] literally
(?:\([^\)\n]+\)|\[[^\]\n]+\]) Non-capturing group
1st Alternative: \([^\)\n]+\)
\( matches the character ( literally
[^\)\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
\n matches a line-feed (newline) character (ASCII 10)
\) matches the character ) literally
2nd Alternative: \[[^\]\n]+\]
\[ matches the character [ literally
[^\]\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\] matches the character ] literally
\n matches a line-feed (newline) character (ASCII 10)
\] matches the character ] literally
2nd Alternative: (?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*
(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,}) Non-capturing group
1st Alternative: \/\w+\/
\/ matches the character / literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\/ matches the character / literally
2nd Alternative: .:\\
. matches any character (except newline)
: matches the character : literally
\\ matches the character \ literally
3rd Alternative: \w*:\/\/
\w* match any word character [a-zA-Z0-9_]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
: matches the character : literally
\/ matches the character / literally
\/ matches the character / literally
4th Alternative: \.+\/[./\w\d]+
\.+ matches the character . literally
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\/ matches the character / literally
[./\w\d]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
./ a single character in the list ./ literally
\w match any word character [a-zA-Z0-9_]
\d match a digit [0-9]
5th Alternative: (?:\w+\.\w+){2,}
(?:\w+\.\w+){2,} Non-capturing group
Quantifier: {2,} Between 2 and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\. matches the character . literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]* match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
./ a single character in the list ./ literally
\w match any word character [a-zA-Z0-9_]
\d match a digit [0-9]
:/?# a single character in the list :/?# literally
\[ matches the character [ literally
\] matches the character ] literally
#!$&'()*+,;= a single character in the list #!$&'()*+,;= literally (case insensitive)
\- matches the character - literally
~% a single character in the list ~% literally
g modifier: global. All matches (don't return on first match)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

Regex to match URL end-of-line or "/" character

I have a URL, and I'm trying to match it to a regular expression to pull out some groups. The problem I'm having is that the URL can either end or continue with a "/" and more URL text. I'd like to match URLs like this:
http://server/xyz/2008-10-08-4
http://server/xyz/2008-10-08-4/
http://server/xyz/2008-10-08-4/123/more
But not match something like this:
http://server/xyz/2008-10-08-4-1
So, I thought my best bet was something like this:
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)[/$]
where the character class at the end contained either the "/" or the end-of-line. The character class doesn't seem to be happy with the "$" in there though. How can I best discriminate between these URLs while still pulling back the correct groups?

To match either / or end of content, use (/|\z)
This only applies if you are not using multi-line matching (i.e. you're matching a single URL, not a newline-delimited list of URLs).
To put that with an updated version of what you had:
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|\z)
Note that I've changed the start to be a non-greedy match for non-whitespace ( \S+? ) rather than matching anything and everything ( .* )

You've got a couple regexes now which will do what you want, so that's adequately covered.
What hasn't been mentioned is why your attempt won't work: Inside a character class, $ (as well as ^, ., and /) has no special meaning, so [/$] matches either a literal / or a literal $ rather than terminating the regex (/) or matching end-of-line ($).

/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)(/.*)?$
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (\d{4}-\d{2}-\d{2})
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
4th Capturing Group (.*)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string

In Ruby and Bash, you can use $ inside parentheses.
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|$)
(This solution is similar to Pete Boughton's, but preserves the usage of $, which means end of line, rather than using \z, which means end of string.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex: extract text blocks, defined beginning, undefined end - regex

Related

Validating emails in file with batch

BASH -- grep'ing for perl-regex [duplicate]

Regex / grep match on: "not this" and "that"

Use RegEx to match URL [closed]

Regex to match URL end-of-line or "/" character

Categories

Resources