complex regular expression question on stop set [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What regular expression to perform search for header that starts with a number such as 1. Humility?
Here's the sample data screen shot, http://www.knowledgenotebook.com/issue/sampleData.html
Thanks.

Don't know what regex your using so I asume its Perl compatible.
You should always post some example data incase your perceptions of regex are unclear.
Breaking down what your 'Stop signs' are:
## left out of regex, this could be anything up here
##
(?: # Start of non-capture group START sign
\d+\. # 1 or more digits followed by '.'
| # or
\(\d+\) # '(' folowed by 1 or more digits followed by ')'
# note that \( could be start of capture group1 in bizzaro world
) # End group
\s? # 0 or 1 whitespace (includes \n)
[^\n<]+ # 1 or more of not \n AND not '<' STOP sign's
It seems you want all chars after the group up to but not to include the
very next \n OR the very next '<'. In that case you should get rid of the \s?
because \s includes newline, if it matches a newline here, it will continue to match
until [^\n<]+ is satisfied.
(?:\d+\.|\(\d+\))[^\n<]+
Edit - After viewing your sample, it appears that you are searching unrendered html
pasted in html content. In that case the header appears to be:
'1. Self-Knowledge<br>' which when the entities are converted, would be
1. Self-Knowledge<br>
Self-Knowledge
Superior leadership ...
You can add the entity to the mix so that all your bases are covered (ie: entity, \n, <):
((?:\d+\.|\(\d+\)))[^\S\n]+((?:(?!<|[\n<]).)+)
Where;
Capture group1 = '1.'
Capture group2 = 'Self-Knowledge'
Other than that, I don't know what it could be.

Related

Regex in Notepad++ to remove certain CRLFs

Given this sample data:
00-1234T|`CRLF`
Data|Commments|`CRLF`
12-3456|Some data|Notes|`CRLF`
65-8436ZZ|Data|`CRLF`
|`CRLF`
45-4576AA|Some data|Comments|`CRLF`
98-4392REV|Data|`CRLF`
|`CRLF`
00-5432|Some Data|Some Comments|
(I added the "CRLF"s to each line to more clearly illustrate what is there and what needs to be replaced)
Each record should only have three pipes in a line, with a CRLF after the third pipe. So lines 1, 4, and 7 (pre-find/replace) need to be fixed, which means any CRLFs before the third pipe needs to be replaced with a "placeholder", which will be "#CRLF#".
The closest I've been able to come up with is ^((?:[^\v|]*\|){3})(.+), which will match (highlight) lines 3 & 4, 6 & 7, and 9 & 10. My expectation (requirement) is to find the CRLFs in lines 2, 5, & 8 and replace those with "#CRLF#".
[UPDATE]
After sleeping on this question, I woke up realizing that, for the purpose of more accurately finding the beginning of a given record - whether on one line or multiple - I should add that the first column will always start with the pattern [0-9][0-9]-[0-9][0-9][0-9][0-9] and possibly have up to three alphanumeric characters after that.
I modified the sample data above to reflect that.
Ctrl+H
Find what: \R(?!\d\d-\d{4}\w{0,3}\|)
Replace with: #CRLF#
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
\R # any kind of linebreak (i.e. \r, \n, \r\n),
if you want to match only windows EOL, use \r\n
(?! # negative lookahead, make sure we haven't after:
\d\d-\d{4} # 2 digit dash 4 digit
\w{0,3} # word character from 0 upto 3
\| # a pipe
) # end lookahead
Screenshot (before):
Screenshot (after):
This should get you started.
The regex just captures the parts between pipes then re-writes on the substitution.
Any CRLF's are not captured and get stripped out.
But this is very simplistic and may need to change if your input is any more complex.
(?m)^([^|\r\n]*)[\r\n]*\|[\r\n]*([^|\r\n]*)[\r\n]*\|[\r\n]*([^|\r\n]*)[\r\n]*\|[\r\n]*
Replace using: $1|$2|$3|\n
https://regex101.com/r/WzDLwf/1
updated answer
To answer your updated question, if you need to make it like mail merge,
it could also be done like this (as an alternative to Toto's method).
(?m)
(?:
^ \d{2} - \d{4} [^|\r\n]* \|
| \G
)
(?: [^|\r\n]* \| )*
\K
[\r\n]+ (?! [\r\n]* (?: ^ \d{2} - \d{4} | $ ) )
https://regex101.com/r/qK4SJP/1

Get data between two pipes present in payload [duplicate]

This question already has answers here:
regex to match substring after nth occurence of pipe character
(3 answers)
Closed 2 years ago.
I have recently started learning regex in ruby and I wanted to extract specific data fro payload.
My payload looks something like this:
2021-02-01T16:06:06.703Z CEF:0|ABCD|Sample text|Numbers|Sample random Text |This value is random and i want to take this value out from payload|9|rest of the payload
Since my data is present between pipes (||), I wrote this regex:
(?<=\|)[^|]++(?=\|)
But the problem is, this regex is taking all the values present between | |.
Can anyone help me extract value between 5th pipe | and 6th pipe |.
You wish to extract the text that is between the 5th and 6th pipe. You can do that with the following regular expression.
r = /\A(?:[^|]*\|){5}\K[^|]*(?=\|)/
str = "2021-02-01T16:06:06.703Z CEF:0|ABCD|Sample text|Numbers|Sample random Text |My dog has fleas|9|rest of the payload"
str[r] #=> "My dog has fleas"
"a|b|c|d|e|My dog has fleas"[r]
#=> nil
We can write the regular expression in free-spacing mode to make it self-documenting. Free-spacing mode causes Ruby's regex engine to remove all comments and spaces before parsing the expression (which means that any spaces that are intended need to be protected, by escaping them, by putting them in a character class, etc.).
/
\A # match beginning of the sting
(?: # begin a non-capture group
[^|]* # match any character other than a pipe zero or more times
\| # match a pipe
){5} # end non-capture group and execute it 5 times
\K # discard all previous matches and reset the start of the
# match to the current location
[^|]* # match any character other than a pipe zero or more times
(?= # begin a positive lookahead to assert that the next
# character is a pipe
\| # match a pipe
)
/x # invoke free-spacing mode
Another way is to remove \K and add a capture group:
str[/\A(?:[^|]*\|){5}([^|]*)(?=\|)/, 1]
#=> "My dog has fleas"
Of course, you don't need to use a regular expression for this:
str.count('|') > 5 && str.split('|')[5]
#=> "My dog has fleas"

Regex match all characters up to number if there is one [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I'm trying to figure out how to write a regex that will match every charter up to, but not including the first number in the character sequence if there is one.
Ex:
Input: abc123
Output: abc
Input: #$%##<>#<123
Output: #$%##<>#<
Input: abc
Output: abc
Input: abc #####-122
Output: abc #####-
You can use:
/^([^\d\n]+)\d*.*$/gm
This will also handle scenarios where you have multiple sets of numbers in a string. Example here.
Explanation:
^ # define the start of the stirng
( # open capture group
[^\d\n]+ # match anything that isn't a digit or a newline that occurs once or more
) # close capture group
\d* # zero or more digits
.* # anything zero or more times
$ # define the end of the string
g # global
m # multi line
The greedy matching will mean that by default you will match the capture group and stop capturing as soon as either a digit or anything that isn't matched in the capture group or the end of the string it encountered.
[Update] Try this regex:
([^0-9\n]+)[0-9]?.*
Regex explains:
( capturing group starts
[^0-9\n] match a single character other than numbers and new line
+ match one or more times
) capturing group ends
[0-9] match a single digit number (0-9)
? match zero or more times
.* if any, match all other than new line
Thanks #Robbie Averill for clarifying OP's requirement. Here is the demo.
I did not select a correct answer because the correct answer was left in the comments. "^\D+"
I am working in java, so putting it all together I got:
Pattern p = Pattern.compile("^\\D+");
Matcher m = p.matcher("Testing123Testing");
String extracted = m.group(1);
Use the character class feature: [...]
Identify numbers: [0-9]
Negate the class: [^0-9]
Allow as many as you like: [^0-9]*

Regular expression model

Hey guys am new to regular expression i have found a regular expression like this ..
preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)
preg_match("/^
(1[-\s.])? # optional '1-', '1.' or '1'
( \( )? # optional opening parenthesis
\d{3} # the area code
(?(2) \) ) # if there was opening parenthesis, close it
[-\s.]? # followed by '-' or '.' or space
\d{3} # first 3 digits
[-\s.]? # followed by '-' or '.' or space
\d{4} # last 4 digits
$/x",$number);
I found these explanation from a tutorial website ..I just need to know why (?(2)) is assigned here..why questionmark(optional symbol) is applied at the beginning and what is the use of (2) there in that code ..
Am sorry if this question is of low standard since am a newbie .Any help would be appreciated .ThANKS .:)
The (?(2)\)) is an if clause that checks to see if the 2nd match group was captured.
You should be able to see a break down of your regex at Regex101. It's pretty useful to see what the regex is doing at all points and it's easy to tweak a regex from there.

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL