Using regex to extract text inside two characters - regex

I'm trying to extract some text from a set of strings. I have three cases on those strings
X | A | Y
A | Y
A
Where A is the text I want to extract. I've tried using (?:\|)(.*?)(?:\|) which only works on the first case and have been trying to combine several options I've seen in other questions but no luck so far, if I match a case, the other cases won't be matched.

If I understand you correctly, try:
(?:.*?\|([^\|]+)\|.*?)|(^[^\|]+)
The result will be in either capturing group 1 or group 2

I think this will work (?:^|(?<=\|))\s*(A)\s*(?:(?=\|)|$)
It finds the A substring in capture group 1
This is definitely a case where you need assertions.
I don't think it will work without them.
Explained:
(?:
^ # BOS
| # or,
(?<= \| ) # | behind
)
\s* # optional wsp trim
( A ) # (1), What your looking for
\s* # optional wsp trim
(?:
(?= \| ) # | ahead
| # or,
$ # EOS
)

Related

How do I only extract numbers between 2 specific words using regex

I am trying to pull numbers between 2 specific words using regex. The problem is that they are multiline. I am trying to extract these from a PDF so it has to be between these 2 words only
WORD1:
(23)
(56)
(78)
END
I tried this
\((.*?)\) and it pulls the numbers between () but I need it to only search between the words WORD1 and END instead of the whole PDF.
Is there a way to do it ?
Expected Output:
23
56
78
Use the \G construct
(?s)(?:(WORD1:)(?=(?:(?!WORD1:|END).)*?\d(?:(?!WORD1:|END).)*END)|(?!^)\G)(?:(?!\d|WORD1:|END).)*?\K\d+
https://regex101.com/r/il00WG/1
Explained
(?s) # Dot-all inline modifier
(?:
( WORD1: ) # (1), Flag start of new set
(?= # Lookahead, must be a digit before the END
(?:
(?! WORD1: | END )
.
)*?
\d
(?:
(?! WORD1: | END )
.
)*
END
)
| # OR,
(?! ^ )
\G # Start where last match left off
)
(?:
(?! \d | WORD1: | END ) # Go past non-digits
.
)*?
\K # Ignor previous match up to here
\d+ # Digits, the only match
You need to include the global modifiers gm in your regex to match what you need.
https://regex101.com/r/c3VLdq/1
(\(.*?\))/gm
m is for multiline
m modifier: multi line. Causes ^ and $ to match the begin/end of each line
I had a similar issue, what I used are LookAhead (?=) and LookBehind(?<=)
So in your case it would look like this (if Lookbehind is supported)
(?<=WORD1:\n)(.*\n)+(?=END)
Note the new line symbol after WORD1: if that symbol is omitted, you will get result starting from the line break
tested here
https://regex101.com/r/qxPQqq/4

Notepad++ regex to find single character bounded by |

I'm having trouble coming up with the regex I need to do this find/replace in Notepad++. I'm fine with needing a couple of separate searches to complete the process.
Basically I need to add a | at the beginning and end of every line from a CSV, plus replace all the , with |. Then, on any value with only 1 character, I need to put two spaces around the character on each side ("A" becomes " A ")
Source:
col1,col2,col3,col4,col5,col6
name,desc,something,else,here,too
another,,three,,,
single,characters,here,a,b,c
last,line,here,,almost,
Results:
|col1|col2|col3|col4|col5|col6|
|name|desc|something|else|here|too|
|another||three||||
|single|characters|here| a | b | c |
|last|line|here||almost||
Adding the | to the beginning and the end of the line is simple enough, and replacing , with | is obviously straightforward. But I can't come up with the regex to find |x| where x is limited to a single character. I'm sure it is simple, but I'm new to regex.
Regex:
(?:(^)|(?!^)\G)(?:([^\r\n,]{2,})|([^\r\n,]))?(?:(,$)|(,)|($))
Replacement string:
(?{1}|)(?{2}\2)(?{3} \3 )(?{4}||)(?{5}|)(?{6}|)
Ugly, dirty and long but works.
Regex Explanation:
(?: # Start of non-capturing group (a)
(^) # Assert beginning of line (CP #1)
| # Or
(?!^) # //
\G # Match at previous matched position
) # End of non-capturing group (a)
(?: # Start of non-capturing group (b)
([^\r\n,]{2,}) # Match characters with more than 2-char length (any except \r, \n or `,`) (CP #2)
| # Or
([^\r\n,]) # Match one-char string (CP #3)
)? # Optional - End of non-capturing group (b)
(?: # Start of non-capturing group (c)
(,$) # Match `,$` (CP #4)
| # Or
(,) # Match single comma (CP #5)
| # Or
($) # Assert end of line (CP #6)
) # End of non-capturing group (c)
Three Step Solution:
Pattern: ^.+$ Replacement: |$0|
Pattern: , Replacement: |
Pattern: (?<=\|)([^|\r\n])(?=\|) Replacement: $0
The first replace adds | at the beginning and at the end, and replaces commas:
Search: ^|$|,
Replace: |
The second replace adds space around single character matches:
Search: (?<=[|])([^|])(?=[|])
Replace: $1
Add spaces to the left and to the right of $1.

Find an item in the text with exceptions[Regular Expression]

Please help create a regular expression that would be allocated "|" character everywhere except parentheses.
example|example (example(example))|example|example|example(example|example|example(example|example))|example
After making the selection should have 5 characters "|" are out of the equation. I want to note that the contents within the brackets should remain unchanged including the "|" character within them.
Considering you want to match pipes that are outside any set of parentheses, with nested sets, here's the pattern to achieve what you want:
Regex:
(?x) # Allow comments in regex (ignore whitespace)
(?: # Repeat *
[^(|)]*+ # Match every char except ( ) or |
( # 1. Group 1
\( # Opening paren
(?: # chars inside:
[^()]++ # a. everything inside parens except nested parens
| # or
(?1) # b. nested parens (recurse group 1)
) #
\) # Until closing paren.
)?+ # (end of group 1)
)*+ #
\K # Keep text out of match
\| # Match a pipe
regex101 Demo
One-liner:
(?:[^(|)]*+(\((?:[^()]++|(?1))\))?+)*+\K\|
regex101 Demo
This pattern uses some advanced features:
Possessive quantifiers
Recursion
Resetting the match start

How do I perform this regex in order to extract the value of the variable

You can test everything out here:
I would like to extract the value of individual variables paying attention to the different ways they have been defined. For example, for dtime we want to extract 0.004. It also has to be able to interpret exponential numbers, like for example for variable vis it should extract 10e-6.
The problem is that each variable has its own number of white spaces between the variable name and the equal sign (i dont have control on how they have been coded)
Text to test:
dtime = 0.004D0
case = 0
newrun = 1
periodic = 0
iscalar = 1
ieddy = 1
mg_level = 5
nstep = 20000
vis = 10e-6
ak = 10e-6
g = 9.81D0
To extract dtime's value this REGEX works:
(?<=dtime =\s)[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
To extract dtime's value this REGEX works:
(?<=vis =\s)[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
The problem is that I need to know the exact number of spaces between the variable name and the equal sign. I tried using \s+ but it does not work, why?
(?<=dtime\s+=\s)[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
If you are using PHP or PERL or more generally PCRE then you can use the \K flag to solve this problem like this:
dtime\s+=\s\K[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
^^
Notice the \K, it tells the expression to ignore everything
behind it as if it was never matched
Regex101 Demo
Edit: I think you need to capture the number in a capturing group if you can't use look behinds or eliminate what was matched so:
dtime\s*=\s*([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)
(?<=dtime\s+=\s) is a variable length lookbehind because of \s+. Most(not all) engines support only a 'fixed' length lookbehind.
Also, your regex requires a digit before the exponential form, so if there is no digit, it won't match. Something like this might work -
# dtime\s*=\s*([-+]?[0-9]*\.?[0-9]*(?:[eE][-+]?[0-9]+)?)
dtime \s* = \s*
( # (1)
[-+]? [0-9]* \.? [0-9]*
(?: [eE] [-+]? [0-9]+ )?
)
Edit: After review, I see you're trying to fold multiple optional forms into one regex.
I think this is not really that straight forward. Just as interest factor, this is probably a baseline:
# dtime\s*=\s*([-+]?(?(?=[\d.]+)(\d*\.\d+|\d+\.\d*|\d+|(?!))|)(?(?=[eE][-+]?\d+)([eE][-+]?\d+)|))(?(2)|(?(3)|(?!)))
dtime \s* = \s*
( # (1 start)
[-+]? # optional -+
(?(?= # conditional check for \d*\.\d*
[\d.]+
)
( # (2 start), yes, force a match on one of these
\d* \. \d+ # \. \d+
| \d+ \. \d* # \d+ \.
| \d+ # \d+
| (?!) # or, Fail the match, the '.' dot is there without a number
) # (2 end)
| # no, match nothing
)
(?(?= # conditional check for [eE] [-+]? \d+
[eE] [-+]? \d+
)
( [eE] [-+]? \d+ ) # (3), yes, force a match on it
| # no, match nothing
)
) # (1 end)
(?(2) # Conditional check - did we match something? One of grp2 or grp3 or both
| (?(3)
| (?!) # Did not match a number, Fail the match
)
)

regex to match postgresql bytea

In PostgreSQL, there is a BLOB datatype called bytea. It's just an array of bytes.
bytea literals are output in the following way:
'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'
See PostgreSQL docs for full definition of the format.
I'm trying to construct a Perl regular expression which will match any such string.
It should also match standard ANSI SQL string literals, like 'Joe', 'Joe''s Mom', 'Fish Called ''Wendy'''
It should also match backslash-escaped variant: 'Joe\'s Mom', .
First aproach (shown below) works only for some bytea representations.
s{ ' # Opening apostrophe
(?: # Start group
[^\\\'] # Anything but a backslash or an apostrophe
| # or
\\ . # Backslash and anything
| # or
\'\' # Double apostrophe
)* # End of group
' # Closing apostrophe
}{LITERAL_REPLACED}xgo;
For other (longer ones, with many escaped apostrophes, Perl gives such warning:
Complex regular subexpression recursion limit (32766) exceeded at ./sqa.pl line 33, <> line 1.
So I am looking for a better (but still regex-based) solution, it probably requires some regex alchemy (avoiding backreferences and all).
OK, here the best solution I could put together, thanks to Leon and hobbs.
Note: This is not the solution I was looking for! It still makes Perl fail with warning "recursion limit (32766) exceeded", for some long strings. (try to stuff 400k random bytes into a bytea field, then export with pg_dump --inserts).
However, it matches most bytea strings (as they appear in SQL code and in server logs), and ANSI SQL string literals. For example:
'\014cS\0059\036a4JEd\021o\005t\0015K7'
'\\037\\213\\010\\010\\005`Us\\000\\0001.fp3\'\\223\\222%'
' Joe''s Mom friend\'s dog is called \'Fluffy'''
And here's the regex:
m{
' # opening apostrophe
(?> # start non-backtracking group
[^\\']+ # anything but a backslash or an apostrophe, one or more times
| # or
(?: # group of
\\ \\? [0-7]{3} # one or two backslashes and three octal digits
)+ # one or more times
| # or
'' # double apostrophe
| # or
\\ [\\'] # backslash-escaped apostrophe or backslash
)* # end of group
' # closing apostrophe
}x;
If you don't care about correctness, at least for now, couldn't you just try to match against regular quoted string literals? Probably something like
m{
(?> # start of a quote group
' # opening apostrophe
(?> # start non-backtracking group
[^\\']+ # anything but a backslash or an apostrophe, one or more times
| # or
\\ . # backslash-escaped something
)* # end of group
' # closing apostrophe
)+ # end of a quote group, many of these
}x;
First of all, it seems like you're trying to so two very different things in one regexp:
Matching it for correctness.
Unquoting it.
To match it, you could try something like his:
m{ ^ # Start of string
' # Opening apostrophe
(?> # Start non-backtracking group
[^\\\'] # Anything but a backslash or an apostrophe
| # or
(?: # Start group
\d{3} # 3 digits
|
. # one other character
) # end group
| # or
'' # Double apostrophe
)* # End of group
' # Closing apostrophe
$ # End of string
}xms;