I am trying in Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. The problem I have is related to the fact that I need only the first match of the regular expression. However when I occupy my regex it finds both.
"FECHA DE EMISION ","26/03/2021 "
"Comuna: ","Valparaiso "
"FECHA DE EMISION ","26/03/2021 "
The regex I am using is:
(FECHA\sDE\sEMISION.*)
The result I need is just the first match of the regex to get:
"FECHA DE EMISION ","26/03/2021 "
It is important to note that the two matches they make are the same content.
I also tried to use the Contents statement \g<1> capture group 1, but it didn't work for me. I think it has to do with that I am not using lazy greedy.
It is important to note that I cannot solve it directly with Python or with functionalities of it. I specifically use re.findall, but I can't add any other additional functionality, that's why I need an expression that resolves to bring me only the first match.
Any idea how to solve it?
If you could use PCRE/Onigmo/Boost regex engine or PyPi regex module, you could get the match value directly using
\A[\s\S]*?\K"FECHA\sDE\sEMISION.*
where \K makes the regex engine "forget" the text matched so far. See this regex demo.
Since you are bound to use a pattern for re.findall, you can use
\A[\s\S]*?("FECHA\sDE\sEMISION.*)
See the regex demo.
Details:
\A - unambiguous start of string
[\s\S]*? - any zero or more chars, as few as possible
("FECHA\sDE\sEMISION.*) - Capturing group 1: "FECHA DE EMISION with any whitespace between the words and then the rest of the line.
Related
I am trying in Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. The problem I have is related to the fact that I need only the first match of the regular expression. However when I occupy my regex it finds both.
"FECHA DE EMISION ","26/03/2021 "
"Comuna: ","Valparaiso "
"FECHA DE EMISION ","26/03/2021 "
The regex I am using is:
(FECHA\sDE\sEMISION.*)
The result I need is just the first match of the regex to get:
"FECHA DE EMISION ","26/03/2021 "
It is important to note that the two matches they make are the same content.
I also tried to use the Contents statement \g<1> capture group 1, but it didn't work for me. I think it has to do with that I am not using lazy greedy.
It is important to note that I cannot solve it directly with Python or with functionalities of it. I specifically use re.findall, but I can't add any other additional functionality, that's why I need an expression that resolves to bring me only the first match.
Any idea how to solve it?
If you could use PCRE/Onigmo/Boost regex engine or PyPi regex module, you could get the match value directly using
\A[\s\S]*?\K"FECHA\sDE\sEMISION.*
where \K makes the regex engine "forget" the text matched so far. See this regex demo.
Since you are bound to use a pattern for re.findall, you can use
\A[\s\S]*?("FECHA\sDE\sEMISION.*)
See the regex demo.
Details:
\A - unambiguous start of string
[\s\S]*? - any zero or more chars, as few as possible
("FECHA\sDE\sEMISION.*) - Capturing group 1: "FECHA DE EMISION with any whitespace between the words and then the rest of the line.
I need to create two regex
One, for catching these type of strings:
/xyz-courses/test/test
/abc-courses/test-abc/test-xyz
/abc-courses/test-abc/test-xyz?itsok=yes
But I don't want to match these strings where fixed word is prepended with -courses:
/fixed-courses/test/test
/fixed-courses/test-abc/test-xyz
/fixed-courses/test-abc/test-xyz?itsok=yes
I have created the following REGEX, which is working perfectly fine, but not sure about case how to exclude the prepended word fixed
/([^/]+)-courses/([^/]+)/([^/]+)$
Second, I need to create REGEX to negate all regex created in previous step.
I tried:
[^/([^/]+)-courses/([^/]+)/([^/]+)]$
But this is showing invalid on all REGEX checkers.
You may use this regex to disallow fixed- before courses:
^/((?!fixed-)[^/-]+)-courses/([^/]+)/([^/]+)$
RegEx Demo
(?!fixed-) is a negative lookahead that will fail the match if fixed- appears right after / and before courses/.
For second part use this to negate first regex:
^/(?!((?!fixed-)[^/-]+)-courses/([^/]+)/([^/]+)$).+
RegEx Demo 2
I have the following string thisIs/My-7777-Any-other-text it also is possible for the following thisIs/My-7777
I am looking to extract My-777 in both scenarios using regex. So essentially I am looking to extract everything between the first forward flash and the second hyphen (Second hyphen may not exist). I tried the following regex which wasn't quite right
(?<=\/)(.*)(?=-)
You could use a capture group
^[^\/]*\/([^-]*-[^-]*)
^ Start of string
[^\/]*\/
( Capture group
[^-]*-[^-]* Match a - between optional chars that are not -
) Close capture group
regex demo
Without an anchor, and not allowing / before and after -
[^\/]*\/([^-\/]*-[^-\/]*)
Regex demo
If we take into account the structure of your current input strings, you can use
(?<=\/)[^-]+-[^-]+
See the regex demo.
If your strings are more complex and look like thisIs/My-7777/more-text-here, and you actually want to match from the first /, then you may use
^[^\/]+\/\K[^\/-]+-[^\/-]+ ## PHP, PCRE, Boost (Notepad++), Onigmo (Ruby)
(?<=^[^\/]+\/)[^\/-]+-[^\/-]+ ## JS (except IE & Safari), .NET, Python PyPi regex)
See this regex demo or this regex demo. Note \n is added in the demo since the input is a single multiline string, in real life input, if a newline char is expected, use it in each negated character class to keep matching on the one line.
This one is working for me, Try it with case insensitive ticked
Find what: .*?/|-any.*
Replace with: blank
Output should be ↠↠ My-7777
Within sublime text I'm trying to match a single double quote followed by the html tag <br>. Any string can come after the html tag and the double quote must not be preceded by a double quote.
I've gotten as far as having my regex meet my expectations when testing in https://regex101.com/r/HHNB1E/4.
This is my regex: ^((?!").)*{"<br>}.*$.
However, when I put this into Sublime Text it throws the error "Ran out of stack space trying to match the regular expression". I'm assuming my regex is inefficient given I am not very experienced with them.
Example results expected:
foobar""<br> - No match
foobar"<br> - Match
""<br>baz - No match
"<br>baz - Match
foo<br>baz - No Match
Do I need to improve the regex for efficiency or am I doing it completely wrong?
The error you are seeing is much probably because of capturing group involved. Turn it to non-capturing group like this:
^(?:(?!").)*"<br>.*$
But...
I'm trying to match a single double quote...
If you are trying to match one double quotation mark that is immediately followed by <br> and is not preceded by another " then chances are you need lookarounds:
"(?<!"")(?=<br>)
Above is a fast solution to your problem.
Try it like this using a negated character class instead of a tempered greedy token/negative lookahead:
^[^"]*"<br>.*$
Demo
The regex can be further simplified if you do not need to select the whole line: [^"]"<br>
If you need the full line and the above does not work I'd recommend running the pattern via grep
grep -P '^[^"]*"<br>.*$'
I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").