Regex to extract text with slash separated by slashes - regex

I'm trying to find the element definition from the xpath string using a regex.
However, some element definitions include the slash separator itself.
Sample of xpath:
/primary[#classCode='ABC']/subject[#typeCode='123/a'][organizer/code[#codeSystem='12.35.1.1/b']]/component[#typeCode='RET']/text()
I expect the result:
primary[#classCode='ABC']
subject[#typeCode='123/a'][organizer/code[#codeSystem='12.35.1.1/b']]
component[#typeCode='RET']
text()
Trying something simple, like
(?<=/)(.*?)(?=/)
or similar variations is not adequate.
Is there a regex expression that splites this without further processing the string?

I dont know what is used case but i hope this will help you out..
Regex demo
Regex: \/.*?[\]\)](?=\/|$)
1. \/.*?[\]\)] this will match / then all till first occurrence of ] or )
2. (?=\/|$) positive look ahead for either / or $(end of string)

Although there are better ways to extract xpath than using regex depending on the language, but if you still have to use regex, then you could try this:
(?<=\/|^)(.*?(?:\[.*?\])*)(?=\/|$)
Lookbehind (?<= includes / or starting anchor ^
(.*?(?:\[.*?\])*) is used to extract each segment in the path
(?:\[.*?\]) is a non-capturing group to match anything present within [ and ]
Used quantifier * with above group since xpath segment can contain more than one arguments such as subject[][] in your example.
Lookahead (?=\/|$) includes / or ending anchor $
Regex101 Demo
// Output:
primary[#classCode='ABC']
subject[#typeCode='123/a'][organizer/code[#codeSystem='12.35.1.1/b']]
component[#typeCode='RET']
text()

Related

Using Regex to extract a specific xml tag

I have this xml string
<aof xmlns="http://tsng.jun.net/jppos/conig/hello"><num>3</num><desc>addy02</desc><tpcs>5</tpcs></aof>'
I need to extract 5 using regex.
What I have done is:
regex = re.compile(r'tag+</.+>\s*(.+)\s*<.+>')
Where tag is 'tpcs'
but its returning empty tag.
Can someone please help.
Don't use regexps for XML / HTML! Read this, one of the most voted & highest ranked answers on this site!
Use XPath instead:
//tpcs/text()
or (namespace-gnostic):
//*[local-name()='tpcs']/text()
will print 5, as expected.
As posted in the comments, this regex does the trick :
(?<=<tpcs>).*?(?=<\/tpcs>)
As seen in this demo.
Explanation :
(?<=<tpcs>) is a positive lookbehind (?<=...), it asserts that a certain string, <tpcs> is placed before the string to match.
.*? the dot matches any character, zero or multiple times because it's followed by a *. Finally, the ? character next to it is a lazy quantifier which means that it's gonna match until the first occurence of what's coming next.
(?=<\/tpcs>) is a positive lookahead (?=...), it asserts that the string follows the pattern.

Regex: ignore characters that follow

I'd like to know how can I ignore characters that follows a particular pattern in a Regex.
I tried with positive lookaheads but they do not work as they preserves those character for other matches, while I want them to be just... discarded.
For example, a part of my regex is: (?<DoubleQ>\"\".*?\"\")|(?<SingleQ>\".*?\")
in order to match some "key-parts" of this string:
This is a ""sample text"" just for "testing purposes": not to be used anywhere else.
I want to capture the entire ""sample text"", but then I want to "extract" only sample text and the same with testing purposes. That is, I want the group to match to be ""sample text"", but then I want the full match to be sample text. I partially achieved that with the use of the \K option:
(?<DoubleQ>\"\"\K.*?\"\")|(?<SingleQ>\"\K.*?\")
Which ignores the first "" (or ") from the full match but takes it into account when matching the group. How can I ignore the following "" (")?
Note: positive lookahead does not work: it does not ignore characters from the following matches, it just does not include them in the current match.
Thanks a lot.
I hope I got your questions right. So you want to match the whole string including the quotes, but you want to replace/extract it only the expression without the quotes, right?
You typically can use the regex replace functionality to extract just a part of the match.
This is the regex expression:
""?(.*?)""?
And this the replace expression:
$1

Negative lookahead to match server directories not properly working

Given the following 3 example paths representing server paths i am trying to create a skiplist for my FTP client via PCRE regular expressions but can't seem to get the wished result.
/subdir-level-1/subdir-level-2/.../Author1_-_Title1-(1234)-Publisher1
/subdir-level-1/subdir-level-2/.../Author2_-_Title2_(5678)-PUBLiSHER2
/subdir-level-1/subdir-level-2/.../Author3_-_Title3-4951-publisher3
I want to skip all folders (not paths) that do not end with
-Publisher1
I am trying to create a working pattern with the help of this online help and and this regex tester but don't get any further than to this negative lookahead pattern
.*-(?!Publisher1)
But with this pattern all lines match because with all of them the substrings up to the pattern do all not contain the pattern.
/subdir/subdir/.../Author1_-_Title1-(1234) -Publisher1
/subdir/subdir/.../Author2_-_Title2_(5678) -PUBLiSHER2
/subdir/subdir/.../Author3_-_Title3-4951 -publisher3
What is my mistake and how would the correct pattern be just to match only the second and third line as line to be skipped but keep the first line?
EDIT to make it clearer what to highlight and what not.
Everything from the beginning of the path to the last slash must be ignored (allowed).
Everything after the last slash that matches the defined regex must be skipped.
EDIT to present an advanced pattern matching only the red part
[^/]*(?<!-Publisher2)$
Debuggex Demo
The regex which you have used is:
.*-(?!Publisher1)
I will tell you whats the fault in it.
According to this regex it will match those lines which dont have a - followed by Publisher1. Okay, do you notice the - there in between on yur text, yes. between author and title or after title. So all the strings satisfy this condition. Instead if you search with a negative lookahead in such a way that hiphen is with Publisher1 then your match should work.
So you plan on moving the hiphen inside the parenthesis so that it matches and make your regex like this :
^.*(?!-Publisher1)
but this will also not work, because here .* matches everything, so when we do a lookahead, we are not able to find a single character to match . Thus we will use a negative lookbehind. <.
.*(?<!-Publisher1)
what now ? . I have done everything but still I cannot get it to work. why is it so ?
because a negative lookbehind will lookback and tell if it is not followed by -Publisher1.
this is complex, just bear with me :
suppose your string
/subdir/subdir/.../Author1_-_Title1-(1234)-Publisher1
we do a negative lookbehind for -Publisher1. From the postition after 1 . i.e. at the end of the string -Publisher1 is visible when we lookback. BUT our condition is negative lookbehind. So it will move one character left to reach a position where it will no more be able to lookback and say that "Hey I can see -Publisher1 from here" because from here we are able to see "-Publisher" only. Our condtin satisfies but the regex still matches the rest of the string.
So it is essential to bind the lookbehind to the end of the string so that it doesnot move one character to the left to search for its match.
final regex:
.*(?<!-Publisher1)$
demo here : http://regex101.com/r/lE1vW2
This should suit your needs:
^.*(?<!-Publisher1)$
Debuggex Demo
I want to skip all folders that do not end with -Publisher1
You can use this negative lookahead based regex:
^(?!.*?-Publisher1$).+$
Working Demo
You could use the following regex in order to exclude lines containing Publisher1:
^((?!Publisher1).)*$
Online demo: http://regex101.com/r/gD8jK0

Regex to match number specific number in a string

I'm trying to fix a regex I create.
I have an url like this:
http://www.demo.it/prodotti/822/Panasonic-TXP46G20E.html
and I have to match the product ID (822).
I write this regex
(?<=prodotti\/).*(?<=\/)
and the result is "822/"
My match is always a group of numbers between two / /
You're almost there!
Simply use:
(?<=prodotti\/).*?(?=\/)
instead of:
(?<=prodotti\/).*(?<=\/)
And you're good ;)
See it working here on regex101.
I've actually just changed two things:
replaced that lookbehind of yours ((?<=\/)) by its matching lookahead... so it asserts that we can match a / AFTER the last character consumed by .*.
changed the greediness of your matching pattern, by using .*? instead of .*. Without that change, in case of an url that has several / following prodotti/, you wouldn't have stopped to the first one.
i.e., given the input string: http://www.demo.it/prodotti/822/Panasonic/TXP46G20E.html, it would have matched 822/Panasonic.

Match Sequence using RegEx After a Specified Character

The initial string is [image:salmon-v5-09-14-2011.jpg]
I would like to capture the text "salmon-v5-09-14-2011.jpg" and used GSkinner's RegEx Tool
The closest I can get to my desired output is using this RegEx:
:([\w+.-]+)
The problem is that this sequence includes the colon and the output becomes
:salmon-v5-09-14-2011.jpg
How can I capture the desired output without the colon. Thanks for the help!
Use a look-behind:
(?<=:)[\w+.-]+
A look-behind (coded as (?<=someregex)) is a zero-width match, so it asserts, but does not capture, the match.
Also, your regex may be able to be simplified to this:
(?<=:)[^\]]+
which simply grabs anything between (but not including) a : and a ]
If you are always looking at strings in that format, I would use this pattern:
(?<=\[image:)[^\]]+
This looks behind for [image:, then matches until the closing ]
You have the correct regex only the tool you're using is highlighting the entire match and not just your capture group. Hover over the match and see what "group 1" actually is.
If you want a slightly more robust regex you could try :([^\]]+) which will allow for any characters other than ] to appear in the file name portion.