I'm trying to write a regex for use in Calibre (python) to find ebooks that have the series name in brackets in the title. I have a custom column with the series name and title separated by a "~", for example:
"The Series~The Book Title (The Series)"
Best I can come up with finds anything with at least one letter from the series name in brackets in the title:
(.+)~.*[\(\1\)].*
I only want to find those that have the whole of the first part of the string in brackets at the end of the second part, it can contain extra info.
Thanks.
This works in Notepad++:
(.+)~[^\(]*\(\1\).*
I'm not sure it will work the same in python, but regexp processors are usually very similar, so try it out.
Your regex is pretty close, you can change a little your regex and have this:
(.+?)~.*[([]\1[)\]].*
Working demo
This will match strings like:
The Series~The Book Title (The Series)
The Series~The Book Title [The Series]
However, if you just want to match words with paretheses, then you can have:
(.+?)~.*[(]\1[)].*
or
(.+?)~.*\(\1\).*
Working demo
Thanks for the suggestions. They work perfectly in the python demo but for some unknown reason don't work in Calibre. Seems like one character is the most it will match from the capture group. Must be a limitation in the regex system Calibre uses.
Related
I am at the beginning of learning Regex, and I use every opportunity to understand how it's working. Currently I am trying to extract dates from a text file (which is in fact a vnt-file type from my mobile phone). It looks like following:
BEGIN:VNOTE
VERSION:1.1
BODY;ENCODING=QUOTED-PRINTABLE;CHARSET=UTF-8:18.07.=0A14.08.=0A15.09.=0A15.10.=
=0A13.11.=0A13.12.=0A12.01.=0A03.02. Grippe=0A06.03.=0A04.04.2015=0A0=
5.05.2015=0A03.06.2015=0A03.07.2015=0A02.08.2015=0A30.08.2015=0A28.09=
17.11.2017=0A
DCREATED:20171118T095601
X-IRMC-LUID:150
END:VNOTE
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
and so on. If the date has also a year, it should also be displayed.
I almost found out how to detect the dates by the following regex:
.+(\d\d\.\d\d\.(2015|2016|2017)?).+
But it only detect very few of the dates. The result is this:
BEGIN:VNOTE
VERSION:1.1
15.10.
04.04.2015
30.08.2015
24.01.2016
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
Then I tried to add a question mark which makes the .+ not greedy, as far as I read in tutorials. Then the regex looks like:
.+?(\d\d\.\d\d\.(2015|2016|2017)?).+?
But the result is still not what I am looking for:
BEGIN:VNOTE
VERSION:1.1
21.03.20.04.18.05.18.06.18.07.14.08.15.09.15.10.
13.11.13.12.12.01.03.02.06.03.04.04.20150A0=
03.06.201503.07.201502.08.201530.08.20150A28.09=
28.10.201525.11.201528.12.201524.01.20160A
DCREATED:20171118T075601
X-IRMC-LUID:150
END:VNOTE
For someone who is familiar with regex I am pretty sure this is very easy to solve, but I don't get it. It's very confusing when you are new to regex. I tried to find a hint in some tutorials or stackoverflow posts, but all I found is this: Notepad++ how to extract only the text field which is needed?
But it doesn't work for me. I assume it might have something to do with the fact that my text file is not one single line.
I have my example on regex101 too.
I would be very thankful if maybe someone can give me a hint what else I can try.
Edit: I would like to detect the dates with the regex and as a result have a list with only the dates (maybe it is called substitute?)
Edit 2: Sorry for not mentioning it earlier: I just want to use the regex in e.g. Notepad++ or an online regex test website. Just to get the result of the dates and save the result in a new txt-file. I don't want to use the regex in an programming language. My apologies for not being precisely before.
Edit 3: The result should be a list with the dates, and each date in a new line:
I want to extract all dates, so that the final list is like that:
18.07.
14.08.
15.09.
15.10.
I suggest this pattern:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)
This makes use of the \G flag that, in this case, allows for multiple matches from the very start of the match without letting any single unmatched character in the text, thus allowing the removal of all but what's wanted.
If you want to remove the extra matches as well, add |.* at the end:
(?:.*?|\G)(\d\d\.\d\d\.(?:\d{4})?)|.*
regex101 demo
In N++, make sure the options underlined are selected, and that the cursor is at the beginning. In the picture below, I replaced then undid the replacement, only to show that matches were identified (16 replacements).
You can try using the following pattern:
\d{2}\.\d{2}\.(?:\d{4})?
This will match day.month dates of the form 18.07., but it also allows such a date to be followed by a four digit year, e.g. 18.07.2017. While it would be nice to make the pattern more restrictive, to avoid false fire matches, I do not see anything obvious which can be added to the above pattern. Follow the demo link below to see the pattern in action.
Demo
I'm awful with RegEx and am in a time crunch. I'm trying to come up with a rule that will pull out instances of text captured between brackets that also include the phrase ".xl" in them.
Example String:
C:\Users[chris.xlm]\Desktop[Test1.xlsx]Sheet1'![$C$4]
What would get captured from the expression would be:
1. chris.xlm
2. Test1.xlsx
The pattern:
\[([^]]+?\.xl.*?)\]
should accomplish what you need.
The pattern grabs everything before and after any presence of .xl if it is in the text, including the full extension.
Revised thanks to C Perkin's comment.
I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)
Hopefully someone can help me out. Been all over google now.
I'm doing some zone-ocr of documents, and want to extract some text with regex. It is always like this:
"Til: Name Name Name org.nr 12323123".
I want to extract the name-part, it can be 1-4 names, but "Til:" and "org.nr" is always before and after.
Anyone?
If you can't use capturing groups (check your documentation) you can try this:
(?<=Til:).*?(?=org\.nr)
This solution is using look behind and lookahead assertions, but those are not supported from every regex flavour. If they are working, this regex will return only the part you want, because the parts in the assertions are not matched, it checks only if the patterns in the assertions are there.
Use the pattern:
Til:(.*)org\.nr
Then take the second group to get the content between the parenthesis.
I'm a Regex noob and am pretty sure I'm not going about this in the most efficient way - wanted to get some advice.
I have a Regex expression ((\w+\b.*?){100}){1} which selects the first 100 words of my string, the length of which varies.
What I want is to select the entire string except for the first 100 words.
Is there syntax I can add to my current expression to do this, or am I better off trying to directly select the rest of the text instead.
Also, if anyone has any good resources for improving my Regex knowledge, i'd be very appreciative. Thus far I've found http://gskinner.com/RegExr/ to be very helpful.
Thanks in advance!
If you use this, you can refer to everything else as group 3 noted as $3
This one will treat hyphenated words as one word.
(\w+(-\w+|\b).*?){100}(.*)
Regex training Here