Regex - Find the Shortest Match Possible - regex

The Problem
Given the following:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the \plain\f2\fs24\cf6{\txfielddef{\*\txfieldstart\txfieldtype1\txfieldflags144\txfielddataval44334\txfielddata 35003800380039000000}{\*\txfielddatadef\txfielddatatype1\txfielddata 340034003300330034000000}{\*\txfieldtext 20{\*\txfieldend}}{\field{\*\fldinst{ HYPERLINK "44334" }}{\fldrslt{20}}}}\plain\f2\fs24 part of the note.
I'd like to produce this:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the third part of the note.
What I've Tried
The example input/output is a very simplified version of the data I need to parse and it would be nice to have a way to parse the data programmatically. I have a PHP application and I've been trying to use regex to match the segments that are important and then filter out the parts of the string that aren't required. Here's what I've come up with so far:
/\\plain.*?\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/2
It almost matches what I want, but it but grabs the longest possible match instead of the shortest. I want it to match only one \\plain before the \\field{.... Maybe after the \\plain, I could match anything except for a space? How would I go about doing that?
I'm no regex expert, but my use-case really calls for it. (Otherwise, I'd just write code to handle everything.) Any help would be much appreciated!

(?:(?!\\plain).)* will match any string unless it contains a match for \\plain. Here's the regex implementing this:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/5
Also, you can replace the space at the end with (?: |$) if you want to allow the end of the text to trigger it as well as a space:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*?(?: |$)/gm
regex101: https://regex101.com/r/ILLZU6/4

Related

Regex for extracting each word between hyphens

I am learning regex and trying to write a pattern that exactly matches each of the strings without'-' so that I can iterate for each of the groups and print the respective strings.
I have a string that looks like "Abcd001-wd2s-vwe1-20180e3103.txt"
I was able to write a regex for extracting Abcd001, wd2s and .txt from above text as shown below
(\A[^-]+)=> Abcd001
(-[^-]+-)=> wd2s
(\..*)=>.txt
However, I was unable to come up with the correct pattern for extracting the exact strings vwe1 and 20180e3103
It will be really helpful if you can guide me on this or if there is a better approach to achieve this?
Please note: [^-.]+ may give me all the words separately but I am looking for an option where I have a group defined for each of these strings so that its one to one mapping.
Thanks!
To get vwe1 or 20180e3103 from the example data, you might use a quantifier {2} or {3} to repeat matching one or more word charcters followed by a hyphen (?:\w+-){2}.
Then you could capture in a group ([^-.]+) matching not a hyphen or a dot.
(?:\w+-){2}([^-.]+)
Try the below regex
/\-([^\)]+)\-/gmi;
Also check the similar implementation:
https://stackoverflow.com/a/50336050/8179245

How to replace all lines based on previous lines in Notepad++?

I have an XML code:
<Line1>Matched_text Other_text</Line1>
<Line2>Text_to_replace</Line2>
How to tell Notepad++ to find Matched_text and replace Text_to_replace to Replaced_text? There are several similar blocks of code, with one exactly Matched _text and different Other_text and Text_to_replace. I want to replace all in once.
My idea is to put
Matched_text*<Line2>*</Line2>
in the Find field, and
Matched_text*<Line2>Replaced_text</Line2>
in the Replace field. I know that \1 in regex might be useful, but I don't know where to start.
The actual code is:
<Name>Matched_text, Other_text</Name>
<IsBillable>false</IsBillable>
<Color>-Text_to_replace</Color>
The regex you're looking for is something like the following.
Find: (Matched_text[\w,\s<>\/]*<Color>-).*(</Color>)
Replace: \1Replaced_text\2
Broken down:
`()` is how you tell regex that you want to keep things (for use in /1, /2, etc.), these are called capture groups in regex land.
`Matched_text[\w,\s<>\/]*` means you want your anchor `Matched_text` and everything after it up till the next part of the expression.
`<Color>-).*(</Color>)` Select everything between <Color>- and </Color> for replacement.
If you have any questions about the expression, I highly recommend looking at a regex cheatsheet.

Replacing char in a String with Regular Expression

I got a string like this:
PREFIX-('STRING WITH SPACES TO REPLACE')
and i need this:
PREFIX-('STRING_WITH_SPACES_TO_REPLACE')
I'm using Notepad++ for the Regex Search and Replace, but i'm shure every other Editor capable of regex replacements can do it to.
I'm using:
PREFIX-\('(.*)(\s)(.*)'\)
for search and
PREFIX-('\1_\3')
for replace
but that replaces only one space from the string.
The regex search feature in Notepad++ is very, very weak. The only way I can see to do this in NPP is to manually select the part of the text you want to work on, then do a standard find/replace with the In selection box checked.
Alternatively, you can run the document through an external script, or you can get a better editor. EditPad Pro has the best regex support I've ever seen in an editor. It's not free, but it's worth paying for. In EPP all I had to do was this:
search: ((?:PREFIX-\('|\G)[^\s']+)\s+
replace: $1_
EDIT: \G matches the position where the previous match ended, or the beginning of the input if there was no previous match. In other words, the first time you apply the regex, \G acts like \A. You can prevent that by adding a negative lookahead, like so:
((?:PREFIX-\('|(?!\A)\G)[^\s']+)\s+
If you want to prevent a match at the very beginning of the text no matter what it starts with, you can move the lookahead outside the group:
(?!\A)((?:PREFIX-\('|\G)[^\s']+)\s+
And, just in case you were wondering, a lookbehind will work just as well as a lookahead:
((?:PREFIX-\('|(?<!\A)\G)[^\s']+)\s+
You have to keep matching from the beggining of the string untill you can match no more.
find /(PREFIX-\('[^\s']*)\s([^']*'\))/
replace $1_$2
like: while (/(PREFIX-\('[^\s']*)\s([^']*'\))/$1_$2/) {}
How about using Replace all for about 20 times? Or until you're sure no string contains more spaces
Due to nature of regex, it's not possible to do this in one step by normal regular expression.
But if I be in your place, I do such replaces in several steps:
find such patterns and mark them with special character
(Like replacing STRING WITH SPACES TO REPLACE with #STRING WITH SPACES TO REPLACE#
Replace #([^#\s]*)\s to #\1_ server times.
Remove markers!
I studied a little the regex tool in Notepad++ because I didn't know their possibilities.
I conclude that they aren't powerful enough to do what you want.
Your are obliged to learn and use a programming language having a real regex capability. There are a number of them. Personnaly, I use Python. It would take 1 mn to do what you want with it
You'd have to run the replace several times for each space but this regex will work
/(?<=PREFIX-\(')([^\s]+)\s+/g
Replace with
\1_ or $1_
See it working at http://refiddle.com/10z

Regex href match a number

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.
Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>
To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.
Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.

Regular expression question

I have some text like this:
dagGeneralCodes$_ctl1$_ctl0
Some text
dagGeneralCodes$_ctl2$_ctl0
Some text
dagGeneralCodes$_ctl3$_ctl0
Some text
dagGeneralCodes$_ctl4$_ctl0
Some text
I want to create a regular expression that extracts the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0 from the text above.
the result should be: dagGeneralCodes$_ctl4$_ctl0
Thanks in advance
Wael
This should do it:
.*(dagGeneralCodes\$_ctl\d\$_ctl0)
The .* at the front is greedy so initially it will grab the entire input string. It will then backtrack until it finds the last occurrence of the text you want.
Alternatively you can just find all the matches and keep the last one, which is what I'd suggest.
Also, specific advice will probably need to be given depending on what language you're doing this in. In Java, for example, you will need to use DOTALL mode to . matches newlines because ordinarily it doesn't. Other languages call this multiline mode. Javascript has a slightly different workaround for this and so on.
You can use:
[\d\D]*(dagGeneralCodes\$_ctl\d+\$_ctl0)
I'm using [\d\D] instead of . to make it match new-line as well. The * is used in a greedy way so that it will consume all but the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0.
I really like using this Regular Expression Cheatsheet; it's free, a single page, and printed, fits on my cube wall.