Problems matching multiline separator with C++ 2011 regex (EMCAScript) - c++

I'm trying to make a regular expression (for use with C++ 2011 std::regex, EMCAScript mode) to parse this content:
1
Second Line of Data for this entry
Here is the content
2
Second Line of Data for this entry
Here is the content that can be multiline
It is multiline for entry two
3
Second Line of Data for this entry
More content
Notice the '2' entry here - it has content with a carriage return (\n) to the next line, but the double \n\n separating it from the next entry.
I've tried this regex:
(\\d)\n(.*)\n(.*)\n\n
But it doesn't do what I would expect using clang 3.3.

Notes
Your example data seems to have a white-space after the number 3
It's not clear if you need CRLF or just LF for line feeds.
Be careful about the characters after the last entry any extra whitespace could cause the regex to fail to match the last item.
You have to be careful of what . (dot) matches, in this case we use [\s\S] to get around not matching newlines.
A RegEx that will deal with the extra white-space and line feeds is as follows:
RegEx (\d+)\s*\r?\n(.*?)\r?\n([\s\S]*?)(?:\r?\n\r?\n|$)
Escaped "(\\d+)\\s*\\r?\\n(.*?)\\r?\\n([\\s\\S]*?)(?:\\r?\\n\\r?\\n|$)"
If you know you will always have an input with LF line endings it's a little more simple:
RegEx (\d+)\s*\n(.*?)\n([\s\S]*?)(?:\n\n|$)
Escaped (\\d+)\\s*\\n(.*?)\\n([\\s\\S]*?)(?:\\n\\n|$)

Related

Matching lines containing Unicode line break chars with a dot pattern in Notepad++ regex

I'm using the following Regex to search for a string in each line of a document. Every line is encapsulated with þ.
^þ.*(SEARCHSTRING).*þ$
But I came across a discrepancy in my count. Running the regex over the below two example lines of data will only get one hit when I'd like to capture both. This is because of the Line Separator Character. My regex believes this to be a new line when in fact it is simply a new line indicator. Is there any way around this?
þ
SEARCHSTRINGþ
þ#SEARCHSTRINGþ
In Notepad++, . matches any char that is not a Unicode line break char.
If you need to match a line that is a chunk of chars other than LF and CR, use
^þ[^\r\n]*(SEARCHSTRING)[^\r\n]*þ$

Regular expression for removing the first line of text

I have googled and checked several resources, but the last 2 hours of trial and error is no good.
I have many hundres of files in which I need to remove the first line of text
As for now I have this regular expression to get the first line of text:
(\A[^\n]*\n)
But I want to have a condition in my Regular expression, that the first line ALSO MUST contain this GLOBALS["\x61\156\x75\156\x61"]
Because not ALL first lines in every file should be replaced.
Is that possible?
To remove that first line containing the GLOBALS["\x61\156\x75\156\x61"] as a literal string, use
\A.*GLOBALS\["\\x61\\156\\x75\\156\\x61"].*\r?\n
Note that . matches any character but a newline, and \r? will also match Windows style line breaks (if you have any). Backslahses must be doubled if you need to match a literal backslash. The square bracket [ is also a regex metacharacter, and must also be escaped (\[).

Notepad++ Regex Issue - Remove Number in Line Replace with HTML

I'm a regex newbie so this has been a lot of trial and error but for some reason I can only get this to work sometimes and I'm not sure why. Let me layout what I'm doing. I have a text file that looks like this:
1.Some Text Here
A paragraph of words here.
2.Some More Text Here
A paragraph of words here.
I use this code to find the lines with a number at the beginning:
^[0-9]+.([^.]*)$
Then I replace it with this:
<h2>$1</h2>\r\r
The problem I'm running into is that it usually grabs the line starting with the number but for some reason it will grab the line with the number and the paragraph below it. So instead of putting the </h2> at the end of the line it puts it at the end of the paragraph below.
I displayed all symbols to see if it had something to do with carriage/line returns but everything looks identical from line to line. The paragraph is on its own line and I see CRLF at the end of each line.
The expression [^.] (ie not a literal dot) matches newlines.
Don't match newlines in your capture:
^[0-9]+\.([^.\r\n]*)
Note that I also escaped the dot following the numbers, making it match a literal dot (a naked dot matches any character).
use \2 instead of $2, check "wrap around"tested on notepad++ 5.9.3 (UNICODE)
Not sure what version of notepad++ you're using but your version of the regex works fine for the example that you have ... i use 6.7.9.2
I can reproduce with the following text. Notice the paragraph for line 1 doesn't end in a period.
1.Some Text Here[CR][LF]
A paragraph of words here[CR][LF]
2.Some Text Here[CR][LF]
A paragraph of words here.[CR][LF]
Your regex matches any number of lines that begins with a set of digits, and doesn't end in a period. It could include more than one line. I would recommend this regex: ^[0-9]+\.([^\r\n]*)\r\n.

Regex expression to select pair of lines

I have a logfile. Its general format is
log text 1 <br/>
log text 2 <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 3 <br/>
log text 4 <br/>
.... <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 5 <br/>
.... <br/>
I would like to select and extract pair of 2 lines containing errors (each error has two lines. The first line always has Error: as keyword. Word Error does not occur anywhere else in logfile).
How do I do it using regex or any other way. I can use MacOS, Unix or Windows XP. MacOS preferred.
Using grep on mac or unix based OS
grep -i error -A2 inputfile
A regular expression to find those 2 lines is for example:
^.*?Error.*(?:\r?\n|\r).*$
^ ... start each search at beginning of a line.
.*? ... match any character except carriage return and line-feed zero or more times non greedy. Non greedy means as less characters as possible. In other words stop on first occurrence of Error and not on last occurrence.
Error ... this word must be found in the first of 2 lines to match.
.* ... match any character except carriage return and line-feed zero or more times greedy. Greedy means now match as many characters as possible.
(?:\r?\n|\r) ... is a non-marking group matching either carriage return + line-feed (DOS/Windows text files), or only line-feed (UNIX text files), or only carriage return (old MAC text file).
.* ... match any character except carriage return and line-feed 0 or more times greedy.
$ ... anchor for end of line. Line termination is not included in matched string.
In other words this expression matches an entire line containing anywhere the word Error, the line terminator of this first line and everything on next line up to end of line, but not matching also the line terminator of this second line.
UltraEdit is a shareware text editor available for Windows, Linux and Mac.
Using this Perl regular expression in Search - Find dialog of UltraEdit with advanced find option List lines containing string enabled results in getting all found 2 line strings written to a window listing all found lines.
Opening the context menu of this window (right click on Windows) and clicking on Copy to Clipboard results in copying all found lines to system clipboard.
Pressing Ctrl+N to open a new file, Ctrl+V to paste the copied lines, and Ctrl+S to save the new file results finally in having a file with the wanted information.
Another method is using the UltraEdit script FindStringsToNewFile with the reduced regular expression search string Error.*(?:\r?\n|\r).*. This script writes all found strings starting with keyword Error and ending at end of next line directly to a new file.
One more note:
If a . (dot) matches also newline characters like carriage return and line-feed depends on a flag. In UltraEdit the flag is by default set that a dot does not match newline characters. With (?s) at beginning of a Perl regular expression search string the flag is changed and the dot would then match also newline characters for this search. With (?-s) at beginning of a search string the flag can be set for not matching newline characters by a dot if the internal default of the application is the opposite.

UltraEdit: Deleting all lines under a certain length with \n and or \r

I have a massive text file and want to remove all lines that are less than 6 characters long.
I tried the following search string (Regular expressions - Perl)
^.{0,5}\n\r$ -- string not found
^.{0,5}\n\r -- string not found
^.{0,5}$ -- leaves blank lines
^.{0,5}$\n\r -- string not found
^.{0,5}$\r -- leaves blank lines
^.{0,5}$\r\n -- **worked**
My question is why should the last one work and the 4th one not work? Why should the 5th one leave blank lines.
Thanks.
Because ^.{0,5}$\n\r is not the same as ^.{0,5}$\r\n.
\n\r is a linefeed followed by carriage return.
\r\n is a carriage return followed by linefeed - a popular line ending combination of characters. Specifically \r\n is used by the MS-DOS and Windows family of operating systems, among others.
In multiline mode, ^ is a metacharacter that matches Begin of String and
can also match after a newline.
Likewise, $ matches End of String and these too:
\r\n
^ ^
here ----+-or-+
or
\n
^ ^
here ----+-or-+
$ will try to match before newline if it can (depends on other parts of the regex).
You can use that to advantage like this regex
^.{0,5}$(\r?\n)* which will match end of string AND optional successive linebreaks.