I have a logfile. Its general format is
log text 1 <br/>
log text 2 <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 3 <br/>
log text 4 <br/>
.... <br/>
Error: xxxxxx <br/>
error description (1 line only) <br/>
log text 5 <br/>
.... <br/>
I would like to select and extract pair of 2 lines containing errors (each error has two lines. The first line always has Error: as keyword. Word Error does not occur anywhere else in logfile).
How do I do it using regex or any other way. I can use MacOS, Unix or Windows XP. MacOS preferred.
Using grep on mac or unix based OS
grep -i error -A2 inputfile
A regular expression to find those 2 lines is for example:
^.*?Error.*(?:\r?\n|\r).*$
^ ... start each search at beginning of a line.
.*? ... match any character except carriage return and line-feed zero or more times non greedy. Non greedy means as less characters as possible. In other words stop on first occurrence of Error and not on last occurrence.
Error ... this word must be found in the first of 2 lines to match.
.* ... match any character except carriage return and line-feed zero or more times greedy. Greedy means now match as many characters as possible.
(?:\r?\n|\r) ... is a non-marking group matching either carriage return + line-feed (DOS/Windows text files), or only line-feed (UNIX text files), or only carriage return (old MAC text file).
.* ... match any character except carriage return and line-feed 0 or more times greedy.
$ ... anchor for end of line. Line termination is not included in matched string.
In other words this expression matches an entire line containing anywhere the word Error, the line terminator of this first line and everything on next line up to end of line, but not matching also the line terminator of this second line.
UltraEdit is a shareware text editor available for Windows, Linux and Mac.
Using this Perl regular expression in Search - Find dialog of UltraEdit with advanced find option List lines containing string enabled results in getting all found 2 line strings written to a window listing all found lines.
Opening the context menu of this window (right click on Windows) and clicking on Copy to Clipboard results in copying all found lines to system clipboard.
Pressing Ctrl+N to open a new file, Ctrl+V to paste the copied lines, and Ctrl+S to save the new file results finally in having a file with the wanted information.
Another method is using the UltraEdit script FindStringsToNewFile with the reduced regular expression search string Error.*(?:\r?\n|\r).*. This script writes all found strings starting with keyword Error and ending at end of next line directly to a new file.
One more note:
If a . (dot) matches also newline characters like carriage return and line-feed depends on a flag. In UltraEdit the flag is by default set that a dot does not match newline characters. With (?s) at beginning of a Perl regular expression search string the flag is changed and the dot would then match also newline characters for this search. With (?-s) at beginning of a search string the flag can be set for not matching newline characters by a dot if the internal default of the application is the opposite.
Related
I have an unformatted xml file in which I would like to delete tags of a specific name that contain some value.
Example:
<XmlElement1>
</XmlElement1>
<XmlElement2 ... >
...
<Xml1SubElement someParameter="...SearchTerm..."/>
...
</XmlElement2>
<XmlElement3/>
... stands for random characters and random multiple lines
In above example I would like to delete all XmlElement2 elements that contain "SearchTerm" in the body. In other words select all text between <XmlElement2 and </XmlElement2> across multiple lines where SearchTerm is in the middle and replace with "".
I'm using UltraEdit on MacOS and am flexible with what tools to use.
Your help is much appreciated!
The Perl regular expression search string for this task can be for example:
(?s)^[\t ]*<XmlElement2(?:.(?!</XmlElement2>))+?SearchTerm.+?</XmlElement2>[\t ]*(?:\r?\n|\r)
Explanation:
(?s) ... flag to match newline characters also by dot in search expression.
^[\t ]* ... start search at beginning of a line and match 0 or more tabs or spaces.
<XmlElement2 ... the start tag of the element to remove on containing SearchTerm.
(?:.(?!</XmlElement2>))+? ... a non marking group to find any character one or more times non-greedy as long as the string after the current character is not </XmlElement2>. The negative lookahead (?!</XmlElement2>) prevents selecting a block starting with <XmlElement2 and matching anything including one or even more </XmlElement2> and <XmlElement2 tags until SearchTerm is found anywhere in file.
SearchTerm ... string which must be found inside element XmlElement2.
.+? ... any character (including newline characters) one or more times non-greedy. Non-greedy means here to stop matching characters on next occurrence of </XmlElement2> and not on last occurrence of </XmlElement2> in file.
</XmlElement2> ... the end tag of the XML element to remove on containing SearchTerm.
[\t ]*(?:\r?\n|\r) ... 0 or more tabs or spaces and either DOS/Windows (carriage return + line-feed) or UNIX (just line-feed) or MAC (just carriage return) line ending.
PS: The Perl regular expression replace was tested with UltraEdit for Windows v22.20.0.49 on Windows XP and v25.20.0.88 on Windows 7 as I don't have a Mac.
I have a massive text file and want to remove all lines that are less than 6 characters long.
I tried the following search string (Regular expressions - Perl)
^.{0,5}\n\r$ -- string not found
^.{0,5}\n\r -- string not found
^.{0,5}$ -- leaves blank lines
^.{0,5}$\n\r -- string not found
^.{0,5}$\r -- leaves blank lines
^.{0,5}$\r\n -- **worked**
My question is why should the last one work and the 4th one not work? Why should the 5th one leave blank lines.
Thanks.
Because ^.{0,5}$\n\r is not the same as ^.{0,5}$\r\n.
\n\r is a linefeed followed by carriage return.
\r\n is a carriage return followed by linefeed - a popular line ending combination of characters. Specifically \r\n is used by the MS-DOS and Windows family of operating systems, among others.
In multiline mode, ^ is a metacharacter that matches Begin of String and
can also match after a newline.
Likewise, $ matches End of String and these too:
\r\n
^ ^
here ----+-or-+
or
\n
^ ^
here ----+-or-+
$ will try to match before newline if it can (depends on other parts of the regex).
You can use that to advantage like this regex
^.{0,5}$(\r?\n)* which will match end of string AND optional successive linebreaks.
If i have a line of text that i want to remove from a text file in notepad and it is always formatted like this
[text]:
except that the words in the text area change. what is a regular expression i could create to remove the whole section with the search and replace function in notepad?
To delete the entire line starting with [any text]: you can use: ^[\t ]*\[.*?\]:.*?\r\n
Explanation:
^ ... start search at beginning of a line (in this case).
[\t ]* ... find 0 or more tabs or spaces.
\[ ... find the opening square bracket as literal character.
.*? ... find 0 or more characters except the new line characters carriage return and line-feed non greedy which means as less characters as possible to get a positive match, i.e. stop matching on first occurrence of following ] in the search expression.
\]: ... find the closing square bracket as literal character and a colon.
.*?\r\n ... find 0 or more characters except the new line characters and finally also the carriage return and line-feed terminating the line.
The search string ^[\t ]*\[.*?\]:.*?$ would find also the complete line, but without matching also the line termination.
The replace string is for both search strings an empty string.
If by removing the entire section, you mean remove the [text]: up to the next [otherText]:, you can try this:
\[text\]:((?!\[[^\]]*\]:).)*
Remember to set the flag for ". matches newline".
This regex basically first matches your section title. Then, it would start matching right after this title and for each character, it uses a negative lookahead to check if the string following this character looks like a section title. If it does the matching is terminated.
Note: Remember that this regex would replace all occurrences of the matched pattern. In other words, if you have more than one of that section, they are both replaced.
I recently received a tab separated file that has 60 fields. Each field can have any character in it. The export I received also has linefeeds and carriage returns in some of the fields. This is causing the tab separated file to not import correctly. Is there a way to remove linebreaks and carriage returns if the line does not have 59 tabs on it? There may or may not be data between each tab.
Sample File
Line 3,4,5 is the issue I'm trying to fix.
Warning: I'm assuming that there are no tabs within a column's data. If there is, then you need something far more capable that what I have here.
The following works with the sample input provided:
First, replace all of the line breaks with a character that doesn't occur anywhere in your file. You can even use characters that you can't type with your keyboard.
Find what: (\r\n?|\n)
Replace with: \xB6
Then, match your 60-field rows and give them line-breaks (I'm going with Windows-style):
Find what: ^(([^\t]*\t){59}[^\t\xB6]*)\xB6
Replace with: $1\r\n
I'm making one huge assumption here: that column 60 never contains a line break. If this is false, then you're going to have some of column 60's data ending up in column 1 of the next record.
Now, if you don't like that paragraph symbol showing up in your data, you can either purge it or replace it with whatever you like:
Find what: \xB6
Replace with:
Explanation of matching patterns:
(\r\n?|\n) matches any of the three kinds of line breaks, which are single \r, a single \n, or the Windows-style \r\n. Wikipedia has a whole article about this.
See http://regex101.com/r/iB6fK9 to explore the ^(([^\t]*\t){59}[^\t\xB6]*)\xB6 pattern.
I'm matching the beginning of the line with ^ at the start.
I have a group of zero or more characters that are not a tab, followed by a tab, that I match exactly 59 times with ([^\t]*\t){59}. That gets us the first 59 tab-separated columns. Only column 59 is captured by this group.
For column 60, I match zero or more characters that are neither a tab nor our special character with [^\t\xB6]*.
I capture the 60 columns with parentheses, but I leave our special character outside of the captured group so that it gets replaced with the \r\n that we insert with the $1\r\n replacement.
What I understand from your question is that you want to remove the windows \r\n from your file, to do this you can use replace dialog ctrl+h.
On the Search Mode select Extended (\n, \r,..., then on the "Find What" look for \r\n and in "Replace" leave it empty (or replace it with what you want).
I'd do:
Find what: ^((?:[^\t]*\t[^\t]*){1,58})[\r\n]+
Replace with: $1
This will replace line break with nothing if there are less than 59 occurrence of \t character in a line.
I'm trying to make a regular expression (for use with C++ 2011 std::regex, EMCAScript mode) to parse this content:
1
Second Line of Data for this entry
Here is the content
2
Second Line of Data for this entry
Here is the content that can be multiline
It is multiline for entry two
3
Second Line of Data for this entry
More content
Notice the '2' entry here - it has content with a carriage return (\n) to the next line, but the double \n\n separating it from the next entry.
I've tried this regex:
(\\d)\n(.*)\n(.*)\n\n
But it doesn't do what I would expect using clang 3.3.
Notes
Your example data seems to have a white-space after the number 3
It's not clear if you need CRLF or just LF for line feeds.
Be careful about the characters after the last entry any extra whitespace could cause the regex to fail to match the last item.
You have to be careful of what . (dot) matches, in this case we use [\s\S] to get around not matching newlines.
A RegEx that will deal with the extra white-space and line feeds is as follows:
RegEx (\d+)\s*\r?\n(.*?)\r?\n([\s\S]*?)(?:\r?\n\r?\n|$)
Escaped "(\\d+)\\s*\\r?\\n(.*?)\\r?\\n([\\s\\S]*?)(?:\\r?\\n\\r?\\n|$)"
If you know you will always have an input with LF line endings it's a little more simple:
RegEx (\d+)\s*\n(.*?)\n([\s\S]*?)(?:\n\n|$)
Escaped (\\d+)\\s*\\n(.*?)\\n([\\s\\S]*?)(?:\\n\\n|$)