RegEx to match text outside of square and curved brackets - regex

I have a log file that I need to search for. The text outside of the brackets is always the same, so to be as precise as I can, I would like to ignore the text inside both square and curved brackets.
So the log is;
[09/May/2022:11:04:05 +0000] NSMMReplicationPlugin - process_postop: Failed to apply update (6278e86f000627440000) error (-1). Aborting replication session(conn=5361 op=6)
Where I do not need the date at the start inside the [ ] or the numbers & text inside both sets of ( ) either side of the (-1) as the (-1) part is key to the search.
I just need to search on;
NSMMReplicationPlugin - process_postop: Failed to apply update error (-1). Aborting replication session
I cannot figure this out! I tried a lazy qualifier but as there's ( ) that I need it didn't work;
(?<=NSMMReplicationPlugin - process_postop: Failed to apply update)(.*)(?=error (-1). Aborting replication session)
I somehow need to escape the ( ) to get the -1 part?

We can use the following pattern which divides the text into 4 groups and only uses group 2 and 4.
(\[.*\])(.*)(\(\w*\))(.*)
to get group 2 and 4 we use \2\4 or $2$4 depending on the flavour of regex.
See https://regex101.com/r/v6izps/1 for an example.
Depending on the language and the tools there may not be any need to put brackets around the 1st and 3rd groups.

Related

Regex to capture two groups from record

I am working on an ETL to handle the parsing of machine generated logs. These logs resemble flattened json files as csv files. The payload of the json (and its length) depend on the log type, for example error, alarm, ...
Every so often, a corrupt line occurs in the log files. These corrupt lines combine two lines into a single and start with the special charcter \x00. As such, these corrupt lines can be identified. Still, I would like to retrieve and separate these two lines from the corrupt line.
Data example (the corrupt line is line 3):
log file
2019.09.12 07:32:00,121,INIED
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
2019.09.12 10:52:38,209,RESUM
Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! would be retrieved as
group 1: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
group 2: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*) to get everything after the timestamps. This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).
Questions:
I am unsure how to terminate my capturing group. I was thinking to use the end of the line or the next timestamp it finds. Any advice to combine these clauses?
In addition, this method removes the timestamps themselves from the capturing group. Should I use a different method?
As you were thinking, you should include in your capturing group the end of the line and timestamp combined in an OR clause.
In your expression, since you want the timestamp and text together, you don't want a capturing group with just (.*) but with the entire expression (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)
So the combination of these two would be:
(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))
The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date.
You can use the site https://regexr.com/ to test and validate expressions, you should try it.

Regex for Notepad++

I have a log for an app that executes every minute or so and at the end it reports number of records it failed with like this: TaskBlahBlah: [0] records failed
Up until now I would simply search through the whole document for ] records failed string and would visually identify the lines with greater then zero records. Is there a way to use regex and search for any non zero value specifically so I don't have to visually go through the list and potentially miss something?
I tried applying some regex to Notepad++ but it seems that either I did it wrong or Noteppad++ has a 'different' regex or something.
thank you
EDIT: just to list some of the things I tried:
[1-9][0-9]|[1-9]
\[[1-9][0-9]\] records failed|\[[1-9]\] records failed
For some reason it picks up things like [1] records failed but not [10] records failed
I guess this should get what you want:
/\[[1-9]\d*\] records failed/

Find all instances where count is less than

I have a big log file which contains IDs. If the ID is present in the log more than 5 times - it's a success. If it's less - I want to know which ID it is.
Ultimately I need a way in Notepad++ that would give me a list of all IDs ([0-9]{10}) where the instance of that is 5 or less.
Is this somehow possible?
Edit: The format of the file is a standard log4j log, so it has a ton of other data. Example (ID in this case is 12345678901234567)
[08-08-2015 02:08:00] [INFO ] Service [329]: Attempting to substitute message ID with 12345678901234567
[08-08-2015 02:08:00] [DEBUG] ParsedBlock [49]: 3296825 => 12345678901234567
[08-08-2015 02:08:00] [DEBUG] LifeCycle [149]: All messages have not yet been sent. Waiting another 2000 milliseconds. [Send: false]
[08-08-2015 02:08:00] [DEBUG] LifeCycle$5 [326]: Running 5, 2592
Since you're in Notepad++ in the first place, you can take advantage of its functionality outside of Search. Be sure you do all this in a copy of the file, not the original, since it makes changes to the file. Since you haven't answered about the format of the file, I'm assuming the file is just the IDs, one on each line.
The first step is to sort the IDs so all the duplicates appear contiguously: Edit -> Line Operations -> Sort Lines As Integers Ascending
Then do this Search/Replace (with Search Mode set to regex):
Search: (\d{17}\r\n)\1{5,}|(\d{17}\r\n)\2*
Replace: $2
You'll be left with only the IDs that occur 5 or fewer times.
Explanation:
The first half of the alternation (\d{17}\r\n)\1{5,} matches any IDs that repeat 6 or more times. The second half (\d{17}\r\n)\2* matches any other IDs, capturing the first instance in group #2. Then the replace puts back that group with $2.

How can I combine matching rules in vim into one?

I have a log file in which I have DEBUG, NORMAL and CRITICAL entries as well as some info that does not start with a regular (for this type of logs) data e.g. [20130313:123412]
[210313:100114] NORMAL: this is normal log
[210313:100114] DEBUG: ../../common/
Additional info:
number of ....
I would like to remove both DEBUG entries as well as those that do not start with [
I know I can do that with:
:g/DEBUG/d
and
:g!/^\[/d
how can I combine this into one ? Or properly use a regex ?
Convert them both to positive or negative rules (as appropriate), and then you can use \| ("or") to match one or the other.
:g/^[^\[]\|DEBUG/d
That would do it. ^[^\[] for lines starting with other than [, or lines containing DEBUG.

Regular Expression to Clean a numbered list

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".
I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:
^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))
and the following as the replace for each of the above:
$1. $2 $3 $4$5
My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...
An example of the text I want to clean up:
1 This is the first step in the list
2 Second lot if instructions to run through
3 Doing more of the recipe instruction
4 Half way through cooking up a storm
5 almost finished the recipe
6 Serve and eat
And what I want it to look like:
1. This is the first step in the list
2. Second lot if instructions to run through
3. Doing more of the recipe instruction
4. Half way through cooking up a storm
5. almost finished the recipe
6. Serve and eat
Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?
dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:
^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search
$1. $2\r\n\r\n // replace
If you're not using Windows, remove the \rs from the replace string.
Explanation:
^ // beginning of the line
(\d+) // capture group 1. one or more digits
\s+ // any spaces after the digit. don't capture
([^\r\n]+) // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture
Replace:
$1. // group 1 (the digit), then period and a space
$2 // group 2
\r\n\r\n // two EOLs, to create a blank line
// (remove both \r for Linux)
What about this?
1 Tbsp salt
2 Tsp sugar
3 Eggs
You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.
I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.