Find all instances where count is less than - regex

I have a big log file which contains IDs. If the ID is present in the log more than 5 times - it's a success. If it's less - I want to know which ID it is.
Ultimately I need a way in Notepad++ that would give me a list of all IDs ([0-9]{10}) where the instance of that is 5 or less.
Is this somehow possible?
Edit: The format of the file is a standard log4j log, so it has a ton of other data. Example (ID in this case is 12345678901234567)
[08-08-2015 02:08:00] [INFO ] Service [329]: Attempting to substitute message ID with 12345678901234567
[08-08-2015 02:08:00] [DEBUG] ParsedBlock [49]: 3296825 => 12345678901234567
[08-08-2015 02:08:00] [DEBUG] LifeCycle [149]: All messages have not yet been sent. Waiting another 2000 milliseconds. [Send: false]
[08-08-2015 02:08:00] [DEBUG] LifeCycle$5 [326]: Running 5, 2592

Since you're in Notepad++ in the first place, you can take advantage of its functionality outside of Search. Be sure you do all this in a copy of the file, not the original, since it makes changes to the file. Since you haven't answered about the format of the file, I'm assuming the file is just the IDs, one on each line.
The first step is to sort the IDs so all the duplicates appear contiguously: Edit -> Line Operations -> Sort Lines As Integers Ascending
Then do this Search/Replace (with Search Mode set to regex):
Search: (\d{17}\r\n)\1{5,}|(\d{17}\r\n)\2*
Replace: $2
You'll be left with only the IDs that occur 5 or fewer times.
Explanation:
The first half of the alternation (\d{17}\r\n)\1{5,} matches any IDs that repeat 6 or more times. The second half (\d{17}\r\n)\2* matches any other IDs, capturing the first instance in group #2. Then the replace puts back that group with $2.

Related

Regex to capture two groups from record

I am working on an ETL to handle the parsing of machine generated logs. These logs resemble flattened json files as csv files. The payload of the json (and its length) depend on the log type, for example error, alarm, ...
Every so often, a corrupt line occurs in the log files. These corrupt lines combine two lines into a single and start with the special charcter \x00. As such, these corrupt lines can be identified. Still, I would like to retrieve and separate these two lines from the corrupt line.
Data example (the corrupt line is line 3):
log file
2019.09.12 07:32:00,121,INIED
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
2019.09.12 10:52:38,209,RESUM
Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! would be retrieved as
group 1: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
group 2: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*) to get everything after the timestamps. This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).
Questions:
I am unsure how to terminate my capturing group. I was thinking to use the end of the line or the next timestamp it finds. Any advice to combine these clauses?
In addition, this method removes the timestamps themselves from the capturing group. Should I use a different method?
As you were thinking, you should include in your capturing group the end of the line and timestamp combined in an OR clause.
In your expression, since you want the timestamp and text together, you don't want a capturing group with just (.*) but with the entire expression (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)
So the combination of these two would be:
(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))
The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date.
You can use the site https://regexr.com/ to test and validate expressions, you should try it.

Can regex be used to find this pattern?

I need to parse a large amount of data in a log file, ideally I can do this by splitting the file into a list where each entry in the list is an individual entry in the log.
Every time a log entry is made it is prefixed with a string following this pattern:
"4404: 21:42:07.433 - After this point there could be anything (including new line characters and such). However, as soon as the prefix repeats that indicates a new log entry."
4404 Can be any number, but is always then followed by a :.
21:42:07.433 is the 21 hours 42 mins 7 seconds 433 milliseconds.
I don't know much about regex, but is it possible to identify this pattern using it?
I figured something like this would work...
"*: [0-24]:[0:60]:[0:60].[0-1000] - *"
However, it just throws an exception and I fear I'm not on the right track at all.
List<string> split_content = Regex.Matches(file_content, #"*: [0-24]:[0:60]:[0:60].[0-1000] - *").Cast<Match>().Select(m => m.Value).ToList();
The following expression would split a string according to your pattern:
\d+: \d{2}:\d{2}:\d{2}\.\d{3}
Add a ^ in the beginning if your delimiting string always starts a line (and use the m flag for regex). Capturing the log chunks with a regex would be more elaborate, I'd suggest just splitting (with Regex.Split) if you have your log content in the memory all at once.

How to convert a *.txt file (copy/pasted variables) into a tabular format

I have a bunch of variables (roughly 80) which I copy+paste into my editor (get those variables from a different *.txt file). After this, it looks a bit messy like
ka15 1-2 tre15 3-4 hsha15 5
juso15 6
kl15 7-9 kkjs15 10
but I'd like to have it structured to get a better idea of what's going on inside the code. I also have to strip away the 15 from each variable. Ideally I would get something like
ka 1-2 tre 3-4 hsha 5
juso 6 kl 7-9 kkjs 10
Is there a clever way to achieve this? I am using SAS Enterprise Guide Editor and VSCode but couldn't find a way. For Example, when I find and replace the 15 I would wish I could replace it with a tab, but couldn't find that option in neither editors. Any ideas to get this automated or at least not do everything by hand?
I found a hacky solution to your problem, if anyone finds a better solution, I'll delete mine, but here it goes ¯\_(ツ)_/¯:
1) Copy all content of file(for example I copied yours twice):
ka15 1-2 tre15 3-4 hsha15 5
juso15 6
kl15 7-9 kkjs15 10
ka15 1-2 tre15 3-4 hsha15 5
juso15 6
kl15 7-9 kkjs15 10
2) Ctrl+H and replace all 15 with nothing (leave empty) using Ctrl+Alt+Enter.
3) Ctrl+F and turn Regular expressions in search box. Now type \s to select whitespace and it should select one whitespace after every word. Now select all occurrences with Alt+Enter and press Backspace followed by Enter. This will delete spaces between the words and place one word on one line of code like so:
ka
1-2
tre
3-4
hsha
...
Press Escape to remove multiple cursors.
4) Press Ctrl+F again and in search box type $ sign. This wil select end of every line. Again, press Alt+Enter to select all occurrences a press Space 5-8 times. Notice however that cursors are not properlly aligned. Press Escape to remove multiple cursors.
5) Place cursor a few spaces from a first word. Then, hold Ctrl+Alt+↓ to add multiple cursors below first one. Then, press Shift+End to select all the whitespace to the end of every line and press Delete to delete it. Press Delete again to align all words in one line seperated by n spaces.
6) Unfortunately, I couldn't find regex for the last part. Cursor should be placed after every 6th variable, but I solved it with by placing cursor next to every 7th word and pressing Enter.
I usually don't type too much like this, but I liked the problem you had. It was more puzzle than a problem to me.
I've come up with 3 regex's that will do what you want. In order to run them all sequentially you will need the regreplace extension or similar.
This goes in your settings:
"regreplace.on-save": false,
"regreplace.commands": [
{
"name": "Transform Data to Table Format, step 1",
"regexp": "([a-zA-Z]+|[\\d-]+)(15)?(\\s[\r\n]?)*",
"replace": "$1 \n",
"priority": 1,
},
{
"name": "Transform Data to Table Format, step 2",
"regexp":
"(([\\S-] {6})(.*))|(([\\S-]{2} {5})(.*))|(([\\S-]{3} {4})(.*))|(([\\S-]{4} {3})(.*))",
"replace": "$2$5$8$11",
"priority": 2,
},
{
"name": "Transform Data to Table Format, step 3",
"regexp":
"((.*)\n)((.*)\n)((.*)\n)((.*)\n)((.*)\n)((.*?)(\\s*\\n))",
"replace": "$2$4$6$8$10$12\n",
"priority": 3,
}
],
It creates a rule for each of the three regex steps. All three rules can be run sequentially by running the regreplace.regreplace command. Here is a demo:
The regex's are designed to look good with data items up to 4 characters long but could be easily modified for longer items.
In step 1, increase the number of spaces before the \n in the replace rule to something like 16 or so.
In step 2, you will have to sense the pattern of the regex groups like (([\\S-]{4} {3})(.*) to modify them. A 13 character long variable might require something like (([\\S-]{13} {3})(.*) as the last group and ([\\S-] {15})(.*))as the first in the sequence, etc. modifying all the other groups in order. Let me know if you need help with that.
Step 3 needs no modification unless you want to change how many data items appear on each line - right now there are 3 variables with their data on each line hence 6 groups in that regex.
It does not matter how many data-value pairs are in any row prior to running the command.
[Two items of caution: There should not be any empty lines before the start of the data, although if necessary you could a regex as the first rule to remove empty lines. Empty lines within the data or at the end are not a problem.
Secondly, the extension cannot be run on selected text only so you will have to place your data at the top of an empty file to convert it and then copy it elsewhere if you wish.]
There is also the replace rules extension which works like regreplace but will according to the docs run on a selection only but it didn't work for me here for some unknown reason. It does have a nicer interface though - all regex's could go into a single rule which could then be independently run.

Regex for Notepad++

I have a log for an app that executes every minute or so and at the end it reports number of records it failed with like this: TaskBlahBlah: [0] records failed
Up until now I would simply search through the whole document for ] records failed string and would visually identify the lines with greater then zero records. Is there a way to use regex and search for any non zero value specifically so I don't have to visually go through the list and potentially miss something?
I tried applying some regex to Notepad++ but it seems that either I did it wrong or Noteppad++ has a 'different' regex or something.
thank you
EDIT: just to list some of the things I tried:
[1-9][0-9]|[1-9]
\[[1-9][0-9]\] records failed|\[[1-9]\] records failed
For some reason it picks up things like [1] records failed but not [10] records failed
I guess this should get what you want:
/\[[1-9]\d*\] records failed/

What would be the best (runtime performance) application or pattern or code or library for matching string patterns

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.
The simplest thougt is that there are some specified patterns and we want to know which of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.
I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.
Scenario:
Data File:
Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:
A1- Will be "3 digits" followed by a
"." follwed by "2 digits" -
[0-9]{3}.[0-9]{2}
A2- Will be "1
character" follwoed by "digits" -
[A-Z][0-9]{4}
The sample would be something like:
<Data>
<R1>
<A1>123.45</A1>
<A2>A5567</A2>
<A4>456EV</A4>
<An>xxx</An>
</R1>
</Data>
Rule File:
Rule ID A1 A2
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}
Rule Location - I am planning to store the Rule IDs in some sort of bit mask.
So the rule IDs are then listed as location on a string
Rule ID Location (from left to right)
1001 1
2002 2
3055 3
Pattern file: (This is not the final structure, but just a thought)
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011
Now let's assume that SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100
Doing AND on each of the loctions
gives me the location 1 - 1001 -
Complete match.
Doing XOR on each of the loctions
gives me the location 3 - 3055 -
Partial match. (I am purposely not
doing an OR, because that would have
returned 1001 and 3055 as the result
which would be wrong for partial
match)
The final reulsts I am looking for are:
1001 - Complete Match
3055 - Partial Match
Start Edit_1: Explanation on Matching results
Complete Match - This occurs when all
of the patterns in given Rule are
matched.
Partial Match - This ocurrs when NOT
all of the patterns in given Rule are
matched, but atleast one pattern
matches.
Example Complete Match (AND):
Rule ID 1001 matched for A1(101) and A2 (100). If you look at the first charcter in 101 and 100 it is "1". When you do an AND - 1 AND 1 the result is 1. Thus position 1 i.e. 1001 is a Complete Match.
Exmple Partial Match (XOR):
Rule ID 3055 matched for A1(101). If you look at the last character in 101 and 100 it is "1" and "0". When you do an XOR - 1 XOR 0 the result is 1. Thus position 3 i.e. 3055 is Partial Match.
End Edit_1
Input:
The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.
Rules:
The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.
Output:
Need to know which rules matched completely and partially.
Preferences:
I would prefer doing as much of the coding as I can in C#. However if there is a major performance boost, I can use a different language.
The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.
Questions:
Are there any existing pattern or
framewroks to do this?
Is using Regex the right path
specifically Assembly Compiled
Regex?
If I end up using Regex how can I
specify for A1 patterns to only
match against A1 column?
If I do specify rule locations in a
bit type pattern. How do I process
ANDs and XORs when it grows to be
100K charcter long?
I am looking for any suggestions or options that I should consider.
Thanks..
The regular expression API only tells you when they fully matched, not when they partially matched. What you therefore need is some variation on a regular expression API that lets you try to match multiple regular expressions at once, and at the end can tell you which matched fully, and which partially matched. Ideally one that lets you precompile a set of patterns so you can avoid compilation at runtime.
If you had that then you could match your A1 patterns against the AI column, A2 columns against the A2 pattern, and so on. Then do something with the list of partial and full regular expressions.
The bad news is that I don't know of any software out there that implements this.
The good news is that the strategy described in http://swtch.com/~rsc/regexp/regexp1.html should be able to implement this. In particular the State sets can be extended to have information about your current state in multiple patterns at the same time. This extended set of State sets will result in a more complex state diagram (because you're tracking more stuff), and a more complex return at the end (you're returning a set of State sets), but runtime won't be changed a bit, whether you're matching one pattern or 50.