Regex to capture two groups from record - regex

I am working on an ETL to handle the parsing of machine generated logs. These logs resemble flattened json files as csv files. The payload of the json (and its length) depend on the log type, for example error, alarm, ...
Every so often, a corrupt line occurs in the log files. These corrupt lines combine two lines into a single and start with the special charcter \x00. As such, these corrupt lines can be identified. Still, I would like to retrieve and separate these two lines from the corrupt line.
Data example (the corrupt line is line 3):
log file
2019.09.12 07:32:00,121,INIED
2019.09.12 09:21:50,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
\x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
2019.09.12 10:52:38,209,RESUM
Ideally the corrupt record \x00 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !! would be retrieved as
group 1: 2019.09.12 10:04:46,611,ALARM ,O,501, Check machine
group 2: 2019.09.12 10:06:22,611,ALARM ,E,303,ARM 2 VACUUM ERROR !!
I started with a the capturing group \d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}(.*) to get everything after the timestamps. This seemed the easiest method, as I cannot assume that the length of the line is fixed (due to the flattened json).
Questions:
I am unsure how to terminate my capturing group. I was thinking to use the end of the line or the next timestamp it finds. Any advice to combine these clauses?
In addition, this method removes the timestamps themselves from the capturing group. Should I use a different method?

As you were thinking, you should include in your capturing group the end of the line and timestamp combined in an OR clause.
In your expression, since you want the timestamp and text together, you don't want a capturing group with just (.*) but with the entire expression (\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*)
So the combination of these two would be:
(\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}.*?)(?:$|(?=\d{4}.\d{2}.\d{2} \d{2}:\d{2}:\d{2}))
The OR clause is a non-capturing group comprised by the end of the line '$' and a 'Positive Lookahead' with the date.
You can use the site https://regexr.com/ to test and validate expressions, you should try it.

Related

How can I define RegEx to remove a particular part in a specified line of code?

I am attempting to remove .nc1 at the end of a line. I receive .nc1 in batches as a steel fabricator. We run into issues with our files where, line 5 in the example below, has an unnecessary .nc1 extension at the end. Problem I have, is that I cannot simply replace the value as it appears in line 2 as well.
In the example photo I have attached, I am looking to remove line 5 .nc1 extension and keep line 2 as is, .nc1 extension removal will be applied in a batch editing to all of my .nc1 files via find/replace.
ST
** BB233.nc1
F88
BB233
BB233.nc1
1000
A992
1
W21X201
Change to this
ST
** BB233.nc1
F88
BB233
BB233
1000
A992
1
W21X201
I was looking into Positive and/or Negative lookahead/lookbehind but didnt have much luck in making it work. I am a novice/lack thereof when it comes to using RegEx.
Match .nc1 only at the end of lines starting with whitespace, capturing the part you want to keep and putting it back, effectively deleting .nc1
Search: ^(\s+.*)\.nc1$
Replace: $1

Can regex be used to find this pattern?

I need to parse a large amount of data in a log file, ideally I can do this by splitting the file into a list where each entry in the list is an individual entry in the log.
Every time a log entry is made it is prefixed with a string following this pattern:
"4404: 21:42:07.433 - After this point there could be anything (including new line characters and such). However, as soon as the prefix repeats that indicates a new log entry."
4404 Can be any number, but is always then followed by a :.
21:42:07.433 is the 21 hours 42 mins 7 seconds 433 milliseconds.
I don't know much about regex, but is it possible to identify this pattern using it?
I figured something like this would work...
"*: [0-24]:[0:60]:[0:60].[0-1000] - *"
However, it just throws an exception and I fear I'm not on the right track at all.
List<string> split_content = Regex.Matches(file_content, #"*: [0-24]:[0:60]:[0:60].[0-1000] - *").Cast<Match>().Select(m => m.Value).ToList();
The following expression would split a string according to your pattern:
\d+: \d{2}:\d{2}:\d{2}\.\d{3}
Add a ^ in the beginning if your delimiting string always starts a line (and use the m flag for regex). Capturing the log chunks with a regex would be more elaborate, I'd suggest just splitting (with Regex.Split) if you have your log content in the memory all at once.

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.
According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

RegEx optional group matching

I'm trying to figure out how to create a regex that would encompass both of the following lines:
02-09-16 08:57PM 24768 Invoice - Copy.docx
05-14-16 08:49PM <DIR> Bin
Both are the result of a directory listing. The first being a file which contains the file size. The second is a directory with no size but contains the type <Dir>.
This allows me to capture all of the data into named groups but the first line's size is capture into the Type field:
(?<Date>\S+)\s+(?<Time>\S+)\s+(?<Type>\S+)\s+(?<Name>.+)
If possible, I'd like to end of with both a Type and Size. I'm not sure how to look for both of these at the same time but ignore one or the other if one is found.
Update : Based on Wiktor's response I've update the Regex and gotten closer :
(?<Date>\S+)\s+(?<Time>\S+)\s+(?:(?<Type>\S+)|\d+)\s+(?<Name>.+)
Using this I can easily parse both lines. However first line 24768 end's up in the Type group. Is it possible to have both a Type and an additional Size group? Logic being something like If you run into characters ('<Dir>') for example, that is the Type; if you run into numbers (24768) that is the Size
Just group the type and size captures into a a non-capturing or-group:
^(?<Date>\S+)\s+(?<Time>\S+)\s+(?:(?<Size>\d+)|(?<Type>\S+))\s+(?<Name>.+)$
The size field will pick up the digits, else you get a type.

Find all instances where count is less than

I have a big log file which contains IDs. If the ID is present in the log more than 5 times - it's a success. If it's less - I want to know which ID it is.
Ultimately I need a way in Notepad++ that would give me a list of all IDs ([0-9]{10}) where the instance of that is 5 or less.
Is this somehow possible?
Edit: The format of the file is a standard log4j log, so it has a ton of other data. Example (ID in this case is 12345678901234567)
[08-08-2015 02:08:00] [INFO ] Service [329]: Attempting to substitute message ID with 12345678901234567
[08-08-2015 02:08:00] [DEBUG] ParsedBlock [49]: 3296825 => 12345678901234567
[08-08-2015 02:08:00] [DEBUG] LifeCycle [149]: All messages have not yet been sent. Waiting another 2000 milliseconds. [Send: false]
[08-08-2015 02:08:00] [DEBUG] LifeCycle$5 [326]: Running 5, 2592
Since you're in Notepad++ in the first place, you can take advantage of its functionality outside of Search. Be sure you do all this in a copy of the file, not the original, since it makes changes to the file. Since you haven't answered about the format of the file, I'm assuming the file is just the IDs, one on each line.
The first step is to sort the IDs so all the duplicates appear contiguously: Edit -> Line Operations -> Sort Lines As Integers Ascending
Then do this Search/Replace (with Search Mode set to regex):
Search: (\d{17}\r\n)\1{5,}|(\d{17}\r\n)\2*
Replace: $2
You'll be left with only the IDs that occur 5 or fewer times.
Explanation:
The first half of the alternation (\d{17}\r\n)\1{5,} matches any IDs that repeat 6 or more times. The second half (\d{17}\r\n)\2* matches any other IDs, capturing the first instance in group #2. Then the replace puts back that group with $2.