Can regex be used to find this pattern? - regex

I need to parse a large amount of data in a log file, ideally I can do this by splitting the file into a list where each entry in the list is an individual entry in the log.
Every time a log entry is made it is prefixed with a string following this pattern:
"4404: 21:42:07.433 - After this point there could be anything (including new line characters and such). However, as soon as the prefix repeats that indicates a new log entry."
4404 Can be any number, but is always then followed by a :.
21:42:07.433 is the 21 hours 42 mins 7 seconds 433 milliseconds.
I don't know much about regex, but is it possible to identify this pattern using it?
I figured something like this would work...
"*: [0-24]:[0:60]:[0:60].[0-1000] - *"
However, it just throws an exception and I fear I'm not on the right track at all.
List<string> split_content = Regex.Matches(file_content, #"*: [0-24]:[0:60]:[0:60].[0-1000] - *").Cast<Match>().Select(m => m.Value).ToList();

The following expression would split a string according to your pattern:
\d+: \d{2}:\d{2}:\d{2}\.\d{3}
Add a ^ in the beginning if your delimiting string always starts a line (and use the m flag for regex). Capturing the log chunks with a regex would be more elaborate, I'd suggest just splitting (with Regex.Split) if you have your log content in the memory all at once.

Related

Seqkit - manipulate regex for parsing ID

I am trying to use seqkit rmdup to remove duplicated sequences from my protein fasta files. However, it's only the accession numbers which are duplicated and not the description or sequences. See example below.
Host_331002_c0_seq1 95 1381 2 +
Host_331002_c0_seq1 1873 2112 1 +
So basically I want to set a flag which will stop at the first tab when searching the identifiers (stop after Host_331002_c0_seq1) otherwise I won't get any duplicates in my output file. This flag would fix it but I am not sure how to manipulate regex.
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
Could you assist with this issue?
I just started learning all the programming languages and I am not certain how to change that.
Regex to match any zero or more characters up to the first tab excluding tab is
^[^\t]*
See proof.

regex - extract strings at specifc positions

I have a huge fixed-width string that looks something like below:
B100000DA3F19C Android 600 AND 2011-08-29 15:03:21.537
352a0D21ffd800000a3a95911801700e iPad 600 iOS 2011-08-29 19:35:12.753
.
.
.
I need to extract the first part (id) and the fourth part (device type - "AND" or "iOS"). The first column starts at 0 and ends at the 51st position for all lines. The fourth part starts at 168 and ends at 171 for all lines. The length of each line is 244 characters. If this is complicated, the other option is to delete everything in this file except id and device type. This single file has around 800K records measuring 180mb but Notepad++ seems to be handling it okay.
I tried doing a SQL Server import data but even though the Preview looks fine, when the data gets inserted into the table, it is not accurate.
I have the following so far which gives me the first 51 characters -
^(.{51}).*
It would be great if I could one regex that will keep id and device type and delete the rest.
Well if you are certain it is always at that position a very simple way is this:
^(.{51}).{117}(.{3})
The parentheses are the captures (the results you are getting out), while the brackets are the counters.
EDIT: Use the following to explicitly discard the rest of the line:
^(.{51}).{117}(.{3}).*$

Is boost:regex block size limited?

I have quite big text file to parse by boost:regex. To make process easier, first I decide to split big file to blocks, for future parsing each block.
I use next regex string for that:
FIRST1.*?FIRST2.*?FIRST3((.*?\r*\n*)*)LAST1.*?LAST2.*?LAST3
It allows me to receive everything I want between "FIRST1 FIRST2 FIRST3" and "LAST1 LAST2 LAST3".
Between them there exists many lines with many text (more then 20 000 bytes). And it don't works. If I split text by 2 parts (part1 ~ 10 000 bytes and part2 ~10 000 bytes), and try this regular expression with:
FIRSTS part1 LASTS - everything parsing well
FIRSTS part2 LASTS - everything parsing well
FIRSTS part1part2 LASTS - breaks.
I thought it may be boost:regex limitation, and tried it here: online regex, it still same.
It looks like part1part2 is too big for regex block to return, is it true? Is there size limit for regex, or I just mess something up?
UPD:
I also found max size. It founds substring if it is characters [106-12131], but if I add any one character to any place of substring, it can't find it. So, it is 12025.
You probably should not be using regex here.
I'd show you the Spirit way to do this efficiently, but you don't show relevant code, so I'll wait.
That said, at least make the groups non-capturing groups (e.g. here ((.*?\r*\n*)*)) and consider using cmatch instead of smatch (docs)
Oh, this might be a case of catastrophic backtracking [ยน]:
((.*?\r*\n*)*)
Try something like this:
(.+?[\r\n]+)*
Make it non-capturing too:
(?:.+?[\r\n]+)*

Specific regex to detect error string

I am parsing a text log, where each line contains an id closed in parenthesis and one or more (possibly hundreds) chunks of data (alphanumeric, always 20 chars), such as this:
id=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(702832), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
At this point of the program, I am not interested about the data, just about quick fetching of all the ids. The problem is, that due to the noisy communication channel an error may appear denoted by string Error that can be anywhere on the line. The goal is to ignore these lines.
What worked for me so far was a simple negative lookahead:
^id=\((\d+)\),(?!.*Error)
But I forgot, that there is some tiny probability, that this Error string may actually appear as a valid sequence of characters somewhere in the data, which has backfired on me just now.
The only way to distinguish between valid and invalid appearance of the Error string in the data chunk is to check for the length. If it's 20 characters, then it was this rare valid occurrence and I want to keep it (unless the Error is elsewhere on the line), if it's longer, I want to discard the line.
Is it still possible to treat this situation with a regular expression or is it already too much for the regex monster?
Thanks a lot.
Edit: Adding examples of error lines - all these should be ignored.
iErrord=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(7028Error32), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
id=(702833), daErrorta1=(hF6eDpLxbnFS5PfKaCds)
id=(702834), data1=(bx5EsH7BCsk6dMzpQDErrorKA)
However this one should not be ignored, the Error is just incidently contained in the data part, but it currently is ignored
id=(702834), data1=(bx5EsH6dMzpQDErrorKA)
Alright, it's not exactly what you were thinking about, but here's a suggestion :
Can't you simply match the lines following the pattern, undisturbed by an Error somewhere ?
Here's the regexp that'll do it :
^id=\((\d+)\), (data\d+=\([a-zA-Z\d]{20}\)(, )?)+$
If Error is anywhere on the line (except in the middle of the chunk of data), the regexp will not match it, so you get the wanted result, it'll be ignored.
If this doesn't please you, you have to add more lookahead and lookbehind groups. I'll try to do that and edit if I write a good regexp.
Since your chunks of data are always 20 characters long, if one is 25 characters this means there is an error in it. Therefore you could check if there is a chunk of such a length, then check if there is Error outside of parenthesis. If so, you shouldn't match the line. If not, it valid.
Something like
(?![^)]*Error)id=\((\d+)(?!.*(?:\(.{25}\)|\)[^(]*Error))
might do the trick.

PowerShell isolating parts of strings

I have no experience with regular expressions and would love some help and suggestions on a possible solution to deleting parts of file names contained in a csv file.
Problem:
A list of exported file names contains a random unique identifier that I need isolated. The unique identifier has no predictable pattern, however the aspects which need removing do. Each file name ends with one of the following variations:
V, -V, or %20V followed by a random number sequence with possible spaces, additional "-","" and ending with .PDF
examples:
GTD-LVOE-43-0021 V10 0.PDF
GTD-LVOE-43-0021-V34-2.PDF
GTD-LVOE-43-0021_V02_9.PDF
GTD-LVOE-43-0021 V49.9.PDF
Solution:
My plan was to write a script to select of the first occurrence of a V from the end of the string and then delete it and everything to the right of it. Then the file names can be cleaned up by deleting any "-" or "_" and white space that occurs at the end of a string.
Question:
How can I do this with a regular expression and is my line of thinking even close to the right approach to solving this?
REGEX: [\s\-_]V.*?\.PDF
Might do the trick. You'd still need to replace away any leading - and _, but it should get you down the path, hopefully.
This would read as follows..
start with a whitespace, - OR _ followed by a V. Then take everything until you get to the first .PDF