Ignore certain part of line while matching using regex/grep - regex

I have plenty of log files that all share the same pattern, DATE TIME USER TEXT, as follows:
2015-09-19 21:19:13 Daniel you should use gpt
In the above example, "Daniel" is just a random username, and whatever comes after is text that "Daniel" wrote: "you should use gpt".
What I am after is a way of being able to ignore everything to the left of the username ("Daniel"), including Daniel, I will never want to match a username, and then start matching what I need using regex. I only need to match within the actual TEXT the USER wrote.
These log files contains IRC-Chat logs from several different IRC servers / tens if not hundreds of different rooms, that were logged over the years.
All of these log files are under the same folder, without any sub-folders, so applying the grep to * will do.
I need to be able to grep-match a specific username (every run It will be a different username and I Will edit the grep accordingly of course), where that Username was mentioned (Highlighted) in the chat (Lines), but not when the actual user was the one writing the line, only when mentioned by others.
The following should match because a USER (Jacob) other than Daniel mentioned him (Remember, Jacob here is just a USER):
2015-09-19 21:19:13 Jacob you should read a book Daniel
The following should not match because it was USER who mentioned USER:
2015-09-19 21:19:13 Daniel my name is also Daniel
The following should not match because relevant USER is not within the TEXT:
2015-09-19 21:19:13 Daniel you should use gpt
The pattern remains intact always, only thing that can change is the values of the date & time, length of the USER and obviously the TEXT.
The delimiters are spaces only as in the example, that's an actual copy&paste.

Try this with GNU grep:
grep -Po '^([^ \t]+[ \t]+){3}\K.*' file
Output:
you should use gpt

Related

grep first n lines only

I'm facing a problem greping the right date within a letter as a document.
Reason is to grep the date of document creation and not any further date within the text.
Usaly the dokument hold information about the company, my address, customer number, bill number....
and the date by when it was created.
Mayby a greeting and/or text maybe within dates again.
Often the date at begin of the document has different look as following.
December 1999 instead of 3.12.1999 as example.
If I grep the date in case of pattern
'(([0-9][0-9]{,1}\.)\s+('Januar'|'Februar'|'März'|'April'|'Mai'|'Juni'|'Juli'|'August'|'September'|'Oktober'|'November'|'Dezember')\s+([1-9][0-9][0-9][0-9]{1,}))'
sometimes get the wrong date as creation date. Reason is the different writing of dates in the documents.
Example 1 is what I usualy get and it works fine as I search for the date (creation date) with correct pattern.
Example 2 is in problem as I get a date, but it's NOT creation date which would be the 1st date. I get instead another date matching the pattern out from the text.
Example 1
Example 2
I could use different pattern '(([0-9][0-9]{,1}\.)([0-9][0-9]{,1}\.)([1-9][0-9][0-9][0-9]{1,}))' grepping the correct date in example 2 but then I would get same issue for example 1.
My idea was to search in first n lines only if pattern match take the date otherwise use different pattern.
I don't get the rule for pdfgrep using the first n lines only what would give me the possibility to use different pattern.
Has anybody an idea how to fix it?
Cheers, bdream
With GNU grep:
-m NUM: Stop reading a file after NUM matching lines.
Alternatively to GNU grep learn to use GNU gawk, specifically designed for such tasks.
Consider also learning python or GNU guile (then read SICP).

Regex for Current NTUSER.DAT files

I am trying to come up with a regex (PCRE) that finds current windows NTUSER.DAT files when cycling through a file list (valid NTUSER.DAT are the ones that are in the correct path for use by Windows).
I am trying to exclude any NTUSER.DAT files that have been copied by a user and placed in a different location (e.g. on the Desktop). In the following sample data, the first 4 results are valid, the next 3 are invalid:
\Users\John Thomas Hamilton\ntuser.dat
\Users\Default\NTUSER.DAT
\Users\Mary Thomas\NTUSER.DAT
\Users\UpdatusUser\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\Windows\Users\Default\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\Windows\Users\Student\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\My stuff to sort\Tech Support Fix it\NTUSER.DAT
Currently the best/simplest regex I have is:
\\USERS\\[A-Z0-9]+\\NTUSER.DAT$
but of course there a plenty of valid Windows file name characters other than letters and numbers that could exist in the user name.
I think i need to search up to the first occurrence of the new folder "\" and then if it does not have NTUSER.DAT after it, reject it. I have not had any luck doing this so any help would be appreciated.
Well assuming you have a valid file list, this would work:
^\\Users\\[^\\]+?\\NTUSER.DAT$
Make sure you ignore case.
The secret is using [^\\]+? instead of .+? so that you match exactly one folder length in.

Get spamassassin to drop emails containing a specific REGEX in attached filenames

newbie asking first question :)
I'm running a mail server (Ubuntu/Postfix/Dovecot) with SpamAssassin. Most of the known spam is flagged (RBLs, and obvious UCE) except for this particular malspam in attached zip files like "order_info_654321.zip", "paymet_document_123456.zip", and so on, when it doesn't fit any other SA rules. I'd like to procure a rule which drops the matching offenders into oblivion.
After fiddling with regex101.com, I've come up with an expression that matches these patterns exclusively:
/\w+[_][0-9]{6}.zip$/img
Question is... How to format it all, get it to work, and where to put it? So far, I edited /etc/spamassassin/local.cf, added this to the bottom, and restarted:
mimeheader TROJAN_ATTACHED Content-Type =~ /\w+[_][0-9]{6}.zip$/img
describe ZIP_ATTACHED email contains a zip trojan attachment
score TROJAN_ATTACHED 99.
But it doesn't seem to do the magic. Where else can I look for this?
Thank you all,
Keijo.-
You have a wrong regex. You do not need a $ char at the end, because filename strings are not necessarily at the end of the Content-Type header. Instead, you can use a word boundary \b anchor. In my rules, I have the following, and it perfectly works:
mimeheader MIME_FAIL Content-Type =~ /\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh|reg)\b/i
describe MIME_FAIL Blacklisted file extension detected
score MIME_FAIL 5
First up, SA doesn't drop e-mails by default, but it can score them so high on spam content that they don't show up to anyone's inbox. Second, the "ingredients" I started with were incorrect, plus messed up with SA ability to function at all.
This actually did the trick when added into/etc/spamassassin/local.cf:
full TROJAN_ZIPUNDS /\w*[_][\d]{1,6}\.zip/img
score TROJAN_ZIPUNDS 99
describe TROJAN_ZIPUNDS RM zip attached trojan underscore
Even though these spammers altered from zip to rar, to underscores to dashes, different filenames, and so on, creating rules to counter them became simple after succeeding with the first one. Here's what I added too:
full TROJAN_RARDASH /\w*[-][\d]{1,6}\.rar/img
score TROJAN_RARDASH 99
describe TROJAN_RARDASH RM rar attached trojan dash
Also, as first described, I needed to specifically block certain zip file names which soon morphed to rar and dashes, so, morphing the regex and appending as a rule triad to spamassassin's local.cf (and restarting) is currently holding, until next spam wave :-)
Finally, this is a very very blunt workaround, so anyone with expertise on the subject is more than welcome to chime in.
You are using the wrong mime header to check for the filename. Use this instead:
mimeheader TROJAN_ATTACHED Content-Disposition =~ /\w+[_][0-9]{6}.zip/img
Also make sure you have the MimeHeader plugin loaded.
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

preg match email and name from to

i want to find name and email from following formats (also if you know any other format that been getting use in mail application for sending emails, please tell in comment :))
how can i know name and email for following format strings (its one string and can be in any following format):
- jon435#hotmail.com
- james jon435#hotmail.com
- "James Jordan" <jon435#hotmail.com> (gmail format)
- janne - jon44#hotmail.com (possible format)
The answer is straightforward, at least for the email portion. The rest can be special-cased away.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Proof I'm not insane.
If you only have those strings, it is going to require more work than a simple regular expression. For instance, your first example doesn't include the full name, it is only the e-mail, thus, you would have to use the Microsoft Live ID API to retrieve that information...and that turns out to be really hard.
What exactly are you trying to do? Perhaps there is another way?

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.