Extract Text From CSV - regex

I want to grab the regular expressions out of the snort rules.
Here's an example of the text that I've saved as a csv - https://rules.emergingthreats.net/open/snort-2.9.0/rules/emerging-exploit.rules
So there are multiple rules,
#by Akash Mahajan
#
alert udp $EXTERNAL_NET any -> $HOME_NET 14000 (msg:"ET EXPLOIT Borland VisiBroker Smart Agent Heap Overflow"; content:"|44 53 52 65 71 75 65 73 74|"; pcre:"/[0-9a-zA-Z]{50}/R"; reference:bugtraq,28084; reference:url,aluigi.altervista.org/adv/visibroken-adv.txt; reference:url,doc.emergingthreats.net/bin/view/Main/2007937; classtype:successful-dos; sid:2007937; rev:4;)
and I want only the text that appears after "pcre" in all of them, extracted and printed to a new file, without the quotes
pcre:"/[0-9a-zA-Z]{50}/R";
So, from this line above, I want to end up with the below text;
/[0-9a-zA-Z]{50}/R
From every place "pcre" appears in the whole file.
I've been messing around with grep, awk, and sed. I just can't figure it out. I'm fairly new to this.
Could anyone give me some tips?
Thanks

With GNU sed:
$ sed -n -r 's/.*\<pcre:"([^"]+).*/\1/p' file
/[0-9a-zA-Z]{50}/R

You can do this using grep. But the thing with grep is that it can't only display a matching group, it can only display the matched text.
In order to get by this you need to use look-ahead and look-behind.
Lookahead (?=foo)
Asserts that what immediately follows the current position in the string is foo
Lookbehind (?<=foo)
Asserts that what immediately precedes the current position in the string is foo
┌─ print file to standard output
│ ┌─ has pcre:" before matching group (look-behind)
│ │ ┌─ has "; after matching group (look-ahead)
cat file | grep -Po '(?<=pcre:\")(.*)(?=\";)'
││ └─ what we want (matching group)
│└─ print only matched part
└─ all users

Related

How to remove specific characters in notepad++ with regex?

This is data present in my .txt file
+919000009998 SMS +919888888888
+919000009998 MMS +91988 88888 88
+919000009998 MMS abcd google
+919000009998 MMS amazon
I want to convert my .txt like this
919000009998 SMS 919888888888
919000009998 MMS 919888888888
919000009998 MMS abcd google
919000009998 MMS amazon
removing the + symbol, and also the spaces if present in third column only if it is a number, if it is string no operation to be performed
is there any regex to do this which can I write in search and replace in notepad++?
Ctrl+H
Find what: \+|(?<=\d)\h+(?=\d)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
\+ # + sign
| # OR
(?<=\d) # positive lookbehind, make sure we have a digit before
\h+ # 1 or more horizontal spaces
(?=\d) # positive lookahead, make sure we have a digit after
Screen capture:
All previous answer will perfectly work.
However, I'm just adding this just in case you need it:
If for some reason you had non-phone numbers on the third column separated by spaces (a street comes to mind for me +919000009998 MMS street foo nº 123 4º-B) you may use this regex instead (It will join number as long as the third column starts by +):
Search: ^[+](\S+\s+\S+\s++)(?:([^+][^\n]*)|[+])|\G\s*(\d+)
Replace by: \1\2\3
That will avoid joining the 3 and 4 on my previous example.
You have a demo here.

Parse a text file with no newline using RegEx

I have a text file like below. Every record has 12 fields which are separated by |, but there is no record delimiter like a newline, and every record starts with 555. I am trying to parse it with RegEx.
555|abc|user|2|20120914055204696|20120914055204718|0||||21|33555|def|udp|2|20120914055204696|20120914055204718|0||||22|33555|abc|user|2|20120914055204696|20120914055204718|0||||23|33
I tried with 555(\|.*?\|){12}(\d\d), but it did not work. Can anyone please help me with this?
You can use
555(?:\|[^|]*){11}(?=$|555)
See demo
It will match these records in the input string:
555|abc|user|2|20120914055204696|20120914055204718|0||||21|33
555|def|udp|2|20120914055204696|20120914055204718|0||||22|33
555|abc|user|2|20120914055204696|20120914055204718|0||||23|33
The regex 555(?:\|[^|]*){11}(?=$|555) matches:
555 - literal 555
(?:\|[^|]*){11} - 11 occurrences of | followed by any number of characters other than |
(?=$|555) - up to (but not returning as part of the match) end of string or 555.
555(?:\|[^|]*?){11}\d\d
You need to remove the second | .See demo.
https://regex101.com/r/sS2dM8/31

Remove Line numbers from Notepad++ file

I have received a very long file. It has 1000+ lines of SQL code. Each line start with line number.
14 PROCEDURE sp_processRuleset(pop_id IN NUMBER);
15
16 -- clear procedure for preview mode to clean crom_population_member_temp table and global variables
17 PROCEDURE sp_commit; -- 28-Oct-09 J.Luo
18
19 -- The rule Set string for the Derived Population Member Preview
20 -- The preview mode will set gv_context_ruleSet by setContext_ruleSet,
21 -- sp_processRuleset uses gv_context_ruleSet to build derived population instead of getting rules from crom_rule_set table
22 gv_context_ruleSet VARCHAR2(32767) := NULL; -- 27-Oct-09 J.Luo
23 -- The population Role Id for the Derived Population Member Preview
I want to remove only line numbers using NotePad++ Find+Replace functionality. Is there any regex available to get this done ?
This using regex is the easiest way.
Other handy way (scrolling a 1K lines is not much IMO) could be :
Block Selection using ALT key and dragging your mouse, like following:
You can use this regex:
^\d+
Working demo
Open Replace window with CTRL+H and run Replace All with these settings:
Find what: ^\s*\d+
Replace with: (empty)
Search mode: Regular expression
Notes:
\s can also be [[:space:]] or [ \t]
\d can also be [[:digit:]] or [0-9]
If the new edit is correct, the pattern \s* that matches the leading space may not be needed.
You can use this one if you have colon after numbers
^\d+:

Matching only 5xx using regex

I want to find all numbers that are in between 500-599. I'm very new to regex, I came up with this :
5[0-9][0-9]+
This is working fine, matching 566,577,500. But it also matches 6578. Which I don't want.
Edit:
Here is my file contents:
asd 554
sad
sads
dsa
456
sa
d
dsa
asda
d500
521
519 asdasd
524 asdasdsdsadsdasd sadsadsadasdsd asdsa dsa dsadsad sad asdas dsa sad sad asds a 543
As many suggested I tried :
grep "^5[0-9]{2}$" test
which isn't finding any numbers at all!
How do I put a constraint on this?
If you want to match 5xx only on a line, and not when 5xx occurs as a part of x5xx,
^5\d{2}$
\d = Digit
^ = beginning of line
$ = end of line
EDIT:
Based on additional details in the question, you have a variable number of spaces at the beginning of the line, so, you want the following instead:
\s*5\d{2}\s
Matches spaces on either side of 5xx.
With grep the easiest way is to use -w to only match whole words:
grep --color=always -w "5[0-9][0-9]" test
Remove the + sign:
5[0-9][0-9]
This will match "5" succeeded by two numbers, and nothing else.
You have to describe a bit more accurately what you want to happen with e.g. 6578? If you want 578 in the output (because after "6" there is a sequence of characters matching your format 5xx) you can simply do
grep -o "5[0-9][0-9]"
Note that unlike other answers, the -o flag emits multiple numbers from a single line if needed.
If, on the other hand, you want to match words of format 5xx, you can add -w flag, too:
grep -o -w "5[0-9][0-9]"
For more complex rules for matching, you want to use -E flag instead and use possibly a much more complex regex.

regex tutorial, How can I improve this

I needed a utililty function earlier today to strip some data out of a file and wrote an appaling regular expresion to do it. The input was a file with lots of line with the format:
<address> <11 * ascii character value> <11 characters>
00C4F244 75 6C 74 73 3E 3C 43 75 72 72 65 ults><Curre
I wanted to strip out everything bar the 11 characters at the end and used the following expression:
"^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}"
This matched to the bits I didn't want which I then removed from the original string. I'd like to see how you'd do this but the particular areas I couldn't get working were:
1: having the regex engine return the characters I wanted rather than the characters I didn't and
2: finding a way of repeating the match on a single ascii value followed by the space (eg "75 " = [0-9A-F]{2}[\s]{1}?) and repeating that 11 times rather than grabbing 34 characters.
Looking at it again the easiest thing to do would be to match to the last 11 characters of each input line but this isn't very flexible and in the interests of learning regex I would like to see how you can match through from the start of the sequence.
Edit: Thanks guys, this is what I wanted:
"(?:^[0-9A-F]{8} )(?:[0-9A-F]{2} ){11} (.*)"
Wish I could turn more than one of you green.
As the file has a fixed format, you could use this regular expression to just match the last 11 characters.
^.{44}(.{11})
Last eleven is:
...........$
or:
.{11}$
Matching a hex byte + space and repeat eleven times:
([0-9A-Fa-f]{2} ){11}
1) ^[0-9A-F+]{8}[\s]{2}[0-9A-F\s]{34}(.*)
Parens are used for grouping with extraction. How you retrieve it depends on your language context, but now some sort of $1 is set to everything after the initial pattern.
2) ^[0-9A-F+]{8}[\s]{2}(?:[0-9A-F\s]){11}\s(.*)
(?:) is grouping without extraction. So (?:[0-9A-F\s]){11} considers the subpattern there as a unit and looks for it repeated 11 times.
I'm assuming PCRE here, by the way.
The address and ascii char value are all hex so:
^[0-9A-F\s]{42}
Matching the end of the line would be
.{11}$
To match only the end, you can use a positive look behind.
"(?<=(^[0-9A-F+]{8}[\\s]{2}[0-9A-F\\s]{34}))(.*?)$"
This would match any character until the end of the line, providing that it is preceded by the "look behind" expression.
(?<=....) defines a condition that must be met before matching is possible.
I am a bit short of time, but if you look on the net for any tutorial that contain the words "regex" and "lookbehind", you will find good stuff (if a regex tutorial covers look ahead/behind, it will usually be pretty complete and advanced).
Another advice is to get a regex training tool and play with it. Have a look at this excellent Regex designer.
If you're using Perl, you could also use unpack(), to get each element.
my #data;
open my $fh, '<', $filename or die;
for my $line(<$fh>){
my($address,#list) = unpack 'a8xx(a2x)11xa11', $line;
my $str = pop #list;
# unpack the hexadecimal bytes
my $data = join '', map { pack 'H2',$_ } #list;
die unless $data eq $str;
push #data, [$address,$data,$str];
}
close $fh;
I also went ahead and converted the 11 hexadecimal codes back into a string, using pack().