How to cut/get all patterns of RE from one line - regex

How does one get all instances, and only the instances of a regular expression contained within a single line or string?
For example, suppose the output (all one single line) from a webpage is:
<Table border=1 cellpadding=2><TR><TH><font size=2>LAN IP BLOCK</font></TH><TH><font size=2>CUST_NAME</font></TH> <TH><font size=2>ID
</TH></TR><TR><TD><font size=2>10.4.4.0 / 29</font></TD><TD><font size=2>Customer data</font></TD><TD><font size=2></font></TD></T
TD><font size=2>10.1.1.0 / 27</font></TD><TD><font size=2>Customer</font></TD><TD><font size=2></font></TD></TR></Table><p>
I'd like to get every instance of the IP CIDR data. I know I've have to use an IP address RE (and I believe I can figure/find that out), but how do I get EACH instance and remove all other text simply? I'd like to do this on the command line with grep/sed etc. but thinking I need to use python. I know I could use Perl but I'd have to get that installed.

The grep options -o and -E are what you are looking for:
grep -oE "pattern1|pattern2|pattern3|pattern4|...|patternN" input_file
From man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
-E, --extended-regexp
Interpret PATTERN as an extended regular expression
(-E is specified by POSIX.)

Related

Use "sed" to Remove Capture Group 1 From All Lines In a File

I currently have a file with lines like the below:
ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z
My goal is to remove everything from the "#" to the next comma, such that it instead looks like the below:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
I'm not that experienced with utilizing sed and RegEx expressions. In playing around on a testing website, I came up with the below RegEx string, in which capture group 1 is perfectly matching to what I want to remove:
regex101.com Test
How would I go about putting this in a "sed" command against a given input file, and writing the results to a new output file. I had tried the below most recently:
sed 's/(#.+?),//' input.csv > input_Corrected.csv
Just as another note, I'm doing this in a bash script in which I have an API call generating the "input.csv" file, and then want to run this sed command to clean up the data format to match my needs.
You can use
sed 's/#[^,]*,/,/' input.csv > input_Corrected.csv
sed 's/#[^,]*//' input.csv > input_Corrected.csv
The #[^,]*, POSIX BRE pattern matches a # and then any zero or more chars other than , and then a , (in the first example, use it if there MUST be a comma after the match) and replaces with a comma (in the first example, keep the replacement empty if you use the second approach).
See the online demo:
s='ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z'
sed 's/#[^,]*,/,/' <<< "$s"
Output:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
You can used the below regular expression in order to remove the content of the valid email address only.
sed "s/#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})//g" input.csv > input_Corrected.csv
And as per your requirement you can use the below code. As it is going to replace all the email address on the file as you have on your file "calvin_hobbes2#netnet" which is not valid email address.
sed "s/#[^,]*//g" input.csv > input_Corrected.csv

How do I lowercase all MD5 hashes inside of a file?

I'm reposting this question with more context and examples because admittedly my last post was rushed. I'm trying to find ALL MD5 hashes (based off the regex) and simply lowercase them (regardless of format, and regardless of what else is on the line).
I previously posted a similar question related to lowercasing emails and this was the solved answer (maybe this can assist in finding the answer).
sed -e 's/^\([^#]*\)#/\L\1#/' file
The MD5 Regex I use is: [a-f0-9]{32}
I'll provide some examples now (each line is unique - there's NOT a set pattern)
Input data:
fwefwe:few32rfwe:3r2frewg:-::d3ewStack:D077F244DEF8A70E5EA758BD8352FCD8
fwefwe:few33rfwe:3r2frewg:-::dsasaewStack:06D80EB0C50B49A509B49F2424E8C805
fwefwe:few34rfwe:3r2f3213ef::2d3ewStack:F1BDF5ED1D7AD7EDE4E3809BD35644B0
fwefwe:few35rf32re4frewgre3frewg:-::d3ewStack:DDE2C7AD63AD86D6A18DE781205D194F
Output data:
fwefwe:few32rfwe:3r2frewg:-::d3ewStack:d077f244def8a70e5ea758bd8352fcd8
fwefwe:few33rfwe:3r2frewg:-::dsasaewStack:06d80eb0c50b49a509b49f2424e8c805
fwefwe:few34rfwe:3r2f3213ef::2d3ewStack:f1bdf5ed1d7ad7ede4e3809bd35644b0
fwefwe:few35rf32re4frewgre3frewg:-::d3ewStack:dde2c7ad63ad86d6a18de781205d194f
You may use
sed -E 's/[A-F0-9]{32}/\L&/g' file
If you also want to modify the existing file add -i option:
sed -i -E 's/[A-F0-9]{32}/\L&/g' file
The point is that you need to use an -E option to enable POSIX ERE regex syntax and write {...} quantifier without escaping. POSIX BRE equivalent will look like sed -i 's/[A-F0-9]\{32\}/\L&/g' file.
Also, I added g flag to modify all match occurrences on a line.
You do it the same way as you did when lowercasing the email. Use the regexp to match part of the line, then use a back-reference in the replacement to pick that up.
sed -e 's/[a-fA-F0-9]\{32\}/\L&/' file
In the replacement, & is replaced with whatever matched the regexp, and \L lowercases it.

Extract Source IP from log files

i want to extract "srcip=x.x.x.x" from log file in bash. my log file is like this:
2019:06:23-17:50:03 myhost ulogd[5692]: id="2021" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped (GEOIP)" action="drop" fwrule="60019" initf="eth0" srcmac="3c:1e:04:92:6f:fb" dstmac="00:50:56:97:7c:af" srcip="185.53.91.50" dstip="192.168.50.10" proto="6" length="44" tos="0x00" prec="0x00" ttl="235" srcport="54522" dstport="5038" tcpflags="SYN"
I've wrote awk '{print $15}' to extract srcip but the problem is srcip position not same in each line. how can i extract srcip=x.x.x.x without position of that?
With any sed in any shell on every UNIX box:
$ sed -n 's/.*\(srcip="[^"]*"\).*/\1/p' file
srcip="185.53.91.50"
The following command provides the result you expect
grep -o -P 'srcip="(\d{1,3}[.]){3}\d{1,3}"' log
The option o is to print only the matched parts. The option P is to use perl-compatible regular expressions. The regex is matching srcip=<ipv4> and log is the name of the file you want to extract content from.
Here is a link to regex101 for an explanation for the regex: https://regex101.com/r/hjuZlM/2
An awk version
awk -F"srcip=" '{split($2,a," ");print FS a[1]}' file
srcip="185.53.91.50"
Split the line using the key word, then get the next field after split.

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).