SED / RegEx Puzzle - regex

I have a file, with many of the log lines in like below example, what i'd like to do is basically add a CR after each piece of process information. I figured i'd do this with SED using the below
sed -rn 's/([0-9]+) \(([a-z._0-9]+)\) ([0-9]+) ([0-9]+)/ \2,\1,\3,\4 \n/gp' < file
This partially works, but I still get the Total: 3266 #015 from the log, which appears at the end of each line. I didn't expect this as it doesn't get matched in the regular expression.
I've tested the regular expression on the available websites, and they always look good, and find what i'd expect, its just when i combine with SED i don't quite get the result i was expecting.
Any help or suggestions would be most appreciated,
Thanks
Andy
This is a single line of the stats
1 (init) 3686400 123 148 (klogd) 3690496 116 16364 (memlogger.sh) 3686400 144 17 0 225 (dropbear) 1847296 113 242 (mini_httpd) 2686976 167 281 (snmpd) 4812800 231 283 (logmuxd) 2514944 262 284 (watchdog) 3551232 82 285 (controld) 5259264 610 287 (setupd) 5120000 436 289 (checkpoold) 3424256 129 296 (trap_sender_d) 3457024 165 298 (watch) 3686400 114 299 (processwatchdog) 3420160 119 314 (timerd) 3637248 219 315 (init) 3686400 116 16365 (cat) 3694592 120 Total: 3266 #015

Just remove the "Total:"
sed -rn 's/ +Total:.*//;
s/([0-9]+) +\(([a-z._0-9]+)\) +([0-9]+) +([0-9]+)/ \2,\1,\3,\4\n/gp'
You can also match the "Total:" optionally:
sed -rn 's/([0-9]+) +\(([a-z._0-9]+)\) +([0-9]+) +([0-9]+)( *Total:.*)?/ \2,\1,\3,\4\n/gp'
# ------------^

Related

Separate Numerical Values Based On Pattern Match

I need to separate 100s values from 90 values... sed may not be the best way to do accomplish this but regardless, I am trying to separate 90s from 100s by inserting a space between the two numbers.
Code:
sed 's/1[0-9][0-9]/ 1[0-9][0-9]/g' file
Data File:
99100 93 96 95 94 93 96 98100
Current Result:
99 1[0-9][0-9] 93 96 95 94 93 96 98 1[0-9][0-9]
Expected Result:
99 100 93 96 95 94 93 96 98 100
You may replace with &, the whole match:
s='99100 93 96 95 94 93 96 98100'
echo $s | sed 's/1[0-9][0-9]/ &/g'
See the online demo, result: 99 100 93 96 95 94 93 96 98 100.
See sed reference:
Also, the replacement can contain unescaped & characters which reference the whole matched portion of the pattern space.
gawk solution:
awk -v FPAT='9[0-9]|1[0-9][0-9]' '{ r=$1; for(i=2;i<=NF;i++) r=r FS $i; print r }' file
The output:
99 100 93 96 95 94 93 96 98 100
-v FPAT='9[0-9]|1[0-9][0-9]' - pattern defining field value (90s or 100s)
r=$1 - capturing the 1st field as initial field
for(i=2;i<=NF;i++) - iterating through the remaining fields

Grep each line of a text file in another tab separated file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 6 years ago.
I have a text file1 that has some id's like:
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like
I used grep '^[^|]*' file1 to extract the string before | from file1.
I want each of this greped string to match lines from another file2 and return the whole line when matched. file2 looks like this:
c10013_g2_i1 781 622.2 73 5.95 5.16
c10014_g1_i1 213 58.67 3 2.59 2.25
c10014_g2_i1 341 182.35 4 1.11 0.96
c10015_g1_i1 404 245.23 16 3.31 2.87
c10017_g1_i1 263 105.37 6 2.89 2.5
Finally the result should look like:
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87
You can use awk:
awk 'FNR == NR {
split($0, a, /[|]/)
seen[a[1]] = $0
next
}
$1 in seen {
$1 = seen[$1]
print
}' file1 file2
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87
for structured text, awk is the king of tools.
$ awk 'NR==FNR{split($0,v,"|");a[v[1]]=$0; next}
$1 in a{k=$1; $1=""; print a[k] $0}' file1 file2
c10013_g2_i1|m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1|m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87
Sounds like you're trying to join on the first field of each file. There's actually a join command that can do this. You'll need to change file1 slightly (join works on spaces):
cat file1 | sed 's/^\([^|]*\)[|]/\1 |/' | sort > file1-delimited
Then you can join them:
cat file2 | sort | join file1-delimited -
c10013_g2_i1 |m.63|vomeronasal type-1 receptor 4-like 781 622.2 73 5.95 5.16
c10015_g1_i1 |m.409|vomeronasal type-1 receptor 1-like 404 245.23 16 3.31 2.87
This should get you 95% of the way there, but the format might not be perfect.

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Match datetime format with Bash REGEX

I have data with this datetime format in bash:
28/11/13 06:20:05 (dd/mm/yy hh:mm:ss)
I need to reformat it like:
2013-11-28 06:20:05 (MySQL datetime format)
I am using the following regex:
regex='([0-9][0-9])/([0-9][0-9])/([0-9][0-9])\s([0-9][0-9]/:[0-9][0-9]:[0-9][0-9])'
if [[$line=~$regex]]
then
$line='20$3-$2-$1 $4';
fi
This produces an error:
./filename: line 10: [[09:34:38=~([0-9][0-9])/([0-9][0-9])/([0-9][0-9])\s([0-9][0-9]/:[0-9][0-9]:[0-9][0-9])]]: No such file or directory
UPDATE:
I want to read this file "line by line", parse it and insert data in mysql database:
'filenameX':
27/11/13 12:20:05 9984 2885 260 54 288 94 696 1852 32 88 27 7 154
27/11/13 13:20:05 9978 2886 262 54 287 93 696 1854 32 88 27 7 154
27/11/13 14:20:05 9955 2875 262 54 287 93 696 1860 32 88 27 7 154
27/11/13 15:20:04 9921 2874 261 54 284 93 692 1868 32 88 27 7 154
27/11/13 16:20:09 9896 2864 260 54 283 92 689 1880 32 88 27 7 154
27/11/13 17:20:05 9858 2858 258 54 279 92 683 1888 32 88 27 7 154
27/11/13 18:20:04 9849 2853 258 54 279 92 683 1891 32 88 27 7 154
27/11/13 19:20:04 9836 2850 257 54 279 93 683 1891 32 88 27 7 154
27/11/13 20:20:05 9826 2845 257 54 279 93 683 1892 32 88 27 7 154
27/11/13 21:20:05 9820 2847 257 54 278 93 682 1892 32 88 27 7 154
27/11/13 22:20:04 9810 2844 257 54 277 93 681 1892 32 88 27 7 154
27/11/13 23:20:04 9807 2843 257 54 276 93 680 1892 32 88 27 7 154
28/11/13 00:20:05 9809 2843 257 54 276 93 680 1747 29 87 17 6 139
28/11/13 01:20:04 9809 2842 257 54 276 93 680 1747 29 87 17 6 139
28/11/13 02:20:05 9809 2843 256 54 276 93 679 1747 29 87 17 6 139
28/11/13 03:20:04 9808 2842 256 54 276 93 679 1747 29 87 17 6 139
28/11/13 04:20:05 9808 2842 256 54 276 93 679 1747 29 87 17 6 139
28/11/13 05:20:39 9807 2842 256 54 276 93 679 1747 29 87 17 6 139
28/11/13 06:20:05 9804 2840 256 54 276 93 679 1747 29 87 17 6 139
Script:
#!/bin/bash
echo "Start!"
while IFS=' ' read -ra ADDR;
do
for line in $(cat results)
do
regex='([0-9][0-9])/([0-9][0-9])/([0-9][0-9]) ([0-9][0-9]:[0-9][0-9]:[0-9]$
if [[ $line =~ $regex ]]; then
$line="20${BASH_REMATCH[3]}-${BASH_REMATCH[2]}-${BASH_REMATCH[1]} ${BASH_REMATCH[4]}"
fi
echo "insert into table(time, total, caracas, anzoategui) values('$line', '$line', '$line', '$line', '$line');"
done | mysql -user -password database;
done < filenameX
Result:
time | total | caracas | anzoategui |
0000-00-00 00:00:00 | 9 | 9 | 9 |
2027-11-13 00:00:00 | 15 | 15 | 15 |
Note: This answer was accepted based on fixing the bash-focused approach in the OP. For a simpler, awk-based solution see the last section of this answer.
Try the following:
line='28/11/13 06:20:05' # sample input
regex='([0-9][0-9])/([0-9][0-9])/([0-9][0-9]) ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])'
if [[ $line =~ $regex ]]; then
line="20${BASH_REMATCH[3]}-${BASH_REMATCH[2]}-${BASH_REMATCH[1]} ${BASH_REMATCH[4]}"
fi
echo "$line" # -> '2013-11-28 06:20:05'
As for why your code didn't work:
As #anubhava pointed out, you need at least 1 space to the right of [[ and to the left of ]].
Whether \s works in a bash regex is platform-dependent (Linux: yes; OSX: no), so a single, literal space is the safer choice here.
Your variable assignment was incorrect ($line = ...) - when assigning to a variable, never prefix the variable name with $.
Your backreferences were incorrect ($1, ...): to refer to capture groups (subexpressions) in a bash regex you have to use the special ${BASH_REMATCH[#]} array variable; ${BASH_REMATCH[0]} contains the entire string that matched, ${BASH_REMATCH[1]} contains what the first capture group matched, and so on; by contrast, $1, $2, ... refer to the 1st, 2nd, ... argument passed to a shell script or function.
Update, to address the OP's updated question:
I think the following does what you want:
# Read input file and store each col. value in separate variables.
while read -r f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15; do
# Concatenate the first 2 cols. to form a date + time string.
dt="$f1 $f2"
# Parse and reformat the date + time string.
regex='([0-9][0-9])/([0-9][0-9])/([0-9][0-9]) ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])'
if [[ "$dt" =~ $regex ]]; then
dt="20${BASH_REMATCH[3]}-${BASH_REMATCH[2]}-${BASH_REMATCH[1]} ${BASH_REMATCH[4]}"
fi
# Echo the SQL command; all of them are piped into a `mysql` command
# at the end of the loop.
# !! Fill the $f<n> variables in as needed - I don't know which ones you need.
# !! Make sure the number column name matches the number of values.
# !! Your original code had 4 column names, but 5 values, causing an error.
echo "insert into table(time, total, caracas, anzoategui) values('$dt', '$f3', '$f4', '$f5');"
done < filenameX | mysql -user -password database
Afterthought: The above solution is based on improvements to the OP's code; below is a streamlined solution that is a one-liner based on awk (spread across multiple lines for readability - tip of the hat to #twalberg for the awk-based date reformatting):
awk -v sq=\' '{
split($1, tkns, "/");
dt=sprintf("20%s-%s-%s", tkns[3], tkns[2], tkns[1]);
printf "insert into table(time,total,caracas,anzoategui) values(%s,%s,%s,%s);",
sq dt " " $2 sq, sq $3 sq, sq $4 sq, sq $5 sq
}' filenameX | mysql -user -password database
Note: To make quoting inside the awk program simpler, a single quote is passed in via variable sq (-v sq=\').
Perl is handy here.
dt="28/11/13 06:20:05"
perl -MTime::Piece -E "say Time::Piece->strptime('$dt', '%d/%m/%y %T')->strftime('%Y-%m-%d %T')"
2013-11-28 06:20:05
This does the trick without any overly complicated regex invocations:
echo "28/11/13 06:20:05" | awk -F'[/ ]' \
'{printf "20%s-%s-%s %s\n", $3, $2, $1, $4}'
Or, as suggested by #fedorqui in the comments, if the source of your timestamp is date, you can just give it the formatting options you want...
Spaces are mandatory in BASH so use:
[[ "$line" =~ $regex ]] && echo "${line//\//-}"
Also you cannot use \s in BASH so use this regex:
regex='([0-9][0-9])/([0-9][0-9])/([0-9][0-9]) ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])'
thanks all for the sample above.
"T" not appended
$line='"2020-11-26 10:20:01.000000","the size of the table is 3.5" (inches)","2020-12-11 10:20:02"'
$echo "$line" | sed -r 's#(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2})#\2T\1#g'
"2020-11-26 10:20:01.000000","the size of the table is 3.5" (inches)","2020-12-11 10:20:02"
"T" appended only to middle of first column and not any other column with date format in the row
$awk '/[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):[0-5][0-9]*/{print}' test_file |sed -e 's/\s/\T/'
"2020-11-26T10:20:01.000000","the size of the table is 3.5" (inches)","2020-12-11 10:20:02"
example from above with grouping
$ line='"2020-11-26 10:20:01.000000","the size of the table is 3.5" (inches)","2020-12-11 10:20:02"'
$ regex='([0-9][0-9])-([0-9][0-9])-([0-9][0-9]) ([0-9][0-9]:[0-9][0-9]:[0-9][0-9])'
$ if [[ $line =~ $regex ]]; then line="20${BASH_REMATCH[3]}-${BASH_REMATCH[2]}-${BASH_REMATCH[1]}T${BASH_REMATCH[4]}"; fi
$ echo "$line"
2026-11-20T10:20:01
#...the intention is to append "T" between date and time (same field) on all fields within huge csv file with millions of records, not just first column, all having same date format YYYY-MM-DD HH24:MI:SS

How do you extract IP addresses from files using a regex in a linux shell?

How to extract a text part by regexp in linux shell? Lets say, I have a file where in every line is an IP address, but on a different position. What is the simplest way to extract those IP addresses using common unix command-line tools?
You could use grep to pull them out.
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' file.txt
Most of the examples here will match on 999.999.999.999 which is not technically a valid IP address.
The following will match on only valid IP addresses (including network and broadcast addresses).
grep -E -o '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' file.txt
Omit the -o if you want to see the entire line that matched.
This works fine for me in access logs.
cat access_log | egrep -o '([0-9]{1,3}\.){3}[0-9]{1,3}'
Let's break it part by part.
[0-9]{1,3} means one to three occurrences of the range mentioned in []. In this case it is 0-9. so it matches patterns like 10 or 183.
Followed by a '.'. We will need to escape this as '.' is a meta character and has special meaning for the shell.
So now we are at patterns like '123.' '12.' etc.
This pattern repeats itself three times(with the '.'). So we enclose it in brackets.
([0-9]{1,3}\.){3}
And lastly the pattern repeats itself but this time without the '.'. That is why we kept it separately in the 3rd step. [0-9]{1,3}
If the ips are at the beginning of each line as in my case use:
egrep -o '^([0-9]{1,3}\.){3}[0-9]{1,3}'
where '^' is an anchor that tells to search at the start of a line.
I usually start with grep, to get the regexp right.
# [multiple failed attempts here]
grep '[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*' file # good?
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file # good enough
Then I'd try and convert it to sed to filter out the rest of the line. (After reading this thread, you and I aren't going to do that anymore: we're going to use grep -o instead)
sed -ne 's/.*\([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\).*/\1/p # FAIL
That's when I usually get annoyed with sed for not using the same regexes as anyone else. So I move to perl.
$ perl -nle '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ and print $&'
Perl's good to know in any case. If you've got a teeny bit of CPAN installed, you can even make it more reliable at little cost:
$ perl -MRegexp::Common=net -nE '/$RE{net}{IPV4}/ and say $&' file(s)
You can use sed. But if you know perl, that might be easier, and more useful to know in the long run:
perl -n '/(\d+\.\d+\.\d+\.\d+)/ && print "$1\n"' < file
I wrote a little script to see my log files better, it's nothing special, but might help a lot of the people who are learning perl. It does DNS lookups on the IP addresses after it extracts them.
You can use some shell helper I made:
https://github.com/philpraxis/ipextract
included them here for convenience:
#!/bin/sh
ipextract ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
}
ipextractnet ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/[[:digit:]]+'
}
ipextracttcp ()
{
egrep --only-matching -E '[[:digit:]]+/tcp'
}
ipextractudp ()
{
egrep --only-matching -E '[[:digit:]]+/udp'
}
ipextractsctp ()
{
egrep --only-matching -E '[[:digit:]]+/sctp'
}
ipextractfqdn ()
{
egrep --only-matching -E '[a-zA-Z0-9]+[a-zA-Z0-9\-\.]*\.[a-zA-Z]{2,}'
}
Load it / source it (when stored in ipextract file) from shell:
$ . ipextract
Use them:
$ ipextract < /etc/hosts
127.0.0.1
255.255.255.255
$
For some example of real use:
ipextractfqdn < /var/log/snort/alert | sort -u
dmesg | ipextractudp
For those who want a ready solution for getting IP addresses from apache log and listing occurences of how many times IP address has visited website, use this line:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' error.log | sort | uniq -c | sort -nr > occurences.txt
Nice method to ban hackers. Next you can:
Delete lines with less than 20 visits
Using regexp cut till single space so you will have only IP addresses
Using regexp cut 1-3 last numbers of IP addresses so you will have only network addresses
Add deny from and a space at the beginning of each line
Put the result file as .htaccess
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}"
I'd suggest perl. (\d+.\d+.\d+.\d+) should probably do the trick.
EDIT: Just to make it more like a complete program, you could do something like the following (not tested):
#!/usr/bin/perl -w
use strict;
while (<>) {
if (/(\d+\.\d+\.\d+\.\d+)/) {
print "$1\n";
}
}
This handles one IP per line. If you have more than one IPs per line, you need to use the /g option. man perlretut gives you a more detailed tutorial on regular expressions.
All of the previous answers have one or more problems. The accepted answer allows ip numbers like 999.999.999.999. The currently second most upvoted answer requires prefixing with 0 such as 127.000.000.001 or 008.008.008.008 instead of 127.0.0.1 or 8.8.8.8. Apama has it almost right, but that expression requires that the ipnumber is the only thing on the line, no leading or trailing space allowed, nor can it select ip's from the middle of a line.
I think the correct regex can be found on http://www.regextester.com/22
So if you want to extract all ip-adresses from a file use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt
If you don't want duplicates use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt | sort | uniq
Please comment if there still are problems in this regex. It easy to find many wrong regex for this problem, I hope this one has no real issues.
Everyone here is using really long-handed regular expressions but actually understanding the regex of POSIX will allow you to use a small grep command like this for printing IP addresses.
grep -Eo "(([0-9]{1,3})\.){3}([0-9]{1,3})"
(Side note)
This doesn't ignore invalid IPs but it is very simple.
I have tried all answers but all of them had one or many problems that I list a few of them.
Some detected 123.456.789.111 as valid IP
Some don't detect 127.0.00.1 as valid IP
Some don't detect IP that start with zero like 08.8.8.8
So here I post a regex that works on all above conditions.
Note : I have extracted more than 2 millions IP without any problem with following regex.
(?:(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)\.){3}(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)
I wrote an informative blog article about this topic: How to Extract IPv4 and IPv6 IP Addresses from Plain Text Using Regex.
In the article there's a detailed guide of the most common different patterns for IPs, often required to be extracted and isolated from plain text using regular expressions.
This guide is based on CodVerter's IP Extractor source code tool for handling IP addresses extraction and detection when necessary.
If you wish to validate and capture IPv4 Address this pattern can do the job:
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
or to validate and capture IPv4 Address with Prefix ("slash notation"):
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/[0-9]{1,2})\b
or to capture subnet mask or wildcard mask:
(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)
or to filter out subnet mask addresses you do it with regex negative lookahead:
\b((?!(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)))(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
For IPv6 validation you can go to the article link I have added at the top of this answer.
Here is an example for capturing all the common patterns (taken from CodVerter`s IP Extractor Help Sample):
If you wish you can test the IPv4 regex here.
You could use awk, as well. Something like ...
awk '{i=1; if (NF > 0) do {if ($i ~ /regexp/) print $i; i++;} while (i <= NF);}' file
May require cleaning. just a quick and dirty response to shows basically how to do it with awk.
The awk example above didn't work for me, and I needed to do it with awk specifically, so I came up with this method:
$ awk '{match($0,/[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+/); ip = substr($0,RSTART,RLENGTH); print ip}' your_sample_file.log
You can also just use pipes if you're getting the data from somewhere else. Eg, ipconfig
I also realized the method matches invalid IP addresses.
Here is an extended version that only matches valid IPv4 Addresses:
$ awk 'match($0, /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/) {print substr($0, RSTART, RLENGTH)}' sample_file.log
Hope it helps someone else.
It's a REALLY brute-force type of solution, and I haven't had time to handle things like subnet masks.
Since many awk variants lack backreferences in regex, range notation in regex {n,m}, FPAT ability, an array target for match(), I have to try my best to emulate some of that functionality here.
The regex itself is very basic, and it's very much intentional, since each of candidates that passed through the first layer filter will then be fed into the ip4 validation function to ensure values are in range.
Additionally, I use a second array to handle the duplicate scenario (although it's only de-duped in the ASCII string sense - leading zeros, for now, will show up multiple times for each unique ASCII string representation of it).
I know it's ultra brute-force and unseemly of a solution - there's only so much lemonade I can make out of the lemons I have.
echo "${bbbbbbbb}" \
\
| mawk 'function validIP4(_,__,___) {
__^=__=___=4;—-__
if(--___!=gsub("[.]","_",_)) {
return !___ }
++___
do {
if ((+_<-_)||(__<+_)||(--___<-___)) {
_="[|]"
break
} } while (sub("^[^_]+[_]","",_))
return _!="[|]"
} BEGIN { FS = RS = "^$"
__=(__= (__="[0]*([012]?[0-9])?[0-9][.]")__)__
sub("...$","",__)
} END {
gsub(/[^0-9.]+/,OFS)
gsub(__,"=&~")
gsub(/[~][^0-9.=~]+[=]/,"~=")
gsub(/^[^=~]+[=~]|[=~][^=~]+$/,"")
split($(_<_),___,"[=~]+")
for(_ in ___) {
if ( ! (____[__=___[_]]++)) {
if (validIP4(__)) {
print (__) } } } }' \
\
| gsort -t'.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n \
| gcat -n \
| rs -t -c$'\n' -C= 0 4 \
| column -s= -t \
| lgp3 5
1 00.69.84.243 76 23.108.43.3 151 79.127.56.148 226 172.241.192.165
2 00.71.110.228 77 23.108.43.19 152 80.48.119.28 227 172.245.220.154
3 00.105.215.18 78 23.108.43.55 153 80.76.60.2 228 175.196.182.58
4 00.123.2.171 79 23.108.43.94 154 80.244.229.102 229 176.74.9.62
5 00.123.228.2 80 23.108.43.120 155 81.8.52.78 230 176.214.97.55
6 00.201.223.164 81 23.108.43.208 156 83.166.241.233 231 177.128.44.131
7 01.51.106.70 82 23.108.43.244 157 85.25.4.28 232 177.129.53.114
8 01.144.14.232 83 23.108.75.98 158 85.25.91.156 233 178.88.185.2
9 01.148.85.50 84 23.108.75.164 159 85.25.91.161 234 180.180.171.123
10 01.174.10.170 85 23.225.64.59 160 85.25.117.171 235 180.183.15.198
11 02.64.120.219 86 36.37.177.186 161 85.25.150.32 236 180.250.153.129
12 02.68.128.214 87 36.94.161.219 162 85.25.201.22 237 181.36.230.242
13 02.129.196.242 88 37.48.82.87 163 85.195.104.71 238 181.191.141.43
14 02.134.127.15 89 37.144.180.52 164 85.208.211.163 239 182.253.186.140
15 03.28.246.130 90 41.65.236.56 165 85.209.149.130 240 185.24.233.208
16 03.73.194.2 91 41.65.251.86 166 88.119.195.35 241 185.61.152.137
17 03.80.77.1 92 41.79.65.241 167 91.107.15.221 242 185.74.7.51
18 03.81.77.194 93 41.161.92.138 168 91.188.246.246 243 185.93.205.236
19 03.97.200.52 94 41.164.68.42 169 93.184.8.74 244 185.138.114.113
20 3.120.173.144 95 41.164.68.194 170 94.16.15.100 245 186.3.85.131
21 03.134.97.233 96 41.205.24.155 171 94.75.76.3 246 186.5.117.82
22 03.148.72.192 97 43.255.113.232 172 94.228.204.229 247 186.46.168.42
23 03.150.113.147 98 45.5.68.18 173 95.181.150.121 248 186.96.50.39
24 03.159.46.18 99 45.5.68.25 174 95.181.151.105 249 186.154.211.106
25 03.162.181.132 100 45.43.63.230 175 110.74.200.177 250 186.167.48.138
26 03.177.45.7 101 45.67.212.99 176 112.163.123.242 251 186.202.176.153
27 03.177.45.10 102 45.67.230.13 177 113.161.59.136 252 186.233.186.60
28 03.177.45.11 103 45.71.203.110 178 115.87.196.88 253 186.251.71.193
29 03.217.169.100 104 45.87.249.80 179 116.212.155.229 254 187.217.54.84
30 03.232.215.194 105 45.122.233.76 180 117.54.114.101 255 188.94.225.177
31 04.208.138.14 106 45.131.213.170 181 117.54.114.102 256 188.95.89.81
32 04.244.75.205 107 45.158.158.29 182 117.54.114.103 257 188.133.153.143
33 5.39.189.39 108 45.179.193.70 183 119.82.241.21 258 188.138.89.50
34 05.149.219.201 109 45.183.142.126 184 120.72.20.225 259 188.138.90.226
35 5.149.219.201 110 45.184.103.68 185 121.1.41.162 260 188.166.218.243
36 5.189.229.42 111 45.184.155.7 186 123.31.30.100 261 190.128.225.115
37 07.151.182.247 112 45.189.113.63 187 125.25.33.241 262 190.217.7.73
38 07.154.221.245 113 45.189.117.237 188 125.25.206.28 263 190.217.19.243
39 07.244.242.103 114 45.192.141.247 189 133.242.146.103 264 192.3.219.94
40 08.177.248.47 115 45.229.32.190 190 137.74.93.21 265 192.99.38.64
41 08.177.248.213 116 45.250.65.15 191 137.184.57.245 266 192.140.42.83
42 08.177.248.217 117 46.99.146.232 192 139.5.151.182 267 192.155.107.59
43 8.210.83.33 118 46.243.220.70 193 139.59.233.24 268 192.254.104.201
44 8.213.128.19 119 46.246.80.6 194 139.255.58.212 269 194.5.193.183
45 8.213.128.30 120 47.74.114.83 195 140.238.19.26 270 194.114.128.149
46 8.213.128.41 121 47.88.79.154 196 151.106.13.221 271 194.233.67.98
47 8.213.128.106 122 47.91.44.217 197 151.106.18.126 272 194.233.69.41
48 8.213.128.123 123 47.243.75.115 198 152.26.229.67 273 194.233.73.103
49 8.213.128.131 124 47.254.28.2 199 152.32.143.109 274 194.233.73.104
50 8.213.128.149 125 49.156.47.162 200 153.122.106.94 275 194.233.73.105
51 8.213.128.152 126 50.195.227.153 201 153.122.107.129 276 194.233.73.107
52 8.213.128.158 127 50.235.149.74 202 154.85.35.235 277 194.233.73.109
53 8.213.128.171 128 50.250.56.129 203 154.95.36.182 278 194.233.88.38
54 8.213.128.172 129 51.68.199.120 204 154.236.162.59 279 195.80.49.3
55 8.213.128.202 130 51.77.141.29 205 154.236.168.179 280 195.80.49.4
56 8.213.128.214 131 51.81.32.81 206 154.236.177.101 281 195.80.49.5
57 8.213.129.23 132 51.159.3.223 207 154.236.179.226 282 195.80.49.6
58 8.213.129.36 133 51.178.182.23 208 157.100.26.69 283 195.80.49.7
59 8.213.129.51 134 54.80.246.241 209 159.65.69.186 284 195.80.49.253
60 8.213.129.57 135 61.9.48.169 210 159.65.133.175 285 195.80.49.254
61 8.213.129.243 136 61.9.53.157 211 159.203.13.121 286 195.158.30.232
62 8.214.41.50 137 62.75.219.49 212 160.16.242.164 287 197.149.247.82
63 8.218.213.95 138 62.75.229.77 213 161.22.34.142 288 197.243.20.178
64 09.200.156.102 139 62.78.84.159 214 164.132.137.241 289 198.46.200.70
65 13.237.147.45 140 62.138.8.42 215 167.71.207.46 290 198.229.231.13
66 20.47.108.204 141 62.204.35.69 216 167.86.81.208 291 212.112.113.178
67 20.113.24.12 142 63.161.104.189 217 167.249.180.42 292 212.154.234.46
68 23.19.7.136 143 66.29.154.103 218 168.205.100.36 293 212.174.44.87
69 23.19.10.93 144 66.29.154.105 219 169.57.1.85 294 213.32.75.44
70 23.81.127.253 145 69.163.252.140 220 170.81.35.26 295 213.230.69.193
71 23.105.78.193 146 76.118.227.8 221 170.83.60.19 296 213.230.71.230
72 23.105.78.252 147 77.83.86.65 222 170.155.5.235 297 213.230.90.106
73 23.105.86.52 148 77.83.87.217 223 171.233.151.214 298 221.159.192.122
74 23.108.42.228 149 77.104.97.3 224 172.241.156.1 299 222.158.197.138
75 23.108.42.238 150 77.236.243.125 225 172.241.192.104 300 222.252.23.5
cat ip_address.txt | grep '^[0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}$'
Lets assume the file is comma delimited and the position of ip address in the beginning ,end and somewhere in the middle
First regexp looks for the exact match of ip address in the beginning of the line.
The second regexp after the or looks for ip address in the middle.we are matching it in such a way that the number that follows ,should be exactly 1 to 3 digits .falsy ips like 12345.12.34.1 can be excluded in this.
The third regexp looks for the ip address at the end of the line
I wanted to get only IP addresses that began with "10", from any file in a directory:
grep -o -nr "[10]\{2\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" /var/www
If you are not given a specific file and you need to extract IP address then we need to do it recursively.
grep command -> Searches a text or file for matching a given string and displays the matched string .
grep -roE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
-r We can search the entire directory tree i.e. the current directory and all levels of sub-directories. It denotes recursive searching.
-o Print only the matching string
-E Use extended regular expression
If we would not have used the second grep command after the pipe we would have got the IP address along with the path where it is present
for centos6.3
ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | awk 'BEGIN {FS=":"} {print $2}'