Extracting just a phone number from a file - regex

I'm sure the answer to this is already online however i don't know what i am looking for. I just started a course in Unix/Linux and my dad asked me to make him something for his work. He has a text file and on every fourth line there is a 10 digit number somewhere. How do i make a list of just the numbers? I assume the file looks something like this:
Random junk
Random junk fake number 1234567809
Random junk
My phone number is 1234567890 and it is here random numbers 32131;1231
Random junk
Random junk another fake number 2345432345
Random junk
Just kidding my phone number is here 1234567890 the date is mon:1231:31231
I assume its something like grep [1-9].\{9\} file but how do i get just lines 4,8,12 etc. Because i tested it and i get all phone numbers on every line. Also how do i get just the number not the whole line?
Any help will be greatly appreciated, even if its pointing me in the right direction so i can research it myself. Thanks.

You can do it in two steps:
$ awk '!(NR%4)' file | grep -Eo '[0-9]{10}'
1234567890
1234567890
awk '!(NR%4)' file prints those lines whose number is multiple of 4. It is the same as saying awk '(NR%4==0) {print}' file.
grep -Eo '[0-9]{10}' prints the numbers that appear on blocks of 10. Note that -o is for "just print the matches" and -E to use extended regular expressions.
Or also
$ awk '!(NR%4)' file | grep -Eo '[1-9][0-9]{9}' #check if first number is <>0

Using GNU sed:
sed -nr '0~4{s/.*\b([0-9]{10})\b.*/\1/p}' inputfile
Saying 0~4 produces every 4th line starting from the 0th line, i.e. produces every 4th line in the file. The substitution part is rather obvious.
For your sample input, it'd produce:
1234567890
1234567890

Since you are looking for one number per line, an awk solution would involve
awk '!(NR%4) && match($0, /[[:digit:]]{10}/){print substr($0, RSTART, RLENGTH)}' file

Using perl:
$ perl -nle 'print /([0-9]{10})/ if !($.%4)' file
1234567890
1234567890

To solve this, first, you should know what should be the length of the phone number. You should also consider area codes to be recognized by your code, and possible phone number start numbers. That way you will filter only the most possible true numbers. But if I write "My number is 028 2233 5674... Just kidding, it's 028 2233 9873." Then the code will consider both numbers as correct. So, to solve this completely, if there are fake numbers in the text, is nearly impossible. But an intelligent code, will filter the ones that are most likely to be correct.

Related

Regular expression which is suitable for use with egrep for repeated occurence of pair of characters

Need to print Lines that contain nothing but a single occurrence of laughter, where laughter is defined as a string of the form Hahahahahahahahahahahahahaha!, with arbitrarily many ha's.
What I have is
egrep "^Ha.*[ha*][!$]" myfile.txt
and it prints
Hahahahahahahahahahahahahaha!
Hahahahahahahahahahahahahaha! nklddln
and myfile.txt contains
kaka
linux.student.cs.uwaterloo.ca
So How is going on Man.
I am now just trying to test the stuff with Ubuntu
Digit in the line 1
Regular<title> expression</title> stuff must be working
alright just testing linux.studEnt.cs.Uwaterloo.ca
liNUx.student.cs.uwaterlOO.Ca so the things
We need to</title> have more thn ten<title> lines that have more tha twenty characters
So, the assignment needs to be done very quickly
This cs247 line contains the course code cs246
The course code cs246 is in the cs247 line
All these lines in this text file are for testing only
We need to<title></title> thouroughly check the cs246 assignment
heheheheehe man this work will take some time
you have to be quick as we need to get it done before deadline
Okey kaka g whats the situation
Man How are you?
course code is CS246
course CS246
Hahahahahahahahahahahahahaha!
ghjf akf Hahahahahahahahahahahahahaha!
dhgD Hahahahahahahahahahahahahaha! jwef
Hahahahahahahahahahahahahaha! nklddln
hufwf Hahahahahahahahahahahahahaha!
course cs246
12345678901234567890
1234567890123456789
I do not want the second line to be printed as it contains an extra word. Use of egrep is must.
You need to use anchor $ after ! for precise matching:
egrep '^Ha(ha)+!$' file
Hahahahahahahahahahahahahaha!

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!
You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2
cat fileName | grep pattern | uniq -c
I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Calculating in NP++ with regular expressions

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?
Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.
A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6
Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.
missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555