Phone Numbers in separate lines in UNIX - regex

In UNIX----
I have a Sample file i want all the phone numbers starting from 987 in another file as a list,
that means if in single row there are 2 phone numbers they should be in separate lines.
Sample File Contents
ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212
sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278
9874561231

This should work,
echo "ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212 sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278 9874561231" > sample.txt
egrep -o '987([0-9]+)' sample.txt
returns,
9871177454
9877745212
9874789645
9874561231
or to be specific for 10 digit phone numbers,
egrep -o '987([0-9]{7})' sample.txt
returns similar results.

Related

Regular Expression to match against first character and file extension

I'm using Bash to try to write a command that gets every file where the first character is not 'a' and the file does not end with '.html' but cannot seem to get both to work properly.
So far I can get my regex to match all the files that start with 'a' and end with '.html' and remove them but my issue that I cannot seem to solve is when the file starts with 'a' and ends with a different file extension. My regex seems to ignore that second requirement and just hides it regardless.
cat inputfile.txt | sed -n '/^[^a].*[^html$]/p'
Input File Contents:
123
anapple.html
456
theapple.html
789
nottrue.html
apple.csv
12
Output:
123
456
theapple.html
789
nottrue.html
12
Instead of trying to write a pattern that matches the rows to keep, write a pattern that matches the rows to remove, and use grep -v to print all the lines that don't match it.
grep -v '^a.*\.html$' inputfile.txt

Dos script to extract X number of lines

I am trying to make a script to:
- Ask the user for customer number (max 8 Digits)
Search a very large text file for that #
(Source.txt)
Extract 19 lines of text above customer # (everything as is, including empty lines)
The line number of customer # would be line 20 in this case.
Extract line 20
Extract the next 30 lines below the customer #.
Save all extracted output in: Output.txt
Basically like copying a block of text and pasting in new text file.
In the source text file, customer# location is not random line number.
You can use standard linux command-line utilities (on windows too) like cat, grep and output redirection (in bash script, for example) as follow.
# read and validate customer number (stdin, parameter, ...)
cat Source.txt | grep '12345678' -A 30 -B 19 > Output.txt
where 12345678 is customer number, -B specifies number of lines before and -A number of lines after match with customer number.

Extracting just a phone number from a file

I'm sure the answer to this is already online however i don't know what i am looking for. I just started a course in Unix/Linux and my dad asked me to make him something for his work. He has a text file and on every fourth line there is a 10 digit number somewhere. How do i make a list of just the numbers? I assume the file looks something like this:
Random junk
Random junk fake number 1234567809
Random junk
My phone number is 1234567890 and it is here random numbers 32131;1231
Random junk
Random junk another fake number 2345432345
Random junk
Just kidding my phone number is here 1234567890 the date is mon:1231:31231
I assume its something like grep [1-9].\{9\} file but how do i get just lines 4,8,12 etc. Because i tested it and i get all phone numbers on every line. Also how do i get just the number not the whole line?
Any help will be greatly appreciated, even if its pointing me in the right direction so i can research it myself. Thanks.
You can do it in two steps:
$ awk '!(NR%4)' file | grep -Eo '[0-9]{10}'
1234567890
1234567890
awk '!(NR%4)' file prints those lines whose number is multiple of 4. It is the same as saying awk '(NR%4==0) {print}' file.
grep -Eo '[0-9]{10}' prints the numbers that appear on blocks of 10. Note that -o is for "just print the matches" and -E to use extended regular expressions.
Or also
$ awk '!(NR%4)' file | grep -Eo '[1-9][0-9]{9}' #check if first number is <>0
Using GNU sed:
sed -nr '0~4{s/.*\b([0-9]{10})\b.*/\1/p}' inputfile
Saying 0~4 produces every 4th line starting from the 0th line, i.e. produces every 4th line in the file. The substitution part is rather obvious.
For your sample input, it'd produce:
1234567890
1234567890
Since you are looking for one number per line, an awk solution would involve
awk '!(NR%4) && match($0, /[[:digit:]]{10}/){print substr($0, RSTART, RLENGTH)}' file
Using perl:
$ perl -nle 'print /([0-9]{10})/ if !($.%4)' file
1234567890
1234567890
To solve this, first, you should know what should be the length of the phone number. You should also consider area codes to be recognized by your code, and possible phone number start numbers. That way you will filter only the most possible true numbers. But if I write "My number is 028 2233 5674... Just kidding, it's 028 2233 9873." Then the code will consider both numbers as correct. So, to solve this completely, if there are fake numbers in the text, is nearly impossible. But an intelligent code, will filter the ones that are most likely to be correct.

How to get complementary lines from two text files?

How to get complementary lines from two text files?
File file1.txt has
123 foo
234 bar
...
File file2.txt has
123 foo
333 foobar
234 bar
...
I want to get all lines in file1.txt and not in file2.txt. The two files are hundreds of MB large and contain non-ASCII characters. What's a fast way to do this?
For good performance with large files, don't read much of the file into memory; work with what's on disk as much as possible.
String-matching can be done efficiently with hashing.
One strategy:
Scan the first file line-by-line. For each line:
Hash the string for the line. The hashing algorithm you use does matter; djb2 is one example but there are many.
Put the key into a hash-set structure. Do not keep the string data.
Scan the second file line-by-line. For each line:
Hash the string for the line.
If the hash key is not found in the set from the first file:
Write the string data for this line to the output where you're tracking the different lines (e.g. standard output or another file). The hash didn't match so this line appears in the 2nd file but not the 1st.
Lines, specifically?
fgrep -vxf file2.txt file1.txt
"Hundreds of MB" is not so much.
I would solve this task this way (in Perl):
$ cat complementary.pl
my %f;
open(F, "$ARGV[1]") or die "Can't open file2: $ARGV[1]\n";
$f[$_] = 1 while(<F>);
close(F);
open(F, "$ARGV[0]") or die "Can't open file1: $ARGV[0]\n";
while(<F>) {
print if not defined $f[$_];
}
Example of usage:
$ cat file1.txt
100 a
200 b
300 c
$ cat file2.txt
200 b
100 a
400 d
$ perl complementary.pl file1.txt file2.txt
300 c

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555