Cleanup file of phone numbers that are not properly formatted - regex

I have a file with nearly 10,000 phone numbers in it and many were not formatted properly, e.g. 123-456-7890 and although I've cleaned up most I still have one pattern I'm not sure how to handle. I used sed to clean up most of it and don't mind using either sed or awk, although I use sed more often then awk, to get one of the last groups (2306 line) formatted properly
Example: 123 4567890 (3 tab 7) needs to be 123-456-7890 (3 dash 3 dash 4).
I know I can find the pattern and replace the tab easily enough using:
sed "^[0-9][0-9][0-9]\t[0-9][0-9][0-9][0-9][0-9][0-9][0-9]/s/\t/-/" infile.txt > outfile.txt
However if I could augment the instruction to parse the 7 numbers, that are grouped together, at the same time it would make it easier for me to clean up what's left after this round. I've done a fair amount of searching although I couldn't get anything I found from the list when I typed in the subject to work before following through with posting the question.

Use extended regular expressions and capturing groups:
sed -E 's/^([0-9]{3})\t([0-9]{3})([0-9]{4})$/\1-\2-\3/' infile.txt > outfile.txt

basicaly something like this will work for a phone number alone.
sed 's/\([0-9]\)[^0-9]*/\1/g;s/\(...\)\(...\)\(....\)/\1-\2-\3/' YourFile
now, you certainly have your phone number associate with other info, so extraction and filtering is more specific

An awk version:
echo "123 4567890" | awk '{gsub(/[^0-9]/,"");print substr($0,1,3)"-"substr($0,4,3)"-"substr($0,7,3)}'
123-456-789
It just removes all non numbers, then print it out in groups of three.

Related

Extract all links between ' and ' in a text file, using CLI (Linux)

I have a very big text (.sql) file, and I want to get all the links out of it in a nice clean text file, where the link are all one in each line.
I have found the following command
grep -Eo "https?://\S+?\.html" filename.txt > newFile.txt
from anubhava, which nearly works for me, link:
Extract all URLs that start with http or https and end with html from text file
Unfortunately, it does not quite work:
Problem 1: In the above link, the webpages end with .html. Not so in my case. They do not have a common ending, so I just have to finish before the second ' symbol.
Problem 2: I do not want it to copy the ' symbol.
To give an example, (cause, I think I explain rather bad here):
Say, my file says things like this:
Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as.
I would want
https://I_want_this
https://I_want_this_too
as the outputfile.
Sorry for the easy question, but I am new to this whole thing and grep/sed etc. are not so easy for me to understand, esp. when I want it to search for special characters, such as /,'," etc.
You can use a GNU grep command like
grep -Po "'\Khttps?://[^\s']+" file
Details:
P enables PCRE regex engine
o outputs matches only, not matched lines
'\Khttps?://[^\s']+ - matches a ', then omits it from the match with \K, then matches http, then an optional s, ://, and then one or more chars other than whitespace and ' chars.
See the online demo:
#!/bin/bash
s="Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as."
grep -Po "'\Khttps?://[^\s']+" <<< "$s"
Output:
https://I_want_this
https://I_want_this_too
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
while(match($0,/\047https?:\/\/[^\047]*/)){
print substr($0,RSTART+1,RLENGTH-1)
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Explanation: Simple explanation would be, using a while loop in main program and running awk's match function in it. Where match function has regex \047https?:\/\/[^\047]*(which matches 'http OR 'https followed by :// till next occurrence of '), then printing sub-string of matched values(by match function).

Deleting the un-matched portion using sed

I'm having a text file containing data in the following format:
2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}
I wanted to collect the timestamp in the beginning and key-value pairs alone. Like the following
2020-01-01 00:00:00,key1:{value1},key2:{value2},key3:{value3}
I'm able to write a regex script that can select the required values (works in visual studio code)
^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}
(first pattern selects the timestamp and second part selects the key-value pattern)
Now, how can I select the un-matched part and delete it using sed ?
Note: I tried using egrep to match the required pattern and writing it to a new file. But every matched string is written on a new line instead of maintaining on the same line. That is not useful to me.
egrep -o '^([0-9 :-]+)|([0-9A-z,_-]+):\{(.*?)\}' source.txt > target.txt
Going from last to first, I can comment that:
egrep: yes, that is the designed behavior - egrep is probably not what you want to use.
sed: it is important to note that sed uses POSIX regular expressions which is simpler and much more limited than what people expect from regular expressions these days. Most of the new style (enhanced, perl-compatible, etc) regular expression work in the last few decades was done in Perl, which is readily available on UNIX systems and is probably what you want to use (but also note that in macOS, like all Apple distributed UNIX programs, the perl binary there is pretty outdated. It will probably still do what you want, but be warned).
Your regular expression uses a range [A-z], which is weird and doesn't work in my egrep or sed - I understand what you want to do, but it shouldn't work in system that actually use character sets (I'm not sure what Visual Studio is doing with this range, but it seems bonkers to me). You probably meant to use [A-Za-z].
I would have written this thing, using Perl, like so:
perl -nle '#res = (); while(m/^([0-9 :-]+\d)|([0-9A-Za-z,_-]+:\{[^}]+\})/g) {
push #res, "$1$2";
};
print join ",",#res' < source.txt > target.txt
With your shown samples, could you please try following. Written and tested in GNU awk in case you are ok with it.
awk '
match($0,/[0-9]{4}-[0-9]{2}-[0-9]{2}[[:space:]]+([0-9]{2}:){2}[0-9]{2}/){
val=""
printf("%s ",substr($0,RSTART,RLENGTH))
while(match($0,/key[0-9]+:{value[0-9]+}(,|$)/)){
val=(val?val OFS:"")substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
' Input_file
This might work for you (GNU sed):
sed -E 's/\S+/\n&/3g;s#.*#echo "&"|sed "1b;/:{.*}/!d;s/, *$//"#e;s/ *\n/,/g' file
Split each line into a lines of tokens (keeping the date and time as the first of these lines).
Remove any line (apart from the first) that does not contain the pattern :{...}.
Flatten the lines by replacing the introduced newlines by , separator.
sed -rn 's/([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]]([[:digit:]]{2}:){2}[[:digit:]]{2})(.*)(key1.*,)(.*)(key2.*,)(.*)(key3.*$)/\1,\4\6\8/p' <<< "2020-01-01 00:00:00 #gibberish - key1:{value1}, unwanted key2:{value2}, unwanted key3:{value3}"
Enable regular expression interpretation with sed -r or -E and then split the string into 8 sections using parenthesis. Substitute the line for the 1st, 4th, 6th and 8th sections and print.

Get list of strings between certain strings in bash

Given a text file (.tex) which may contain strings of the form "\cite{alice}", "\cite{bob}", and so on, I would like to write a bash script that stores the content within brackets of each such string ("alice" and "bob") in a new text file (say, .txt).
In the output file I would like to have one line for each such content, and I would also like to avoid repetitions.
Attempts:
I thought about combining grep and cut.
From other questions and answers that I have seen on Stack Exchange I think that (modulo reading up on cut a bit more) I could manage to get at least one such content per line, but I do not know how to get all occurences of a single line if there are several such strings in it and I have not seen any question or answer giving hints in this direction.
I have tried using sed as well. Yesterday I read this guide to see if I was missing some basic sed command, but I did not see any straightforward way to do what I want (the guide did mention that sed is Turing complete, so I am sure there is a way to do this only with sed, but I do not see how).
What about:
grep -oP '(?<=\\cite{)[^}]+(?=})' sample.tex | sort -u > cites.txt
-P with GNU grep interprets the regexp as a Perl-compatible one (for lookbehind and lookahead groups)
-o "prints only the matched (non-empty) parts of a matching line, with each such part on a separate output line" (see manual)
The regexp matches a curly-brace-free text preceded by \cite{ (positive lookbehind group (?<=\\cite{)) and followed by a right curly brace (positive lookafter group (?=})).
sort -u sorts and remove duplicates
For more details about lookahead and lookbehind groups, see Regular-Expressions.info dedicated page.
You can use grep -o and postprocess its output:
grep -o '\\cite{[^{}]*}' file.tex |
sed 's/\\cite{\([^{}]*\)}/\1/'
If there can only ever be a single \cite on an input line, just a sed script suffices.
sed -n 's/.*\\cite{\([^{}]*\)}.*/\1/p' file.tex
(It's by no means impossible to refactor this into a script which extracts multiple occurrences per line; but good luck understanding your code six weeks from now.)
As usual, add sort -u to remove any repetitions.
Here's a brief Awk attempt:
awk -v RS='\' '/^cite\{/ {
split($0, g, /[{}]/)
cite[g[2]]++ }
END { for (cit in cite) print cit }' file.tex
This conveniently does not print any duplicates, and trivially handles multiple citations per line.

regex - multiple $1 by 10

I want to replace the results of this:
(something=)([\-\d\.]*)
with this:
nowitis=($2*10)
but isntead of getting
nowitis=(80)
i get
nowitis=(8*10)
How to solve it?
In sed, for example:
echo "something=123" | sed -r 's/(something=)([\-\d\.]*)/\1\2*10)/'
something=123*10)
echo "something=123" | sed -r 's/(something=)([\-\d\.]*)/\1\20/'
something=1230
Multiplication by 10 is just adding a Zero to the number. Sed doesn't calculate results.
However, all regex implementations I know of, can have it a bit more easy:
echo "something=123" | sed -r 's/(something=)([-\d.]*)/\1\20/'
something=0123
In the group [-\d.], the - sign is leading, so it can't be part of a range like A-Z. Well, it could, it could mean from \0 to something, but it doesn't. As first or last character, it doesn't need a mask.
Similarly, every group containing a dot, if dot was interpreted as a joker sign, could be reduced to just that jokersign. Therefore you don't need a joker like this in the group. So you don't have to mask it too.
Let's suppose you are on a POSIX system with Perl available.
echo "something= 8" | perl -pe 's/\w\s*=\s*\K-?\d+(\.\d+)?/$&*10/ge'
something= 80
What you want to do is not possible with regular regex because they cannot do arithmetic e.g. compute 8*10. One way is to use an interpreter that can do so.
Perl has a nice feature which is the e switch. It evaluates the replacement pattern in which I do $& * 10, where $& is the captured pattern.
The input string can be like:
something=10.2
something=-3.15
So there can be negative numbers and float numbers.
I have a PHPStorm IDE and I'm using its find&replace function with regex
So it is fine but no multiplication.
So I think I could do it in couple runs.
For example in next run I would find mine results and then move the dot by 1.
I read the PCRE docs and didn't find multiplication option.
Easier would be writing a script even in PHP to do it right.
But I thought it could be done easier.

Repeating a regex pattern

First, I don't know if this is actually possible but what I want to do is repeat a regex pattern.
The pattern I'm using is:
sed 's/[^-\t]*\t[^-\t]*\t\([^-\t]*\).*/\1/' films.txt
An input of
250. 7.9 Shutter Island (2010) 110,675
Will return:
Shutter Island (2010)
I'm matching all none tabs, (250.) then tab, then all none tabs (7.9) then tab. Next I backrefrence the film title then matching all remaining chars (110,675).
It works fine, but im learning regex and this looks ugly, the regex [^-\t]*\t is repeated just after itself, is there anyway to repeat this like you can a character like a{2,2}?
I've tried ([^-\t]*\t){2,2} (and variations) but I'm guessing that is trying to match [^-\t]*\t\t?
Also if there is any way to make my above code shorter and cleaner any help would be greatly appreciated.
This works for me:
sed 's/\([^\t]*\t\)\{2\}\([^\t]*\).*/\2/' films.txt
If your sed supports -r you can get rid of most of the escaping:
sed -r 's/([^\t]*\t){2}([^\t]*).*/\2/' films.txt
Change the first 2 to select different fields (0-3).
This will also work:
sed 's/[^\t]\+/\n&/3;s/.*\n//;s/\t.*//' films.txt
Change the 3 to select different fields (1-4).
To use repeating curly brackets and grouping brackets with sed properly, you may have to escape it with backslashes like
sed 's/\([^-\t]*\t\)\{3\}.*/\1/' films.txt
Yes, this command will work properly with your example.
If you feel annoyed to, you can choose to put -r option which enables regex extended mode and forget about backslash escapes on brackets.
sed -r 's/([^-\t]*\t){3}.*/\1/' films.txt
Found that this is almost the same as Dennis Williamson's answer, but I'm leaving it because it's shorter expression to do the same.
I think you might be going about this the wrong way. If you're simply wanting to extract the name of the film, and it's release year, then you could try this regex:
(?:\t)[\w ()]+(?:\t)
As seen in place here:
http://regexr.com?2sd3a
Note that it matches a tab character at the beginning and end of the actual desired string, but doesn't include them in the matching group.
You can repeat things by putting them in parenthesis, like this:
([^-\t]*\t){2,2}
And the full pattern to match the title would be this:
([^-\t]*\t){2,2}([^-\t]+).*
You said you tried it. I'm not sure what is different, but the above worked for me on your sample data.
why are you doing things the hard way??
$ awk '{$1=$2=$NF=""}1' file
Shutter Island (2010)
If this is a tab separated file with a regular format I'd use cut instead of sed
cut -d' ' -f3 films.txt
Note there's a single tab between the quotes after the -d which can be typed at the shell prompt by typing ctrl+v first, i.e. ctrl+v ctrl+i