How do I force lines to be a certain length? - regex

I have a text file that contains a very large list of 5-digit numbers. Some lines contain more than one 5-digit number without a newline separating them
12345
23456
34567
4567856789
67890
...
837460174975917
...
I'm trying to find a regular expression that I can use with sed that will add newlines in-between the numbers.
The desired output would be:
12345
23456
34567
45678
56789
67890
...
83746
01749
75917
...
I've played around with it a bit, but the best I can figure out is something like ^([0-9]{5}) replaced with $1/r/n. However, this adds a newline after every digit, and I'd need to remove all the blank lines afterwards which is not optimal because of the size of this file.

Light weight solution using fold :
Sample input:
cat filename
12345
23456
34567
4567856789
Solution using fold:
cat filename|fold -w5
12345
23456
34567
45678
56789
Update(As suggested by Kenavoz): To avoid unnecessary use of cat and pipe
fold -w5 filename

Using grep -o you can do this:
grep -Eo '.{5}' file
12345
23456
34567
45678
56789
67890
83746
01749
75917

Related

Phone Numbers in separate lines in UNIX

In UNIX----
I have a Sample file i want all the phone numbers starting from 987 in another file as a list,
that means if in single row there are 2 phone numbers they should be in separate lines.
Sample File Contents
ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212
sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278
9874561231
This should work,
echo "ajfhvjfdhvjdfb jfbhfb fg 9871177454 9563214578 shgfsehfgvhb vhf 9877745212 sjdjfgsfhvg b 9874789645 sfjkvhbjfbg shgfhbfg 2563145278 9874561231" > sample.txt
egrep -o '987([0-9]+)' sample.txt
returns,
9871177454
9877745212
9874789645
9874561231
or to be specific for 10 digit phone numbers,
egrep -o '987([0-9]{7})' sample.txt
returns similar results.

egrep command for lines that have one or more instance of 1234 but no other numbers?

So I'm fairly new to regular expressions and I'm wondering how this would be implemented as a egrep command.
I basically want to look for lines in a file that have one or more instances of "1234", but no other numbers. (non-digit characters are allowed).
Examples:
1234 - valid
12341234 - valid
12345 - invalid (since 5 is there)
You can use grep to extract the lines that contain 1234, then replace 1234 with something that doesn't appear in the input, then remove lines that still contain any digits, and replace the special string back by 1234:
< input-file grep 1234 \
| sed 's/1234/\x1/g' \
| grep -v '[0-9]' \
| sed 's/\x1/1234/g'
So, we want to select lines that have 1234 one or more times but no other digits:
grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
How it works
The regex begins with ^ and ends with $. That means that is must match the whole line.
Inside the regex are two parts:
([^[:digit:]]*1234)+ matches one or more 1234 with no other digits.
[^[:digit:]]* matches any non-digits that follows the last 1234.
In olden times, one would use [0-9] to match digits. With unicode, that is no longer reliable. So, we are using [:digit:] which is unicode safe.
Example
Let's use this test file:
$ cat file
this 1234 is valid
12341234 valid
not valid 12345
not 2 valid 1234 line
no numbers so not valid
Here is the result:
$ grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
this 1234 is valid
12341234 valid
If you want no other digit after your 1234 block:
egrep '\<(1234)+(\>|[^0-9])' *
-- -- --> word delimiters
---- --> the word you're looking for
------ --> non digit characters
- --> one or more times
If you want only "words" made up by the "1234" block, then you can egrep this:
egrep '\<(1234)+\>' *
-- -- --> word delimiters
---- --> the word you're looking for
- --> one or more times.

How to find lines with multiple occurrences of a(ny) word in a file?

I want to find lines that have multiple occurrences of a(ny) word. For example, if the input text is
John is a teacher, who is not highly paid.
abc abcde
James lives in Detroit.
abc abc abcde
Paul has 2 dogs and 2 cats.
The output should be
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
First line has is repeated, second line has abc repeated and last line has 2 repeated.
^(?=.*\b(\w+)\b.*\b\1\b).*$
Try this.See demo.
https://www.regex101.com/r/rG7gX4/6
Use this with grep -P
Here is a simple way to do it in awk
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f' file
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
It loops trough every word and count them in array a
If any word found more than once, set flag f
If flag f is true, do default action, print line.
To see how many:
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f {for (i in a) if (a[i]>1) printf "%sx\"%s\"-",a[i],i;print $0}' file
2x"is"-John is a teacher, who is not highly paid.
2x"abc"-abc abc abcde
2x"2"-Paul has 2 dogs and 2 cats.
Some improvement: Ignore case. Remove . and ,.
awk '{f=0;delete a;for (i=1;i<=NF;i++) {w=tolower($i);sub(/[.,]/,"",w);if (a[w]++) f=1}} f' file

How to replace a string and number with the same string and random number using sed/awk?

I have a text file that has, amongst other data, many occurrences of a string with a random number (between 0-n) appended. For example:
string1
string33
string10
and so on.
I want to be able to replace each of those with the same string but with a random number (between 0-n) appended. For example:
string2
string9
string12
I've tried this nawk script but I can only get it to replace all numbers in the file.
nawk 'BEGIN{OFS=FS="";srand()}{for(i=1;i<=NF;i++)sub(/[0-9]/,("string")int(10*rand()),$i)}1' infile > outfile
Plus, it replaces each number, so a double-digit number could end up being three or four. For example:
stringstring4
stringstring30
stringstring3string6
Can anyone help me get the desired output?
You were almost there:
$ nawk 'BEGIN{OFS=FS="";srand()}{for(i=1;i<=NF;i++)sub(/[0-9]/,int(10*rand()),$i)}1' infile
string5
string89
string57
$ nawk 'BEGIN{OFS=FS="";srand()}{for(i=1;i<=NF;i++)sub(/[0-9]/,int(10*rand()),$i)}1' file
string6
string08
string37
Note I just replaced this:
sub(/[0-9]/,("string")int(10*rand()),$i)
^^^^^^^^^^
with:
sub(/[0-9]/,int(10*rand()),$i)
Update
I think I wasn't clear in my question. I want it to only replace
instances of "string + number" not all numbers in the file as there is
other data within the file
Keep it simple, no need to set FS:
awk '{for(i=1;i<=NF;i++) sub(/string[0-9]*/,"string"int(10*rand()),$i)}1' file
Look for "string" plus many [0-9] and then replace with "string" plus random.
Test
$ cat a
string1
string33
string10
hello2
$ awk '{for(i=1;i<=NF;i++) sub(/string[0-9]*/,"string"int(10*rand()),$i)}1' a
string1
string7
string4
hello2
$ awk '{for(i=1;i<=NF;i++) sub(/string[0-9]*/,"string"int(10*rand()),$i)}1' a
string9
string3
string5
hello2

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555