remove single elements in a text file in bash

remove single elements in a text file in bash - regex

Basically what I have is a text file (file.txt), which contains lines of numbers (lines aren't necessarily the same length) e.g.
1 2 3 4
5 6 7 8
9 10 11 12 13
What I need to do is write new files with each of these numbers deleted, one at a time, with replacement e.g. the first new file will contain
2 3 4 <--- 1st element removed
5 6 7 8
9 10 11 12 13
and the 7th file will contain
1 2 3 4
5 6 8 <--- 7th element removed here
9 10 11 12 13
To generate these, I'm looping through each line, and then each element in each line. E.g. for the 7th file, where I remove the third element of the second line, I'm trying to do this by reading in the line, removing the appropriate element, then reinserting this new line
$lineNo is 2 (second line)
$line is 5 6 7 8
with cut, I remove the third number, making $newline 5 6 8
Then I try to replace the line $lineNo in file.txt with $newline using sed:
sed -n '$lineNo s/.*/'$newline'/' > file.txt
This is totally not working. I get an error
sed: can't read 25.780000: No such file or directory
(where 25.780000 is a number in my text file. It looks like it's trying to use $newline to read files or something)
I have reason to suspect my way of stating which line to replace isn't working either :(
My question is, a) is there a better way to do this rather than sed, and b) if sed is the way to go, what am I doing wrong?
Thanks!!

filename=file.txt
i=1
while [[ -s $filename ]]; do
new=file_$i.txt
awk 'NR==1 {if (NF==1) next; else sub(/^[^ ]+ /, "")} 1' $filename > $new
((i++))
filename=$new
done
This leaves a space at the beginning the the first line for each new file, and when a line becomes empty the line is removed. The loop ends when the last generated file is empty.
Update due to requirement clarification:
words=$(wc -w < file.txt)
for ((i=1; i<=words; i++)); do
awk -v n=$i '
words < n && n <= words+NF {$(n-words) = "" }
{words += NF; print}
' file.txt > file_$i.txt
done

Unless I misunderstood the question, the following should work, although it will be pretty slow if your files are large:
#! /bin/bash
remove_by_value()
{
local TO_REMOVE=$1
while read line; do
out=
for word in $line; do [ "$word" = "$TO_REMOVE" ] || out="$out $word"; done
echo "${out/ }"
done < $2
}
remove_by_position()
{
local NTH=$1
while read line; do
out=
for word in $line; do
((--NTH == 0)) || out="$out $word"
done
echo "${out/ }"
done < $2
}
FILE=$1
shift
for number; do
echo "Removing $number"
remove_by_position $number "$FILE"
done
This will dump all the output to stdout, but it should be trivial to change it so the output for each removed number is redirected (e.g. with remove_by_position $number $FILE > $FILE.$$ && mv $FILE.$$ $FILE.$number and proper quoting). Run it as, say,
$ bash script.sh file.txt $(seq 11)

I have to admit, that I'm a bit surprised how short the other solutions are.
#!/bin/bash
#
file=$1
lines=$(cat $file | wc -l)
out=0
dropFromLine () {
file=$1
row=$2
to=$((row-1))
from=$((row+1))
linecontent=($(sed -n "${row}p" $file))
# echo " linecontent: " ${linecontent[#]}
linelen=${#linecontent[#]}
# echo " linelength: " $linelen
for n in $(seq 0 $linelen)
do
(
if [[ $row > 1 ]] ; then sed -n "1,${to}p" $file ;fi
for i in $(seq 0 $linelen)
do
if [[ $n != $i ]]
then
echo -n ${linecontent[$i]}" "
fi
done
echo
# echo "mod - drop " ${linecontent[$n]}
sed -n "$from,${lines}p" $file
) > outfile-${out}.txt
out=$((out+1))
done
}
for row in $(seq 1 $lines)
do
dropFromLine $file $row
done
invocation:
./dropFromRow.sh num.dat
num.dat:
1 2 3 4
5 6 7 8
9 10 11
result:
outfile-0 outfile-10 outfile-12 outfile-2 outfile-4 outfile-6 outfile-8
outfile-1 outfile-11 outfile-13 outfile-3 outfile-5 outfile-7 outfile-9
samples:
asux:~/proj/mini/forum > cat outfile-0
2 3 4
5 6 7 8
9 10 11
asux:~/proj/mini/forum > cat outfile-1
1 3 4
5 6 7 8
9 10 11

One way using perl:
Content of file.txt:
1 2 3 4
5 6 7 8
9 10 11 12 13
Content of script.pl:
use warnings;
use strict;
## Read all input to a scalar variable as a single string.
my $str;
{
local $/ = undef;
$str = <>;
}
## Loop for each number found.
while ( $str =~ m/(\d+)(?:\h*)?/g ) {
## Open file for writing. The name of the file will be
## the number matched in previous regexp.
open my $fh, q[>], ($1 . q[.txt]) or
die qq[Couldn't create file $1.txt\n];
## Print everything prior to matched string plus everything
## after matched string.
printf $fh qq[%s%s], $`, $';
## Close file.
close $fh;
}
Run it like:
perl script.pl file.txt
Show files created:
ls [0-9]*.txt
With output:
10.txt 11.txt 12.txt 13.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
Show content of one of them:
cat 9.txt
Output:
1 2 3 4
5 6 7 8
10 11 12 13

Related

Bash script: how to stop looping after a match was found?

This must be very basic but I can't not find a way to solve it.
I have a script like this:
#!/bin/bash
seqFolder="/raw_data/data"
seqmode="paired"
Input=$(basename ${seqFolder});
if [ $seqmode = paired ]; then
for x in $seqFolder/*; do
if [[ "$x" =~ .*\.fastq.gz$ ]]; then
z=$(basename $x 1_001.fastq.gz)
echo $z
echo "file of this iteration $z"1_001.fastq.gz" $z"2_001.fastq.gz""
fi
done
fi
When I run this script I get this:
MG-AB-17_S17_R
file of this iteration MG-AB-17_S17_R1_001.fastq.gz MG-AB-17_S17_R2_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz
file of this iteration MG-AB-17_S17_R2_001.fastq.gz1_001.fastq.gz MG- AB-17_S17_R2_001.fastq.gz2_001.fastq.gz
MG-AB-81_S74_R
file of this iteration MG-AB-81_S74_R1_001.fastq.gz MG-AB-81_S74_R2_001.fastq.gz
MG-AB-81_S74_R2_001.fastq.gz
file of this iteration MG-AB-81_S74_R2_001.fastq.gz1_001.fastq.gz MG- AB-81_S74_R2_001.fastq.gz2_001.fastq.gz
Files in: /raw_data/data are these 4 (this is just example):
MG-AB-17_S17_R1_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz
MG-AB-81_S74_R1_001.fastq.gz
MG-AB-81_S74_R2_001.fastq.gz
The issue is that I don't want my variable $z to be MG-AB-17_S17_R2_001.fastq.gz or MG-AB-81_S74_R2_001.fastq.gz because files like these:
MG-AB-17_S17_R2_001.fastq.gz1_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz2_001.fastq.gz
...
don't exist in directory /raw_data/data
I was thinking that .fastq.gz$ in "$x" =~ .*\.fastq.gz$ would ensure that, but it seems that's not the case. Can you please advise.

You're looking for the break statement.
$ cat /tmp/so3651.sh
#!/usr/bin/env bash
for i in 1 2 3 4 5; do
echo "i: $i"
if [[ -f /tmp/foo$i ]]; then
echo "found it!"
break
fi
done
If the file doesn't exist, it produces:
$ /tmp/so3651.sh
i: 1
i: 2
i: 3
i: 4
i: 5
But when a file does exist, it stops early:
$ touch /tmp/foo4
$ /tmp/so3651.sh
i: 1
i: 2
i: 3
i: 4
found it!

Split file by vector of line numbers

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18

Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12

Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3

This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.

Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...

Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g

How to do reach specific section of text file and then search

I have a text file like
Apples
Big 7
Small 6
Apples
Good 5
Bad 3
Oranges
Big 4
Small 2
Good 1
Bad 5
How do I get to specific section of this file and then do a grep? For example, If I need to find how many Good Oranges are there, how do I do it from command line with this file as input, using say awk?

You could use the range operator like this:
awk '/Apples/,/^$/ { if (/Good/) print $2}' file
would print how many good apples there are:
5
The range operator , will evaluate to true when the first condition is satisfied and remain true until the second condition. The second pattern /^$/ matches a blank line. This means that only the relevant sections will be tested for the property Good, Bad, etc.
I'm assuming that your original input file wasn't double-spaced? If it was, the method above can be patched to skip every other line:
awk '!NR%2{next} /Oranges/,/^$/ { if (/Good/) print $2}' file
When the record number NR is even, NR%2 is 0 and !0 is true so every other line will be skipped.

You could use Bash to read from the file line by line in a loop.
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
Oranges) [ "${fruit[0]}" = "Good" ] && echo ${fruit[1]};;
esac
done < file
You could also make this a function and pass it arguments to get trait information about any fruit.
read_fruit (){
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
$1) [ "${fruit[0]}" = "$2" ] && echo ${fruit[1]};;
esac
done < file
}
Use:
read_fruit Apples Small
result:
6

When you have name/value pairs, it's usually best to first build an array indexed by the name and containing the value, then you can just print whatever you're interested in using the appropriate name(s) to index the array:
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Oranges","Good"]}' file
1
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Apples","Bad"]}' file
3
or if you're looking for the starting point to implement a more complete/complex set of requirements here's one way:
$ awk '
NF {
if (NF==1) {
key=$1
keys[key]
}
else {
val[key,$1]=$2
names[$1]
}
}
END {
for (key in keys)
for (name in names)
print key, name, val[key,name]
}
' file
Apples Big 7
Apples Bad 3
Apples Good 5
Apples Small 6
Oranges Big 4
Oranges Bad 5
Oranges Good 1
Oranges Small 2
To test #JohnB's theory that a shell script would be faster than an awk script if there were thousands of files, I copied the OPs input file 5,000 times into a tmp directory then ran these 2 equivalent scripts on them (the bash one based on Johns answer in this thread and then an awk one that does the same as the bash one):
$ cat tst.sh
for file in "$#"; do
while read -r field1 field2 ; do
[ -z "$field2" ] && name="$field1"
case $name in
Oranges) [ "$field1" = "Good" ] && echo "$field2";;
esac
done < "$file"
done
.
$ cat tst.awk
NF==1 { fruit=$1 }
fruit=="Oranges" && $1=="Good" { print $2 }
and here's the results of running both on those 5,000 files:
$ time ./tst.sh tmp/* > bash.out
real 0m6.490s
user 0m2.792s
sys 0m3.650s
.
$ time awk -f tst.awk tmp/* > awk.out
real 0m2.262s
user 0m0.311s
sys 0m1.934s
The 2 output files were identical.

Make reference to a file in a regular expression

I have two files. One is a SALESORDERLIST, which goes like this
ProductID;ProductDesc
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
4,bottles of beer 40 gal
(ProductID;ProductDesc) header is actually not in the file, so disregard it.
In another file, POSSIBLEUNITS, I have -you guessed- the possible units, and their equivalencies:
u;u.;un;un.;unit
k;k.;kg;kg.,kilograms
This is my first day with regular expressions and I would like to know how can I get the entries in SALESORDERLIST, whose units appear in POSSIBLEUNITS. In my example, I would like to exclude entry 4 since 'gal' is not listed in POSSIBLEUNITS file.
I say regex, since I have a further criteria that needs to be matched:
egrep "^[0-9]+;{1}[^; ][a-zA-Z ]+" SALESORDERLIST
From those resultant entries, I want to get those ending in valid units.
Thanks!

One way of achieving what you want is:
cat SALESORDERLIST | egrep "\b(u|u\.|un|un\.|unit|k|k\.|kg|kg\.|kilograms)\b"
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
The metacharacter \b is an anchor that allows you to perform a "whole words only" search using
a regular expression in the form of \bword\b.
http://www.regular-expressions.info/wordboundaries.html

One way would be to create a bash script, say called findunit.sh:
while read line
do
match=$(egrep -E "^[0-9]+,{1}[^, ][a-zA-Z ]+" <<< $line)
name=${match##* }
# echo "$name..."
found=$(egrep "$name" /pathtofile/units.txt)
# echo "xxx$found"
[ -n "$found" ] && echo $line
done < $1
Then run with:
findunit.sh SALESORDERLIST
My output from this is:
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.

An example of doing it completely in bash:
declare -A units
while read line; do
while [ -n "$line" ]; do
i=`expr index $line ";"`
if [[ $i == 0 ]]; then
units[$line]=1
break
fi
units[${line:0:$((i-1))}]=1
line=${line#*;}
done
done < POSSIBLEUNITS
while read line; do
unit=${line##* }
if [[ ${units[$unit]} == 1 ]]; then
echo $line
fi
done < SALESORDERLIST

Substring in UNIX

Suppose I have a string "123456789".
I want to extract the 3rd, 6th, and 8th element. I guess I can use
cut -3, -6, -8
But if this gives
368
Suppose I want to separate them by a white space to get
3 6 8
What should I do?

Actually shell parameter expansion lets you do substring slicing directly, so you could just do:
x='123456789'
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
Update
To do this over an entire file, read the line in a loop:
while read x; do
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
done < file
(By the way, bash slicing is zero-indexed, so if you want the numbers '3', '6' and '8' you'd really want ${x:2:1} ${x:5:1} and {$x:7:1}.)

You can use the sed tool and issue this command in your teminal:
sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Explained RegEx: http://regex101.com/r/fH7zW6
To "generalize" this on a file you can pipe it after a cat like so:
cat file.txt|sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"

Perl one-liner.
perl -lne '#A = split //; print "$A[2] $A[5] $A[7]"' file

Using cut:
$ cat input
1234567890
2345678901
3456789012
4567890123
5678901234
$ cut -b3,6,8 --output-delimiter=" " input
3 6 8
4 7 9
5 8 0
6 9 1
7 0 2
The -b option selects only the specified bytes. The output delimiter can be specified using --output-delimiter.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

remove single elements in a text file in bash - regex

Related

Bash script: how to stop looping after a match was found?

Split file by vector of line numbers

How to do reach specific section of text file and then search

Make reference to a file in a regular expression

Substring in UNIX

Categories

Resources