Split file by vector of line numbers - regex

I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18

Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12

Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3

This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.

Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...

Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g

Related

grep single digit occurs one time in line

I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

Substring in UNIX

Suppose I have a string "123456789".
I want to extract the 3rd, 6th, and 8th element. I guess I can use
cut -3, -6, -8
But if this gives
368
Suppose I want to separate them by a white space to get
3 6 8
What should I do?
Actually shell parameter expansion lets you do substring slicing directly, so you could just do:
x='123456789'
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
Update
To do this over an entire file, read the line in a loop:
while read x; do
echo "${x:3:1}" "${x:6:1}" "${x:8:1}"
done < file
(By the way, bash slicing is zero-indexed, so if you want the numbers '3', '6' and '8' you'd really want ${x:2:1} ${x:5:1} and {$x:7:1}.)
You can use the sed tool and issue this command in your teminal:
sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Explained RegEx: http://regex101.com/r/fH7zW6
To "generalize" this on a file you can pipe it after a cat like so:
cat file.txt|sed -r "s/^..(.)..(.).(.).*$/\1 \2 \3/"
Perl one-liner.
perl -lne '#A = split //; print "$A[2] $A[5] $A[7]"' file
Using cut:
$ cat input
1234567890
2345678901
3456789012
4567890123
5678901234
$ cut -b3,6,8 --output-delimiter=" " input
3 6 8
4 7 9
5 8 0
6 9 1
7 0 2
The -b option selects only the specified bytes. The output delimiter can be specified using --output-delimiter.

remove single elements in a text file in bash

Basically what I have is a text file (file.txt), which contains lines of numbers (lines aren't necessarily the same length) e.g.
1 2 3 4
5 6 7 8
9 10 11 12 13
What I need to do is write new files with each of these numbers deleted, one at a time, with replacement e.g. the first new file will contain
2 3 4 <--- 1st element removed
5 6 7 8
9 10 11 12 13
and the 7th file will contain
1 2 3 4
5 6 8 <--- 7th element removed here
9 10 11 12 13
To generate these, I'm looping through each line, and then each element in each line. E.g. for the 7th file, where I remove the third element of the second line, I'm trying to do this by reading in the line, removing the appropriate element, then reinserting this new line
$lineNo is 2 (second line)
$line is 5 6 7 8
with cut, I remove the third number, making $newline 5 6 8
Then I try to replace the line $lineNo in file.txt with $newline using sed:
sed -n '$lineNo s/.*/'$newline'/' > file.txt
This is totally not working. I get an error
sed: can't read 25.780000: No such file or directory
(where 25.780000 is a number in my text file. It looks like it's trying to use $newline to read files or something)
I have reason to suspect my way of stating which line to replace isn't working either :(
My question is, a) is there a better way to do this rather than sed, and b) if sed is the way to go, what am I doing wrong?
Thanks!!
filename=file.txt
i=1
while [[ -s $filename ]]; do
new=file_$i.txt
awk 'NR==1 {if (NF==1) next; else sub(/^[^ ]+ /, "")} 1' $filename > $new
((i++))
filename=$new
done
This leaves a space at the beginning the the first line for each new file, and when a line becomes empty the line is removed. The loop ends when the last generated file is empty.
Update due to requirement clarification:
words=$(wc -w < file.txt)
for ((i=1; i<=words; i++)); do
awk -v n=$i '
words < n && n <= words+NF {$(n-words) = "" }
{words += NF; print}
' file.txt > file_$i.txt
done
Unless I misunderstood the question, the following should work, although it will be pretty slow if your files are large:
#! /bin/bash
remove_by_value()
{
local TO_REMOVE=$1
while read line; do
out=
for word in $line; do [ "$word" = "$TO_REMOVE" ] || out="$out $word"; done
echo "${out/ }"
done < $2
}
remove_by_position()
{
local NTH=$1
while read line; do
out=
for word in $line; do
((--NTH == 0)) || out="$out $word"
done
echo "${out/ }"
done < $2
}
FILE=$1
shift
for number; do
echo "Removing $number"
remove_by_position $number "$FILE"
done
This will dump all the output to stdout, but it should be trivial to change it so the output for each removed number is redirected (e.g. with remove_by_position $number $FILE > $FILE.$$ && mv $FILE.$$ $FILE.$number and proper quoting). Run it as, say,
$ bash script.sh file.txt $(seq 11)
I have to admit, that I'm a bit surprised how short the other solutions are.
#!/bin/bash
#
file=$1
lines=$(cat $file | wc -l)
out=0
dropFromLine () {
file=$1
row=$2
to=$((row-1))
from=$((row+1))
linecontent=($(sed -n "${row}p" $file))
# echo " linecontent: " ${linecontent[#]}
linelen=${#linecontent[#]}
# echo " linelength: " $linelen
for n in $(seq 0 $linelen)
do
(
if [[ $row > 1 ]] ; then sed -n "1,${to}p" $file ;fi
for i in $(seq 0 $linelen)
do
if [[ $n != $i ]]
then
echo -n ${linecontent[$i]}" "
fi
done
echo
# echo "mod - drop " ${linecontent[$n]}
sed -n "$from,${lines}p" $file
) > outfile-${out}.txt
out=$((out+1))
done
}
for row in $(seq 1 $lines)
do
dropFromLine $file $row
done
invocation:
./dropFromRow.sh num.dat
num.dat:
1 2 3 4
5 6 7 8
9 10 11
result:
outfile-0 outfile-10 outfile-12 outfile-2 outfile-4 outfile-6 outfile-8
outfile-1 outfile-11 outfile-13 outfile-3 outfile-5 outfile-7 outfile-9
samples:
asux:~/proj/mini/forum > cat outfile-0
2 3 4
5 6 7 8
9 10 11
asux:~/proj/mini/forum > cat outfile-1
1 3 4
5 6 7 8
9 10 11
One way using perl:
Content of file.txt:
1 2 3 4
5 6 7 8
9 10 11 12 13
Content of script.pl:
use warnings;
use strict;
## Read all input to a scalar variable as a single string.
my $str;
{
local $/ = undef;
$str = <>;
}
## Loop for each number found.
while ( $str =~ m/(\d+)(?:\h*)?/g ) {
## Open file for writing. The name of the file will be
## the number matched in previous regexp.
open my $fh, q[>], ($1 . q[.txt]) or
die qq[Couldn't create file $1.txt\n];
## Print everything prior to matched string plus everything
## after matched string.
printf $fh qq[%s%s], $`, $';
## Close file.
close $fh;
}
Run it like:
perl script.pl file.txt
Show files created:
ls [0-9]*.txt
With output:
10.txt 11.txt 12.txt 13.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
Show content of one of them:
cat 9.txt
Output:
1 2 3 4
5 6 7 8
10 11 12 13