Related
I have a large file, about 10GB. I have a vector of line numbers which I would like to use to split the file. Ideally I would like to accomplish this using command-line utilities. As a regex:
File:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
Vector of line numbers:
2 5
Desired output:
File 1:
1 2 3
File 2:
4 5 6
7 8 9
10 11 12
File 3:
13 14 15
16 17 18
Using awk:
$ awk -v v="2 5" ' # space-separated vector if indexes
BEGIN {
n=split(v,t) # reshape vector to a hash
for(i=1;i<=n;i++)
a[t[i]]
i=1 # filename index
}
{
if(NR in a) { # file record counter in the vector
close("file" i) # close previous file
i++ # increase filename index
}
print > ("file" i) # output to file
}' file
Sample output:
$ cat file2
4 5 6
7 8 9
10 11 12
Very slightly different from James's and kvantour's solutions: passing the vector to awk as a "file"
vec="2 5"
awk '
NR == FNR {nr[$1]; next}
FNR == 1 {filenum = 1; f = FILENAME "." filenum}
FNR in nr {
close(f)
f = FILENAME "." ++filenum
}
{print > f}
' <(printf "%s\n" $vec) file
$ ls -l file file.*
-rw-r--r-- 1 glenn glenn 48 Jul 17 10:02 file
-rw-r--r-- 1 glenn glenn 7 Jul 17 10:09 file.1
-rw-r--r-- 1 glenn glenn 23 Jul 17 10:09 file.2
-rw-r--r-- 1 glenn glenn 18 Jul 17 10:09 file.3
This might work for you:
csplit -z file 2 5
or if you want regexp:
csplit -z file /2/ /5/
With the default values, the output files will be named xxnn where nn starts at 00 and is incremented by 1.
N.B. The -z option prevents empty elided files.
Here is a little awk that does the trick for you:
awk -v v="2 5" 'BEGIN{v=" 1 "v" "}
index(v," "FNR" ") { close(f); f=FILENAME "." (++i) }
{ print > f }' file
This will create files of the form: file.1, file.2, file.3, ...
Ok, I've gone totally mental this morning, and I came up with a Sed program (with functions, loops, and all) to generate a Sed script to make what you want.
Usage:
put the script in a file (e.g. make.sed) and chmod +x it;
then use it as the script for this Sed command sed "$(./make.sed <<< '1 4')" inputfile¹
Note that ./make.sed <<< '1 4' generates the following sed script:
1,1{w file.1
be};1,4{w file.2
be};1,${w file.3
be};:e
¹ Unfortunately I misread the question, so my script works taking the line number of the last line of each block that you want to write to file, so your 2 5 has to be changed to 1 4 to be fed to my script.
#!/usr/bin/env -S sed -Ef
###########################################################
# Main
# make a template sed script, in which we only have to increase
# the number of each numbered output file, each of which is marked
# with a trailing \x0
b makeSkeletonAndMarkNumbers
:skeletonMade
# try putting a stencil on the rightmost digit of the first marked number on
# the line and loop, otherwise exit
b stencilLeastDigitOfNextMarkedNumber
:didStencilLeastDigitOfNextMarkedNumber?
t nextNumberStenciled
b exit
# continue processing next number by adding 1
:nextNumberStenciled
b numberAdd1
:numberAdded1
# try putting a stencil on the rightmost digit of the next marked number on
# the line and loop, otherwise we're done with the first marked number, we can
# clean its marker, and we can loop
b stencilNextNumber
:didStencilNextNumber?
t nextNumberStenciled
b removeStencilAndFirstMarker
:removeStencilAndFirstMarkerDone
b stencilLeastDigitOfNextMarkedNumber
###########################################################
# puts a \n on each side of the first digit marked on the right by \x0
:stencilLeastDigitOfNextMarkedNumber
tr
:r
s/([0-9])\x0;/\n\1\n\x0;/1
b didStencilLeastDigitOfNextMarkedNumber?
###########################################################
# makes desired sed script skeleton from space-separated numbers
:makeSkeletonAndMarkNumbers
s/$/ $/
s/([1-9]+|\$) +?/1,\1{w file.0\x0;be};/g
s/$/:e/
b skeletonMade
###########################################################
# moves the stencil to the next number followed by \x0
:stencilNextNumber
trr
:rr
s/\n(.)\n([^\x0]*\x0[^\x0]+)([0-9])\x0/\1\2\n\3\n\x0/
b didStencilNextNumber?
###########################################################
# +1 with carry to last digit on the line enclosed in between two \n characters
:numberAdd1
#i\
#\nprima della somma:
#l
:digitPlus1
h
s/.*\n([0-9])\n.*/\1/
y/0123456789/1234567890/
G
s/(.)\n(.*)\n.\n/\2\n\1\n/
trrr
:rrr
/[0-9]\n0\n/s/(.)\n0\n/\n\1\n0/
t digitPlus1
# the following line can be problematic for lines starting with number
/[^0-9]\n0\n/s/(.)\n0\n/\n\1\n10/
b numberAdded1
###########################################################
# remove stencil and first marker on line
:removeStencilAndFirstMarker
s/\n(.)\n/\1/
s/\x0//
b removeStencilAndFirstMarkerDone
###########################################################
:exit
# a bit of post processing the `w` command has to be followed
# by the filename, then by a newline, so we change the appropriate `;`s to `\n`.
s/(\{[^;]+);/\1\n/g
I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.
Why isn't my regex working? It just returns back the original file. My file looks like this (for a few hundred lines):
1 Germany 1765 0 Equal
2 Argentina 1631 0 Equal
3 Colombia 1488 1 Up
4 Netherlands 1456 -1 Down
5 Belgium 1444 0 Equal
6 Brazil 1291 1 Up
7 Uruguay 1243 -1 Down
8 Spain 1228 -1 Down
9 France 1202 1 Up
...
192 US Virgin Islands 28 -1 Down
And I want this:
Germany,1
Argentina,2
Colombia,3
...
US Virgin Islands,192
This is the regex I tried:
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
But it just returns the original file.
EDIT:
Now I tried
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
and got
,1 Germany,,1765Equal,0,
,2 Argentina,,1631Equal,0,
,3 Colombia,,1488Up,1,
,4 Netherlands,,1456-Down,1,
,5 Belgium,,1444Equal,0,
You could try the below sed command if the fields are tab-separated.
sed 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
Add the inline-edit option -i to save the changes made.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
^ means start of the line anchor. + would repeat the previous character one or more times. Basic sed uses BRE so you need to escape the + to do the functionality of repeating the previous character one or more times. [^\t]* matches any character but not of \t tab character zero or more times.
The following is what you are looking for. The -i option specifies that files are to be edited in-place.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' fifa.csv
awk '{print( $2 "," $1)}' YourFile
not a sed but easier to manage
I have a tab delimited file such as the one below. I want to find the specific number of minimum values in a group. The group starts after finding E in the last column. For example, I want to print two lines (records) that are furthest from, first occurrence of E, the items are sorted in column with E. Here Jack's case and also after second occurrence of E in Gareth's case.
Jack 2 98 E
Jones 6 25 8.11
Mike 8 11 5.22
Jasmine 5 7 4
Simran 5 7 3
Gareth 1 85 E
Jones 4 76 178.32
Mark 11 12 157.3
Steve 17 8 88.5
Clarke 3 7 12.3
Vid 3 7 2.3
I want my result to be
Jasmine 5 7 4
Simaran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3
There can be different number of records in a group. I tried with grep
grep -B 2 F$ inputfile.txt
But it repeats the results with E and also does not work with the last record.
quick & dirty:
kent$ awk '/E$/&&a&&b{print b RS a;a=b="";next}{b=a;a=$0}END{print b RS a}' file
Jasmine 5 7 4
Simran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3
Using arrays of arrays in Gnu Awk version 4, you can try
gawk -vnum=2 -f e.awk input.txt
where e.awk is:
$4=="E" {
N[j++]=i
i=0
}
{
l[j][++i]=$0
}
END {
N[j]=i; ngr=j
for (i=1; i<=ngr; i++) {
m=N[i]
for (j=m-num+1; j<=m; j++)
print l[i][j]
}
}
I don't see an F in you last column. But assuming you want to get every 2 lines above a line ending in E:
grep -B2 'E$' <(cat inputfile.txt;echo "E")|sed "/E$\|^--/d"
Should do the trick
'E$' look for an "E" at the end of a line
the -B2 gets the 2 lines before as well
<(cat inputfile.txt;echo "E") add an "E" as last line to match the last ones as well (this does not chage the actual file)
sed "/E$\|^--/d" delete all lines ending in "E" or beginning with "--" (separator of grep)
awk '$2 ~/5|3/ && $3 ~/7/' file
Jasmine 5 7 4
Simran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3
I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!
How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.
You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)
A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file
Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt