Replace first two whitespace occurrences with a comma using sed - regex

I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!

How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33

Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.

You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33

Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)

A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file

Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt

Related

grep single digit occurs one time in line

I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

Regex for soccer data

Why isn't my regex working? It just returns back the original file. My file looks like this (for a few hundred lines):
1 Germany 1765 0 Equal
2 Argentina 1631 0 Equal
3 Colombia 1488 1 Up
4 Netherlands 1456 -1 Down
5 Belgium 1444 0 Equal
6 Brazil 1291 1 Up
7 Uruguay 1243 -1 Down
8 Spain 1228 -1 Down
9 France 1202 1 Up
...
192 US Virgin Islands 28 -1 Down
And I want this:
Germany,1
Argentina,2
Colombia,3
...
US Virgin Islands,192
This is the regex I tried:
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
But it just returns the original file.
EDIT:
Now I tried
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
and got
,1 Germany,,1765Equal,0,
,2 Argentina,,1631Equal,0,
,3 Colombia,,1488Up,1,
,4 Netherlands,,1456-Down,1,
,5 Belgium,,1444Equal,0,
You could try the below sed command if the fields are tab-separated.
sed 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
Add the inline-edit option -i to save the changes made.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
^ means start of the line anchor. + would repeat the previous character one or more times. Basic sed uses BRE so you need to escape the + to do the functionality of repeating the previous character one or more times. [^\t]* matches any character but not of \t tab character zero or more times.
The following is what you are looking for. The -i option specifies that files are to be edited in-place.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' fifa.csv
awk '{print( $2 "," $1)}' YourFile
not a sed but easier to manage

SED- combine matching regex lines to make a csv file

I was wondering if it is possible to use sed to create a csv file by combining multiple lines together onto a singe line separate by commas.
For example I have written a sed statement that retrieves the lines I want.
sed -n -e '/ENTITIES/,/ENDSEC/p' | sed -n -e '/ 8/{n;p;}' -e '/ 10/{n;p;}' -e '/ 20/{n;p;}' -e '/ 11/{n;p;}' -e '/ 21/{n;p;}' < Test.txt > out.csv
Which produces the output;
0
4.93
9.04
27.9
23.4
0
34.56
0.77
66.65
19.50
0
55.26
47.29
53.42
19.75
0
-18.22
44.35
19.74
53.28
But I would Like it to output;
0,4.93,9.04,27.9,23.4
0,34.56,0.77,66.65,19.50
0,55.26,47.29,53.42,19.75
0,-18.22,44.35,19.74,53.28
Is there anyway to do this without a pipe? Id Rather not invoke another command as the files I process are upwards of 100 mil lines or so.
Thanks in advance for your help!
To add, here is a portion of my input file;
More Stuff Above
AcDbBlockEnd
0
ENDSEC
0
SECTION
2
ENTITIES
0
LINE
5
1B1
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
4.933855223957067
20
9.042372500389475
30
0.0
11
27.92566226775641
21
23.49207557886149
31
0.0
0
LINE
5
1B2
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
34.56437535704545
20
0.778745874786317
30
0.0
11
66.65564369957746
21
19.50612180407816
31
0.0
0
LINE
5
1B3
330
1F
100
AcDbEntity
8
0
100
AcDbLine
10
55.26446832764479
20
47.29118282642324
30
0.0
11
53.42718194719286
21
19.75092411476788
31
0.0
0
LINE
5
1B4
330
1F
100
AcDbEntity
ENDSEC
0
More stuff below.
Something like this might be what you're looking for, but as jaypal said, without seeing the input it's somewhat of a guess.
sed -n '
/ENTITIES/,/ENDSEC/p
/ 8/{n;h}
/ 10/{n;H}
/ 20/{n;H}
/ 11/{n;H}
/ 21/{n;H;g;s/\n/,/g;p}
' Test.txt > out.csv
With comments:
sed -n '
/ENTITIES/,/ENDSEC/p
/ 8/{n;h} # store next line in hold space
/ 10/{n;H} # append next line to hold space (after newline)
/ 20/{n;H} # ditto
/ 11/{n;H} # ditto
/ 21/{n;H; # ditto
g; # put hold space into pattern space
s/\n/,/g; # substitute commas for newlines
p # print it
}
' Test.txt > out.csv
Just pipe your sed to
sed 'your long sed commnand' | paste -d, - - - - -
the result will be
0,4.93,9.04,27.9,23.4
0,34.56,0.77,66.65,19.50
0,55.26,47.29,53.42,19.75
0,-18.22,44.35,19.74,53.28
Got it Thanks to ooga! Before I lacked the understanding of hold space vs. pattern space, now it has all become clear!
sed -n '
/ENTITIES/,/ENDSEC/{
/ 8/{n;h;};
/ 10/{n;H;};
/ 20/{n;H;};
/ 11/{n;H;};
/ 21/{n;H;g;s/\n/,/g;p};
}
' < Test.dxf > out.csv