sed awk grep ? for find exact line number - regex

here is an example of my datafile.txt
jones
dave
mike
dave
nathan
ben
james
jim
dave
dave
jones
bill
john
i am using grep to find string dave which is fine
grep "dave" datafile.txt >> duplicate.txt
i need to find which line # that string dave was found
first match dave is on line # 2
next dave is on line # 4
next dave is on line # 9
next dave is on line # 10
and 2nd query to find the line count between the last occurrence
so first match is 0
2nd match is after 2 lines
third match is after 5 lines
fourth match is after 1 line
so need to know the exact line number as well as the line number

simple awk can do the work for you
$ awk '/dave/{print NR}' input
2
4
9
10
What it does
/dave/ matches /dave/ on the line
{print NR} prints the NR, line number.
And
$ awk '/dave/{print prev?NR-prev:0; prev=NR}' input
0
2
5
1
What it does?
prev variable contains the previous line which matches the /dave/
prev?NR-prev:0 if prev is set, then print NR-prev else print 0
prev=NR sets the prev as the current NR

Related

grep single digit occurs one time in line

I need help with one grep command
-single digit occurs one time in line
my solution doesn't work
egrep "^(\s*[1]\s*)(\s*[^1]+\s*)+$|^(\s*[^1]\s*)(\s*[1]+\s*)+$|^(\s*[2]\s*)(\s*[^2]+\s*)+$|^(\s*[^2]\s*)(\s*[2]+\s*)+$|^(\s*[3]\s*)(\s*[^3]+\s*)+$|^(\s*[^3]\s*)(\s*[3]+\s*)+$|^(\s*[4]\s*)(\s*[^4]+\s*)+$|^(\s*[^4]\s*)(\s*[4]+\s*)+$|^(\s*[5]\s*)(\s*[^5]+\s*)+$|^(\s*[^5]\s*)(\s*[5]+\s*)+$|^(\s*[6]\s*)(\s*[^6]+\s*)+$|^(\s*[^6]\s*)(\s*[6]+\s*)+$|^(\s*[7]\s*)(\s*[^7]+\s*)+$|^(\s*[^7]\s*)(\s*[7]+\s*)+$|^(\s*[8]\s*)(\s*[^8]+\s*)+$|^(\s*[^8]\s*)(\s*[8]+\s*)+$|^(\s*[9]\s*)(\s*[^9]+\s*)+$|^(\s*[^9]\s*)(\s*[9]+\s*)+$"
example
for example in this text
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
grep color only second line.
I want to grep color every line because in each line any digit occurs one time.In first line this is 5 in second line this is 5 in third line this is 7
A pattern that detects if a digit is unique on a line (if I'm understanding the question correctly):
For the digit 5:
^[^5]*(5)[^5]*$
^ // start of line
[^5]* // any char not 5, 0-or-more
(5) // 5
[^5]* // any char not 5, 0-or-more
$ // end of line
To test all digits, it becomes:
^(?:[^0]*(0)[^0]*|[^1]*(1)[^1]*)$ etc for all digits. The digit is captured in the first group.
Demo
Steps: 509 steps
Flags: g, m
I'm really unsure what the expected output should be (PLEASE UPDATE IT PROPERLY TO THE QUESTION), but here using GNU awk. First test data:
$ cat foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then:
$ awk -F '' '{
delete a
for(i=1;i<=NF;i++)
if($i~/[0-9]/)
a[$i]++
for(i in a)
if(a[i]==1 && match($0, "[^" i "]*" i "[^" i "]*")) {
print $0
next # second data line has 2 matches
}
}' foo
012 210 5
6343 232 5 3423
345 689 7 986 543012 210 5
234 12 43
Then again, its shorter just to:
$ awk '{for(i=0;i<=9;i++)if(gsub(i,i,$0)==1){print;next}}' foo
I'm not absolutely sure what you're after, but if it's matching lines that only contain one instance of a digit, try this:
[^0]*0[^0]*|[^1]*1[^1]*|[^2]*2[^2]*|[^3]*3[^3]*|[^4]*4[^4]*|[^5]*5[^5]*|[^6]*6[^6]*|[^7]*7[^7]*|[^8]*8[^8]*|[^9]*9[^9]*
or grepified
grep -x "[^0]*0[^0]*\|[^1]*1[^1]*\|[^2]*2[^2]*\|[^3]*3[^3]*\|[^4]*4[^4]*\|[^5]*5[^5]*\|[^6]*6[^6]*\|[^7]*7[^7]*\|[^8]*8[^8]*\|[^9]*9[^9]*"
(-x makes grep match the full line.)
The regex uses 10 identical alternations, one for each digit. Each of the alternations
make sure zero or more of anything but the digit starts the line.
match the one allowed digit
make sure zero or more of anything but the digit ends the line.
See it here at regex101.

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

How to find lines with multiple occurrences of a(ny) word in a file?

I want to find lines that have multiple occurrences of a(ny) word. For example, if the input text is
John is a teacher, who is not highly paid.
abc abcde
James lives in Detroit.
abc abc abcde
Paul has 2 dogs and 2 cats.
The output should be
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
First line has is repeated, second line has abc repeated and last line has 2 repeated.
^(?=.*\b(\w+)\b.*\b\1\b).*$
Try this.See demo.
https://www.regex101.com/r/rG7gX4/6
Use this with grep -P
Here is a simple way to do it in awk
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f' file
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
It loops trough every word and count them in array a
If any word found more than once, set flag f
If flag f is true, do default action, print line.
To see how many:
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f {for (i in a) if (a[i]>1) printf "%sx\"%s\"-",a[i],i;print $0}' file
2x"is"-John is a teacher, who is not highly paid.
2x"abc"-abc abc abcde
2x"2"-Paul has 2 dogs and 2 cats.
Some improvement: Ignore case. Remove . and ,.
awk '{f=0;delete a;for (i=1;i<=NF;i++) {w=tolower($i);sub(/[.,]/,"",w);if (a[w]++) f=1}} f' file

Regex for soccer data

Why isn't my regex working? It just returns back the original file. My file looks like this (for a few hundred lines):
1 Germany 1765 0 Equal
2 Argentina 1631 0 Equal
3 Colombia 1488 1 Up
4 Netherlands 1456 -1 Down
5 Belgium 1444 0 Equal
6 Brazil 1291 1 Up
7 Uruguay 1243 -1 Down
8 Spain 1228 -1 Down
9 France 1202 1 Up
...
192 US Virgin Islands 28 -1 Down
And I want this:
Germany,1
Argentina,2
Colombia,3
...
US Virgin Islands,192
This is the regex I tried:
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
But it just returns the original file.
EDIT:
Now I tried
sed 's/\([0-9]*\)\t\([a-zA-Z]*\)/\2,\1/g' <fifa.csv >fifa.csv
and got
,1 Germany,,1765Equal,0,
,2 Argentina,,1631Equal,0,
,3 Colombia,,1488Up,1,
,4 Netherlands,,1456-Down,1,
,5 Belgium,,1444Equal,0,
You could try the below sed command if the fields are tab-separated.
sed 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
Add the inline-edit option -i to save the changes made.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' file
^ means start of the line anchor. + would repeat the previous character one or more times. Basic sed uses BRE so you need to escape the + to do the functionality of repeating the previous character one or more times. [^\t]* matches any character but not of \t tab character zero or more times.
The following is what you are looking for. The -i option specifies that files are to be edited in-place.
sed -i 's/^\([0-9]\+\)\t\([^\t]*\).*/\2,\1/' fifa.csv
awk '{print( $2 "," $1)}' YourFile
not a sed but easier to manage

How to print a range of lines with sed except the one matching the range-end pattern?

I wonder if there is a sed-only way to print a range of lines, determined by patterns to be matched, except the one last line matching the end pattern.
Consider following example. I have a file
line 1
line 2
line 3
ABC line 4
+ line 5
+ line 6
+ line 7
line 8
line 9
line 10
line 11
line 12
I want to get everything starting with ABC (including) and all the lines beginning with a +:
ABC line 4
+ line 5
+ line 6
+ line 7
I tried it with
sed -n '/ABC/I,/^[^+]/ p' file
but this gives one line too much:
ABC line 4
+ line 5
+ line 6
+ line 7
line 8
What's the easiest way (sed-only) to leave this last line out?
There might be better ways but I could come up with this sed 1 liner:
sed -rn '/ABC/,/^[^+]/{/(ABC|^\+)/!d;p;}' file
Another sed 1 liner is
sed -n '/ABC/,/^[^+]/{x;/^$/!p;}' file
One more sed 1 liner (and probably better)
sed -n '/ABC/I{h;:A;$!n;/^+/{H;$!bA};g;p;}' file
The easiest way (I'll learn something new if anyone can solve this with one call to sed), is to add an extra sed at the end, i.e.
sed -n '/ABC/I,/^[^+]/ p' file | sed '$d'
ABC line 4
+ line 5
+ line 6
+ line 7
Cheating, I know, but that is the beauty of the Unix pipe philosphy. Keep whitiling down your data until you get what you want ;-)
I hope this helps.
This might work for you:
sed '/^ABC/{:a;n;/^\(ABC\|+\)/ba};d' file
EDIT: to allow adjacent ABC sections.
Well, you have selected your answer. But why aren't you using /^(ABC|\+)/ ? Or am i mis-understanding your requirement?
If you want to find those + lines AFTER a search for ABC is found
awk '/ABC/{f=1;print} f &&/^\+/ { print }' file
This is much simpler to understand than crafting cryptic sed expressions. When ABC is found, set a flag. When lines starting with + is found and flag is set, print line.