I have the following lines in a text file 'file.txt'
String1 ABCDEFGHIJKL
String2 DCEGIJKLQMAB
I want to print the characters corresponding to 'String1' in another text file 'text.txt' like this
ABCDEFGHIJKL
Here, I don't want to use any line numbers. Any suggestions using 'sed' command?. I tried with between 'string 1' and 'string 2', but couldn't obtain command excluding 'string1'. This following code for excluding only 'string2'.
sed -n '/^string1/,/^string2/{p;/^string2/q}' file.txt | sed '$d' > text.txt
awk '$1=="String1" { print $2 }' file.txt > text.txt
Where the first space delimited field equals "String1", print the second field. Redirect the output to text.txt.
Use GNU grep:
grep -Po 'String1\s+\K.*' in_file
Here, grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Related
I am trying to do multiple grep pattern to find a number within a grepped string.
I have a text file like this:
This is the first sample line 1
this is the second sample line
another line
total lines: 3 tot
I am trying to find a way to get just the number of total lines. So the output here should be "3"
Here are the things I've tried:
grep "total lines: [0-9]" myfile.txt
grep "total lines" myfile.txt | grep "[0-9]"
You could use sed:
sed -En 's/^total lines: ([0-9]+).*/\1/p' myfile.txt
-E extended regular expressions
-n suppress automatic printing
Match ^total lines: ([0-9]+).* (capture the number)
\1 replace the whole line with the captured number
p print the result
1st solution: Using GNU grep try following. Simply using -o option to print only matched value, -P enables PCRE regex for program. Then in regex portion matching from starting ^total lines: in each line and if a match found then discard matched values by \K option(to remove it from expected output) which is followed by 1 or more digits, using positive look ahead to make sure its followed by space(s) tot here.
grep -oP '^total lines: \K[0-9]+(?=\s+tot)' Input_file
2nd solution: With your shown samples, please try following in awk. This could be done in a single awk itself. Searching line which has string /total lines: / in it then printing 2nd last field of that line.
awk '/total lines: /{print $(NF-1)}' Input_file
3rd solution: Using awk's match function here. Matching total lines: [0-9]+ tot and then substituting everything apart from digits with null in matched values.
awk 'match($0,/total lines: [0-9]+ tot/){val=substr($0,RSTART,RLENGTH);gsub(/[^0-9]+/,"",val);print val}' Input_file
Do you have to use grep?
$ echo myfile.txt | wc -l
If you mean that the file has a line in it formatted as
total lines: 3 tot
Then refer to https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match and use something like:
grep -Po 'total lines: \K\d+' myfile.txt
Notes:
Perl regex is not my forte, so the \d\w part might not work.
This may be doable without -P, but I cannot test from this windows computer.
regex101.com helped me test the above line, so it may work.
Problem with relying on pattern of last line and applying grep/sed to find pattern is that if any line in file contains such pattern, then you will have to apply some additional logic to filter that.
e.g. Consider case of below input file.
line001
total lines: 883 tot
This is the first sample line 1
this is the second sample line
another line
total lines: 883 tot
Assuming your file format is constant (i.e. Second last line will be blank and last line will contain total count), instead of using any pattern matching commands you can directly count number of rows using below awk command.
awk 'END { print NR - 2 }' myfile.txt
You can use the following awk to get the third field on a line that starts with total count: and stop processing the file further:
awk '/^total lines:/{print $3; exit}' file
See this online demo.
You can use the following GNU grep:
# Extract a non-whitespace chunk after a certain pattern
grep -oP '^total lines:\s*\K\S+' file
# Extract a number after a pattern
grep -oP '^total lines:\s*\K\d+(?:\.\d+)?' file
See an online demo. Details:
^ - start of string
total lines: - a literal string
\s* - any zero or more whitespace chars
\K - match reset operator discarding all text matched so far
\S+ - one or more non-whitespace chars
\d+(?:\.\d+)? - one or more digits and then an optional sequence of . and one or more digits.
See the regex demo.
I am trying to add 5 blank line spaces in a text file (text.txt) before and after string pattern matches. I used the following to get spaces after the 'string' match which worked for me-
sed '/string/{G;G;G;G;G;}' text.txt
I want to apply the same sed command to obtain 5 blank lines before the 'string' Here I don't want spaces, but rather blank lines before and after them. Any suggestions?
sed -r 's/(^.*)(string)(.*$)/\1\n\n\n\n\n\2\n\n\n\n\n\3/' text.txt
Use -r or -E to allow regular expressions, split likes into three sections and then substitute the line for the first section, 5 new lines, the second section, 5 new lines and then finally the third section.
Use this Perl one-liner:
perl -pe 's/string/\n\n\n\n\n$&\n\n\n\n\n/' text.txt
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
s/PATTERN/REPLACEMENT/ : change PATTERN to REPLACEMENT.
$& : matched pattern.
\n : newline character.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start
For a single string match:
$ sed -e '/string/{ s/^/\n\n\n\n\n/; s/$/\n\n\n\n\n/ }' text.txt
For multiple strings, assuming same requirements:
$ sed -E '/(string1|string2|string3)/{ s/^/\n\n\n\n\n/; s/$/\n\n\n\n\n/ }' text.txt
This might work for you:
sed '/string/{G;s/\(string\)\(.*\)\(.\)/\3\3\3\3\3\1\3\3\3\3\3\2/}' file
Match on string, append an empty line, pattern match using the newline to separate the match by 5 lines either side.
And an awk version:
awk '{if(/string1|string2|.../){printf "\n\n\n\n\n%s\n\n\n\n\n",$0}else{print}}' file
I want to find a list of words that contain six or more consonants in a row from a number of text files.
I'm pretty new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]{6}"
I use the cat command here because it will otherwise include the file names in the next pipe. I use the second pipe to get a list of all the words in the text files.
The problem is the last pipe, I want to somehow get it to grep 6 consonants in a row, it doesn't need to be the same one. I would know one way of solving the problem, but that would create a command longer that this entire post.
For the last grep you also need the -E switch - or you need to escape the curly braces:
cat *.txt | grep -Eo "\w+" | grep -Ei "[^AEOUIaeoui]{6}"
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]\{6\}"
I use the cat command here because it will otherwise include the file names in the next pipe
You can disable this using the -h flag:
grep -hEo "\w+" *.txt | grep -Ei "[^AEOUIaeoui]{6}"
You can use
grep -hEio '[[:alpha:]]*[b-df-hj-np-tv-z]{6}[[:alpha:]]*' *.txt
Regex details
[[:alpha:]]* - any zero or more letter
[b-df-hj-np-tv-z]{6} - six English consonant letters on end
[[:alpha:]]* - any zero or more letter.
The grep options make the regex search case insensitive (i) and grep shows the matched texts only (with o) without displaying the filenames (h). The -E option allows the POSIX ERE syntax, else, if you do not specify it, you would need to escape {6} as \{6\},
Use this Perl one-liner:
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Example:
cat > in_file.txt <<EOF
the abcdfghi aBcdfghi.
ABCDFGHI234
abcdEfgh
EOF
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Output:
abcdfghi
aBcdfghi
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
The regex uses these modifiers:
/g : Multiple matches.
/i : Case-insensitive matches.
/\b([a-z]+)\b/ig : Match words that consist of 1 or more letters only ([a-z]+), with words boundary \b on both sides. This way, ABCDFGHI234 does not match, but all 3 words in line 1 (the, abcdfghi, aBcdfghi) match. This may be important for some applications. Note that not all answers in this thread use the word boundary around letters, and thus do not make the distinction shown in this example.
/[^aeoui]{6}/i : Match 6 or more consecutive non-vowels. Non-vowels here resolve exactly to consonants, because the previous regex selected for words made of letters only, that is, vowels and consonants.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
Get all words containing 6 or more consonants in a row in a given directory
cat *.txt | grep -Eo "\w+" | grep -E "[^AEOUIaeoui]{6,}"
We can use grep -Eo (-E Extended regex, -o output ONLY matching)
cat *.txt will output all of the data from all txt files in the current directory
grep -Eo "\w+" will output all of the words from an input in the form of one word per line
We can use Regex to search for strings that contain a pattern:
[^LISTOFCHARACTERS] Any character but LISTOFCHARACTERS
{6,} 6 or more
Is it possible to have a regex that parses only a1bcdea1 from this line a1bcdea1ABCa1DEFa1 ?
This grep command does not work:
$ cat txtfile
a1bcdea1ABCa1DEFa1
$ grep -oE "[A-Z,a-z]1.*?[A-Z,a-z]1" txtfile
a1bcdea1ABCa1DEFa1
I want the output of grep to be only a1bcdea1.
EDIT:
It is obvious that I can just use grep -o "a1bcdea1" for the above line, but consider if one has several thousands of lines and the goal is to match FIRST [A-Z,a-z]1.*?[A-Z,a-z]1 for each single line.
How about using a ^ start anchor and restricting character set used:
grep -o '^[A-Za-z]1[A-Za-z]*1'
See this Bash demo or Regex Pattern at regex101
If you expect more digits or other characters in between, go with this
grep -oP '^[A-Za-z]1.*?[A-Za-z]1'
The lazy matching requires perl compatible mode. For not at line start, go with this
grep -oP '^.*?\K[A-Za-z]1.*?[A-Za-z]1'
\K resets beginning of the reported match and is a PCRE feature as well.
Here is a gnu awk solution using split function:
awk '(n = split($0, a, /[a-zA-Z]1/, b)) > 1 {print b[1] a[2] b[2]}' file
a1bcdea1
This awk command splits each line on regex /[a-zA-Z]1/ and stores split tokens in array a and delimiters in array b.
I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file.
Here is an example file (demo.txt):
\x00START how are you? END\x00
\x00START good thanks END\x00
sometimes random things\x00\x00 inbetween it (ignore this text)
\x00START thats nice END\x00
And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'.
/folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks".
So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'.
It's worth keeping in mind that I am dealing with a very large binary file.
I am currently using
sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt
but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').
If you have GNU awk, try:
awk -v RS='\0START|END\0' '
length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")}
' demo.txt
RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here).
Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.).
Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty.
{printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.
You can use grep for that:
grep -Po "START\s+\K.*?(?=END)" file
how are you?
good thanks
thats nice
Explanation:
-P To allow Perl regex
-o To extract only matched pattern
-K Positive lookbehind
(?=something) Positive lookahead
EDIT: To match \00 as START and END may appear in between:
echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)'
hi how are you
EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar:
echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1'
hi
how
are
you
What's new here:
undef $/ Undefine INPUT separator $/ which defaults to '\n'
(.|\n)* Dot matches almost any character, but it does not match
\n so we need to add it here.
/gm Modifiers, g for global m for multi-line
I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself:
tr '\000' '\n' < yourfile.bin | grep "^START"
from there you can take it into sed as before.