AWK: Write lines into multiple files - regex

I'm trying to extract sequences from a FASTA file using awk.
e.g. the file looks like this and it contains 703 sequences. I want to extract each of them to separate files.
>sequence_1
AACTTGGCCTT
>sequence_2
AACTTGGCCTT
.
.
.
I'm using this awk script:
awk '/>/ {OUT=substr($0,2) ".fasta"}; OUT {print >OUT}'file.fasta
...which works but only for the 16 first and then I get an error saying;
.fasta makes too many open files
input record number 35, file file.fasta
source line number 1

You would need to close files when you're done. Try:
awk '/>/ {close(OUT); OUT=substr($0,2) ".fasta"}; OUT {print > OUT}' file.fasta

Related

Regex to add a character at the beginning of a particular line in a file

I have a file which is having n number of lines but I want to find only one line and edit it without printing the file contents on the screen. File is dynamically created so I can't count the spaces and all. So I want to use the RegEx for this.
My file is:
hey retry=3
hello so
password so
And I want to make it as:
#hey retry=3
hello so
password so
I tried all these:
sed 's/password[ \t]+requisite[ \t]+pam_pwquality.so/s/^/#/' test1
x='/password[ \t]+requisite[ \t]+pam_pwquality.so/'
sed -i -e "s/\($x\)/#\1/" test1
re="^[password][[ :blank: ]]*[requisite][[ :blank:]]*[pam_pwquality.so][[ :blank:]]*[retry=3]"
But no changes in the file.
I would use awk:
awk '$1=="password" && $2=="requisite" && $3=="pam_deny.so" {
$0="#"$0
}1' file
awk splits the line into fields separated by one or more whitespace characters (which includes tabs). That makes it simple to check the content of the individual fields.
With gawk you can change the file in place:
gawk -i inplace '$1=="password" && $2=="requisite" && $3=="pam_deny.so"
{
$0="#"$0
}1' file
grep -n "password requisite pam_deny.so"
man grep states
-n, --line-number
Prefix each line of output with the 1-based line number within its input file. (-n is specified by POSIX.)

Sed: Search and replace on a 4GB one-line file

OS: 14.04
sed: 4.2.4
I have multiple large files (2-4gb) that I want to perform some simple manipulations on. The entire file is in one line, which makes me wonder how to perform sed operations on it.
There are three things I want to do with each file:
1) Remove all [ characters
2) Remove all ] characters
3) Replace all occurrences of },{ with }{.
So far I have tried sed -e 's/},{/}{/g' file.json > file_new.json with and without the g option, without any luck. I have also tried sed -e 's/\[//g' file.json > file_new.json without any luck. I only get a duplicate file.
Any ideas?
With gnu awk:
awk 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file
Perhaps faster with perl (must be tested):
perl -0135 -pe 's/},{/}{/g;y/][//d' file
Where 135 stands for the character ] in octal. The -0 option defines the record separator (instead of to be read line by line, the file is read by parts from and until each ])
The goal of these two scripts is to avoid to load the whole file in memory.
To store the result in a file:
You can redirect the output.
awk 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file > result
or
perl -0135 -pe 's/},{/}{/g;y/][//d' file > result
You can use command line options:
awk -i inplace -v INPLACE_SUFFIX=.bak 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file
or
perl -0135 -pi'*.bak' -e 's/},{/}{/g;y/][//d' file
(these two commands create a backup of the original file adding the extension .bak, if you want to change the source file in place, remove -v INPLACE_SUFFIX=.bak for gawk, and '*.bak' for perl.)
When I've got huge single-line files like that, for which the usual line-based tools won't work, I usually turn to: tr!
1) Remove all [ characters
2) Remove all ] characters
That's easy:
tr -d '[]' < file > strippedfile
(This might not work with a really, really old SysV version of tr, but it should be fine with any modern version.)
3) Replace all occurrences of },{ with }{.
That's trickier, because you care about context, so it's really a job for sed. One kludge I've used is to use tr to temporarily change some other character to a newline -- that is, to temporarily change the huge single-line file into a multi-line file -- then run sed, and finally change it back to a single-line file. Something like
tr '{' '\n' < file | sed 's/},$/}/' | tr '\n' '{' > newfile
This last works only if the original file contains no newlines. You could run through tr -d '\n' first to be sure.
Try this to place a newline at the end of the file:
echo "" >> file
sed 'whatever' file
Many UNIX tools will simply not recognize a file with no ending newlines as a text file and so will not operate on them so that MAY be your problem. If that doesn't work then edit your question to include a concise, testable example of your file.

Delete rows with extra delimiter from csv file in unix

I have a csv file with 3 columns separated by ',' delimiter. Some values have , in data and I would like to remove the whole record. Suggest if I can do this using sed/awk,grep commands .
Input file :
monitor,display,45
keyboard,input,20
loud,speaker,output,20
mount,input,20
Expected Output :
monitor,display,45
keyboard,input,20
mount,input,20
I used grep command to filter out rows with extra commas.
grep -v '.*,.*,.*,.*' input_file > output_file.
We need to define the regex pattern between .*
-v excludes the records which match the pattern specified.
Below is how you can do the same using awk , basically you want the record in which there are exactly 3 fields
$ awk -F, 'NF==3 {print $0}' data1.txt
monitor,display,45
keyboard,input,20
mount,input,20

Parse target file based on source file contents

I am trying to search for lines in FileB (which is comma separated) that contain content from lines in FileA. I originally tried using grep but it does not seem to care for some of the characters in FileA. I do not assume that the CSV formatting would matter much, well at least to grep.
$ grep -f FileA FileB
grep: Unmatched [ or [^
I am open to using any generally available Linux command, Perl or Python. There is not a specific expression that can be matched which is the reason for using the content from FileA to match on. Below are some example lines that are in FileA that we want to match in FileB.
page=--&id='`([{^~
page=&rows_select=%' and '%'='
l=admin&x=&id=&pagex=http://.../search/cache?ei=utf-&p=change&fr=mailc&u=http://sub.domain.com/cache.aspx?q=change&d=&mkt=en-us&setlang=en-us&w=afe,dbfcd&icp=&.intl=us&sit=dbajdy.alt
The lines in fileB that contain the above strings will contain additional characters in the line, i.e. the strings the the two files will not be a one for one match:
fileA contains abc and fileB contains 012abc*(), 012abc*() would print
A simple python solution would be:
with open('filea', 'r') as fa:
with open('fileb', 'r') as fb:
patterns = fa.readlines()
for line in fb:
if line in patterns:
print line
which would store the whole pattern file in memory, and compare each line of the other file against the list.
but why wouldn't you just use diff? I'd have to look at the manpage, but I'm pretty sure there's a way to make it tell what are the similarities between two files. After googling:
Using diff to find the portions of many files that are the same? (bizzaro-diff, or inverse-diff)
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff
they give that solution:
diff --unchanged-group-format='## %dn,%df
%<' --old-group-format='' --new-group-format='' \
--changed-group-format='' a.txt b.txt
Untested Solution:
Logic:
Store line from FileB in lines array
For each line in lines array;
Check if line in array appears as a part of your line in FileB
If index(..) returns > 0 then;
Print that line from FileB
awk 'NR==FNR{lines[$0]++;next}{for (line in lines) {if (index($0,line)>0) {print $0}}}' FILEA FILEB`
Use fgrep (or equivalently grep -F). That interprets the pattern (the contents of FileA) as a literal string to search for instead of a regular expression.

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename