Combine multiple lines of text documents into one - python-2.7

I have thousands of text documents and they have varied number of lines of texts. I want to combine all the lines into one single line in each document individually. That is for example:
abcd
efgh
ijkl
should become as
abcd efgh ijkl
I tried using sed commands but it is quite not achieving what I want as the number of lines in each documents vary. Please suggest what I can do. I am working on python in ubuntu. One line commands would be of great help. thanks in advance!

If you place your script in the same directory as your files, the following code should work.
import os
count = 0
for doc in os.listdir('C:\Users\B\Desktop\\newdocs'):
if doc.endswith(".txt"):
with open(doc, 'r') as f:
single_line = ''.join([line for line in f])
single_space = ' '.join(single_line.split())
with open("new_doc{}.txt".format(count) , "w") as doc:
doc.write(single_space)
count += 1
else:
continue
#inspectorG4dget's code is more compact than mine -- and thus I think it's better. I tried to make mine as user-friendly as possible. Hope it helps!

Using python wouldn't be necessary. This does the trick:
% echo `cat input.txt` > output.txt
To apply to a bunch of files, you can use a loop. E.g. if you're using bash:
for inputfile in /path/to/directory/with/files/* ; do
echo `cat ${inputfile}` > ${inputfile}2
done

assuming all your files are in one directory,have a .txt extension and you have access to a linux box with bash you can use tr like this:
for i in *.txt ; do tr '\n' ' ' < $i > $i.one; done
for every "file.txt", this will produce a "file.txt.one" with all the text on one line.
If you want a solution that operates on the files directly you can use gnu sed (NOTE THIS WILL CLOBBER YOUR STARTING FILES - MAKE A BACKUP OF THE DIRECTORY BEFORE TRYING THIS):
sed -i -n 'H;${x;s|\n| |g;p};' *.txt
If your files aren't in the same directory, you can used find with -exec:
find . -name "*.txt" -exec YOUR_COMMAND \{\} \;
If this doesn't work, maybe a few more details about what you're trying to do would help.

Related

How to rename multiple files with multiple letters and numbers combinations and sizes using bash or regex?

I've been trying to rename a list of files by it's been quite difficult...
The 41 filenames are:
BEIII_S29_pear_derep.fasta
BEII_S15_pear_derep.fasta
BEI_S1_pear_derep.fasta
MB211III_S30_pear_derep.fasta
MB211II_S16_pear_derep.fasta
MB211I_S2_pear_derep.fasta
...
and I need to rename to:
BEIII.fas
BEII.fas
BEI.fas
MB211III.fas
MB211II.fas
MB211I.fas
I tryed using for loop:
for i in *_S[0-9]{1,2}_pear_derep.fasta; do newfile="$(basename $i _S[0-9]{1,2}_pear_derep.fasta)"; echo $newfile; cp ${newfile}_S[0-9]{1,2}_pear_derep.fasta ${newfile}.fas; done;
It didn't work, then
rename 's/([A-Z]*[0-9]*[I]{1,4})_[A-Z][0-9]_[a-z]_[a-z]{1,5}(\.fasta).*/$1$2/g' *
It didn't work
then
for file in *.fas; do newfile=$(echo "$file" | sed -re 's/S_[0-9][0-9](\.)/\./g') mv -v $file $newfile; done;
None of them worked.
The thing here is that I have to use a regex to KEEP a variable beggining, which varys between
[A-Z]{2}[0-9]{3}[I]{1,3}
then everything else is excluded
S[0-9]{1,2}_[a-z]{4}_[a-z]{5} and then the extension .fasta to .fas
Could someone help me please?
Thank you Guys
You should make sure that *\.fasta targets every file you need. Make sure that you echo the mv command or create a copy of the directory and try it there first.
for i in *\.fasta; do
mv $i ${i/_*/}.fas;
done
The substitution ${i/_*/} removes everything after the first _.
The regexp in your rename attempt is missing a bunch of quantifiers. Also, it doesn't change the extension from .fasta to .fas. You should also anchor it to the beginning and end of the filename. There's no need for the g modifier, since you're only doing one replacement per name.
rename 's/^([A-Z]*[0-9]*I{1,4})_[A-Z][0-9]*_[a-z]*_[a-z]{1,5}\.fasta$/$1.fas/' *

Regex match for file and rename + overwrite old file

Im trying to make a bash script to rename some files wich match my regex, if they match i want to rename them using the regex and overwrite an old existing file.
I want to do this because on computer 1 i have a file, on computer 2 i change the file. Later i go back to computer 1 and it gives an example conflict so it saves them both.
Example file:
acl_cam.MYI
Example file after conflict:
acl_cam (Example conflit with .... on 2015-08-20).MYI
I tried a lot of thinks like rename, mv and couple other scripts but it didn't work.
the regex i should use in my opinion:
(.*)/s\(.*\)\.(.*)
then rename it to value1 . value2 and replace the old file (acl_cam.MYI) and do this for all files/directories from where it started
can you guys help me with this one?
The issue you have, if I understand your question correctly, is two part. (1) What is the correct regex that will match the error string and produce a filename?; and (2) how to use the returned filename to move/remove the offending file?
If the sting at issue is:
acl_cam (Example conflit with .... on 2015-08-20).MYI
and you need to return the MySQL file name, then a regex similar to the following will work:
[ ][(].*[)]
The stream editor sed is about as good as anything else to return the filename from your string. Example:
$ printf "acl_cam (Example conflit with .... on 2015-08-20).MYI\n" | \
sed -e 's/[ ][(].*[)]//'
acl_cam.MYI
(shown with line continuation above)
Then it is up to you how you move or delete the file. The remaining question is where is the information (the error string) currently stored and how do you have access to it? If you have a file full of these errors, then you could do something like the following:
while read -r line; do
victim=$( printf "%s\n" "$line" | sed -e 's/[ ][(].*[)]//' )
## to move the file to /path/to/old
[ -e "$victim" ] && mv "$victim" /path/to/old
done <$myerrorfilename
(you could also feed the string to sed as a here-string, but omitted for simplicity)
You could also just delete the file if that suits your purpose. However, more information is needed to clarify how/where that information is stored and what exactly you want to do with it to provide any more specifics. Let me know if you have further questions.
Final solution for this question for people who are interested:
for i in *; do
#Wildcar check if current file containt (Exemplaar
if [[ $i == *"(Exemplaar"* ]]
then
#Rename the file to the original name (without Exemplaar conflict)
NewFileName=$(echo "$i" | sed -E -e 's/[ ][(].*[)]//')
#Remove the original file
rm $NewFileName;
#Copy the conflict file as the original file name
cp -a "$i" $NewFileName;
#Delete the conflict file
rm "$i";
echo "Removed file: $NewFileName with: $i";
fi
done
I used this code to replace my database conflict files created by dropbox sync with different computers.

How to find lines using patterns in a file in UNIX

I am trying to use a .txt file with around 5000 patterns (spaced with a line) to search through another file of 18000 lines for any matches. So far I've tried every form of grep and awk I can find on the internet and it's still not working, so I am completely stumped.
Here's some text from each file.
Pattern.txt
rs2622590
rs925489
rs2798334
rs6801957
rs6801957
rs13137008
rs3807989
rs10850409
rs2798269
rs549182
There's no extra spaces or anything.
File.txt
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
rs12562034 1 758311 A G -1.552 0.1207 0.09167
rs4040617 1 769185 A G -0.414 0.6786 0.875
rs4970383 1 828418 A C 0.214 0.8303 .
rs4475691 1 836671 T C -0.604 0.5461 .
rs1806509 1 843817 A C -0.262 0.7933 .
The file.txt was downloaded directly from a med directory.
I'm pretty new to UNIX so any help would be amazing!
Sorry edit: I have definitely tried every single thing you guys are recommending and the result is blank. Am I maybe missing a syntax issue or something in my text files?
P.P.S I know there are matches as doing individual greps works. I'll move this question to unix.stackexchange. Thanks for your answers guys I'll try them all out.
Issue solved: I was obviously using DOS carriages. I didn't know about this before so thank you everyone that answered. For future users who are having this issue, here is the solution that worked:
dos2unix *
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt > Output.txt
You can use grep -Fw here:
grep -Fw -f Pattern.txt File.txt
Options used are:
-F - Fixed string search to tread input as non-regex
-w - Match full words only
-f file - Read pattern from a file
idk if it's what you want or not, but this will print every line from File.txt whose first field equals a string from Patterns.txt:
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt
If that is not what you want, tell us what you do want. If it is what you want but doesn't produce the output you expect then one or both of your files contains control characters courtesy of being created in Windows so run dos2unix or similar on them both first.
Use a shell script to read each line of the file containing your patterns then fgrep it.
#!/bin/bash
FILENAME=$1
awk '{kount++;print $0}' $FILENAME | fgrep -f - PATTERNFILE.txt

in a text file that is a list of paths, insert directory immediately before files with certain extension

I have a list of files in files.txt, a hugely simplified example
$FOO%foo\bar\biz.asmx
%FOO%foo\bar\biz.cs
%FOO%baz\bar\foo\biz.asmx
It is my desire to insert App_Code in the path of .asmx files like:
$FOO%foo\bar\App_code\biz.asmx
%FOO%foo\bar\biz.cs
%FOO%baz\bar\foo\App_Code\biz.asmx
Though I'm on a windows box I have gnuwin32, which gives me sed/awk/grep and other fancy stuff.
I'm not wedded to a particular solution, but am interested in the sed/awk route for my on enlightenment
I have tried:
sed "s/\\([:alnum:]*)\.asmx/App_Code\/{1}/"
which I had thought would capture any alphanumeric characters after a path separator (filename) that are followed by .asmx, and then replace it with `App_Code{contents of group}.
Something is off as it never finds what I want. I'm strugging with the docs and examples, advice and guidance would be appreciated.
Quoting on Windows is a pain so put the following script into a file called appcode.awk:
BEGIN {
FS=OFS="\\"
}
$NF~/[.]asmx/{
$NF = "App_code" OFS $NF
}
{
print
}
And run like:
$ awk -f appcode.awk file
$FOO%foo\bar\App_code\biz.asmx
%FOO%foo\bar\biz.cs
%FOO%baz\bar\foo\App_code\biz.asmx
Using awk
awk -F\\ '/\.asmx/ {$NF="App_Code\\"$NF}1' OFS=\\ file
$FOO%foo\bar\App_Code\biz.asmx
%FOO%foo\bar\biz.cs
%FOO%baz\bar\foo\App_Code\biz.asmx
Using sed:
sed -r 's/(\\\w+\.asmx)/\\App_Code\1/' files.txt
Output:
$FOO%foo\bar\App_Code\biz.asmx
%FOO%foo\bar\biz.cs
%FOO%baz\bar\foo\App_Code\biz.asmx
EDIT
As suggested in by sudo_O, capture group can be dropped and & can be used in the same command.
sed -r 's/\\\w+\.asmx/\\App_Code&/' files.txt

Shell Script - list files, read files and write data to new file

I have a special question to shell scripting.
Simple scripting is no Problem for me but I am new on this and want to make me a simple database file.
So, what I want to do is:
- Search for filetypes (i.e. .nfo) <-- should be no problem :)
- read inside of each found file and use some strings inside
- these string of each file should be written in a new file. Each found file informations
should be one row in new file
I hope I explained my "project" good.
My problem is now, to understand how I can tell the script it has to search for files and then use each of this files to read in it and use some information in it to write this in a new file.
I will explain a bit better.
I am searching for files and that gives me back:
file1.nfo
file2.nfo
file3.nfo
Ok now in each of that file I need the information between 2 lines. i.e.
file1.nfo:
<user>test1</user>
file2.nfo:
<user>test2</user>
so in the new file there should now be:
file1.nfo:user1
file2.nfo:user2
OK so:
find -name *.nfo > /test/database.txt
is printing out the list of files.
and
sed -n '/<user*/,/<\/user>/p' file1.nfo
gives me back the complete file and not only the information between <user> and </user>
I try to go on step by step and I am reading a lot but it seems to be very difficult.
What am I doing wrong and what should be the best way to list all files, and write the files and the content between two strings to a file?
EDIT-NEW:
Ok here is an update for more informations.
I learned now a lot and searched the web for my problems. I can find a lot of informations but i don´t know how to put them together so that i can use it.
Working now with awk is that i get back filename and the String.
Here now the complete Informations (i thought i can go on by myself with a bit of help but i can´t :( )
Here is an example of: /test/file1.nfo
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Baseball</hobby>
<hobby>Baskeball</hobby>
</personal informations>
Here an example of /test/file2.nof
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Soccer</hobby>
<hobby>Traveling</hobby>
</personal informations>
The File i want to create has to look like this.
STRING 1:::/test/file1.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
STRING 1:::/test/file2.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
"Date of file" should be the creation date of the file. So that i can see how old is the file.
So, that´s what i need and it seems not easy.
Thanks a lot.
UPATE ERROR -printf
find: unrecognized: -printf
Usage: find [PATH]... [OPTIONS] [ACTIONS]
Search for files and perform actions on them.
First failed action stops processing of current file.
Defaults: PATH is current directory, action is '-print'
-follow Follow symlinks
-xdev Don't descend directories on other filesystems
-maxdepth N Descend at most N levels. -maxdepth 0 applies
actions to command line arguments only
-mindepth N Don't act on first N levels
-depth Act on directory *after* traversing it
Actions:
( ACTIONS ) Group actions for -o / -a
! ACT Invert ACT's success/failure
ACT1 [-a] ACT2 If ACT1 fails, stop, else do ACT2
ACT1 -o ACT2 If ACT1 succeeds, stop, else do ACT2
Note: -a has higher priority than -o
-name PATTERN Match file name (w/o directory name) to PATTERN
-iname PATTERN Case insensitive -name
-path PATTERN Match path to PATTERN
-ipath PATTERN Case insensitive -path
-regex PATTERN Match path to regex PATTERN
-type X File type is X (one of: f,d,l,b,c,...)
-perm MASK At least one mask bit (+MASK), all bits (-MASK),
or exactly MASK bits are set in file's mode
-mtime DAYS mtime is greater than (+N), less than (-N),
or exactly N days in the past
-mmin MINS mtime is greater than (+N), less than (-N),
or exactly N minutes in the past
-newer FILE mtime is more recent than FILE's
-inum N File has inode number N
-user NAME/ID File is owned by given user
-group NAME/ID File is owned by given group
-size N[bck] File size is N (c:bytes,k:kbytes,b:512 bytes(def.))
+/-N: file size is bigger/smaller than N
-links N Number of links is greater than (+N), less than (-N),
or exactly N
-prune If current file is directory, don't descend into it
If none of the following actions is specified, -print is assumed
-print Print file name
-print0 Print file name, NUL terminated
-exec CMD ARG ; Run CMD with all instances of {} replaced by
file name. Fails if CMD exits with nonzero
-delete Delete current file/directory. Turns on -depth option
The pat1,pat2 notation of sed is line based. Think of it like this, pat1 sets an enable flag for its commands and pat2 disables the flag. If both pat1 and pat2 are on the same line the flag will be set, and thus in your case print everything following and including the <user> line. See grymoire's sed howto for more.
An alternative to sed, in this case, would be to use a grep that supports look-around assertions, e.g. GNU grep:
find . -type f -name '*.nfo' | xargs grep -oP '(?<=<user>).*(?=</user>)'
If grep doesn't support -P, you can use a combination of grep and sed:
find . -type f -name '*.nfo' | xargs grep -o '<user>.*</user>' | sed 's:</\?user>::g'
Output:
./file1.nfo:test1
./file2.nfo:test2
Note, you should be aware of the issues involved with passing files on to xargs and perhaps use -exec ... instead.
It so happens that grep outputs in the format you need and is enough for an one-liner.
By default a grep '' *.nfo will output something like:
file1.nfo:random data
file1.nfo:<user>test1</user>
file1.nfo:some more random data
file2.nfo:not needed
file2.nfo:<user>test2</user>
file2.nfo:etc etc
By adding the -P option (Perl RegEx) you can restrict the output to matches only:
grep -P "<user>\w+<\/user>" *.nfo
output:
file1.nfo:<user>test1</user>
file2.nfo:<user>test2</user>
Now the -o option (only show what matched) saves the day, but we'll need a bit more advanced RegEx since the tags are not needed:
grep -oP "(?<=<user>)\w+(?=<\/user>)" *.nfo > /test/database.txt
output of cat /test/database.txt:
file1.nfo:test1
file2.nfo:test2
Explained RegEx here: http://regex101.com/r/oU2wQ1
And your whole script just became a single command.
Update:
If you don't have the --perl-regexp option try:
grep -oE "<user>\w+<\/user>" *.nfo|sed 's#</?user>##g' > /test/database.txt
All you need is:
find -name '*.nfo' | xargs awk -F'[><]' '{print FILENAME,$3}'
If you have more in your file than just what you show in your sample input then this is probably all you need:
... awk -F'[><]' '/<user>/{print FILENAME,$3}' file
Try this (untested):
> outfile
find -name '*.nfo' -printf "%p %Tc\n" |
while IFS= read -r fname tstamp
do
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
The above will only work if your file names do not contain spaces. If they can, we'd need to tweak the loop.
Alternative if your find doesn't support -printf (suggestion - seriously consider getting a modern "find"!):
> outfile
find -name '*.nfo' -print |
while IFS= read -r fname
do
tstamp=$(stat -c"%x" "$fname")
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
If you don't have "stat" then google for alternatives to get a timestamp from a file or consider parsing the output of ls -l - it's unreliable but if it's all you've got...