Merging files in Linux - uniq

I am using Cygwin to merge multiple files. However, I wanted to know if my approach is correct or not. This is both a question and a discussion :)
First, a little info about the files I have:
Both the files have ASCII as well as NON ASCII Characters.
File1 has 7899097 lines in it and a size of ~ 70.9 Mb
File2 has 14344391 lines in it and a size of ~ 136.6 Mb
File Encoding Information:
$ file -bi file1.txt
text/x-c++; charset=unknown-8bit
$ file -bi file2.txt
text/x-c++; charset=utf-8
$ file -bi output.txt
text/x-c++; charset=unknown-8bit
This is the method I am following to merge the two files, sort them and then remove all the duplicate entries:
I create a temp folder and place both the text files inside it.
I run the following commands to merge both the files but keep a line break between the two
for file in *.txt; do
cat $file >> output.txt
echo >> output.txt
done
The resulting output.txt file has 22243490 lines in it and a size of 207.5 Mb
Now, if I run the sort command on it as shown below, I get an error since there are Non ASCII characters (maybe unicode, wide characters) present inside it:
sort -u output.txt
string comparison failed: Invalid or incomplete multibyte or wide character
So, I set the environment variable LC_ALL to C and then run the command as follows:
cat output.txt | sort -u | uniq >> result.txt
And, the result.txt has 22243488 lines in it and a size of 207.5 Mb.
So, result.txt is the same as output.txt
Now, I already know that there are many duplicate entries in output.txt, then why the above commands are not able to remove the duplicate entries?
Also, considering the large size of the files, I wanted to know if this is an efficient method to merge multiple files, sort them and then unique them?

Hmm, I'd use
cat file1.txt file2.txt other-files.* | recode enc1..enc2 | sort | uniq > file3.txt
but watch out - this could cause problem with some big file sizes, counted in gigabytes ( or bigger), anyway with hundreds of megabytes should probably go fine. If I'd want real efficiency, e.g. having really huge files, I'd first remove single-file duplicates, then sort it, merge one after one, and then sort again and remove duplicate lines again. Theoretically uniq -c and grep filter could remove duplicates. Try to avoid falling into some unneeded sophistication of the solution :)
http://catb.org/~esr/writings/unix-koans/two_paths.html
edited:
mv file1.txt file1_iso1234.txt
mv file2.txt file2_latin7.txt
ls file*.txt |while read line; do cat $line |recode $(echo $line|cut -d'_' -f2 |cut -d'.' -f1)..utf8 ; done | sort | uniq > finalfile.txt

Related

What is the fastest way to remove a number from the beginning of so many files?

I have 1000 files each having one million lines. Each line has the following form:
a number,a text
I want to remove all of the numbers from the beginning of every line of every file. including the ,
Example:
14671823,aboasdyflj -> aboasdyflj
What I'm doing is:
os.system("sed -i -- 's/^.*,//g' data/*")
and it works fine but it's taking a huge amount of time.
What is the fastest way to do this?
I'm coding in python.
This is much faster:
cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt
On a file with 11 million rows it took less than one second.
To use this on several files in a directory, use:
TMP=/pathto/tmpfile
for file in dir/*; do
cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done
A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.
I would use GNU awk (to leverage the -i inplace editing of file) with , as the field separator, no expensive Regex manipulation:
awk -F, -i inplace '{print $2}' file.txt
For example, if the filenames have a common prefix like file, you can use shell globbing:
awk -F, -i inplace '{print $2}' file*
awk will treat each file as different argument while applying the in-place modifications.
As a side note, you could simply run the shell command in the shell directly instead of wrapping it in os.system() which is insecure and deprecated BTW in favor of subprocess.
that's probably pretty fast & native python. Reduced loops and using csv.reader & csv.writer which are compiled in most implementations:
import csv,os,glob
for f1 in glob.glob("*.txt"):
f2 = f1+".new"
with open(f1) as fr, open(f2,"w",newline="") as fw:
csv.writer(fw).writerows(x[1] for x in csv.reader(fr))
os.remove(f1)
os.rename(f2,f1) # move back the newfile into the old one
maybe the writerows part could be even faster by using map & operator.itemgetter to remove the inner loop:
csv.writer(fw).writerows(map(operator.itemgetter(1),csv.reader(fr)))
Also:
it's portable on all systems including windows without MSYS installed
it stops with exception in case of problem avoiding to destroy the input
the temporary file is created in the same filesystem on purpose so deleting+renaming is super fast (as opposed to moving temp file to input across filesystems which would require shutil.move & would copy the data)
You can take advantage of your multicore system, along with the tips of other users on handling a specific file faster.
FILES = ['a', 'b', 'c', 'd']
CORES = 4
q = multiprocessing.Queue(len(FILES))
for f in FILES:
q.put(f)
def handler(q, i):
while True:
try:
f = q.get(block=False)
except Queue.Empty:
return
os.system("cut -f2 -d ',' {f} > tmp{i} && mv tmp{i} {f}".format(**locals()))
processes = [multiprocessing.Process(target=handler, args=(q, i)) for i in range(CORES)]
[p.start() for p in processes]
[p.join() for p in processes]
print "Done!"

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Bash: Batch Rename Appending File Extension

I have a bunch of temperature logger data files in .csv format. The proprietary temp-logger software saves them with weird useless names. I want to name the files by their serial numbers (S/N). The S/N can be found in each of the files (in several places).
So, I need to extract the S/N and change the name of the file to {S/N}.csv.
I'm almost there, but can't figure out how to get the ".csv" file extension onto the end.
Here's my code:
for i in *.csv; do grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1 | xargs mv "$i" ; done
Note the "cut" and "head" commands are necessary to get just the S/N number from the regular expression return, and to take only one (the S/N is listed several times in the file).
If anyone has a more elegant solution, I'd love to see it. All I really need though is to get that ".csv" onto the end of my new file names.
Thanks!
You can do it with xargs, but it's simpler to skip it and call mv directly. (You're only renaming one file per call to xargs anyway.)
for i in *.csv; do
ser_num=$(grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1)
mv "$i" "$ser_num.csv"
done

linux pipe with egrep not working as expected

I have this small piece of code
egrep -oh '([A-Z]+)_([A-Z]+)_([A-Z]+)' -R /path | uniq | sort
I use this script to dig for environment variables inside files stored in a common directory when I don't want to display any duplicate, but I just want the the name of any variable if any are being used.
needless to say that the regex works, the matched words are the ones that are composed of 3 subsets of letters in uppercase *_*_*, the problem is that uniq doesn't look like it's work and doing anything, the variables are just printed out as egrep finds them.
Not even uniq -u does the trick.
Is the pipe itself the problem ?
uniq requires its input to be sorted if you want it to work in this manner. From the man page: (emphasis mine)
DESCRIPTION: Filter adjacent matching lines
So you could put a sort before the uniq in the pipeline, but that is not necessary, you can simply use the -u flag to sort to only output unique lines from the sorted output:
egrep -oh '([A-Z]+)_([A-Z]+)_([A-Z]+)' -R /path | sort -u

Find matches between 2 files

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$
Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are