unzip a large multipart (spanned) gzip file on AWS S3 bucket

unzip a large multipart (spanned) gzip file on AWS S3 bucket - amazon-web-services

I have a huge csv file (about 60Gb), which is zipped and is stored as a multipart (spanned) gzip file on a s3 bucket. The individual files are like:
file.csv.gz
file.csv1.gz
file.csv2.gz
file.csv3.gz
....
file.csv15.gz
Each file has a size of 512Mb. I need to unzip them into a single csv file.
How can I do that?

cat file.csv.gz file.csv1.gz ... file.csv15.gz | gzip -dc > file.csv
Note that you can't use file.csv*.gz, because the order will not be what you want, for example putting file.csv10.gz before file.csv2.gz. If you renamed them to file.csv00.gz, file.csv01.gz, etc., then you could just:
cat file.csv*.gz | gzip -dc > file.csv

Related

Reading first few lines from files in google cloud storage

While processing huge files ~100GB file size, sometime we need to check first/last few lines (header and trailer lines).
The easy option is to download entire file locally using
gsutil cp gs://bucket_name/file_name .
and then use head/tail command to check header/trailer lines which is not feasible as it will be time consuming and associated cost of extracting data from cloud.
It is same as performing -
gsutil cat gs://bucket_name/file_name | head -1
The other option is to create external table in GCP Tables OR visualize them in datastudio OR read from dataproc cluster/VM.
Is there any other quick option just to check header/trailer lines from cloud storage ?

gsutil cat -r
is the key here.
It output just the specified byte range of object. Offsets starts with 0.
Eg.
To return bytes from 10th to 100th position from the file :
gsutil cat -r 10-100 gs://bucket_name/file_name
To return bytes from 100th till end of file :
gustil cat -r 100- gs://bucket_name/file_name
To return last 10 bytes from the files :
gsutil cat -r -10 gs://bucket_name/file_name

Exclude small files when using gsutil rsync

I would like to upload files of a given folder to a bucket using gsutil rsync. However, instead of uploading all files, I would like to exclude files that are below a certain size. The Unix rsync command offers the option --min-size=SIZE. Is there an equivalent for the gsutil tool? If not, is there an easy way of excluding small files?

Ok so the easiest solution I found is to move the small files into a subdirectory, and then use rsync (without the -r option). The code for moving the files:
def filter_images(source, limit):
imgs = [img for img in glob(join(source, "*.tiff")) if (getsize(img)/(1024*1024.0)) < limit]
if len(imgs) == 0:
return
filtered_dir = join(source, "filtered")
makedirs(filtered_dir)
for img in imgs:
shutil.move(img, filtered_dir)

You don't have this option. You can perform manually by scripting this and send file by file. But it's not very efficient. I propose you this command:
find . -type f -size -4000c | xargs -I{} gsutil cp {} gs://my-bucket/path
Here only the file bellow 4k will be copied. Here the list of find unit
c for bytes
w for two-byte words
k for Kilobytes
M for Megabytes
G for Gigabytes

zgrep in hadoop streaming

I am trying to grep a zip file on S3/aws & write the output to a new location with same file name
I am using below on s3 , is this the right way to write the streaming output from first CAT command to hdfs output?
hadoop fs -cat s3://analytics/LZ/2017/03/03/test_20170303-000000.tar.gz | zgrep -a -E '*word_1*|*word_2*|word_3|word_4' | hadoop fs -put - s3://prod/project/test/test_20170303-000000.tar.gz

Given you are playing with hadoop, why not run the code in cluster? Scanning for strings inside a .gzip file is commonplace, though I don't know about .tar files.
I'd personally use the -copyToLocal and -copyFromLocal commands to copy it to the local FS and work there. The trouble with things like -cat is that a lot gets logged out on the Hadoop client code, so a pipe is likely to pick up too much extraneous cruft,

Compare HDFS Checksum to Local File System Checksum

I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m#x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m#x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m#x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks

Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.
This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:
hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp
Then, to get the checksum:
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile
For the local copy:
crc32 myFile
If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:
hdfs dfs -put myFile /tmp
And compare the two files on HDFS with:
hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.
Reference:
https://community.cloudera.com/t5/Community-Articles/Comparing-checksums-in-HDFS/ta-p/248617

This is not a solution but a workaround which can be used.
Local File Checksum:
cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt tmp.txt
You can compare them.
Hope it helps.

I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg#mail.gmail.com%3E
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%3CAANLkTinR8zM2QUb+T-drAC6PDmwJm8qcdxL48hzxBNoi#mail.gmail.com%3E

Piping the results of a cat'd hdfs file to md5sum worked for me:
$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f -
$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f /path/to/localFilesystem/file.txt
This would not be recommended for massive files.

I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.
md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" user#xyz.com < /dev/null
fi

Merging files in Linux

I am using Cygwin to merge multiple files. However, I wanted to know if my approach is correct or not. This is both a question and a discussion :)
First, a little info about the files I have:
Both the files have ASCII as well as NON ASCII Characters.
File1 has 7899097 lines in it and a size of ~ 70.9 Mb
File2 has 14344391 lines in it and a size of ~ 136.6 Mb
File Encoding Information:
$ file -bi file1.txt
text/x-c++; charset=unknown-8bit
$ file -bi file2.txt
text/x-c++; charset=utf-8
$ file -bi output.txt
text/x-c++; charset=unknown-8bit
This is the method I am following to merge the two files, sort them and then remove all the duplicate entries:
I create a temp folder and place both the text files inside it.
I run the following commands to merge both the files but keep a line break between the two
for file in *.txt; do
cat $file >> output.txt
echo >> output.txt
done
The resulting output.txt file has 22243490 lines in it and a size of 207.5 Mb
Now, if I run the sort command on it as shown below, I get an error since there are Non ASCII characters (maybe unicode, wide characters) present inside it:
sort -u output.txt
string comparison failed: Invalid or incomplete multibyte or wide character
So, I set the environment variable LC_ALL to C and then run the command as follows:
cat output.txt | sort -u | uniq >> result.txt
And, the result.txt has 22243488 lines in it and a size of 207.5 Mb.
So, result.txt is the same as output.txt
Now, I already know that there are many duplicate entries in output.txt, then why the above commands are not able to remove the duplicate entries?
Also, considering the large size of the files, I wanted to know if this is an efficient method to merge multiple files, sort them and then unique them?

Hmm, I'd use
cat file1.txt file2.txt other-files.* | recode enc1..enc2 | sort | uniq > file3.txt
but watch out - this could cause problem with some big file sizes, counted in gigabytes ( or bigger), anyway with hundreds of megabytes should probably go fine. If I'd want real efficiency, e.g. having really huge files, I'd first remove single-file duplicates, then sort it, merge one after one, and then sort again and remove duplicate lines again. Theoretically uniq -c and grep filter could remove duplicates. Try to avoid falling into some unneeded sophistication of the solution :)
http://catb.org/~esr/writings/unix-koans/two_paths.html
edited:
mv file1.txt file1_iso1234.txt
mv file2.txt file2_latin7.txt
ls file*.txt |while read line; do cat $line |recode $(echo $line|cut -d'_' -f2 |cut -d'.' -f1)..utf8 ; done | sort | uniq > finalfile.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

unzip a large multipart (spanned) gzip file on AWS S3 bucket - amazon-web-services

Related

Reading first few lines from files in google cloud storage

Exclude small files when using gsutil rsync

zgrep in hadoop streaming

Compare HDFS Checksum to Local File System Checksum

Merging files in Linux

Categories

Resources