I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m#x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m#x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m#x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks
Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.
This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:
hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp
Then, to get the checksum:
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile
For the local copy:
crc32 myFile
If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:
hdfs dfs -put myFile /tmp
And compare the two files on HDFS with:
hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.
Reference:
https://community.cloudera.com/t5/Community-Articles/Comparing-checksums-in-HDFS/ta-p/248617
This is not a solution but a workaround which can be used.
Local File Checksum:
cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt tmp.txt
You can compare them.
Hope it helps.
I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg#mail.gmail.com%3E
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%3CAANLkTinR8zM2QUb+T-drAC6PDmwJm8qcdxL48hzxBNoi#mail.gmail.com%3E
Piping the results of a cat'd hdfs file to md5sum worked for me:
$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f -
$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f /path/to/localFilesystem/file.txt
This would not be recommended for massive files.
I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.
md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" user#xyz.com < /dev/null
fi
Related
While processing huge files ~100GB file size, sometime we need to check first/last few lines (header and trailer lines).
The easy option is to download entire file locally using
gsutil cp gs://bucket_name/file_name .
and then use head/tail command to check header/trailer lines which is not feasible as it will be time consuming and associated cost of extracting data from cloud.
It is same as performing -
gsutil cat gs://bucket_name/file_name | head -1
The other option is to create external table in GCP Tables OR visualize them in datastudio OR read from dataproc cluster/VM.
Is there any other quick option just to check header/trailer lines from cloud storage ?
gsutil cat -r
is the key here.
It output just the specified byte range of object. Offsets starts with 0.
Eg.
To return bytes from 10th to 100th position from the file :
gsutil cat -r 10-100 gs://bucket_name/file_name
To return bytes from 100th till end of file :
gustil cat -r 100- gs://bucket_name/file_name
To return last 10 bytes from the files :
gsutil cat -r -10 gs://bucket_name/file_name
I have a huge csv file (about 60Gb), which is zipped and is stored as a multipart (spanned) gzip file on a s3 bucket. The individual files are like:
file.csv.gz
file.csv1.gz
file.csv2.gz
file.csv3.gz
....
file.csv15.gz
Each file has a size of 512Mb. I need to unzip them into a single csv file.
How can I do that?
cat file.csv.gz file.csv1.gz ... file.csv15.gz | gzip -dc > file.csv
Note that you can't use file.csv*.gz, because the order will not be what you want, for example putting file.csv10.gz before file.csv2.gz. If you renamed them to file.csv00.gz, file.csv01.gz, etc., then you could just:
cat file.csv*.gz | gzip -dc > file.csv
I would like to upload files of a given folder to a bucket using gsutil rsync. However, instead of uploading all files, I would like to exclude files that are below a certain size. The Unix rsync command offers the option --min-size=SIZE. Is there an equivalent for the gsutil tool? If not, is there an easy way of excluding small files?
Ok so the easiest solution I found is to move the small files into a subdirectory, and then use rsync (without the -r option). The code for moving the files:
def filter_images(source, limit):
imgs = [img for img in glob(join(source, "*.tiff")) if (getsize(img)/(1024*1024.0)) < limit]
if len(imgs) == 0:
return
filtered_dir = join(source, "filtered")
makedirs(filtered_dir)
for img in imgs:
shutil.move(img, filtered_dir)
You don't have this option. You can perform manually by scripting this and send file by file. But it's not very efficient. I propose you this command:
find . -type f -size -4000c | xargs -I{} gsutil cp {} gs://my-bucket/path
Here only the file bellow 4k will be copied. Here the list of find unit
c for bytes
w for two-byte words
k for Kilobytes
M for Megabytes
G for Gigabytes
I am trying to grep a zip file on S3/aws & write the output to a new location with same file name
I am using below on s3 , is this the right way to write the streaming output from first CAT command to hdfs output?
hadoop fs -cat s3://analytics/LZ/2017/03/03/test_20170303-000000.tar.gz | zgrep -a -E '*word_1*|*word_2*|word_3|word_4' | hadoop fs -put - s3://prod/project/test/test_20170303-000000.tar.gz
Given you are playing with hadoop, why not run the code in cluster? Scanning for strings inside a .gzip file is commonplace, though I don't know about .tar files.
I'd personally use the -copyToLocal and -copyFromLocal commands to copy it to the local FS and work there. The trouble with things like -cat is that a lot gets logged out on the Hadoop client code, so a pipe is likely to pick up too much extraneous cruft,
I'm working on a management script for Docker containers. Right now the user has to configure certain variables before using it. Often, these variables are already defined in the Dockerfile so the default should be to read those values.
I'm having some trouble with the array format used in these Dockerfiles. If I have the volume definition: VOLUME ["/root/", "/var/log/"] the file script should be able to figure out /root/ and /var/log, I haven't been able to accomplish this yet.
So far I have been able to get "/root/" ", " and "/var/log" out of the file using grep VOLUME Dockerfile | cut -c 8- | grep -o -P '(?<=").+?(?=")' but this stil includes the ", " which should be left out.
Does anyone have suggestions about how to parse this properly?
awk to the rescue!
$ echo VOLUME ["/root/", "/var/log/"] |
awk -F'[ ,\\[\\]]' '/VOLUME/{for(i=3;i<=NF;i+=2) print $i}'
/root/
/var/log/
by setting the delimiters you can extract all the fields.