Reading first few lines from files in google cloud storage - google-cloud-platform

While processing huge files ~100GB file size, sometime we need to check first/last few lines (header and trailer lines).
The easy option is to download entire file locally using
gsutil cp gs://bucket_name/file_name .
and then use head/tail command to check header/trailer lines which is not feasible as it will be time consuming and associated cost of extracting data from cloud.
It is same as performing -
gsutil cat gs://bucket_name/file_name | head -1
The other option is to create external table in GCP Tables OR visualize them in datastudio OR read from dataproc cluster/VM.
Is there any other quick option just to check header/trailer lines from cloud storage ?

gsutil cat -r
is the key here.
It output just the specified byte range of object. Offsets starts with 0.
Eg.
To return bytes from 10th to 100th position from the file :
gsutil cat -r 10-100 gs://bucket_name/file_name
To return bytes from 100th till end of file :
gustil cat -r 100- gs://bucket_name/file_name
To return last 10 bytes from the files :
gsutil cat -r -10 gs://bucket_name/file_name

Related

Grab latest AWS S3 Folder Object name with AWS CLI

I tried using this post to look for the last modified file then awk for the folder it's contained in: Get last modified object from S3 using AWS CLI
But this isn't ideal for over 1000 folders and by documentation, should be failing. I have 2000+ folder objects I need to search through. My desired folder will always begin with an D and be followed by a set of incrementing numbers. Ex: D1200
The results from the answer led me to creating this call which works:
aws s3 ls main.test.staging/General_Testing/Results/ --recursive | sort | tail -n 1 | awk '{print $4}'
but it takes over 40 secs to search through thousands of files and I then need to regex parse the output to find the folder object and not the last file modified within it. Also, if I try to do this to find my desired folder (which is the object right after the Results object):
aws ls s3 main.test.staging/General_Testing/Results/ | sort | tail -1
Then my output will be D998 because the sort function will order folder names like this:
D119
D12
D13
Because technically D12 is greater than D119 because it has a 2 in the 2nd position. Following this strange logic, there's no way I can use that call to reliable retrieve the highest numbered folder and therefore the last one created. Something to note is that folder objects that contain files do not have a Last Modified tag that one can use to query.
To be clear of my question: What call can I use to look through a large amount of S3 objects to find the largest numbered folder object? Preferably the answer is fast, can work with 1000+ objects, and won't require a regex breakdown.
I wonder whether you can use a list of CommonPrefixes to overcome your program of having many folders?
Try this command:
aws s3api list-objects-v2 --bucket main.test.staging --delimiter '/' --prefix 'General_Testing/Results/' --query CommonPrefixes --output text
(Note that is uses s3api rather than s3.)
It should provide a list of 'folders'. I don't know whether it has a limit on the number of 'folders' returned.
As for sorting D119 before D2, this is because it is sorting strings. The output is perfectly correct when sorting strings.
To sort by the number portion, you can likely use "version sorting". See: How to sort strings that contain a common prefix and suffix numerically from Bash?

Exclude small files when using gsutil rsync

I would like to upload files of a given folder to a bucket using gsutil rsync. However, instead of uploading all files, I would like to exclude files that are below a certain size. The Unix rsync command offers the option --min-size=SIZE. Is there an equivalent for the gsutil tool? If not, is there an easy way of excluding small files?
Ok so the easiest solution I found is to move the small files into a subdirectory, and then use rsync (without the -r option). The code for moving the files:
def filter_images(source, limit):
imgs = [img for img in glob(join(source, "*.tiff")) if (getsize(img)/(1024*1024.0)) < limit]
if len(imgs) == 0:
return
filtered_dir = join(source, "filtered")
makedirs(filtered_dir)
for img in imgs:
shutil.move(img, filtered_dir)
You don't have this option. You can perform manually by scripting this and send file by file. But it's not very efficient. I propose you this command:
find . -type f -size -4000c | xargs -I{} gsutil cp {} gs://my-bucket/path
Here only the file bellow 4k will be copied. Here the list of find unit
c for bytes
w for two-byte words
k for Kilobytes
M for Megabytes
G for Gigabytes

zgrep in hadoop streaming

I am trying to grep a zip file on S3/aws & write the output to a new location with same file name
I am using below on s3 , is this the right way to write the streaming output from first CAT command to hdfs output?
hadoop fs -cat s3://analytics/LZ/2017/03/03/test_20170303-000000.tar.gz | zgrep -a -E '*word_1*|*word_2*|word_3|word_4' | hadoop fs -put - s3://prod/project/test/test_20170303-000000.tar.gz
Given you are playing with hadoop, why not run the code in cluster? Scanning for strings inside a .gzip file is commonplace, though I don't know about .tar files.
I'd personally use the -copyToLocal and -copyFromLocal commands to copy it to the local FS and work there. The trouble with things like -cat is that a lot gets logged out on the Hadoop client code, so a pipe is likely to pick up too much extraneous cruft,

Is it possible to exclude from aws S3 sync files older then x time?

I'm trying to use aws s3 CLI command to sync files (then delete a local copy) from the server to S3 bucket, but can't find a way to exclude newly created files which are still in use in local machine.
Any ideas?
This should work:
find /path/to/local/SyncFolder -mtime +1 -print0 | sed -z 's/^/--include=/' | xargs -0 /usr/bin/aws s3 sync /path/to/local/SyncFolder s3://remote.sync.folder --exclude '*'
There's a trick here: we're not excluding the files we don't want, we're excluding everything and then including the files we want. Why? Because either way, we're probably going to have too many parameters to fit into the command line. We can use xargs to split up long lines into multiple calls, but we can't let xargs split up our excludes list, so we have to let it split up our includes list instead.
So, starting from the beginning, we have a find command. -mtime +1 finds all the files that are older than a day, and -print0 tells find to delimit each result with a null byte instead of a newline, in case some of your files have newlines in their names.
Next, sed adds the --include=/ option to the start of each filename, and the -z option is included to let sed know to use null bytes instead of newlines as delimiters.
Finally, xargs will feed all those include options to the end of our aws command, calling aws multiple times if need be. The -0 option is just like sed's -z option, telling it to use null bytes instead of newlines.
To my knowledge you can only Include/ Exclude based on Filename. So the only way I see is a realy dirty hack.
You could run a bash script to rename all files below your treshhold and prefix/ postfix them like TOO_NEW_%Filename% and run cli like:
--exclude 'TOO_NEW_*'
But no don't do that.
Most likely ignoring the newer files is the default behavior. We can read in aws s3 sync help:
The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
If you'd like to change the default behaviour, you've the following parameters to us:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
--exact-timestamps (boolean) When syncing from S3 to local, same-sized
items will be ignored only when the timestamps match exactly. The
default behavior is to ignore same-sized items unless the local version
is newer than the S3 version.
To see what files are going to be updated, run the sync with --dryrun.
Alternatively use find to list all the files which needs to be excluded, and pass it into --exclude parameter.

Compare HDFS Checksum to Local File System Checksum

I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m#x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m#x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m#x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks
Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.
This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:
hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp
Then, to get the checksum:
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile
For the local copy:
crc32 myFile
If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:
hdfs dfs -put myFile /tmp
And compare the two files on HDFS with:
hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.
Reference:
https://community.cloudera.com/t5/Community-Articles/Comparing-checksums-in-HDFS/ta-p/248617
This is not a solution but a workaround which can be used.
Local File Checksum:
cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt tmp.txt
You can compare them.
Hope it helps.
I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg#mail.gmail.com%3E
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%3CAANLkTinR8zM2QUb+T-drAC6PDmwJm8qcdxL48hzxBNoi#mail.gmail.com%3E
Piping the results of a cat'd hdfs file to md5sum worked for me:
$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f -
$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f /path/to/localFilesystem/file.txt
This would not be recommended for massive files.
I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.
md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" user#xyz.com < /dev/null
fi