zgrep in hadoop streaming - amazon-web-services

I am trying to grep a zip file on S3/aws & write the output to a new location with same file name
I am using below on s3 , is this the right way to write the streaming output from first CAT command to hdfs output?
hadoop fs -cat s3://analytics/LZ/2017/03/03/test_20170303-000000.tar.gz | zgrep -a -E '*word_1*|*word_2*|word_3|word_4' | hadoop fs -put - s3://prod/project/test/test_20170303-000000.tar.gz

Given you are playing with hadoop, why not run the code in cluster? Scanning for strings inside a .gzip file is commonplace, though I don't know about .tar files.
I'd personally use the -copyToLocal and -copyFromLocal commands to copy it to the local FS and work there. The trouble with things like -cat is that a lot gets logged out on the Hadoop client code, so a pipe is likely to pick up too much extraneous cruft,

Related

Exclude small files when using gsutil rsync

I would like to upload files of a given folder to a bucket using gsutil rsync. However, instead of uploading all files, I would like to exclude files that are below a certain size. The Unix rsync command offers the option --min-size=SIZE. Is there an equivalent for the gsutil tool? If not, is there an easy way of excluding small files?
Ok so the easiest solution I found is to move the small files into a subdirectory, and then use rsync (without the -r option). The code for moving the files:
def filter_images(source, limit):
imgs = [img for img in glob(join(source, "*.tiff")) if (getsize(img)/(1024*1024.0)) < limit]
if len(imgs) == 0:
return
filtered_dir = join(source, "filtered")
makedirs(filtered_dir)
for img in imgs:
shutil.move(img, filtered_dir)
You don't have this option. You can perform manually by scripting this and send file by file. But it's not very efficient. I propose you this command:
find . -type f -size -4000c | xargs -I{} gsutil cp {} gs://my-bucket/path
Here only the file bellow 4k will be copied. Here the list of find unit
c for bytes
w for two-byte words
k for Kilobytes
M for Megabytes
G for Gigabytes

Parsing volume array of Dockerfile in bash

I'm working on a management script for Docker containers. Right now the user has to configure certain variables before using it. Often, these variables are already defined in the Dockerfile so the default should be to read those values.
I'm having some trouble with the array format used in these Dockerfiles. If I have the volume definition: VOLUME ["/root/", "/var/log/"] the file script should be able to figure out /root/ and /var/log, I haven't been able to accomplish this yet.
So far I have been able to get "/root/" ", " and "/var/log" out of the file using grep VOLUME Dockerfile | cut -c 8- | grep -o -P '(?<=").+?(?=")' but this stil includes the ", " which should be left out.
Does anyone have suggestions about how to parse this properly?
awk to the rescue!
$ echo VOLUME ["/root/", "/var/log/"] |
awk -F'[ ,\\[\\]]' '/VOLUME/{for(i=3;i<=NF;i+=2) print $i}'
/root/
/var/log/
by setting the delimiters you can extract all the fields.

Is it possible to exclude from aws S3 sync files older then x time?

I'm trying to use aws s3 CLI command to sync files (then delete a local copy) from the server to S3 bucket, but can't find a way to exclude newly created files which are still in use in local machine.
Any ideas?
This should work:
find /path/to/local/SyncFolder -mtime +1 -print0 | sed -z 's/^/--include=/' | xargs -0 /usr/bin/aws s3 sync /path/to/local/SyncFolder s3://remote.sync.folder --exclude '*'
There's a trick here: we're not excluding the files we don't want, we're excluding everything and then including the files we want. Why? Because either way, we're probably going to have too many parameters to fit into the command line. We can use xargs to split up long lines into multiple calls, but we can't let xargs split up our excludes list, so we have to let it split up our includes list instead.
So, starting from the beginning, we have a find command. -mtime +1 finds all the files that are older than a day, and -print0 tells find to delimit each result with a null byte instead of a newline, in case some of your files have newlines in their names.
Next, sed adds the --include=/ option to the start of each filename, and the -z option is included to let sed know to use null bytes instead of newlines as delimiters.
Finally, xargs will feed all those include options to the end of our aws command, calling aws multiple times if need be. The -0 option is just like sed's -z option, telling it to use null bytes instead of newlines.
To my knowledge you can only Include/ Exclude based on Filename. So the only way I see is a realy dirty hack.
You could run a bash script to rename all files below your treshhold and prefix/ postfix them like TOO_NEW_%Filename% and run cli like:
--exclude 'TOO_NEW_*'
But no don't do that.
Most likely ignoring the newer files is the default behavior. We can read in aws s3 sync help:
The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
If you'd like to change the default behaviour, you've the following parameters to us:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
--exact-timestamps (boolean) When syncing from S3 to local, same-sized
items will be ignored only when the timestamps match exactly. The
default behavior is to ignore same-sized items unless the local version
is newer than the S3 version.
To see what files are going to be updated, run the sync with --dryrun.
Alternatively use find to list all the files which needs to be excluded, and pass it into --exclude parameter.

Compare HDFS Checksum to Local File System Checksum

I am trying to write a simple script to verify the HDFS and local filesystem checksums.
On HDFS i get -
[m#x01tbipapp3a ~]$ hadoop fs -checksum /user/m/file.txt
/user/m/file.txt MD5-of-0MD5-of-512CRC32C **000002000000000000000000755ca25bd89d1a2d64990a68dedb5514**
On the Local File System, I get -
[m#x01tbipapp3a ~]$ cksum file.txt
**3802590149 26276247** file.txt
[m#x01tbipapp3a ~]$ md5sum file.txt
**c1aae0db584d72402d5bcf5cbc29134c** file.txt
Now how do i compare them. I tried to convert the HDFS checksum from Hex to Decimal to see if it matches the chksum but it does not...
Is there a way to compare the 2 checksums using any algorithm?
thanks
Starting from Hadoop 3.1, checksums can be performed in HDFS. However, the comparison depends on how you put the file to HDFS in the first place. By default, HDFS uses CRC32C, which calculates MD5 of all MD5 checksums of individual chunks.
This means that you can't easily compare that checksum with one of a local copy. You can write the file initially with CRC32 checksum:
hdfs dfs -Ddfs.checksum.type=CRC32 -put myFile /tmp
Then, to get the checksum:
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/myFile
For the local copy:
crc32 myFile
If you didn't upload the file with CRC32 checksum, or don't want to upload it again with CRC32 checksum, you can also just upload the local copy you want to compare with again with CRC32C checksum:
hdfs dfs -put myFile /tmp
And compare the two files on HDFS with:
hdfs dfs -checksum /tmp/myFile and hdfs dfs -checksum /tmp/myOtherFile.
Reference:
https://community.cloudera.com/t5/Community-Articles/Comparing-checksums-in-HDFS/ta-p/248617
This is not a solution but a workaround which can be used.
Local File Checksum:
cksum test.txt
HDFS Checksum :
cksum hadoop fs -cat /user/test/test.txt > tmp.txt tmp.txt
You can compare them.
Hope it helps.
I was also confused because the md5 was not matching,turned out Hadoop checksum is not a simple md5, its a MD5 of MD5 of CRC32C :-)
see this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg#mail.gmail.com%3E
and this
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201103.mbox/%3CAANLkTinR8zM2QUb+T-drAC6PDmwJm8qcdxL48hzxBNoi#mail.gmail.com%3E
Piping the results of a cat'd hdfs file to md5sum worked for me:
$ hadoop fs -cat /path/to/hdfs/file.dat|md5sum
cb131cdba628676ce6942ee7dbeb9c0f -
$ md5sum /path/to/localFilesystem/file.txt
cb131cdba628676ce6942ee7dbeb9c0f /path/to/localFilesystem/file.txt
This would not be recommended for massive files.
I used a workaround for this, created a simple script to compare checksum of local and hdfs file system using md5sum. I have mounted my hdfs file system as local /hdfs.
md5sum /hdfs/md5test/* | awk {'print $1'} > hdfsfile.txt
md5sum /test/* | awk {'print $1'} > localfile.txt
if ! diff /root/localfile.txt /root/hdfsfile.txt > /dev/null 2>&1;
then
/bin/mail -s "checksum difference between local and hdfs files" user#xyz.com < /dev/null
fi

Bash, Netcat, Pipes, perl

Background: I have a fairly simple bash script that I'm using to generate a CSV log file. As part of that bash script I poll other devices on my network using netcat. The netcat command returns a stream of information that I can pipe that into a grep command to get to certain values I need in the CSV file. I save that return value from grep into a bash variable and then at the end of the script, I write out all saved bash variables to a CSV file. (Simple enough.)
The change I'd like to make is the amount of netcat commands I have to issue for each piece of information I want to save off. With each issued netcat command I get ALL possible values returned (so each time returns the same data and is burdensome on the network). So, I'd like to only use netcat once and parse the return value as many times as I need to create the bash variables that can later be concatenated together into a single record in the CSV file I'm creating.
Specific Question: Using bash syntax if I pass the output of the netcat command to a file using > (versus the current grepmethod) I get a file with each entry on its own line (presumably separated with the \n as the EOL record separator -- easy for perl regex). However, if I save the output of netcat directly to a bash variable, and echo that variable, all of the data is jumbled together, so it is cumbersome to parse out (not so easy).
I have played with two options: First, I think a perl one-liner may be a good solution here, but I'm not sure how to best execute it. Pseudo code might be to save the netcat output to a a bash variable and then somehow figure out how to parse it with perl (not straight forward though).
The second option would be to use bash's > and send netcat's output to a file. This would be easy to process with perl and Regex given the \n EOL, but that would require opening an external file and passing it to a perl script for processing AND then somehow passing its return value back into the bash script as a bash variable for entry into the CSV file.
I know I'm missing something simple here. Is there a way I can force a newline entry into the bash variable from netcat and then repeatedly run a perl-one liner against that variable to create each of the CSV variables I need -- all within the same bash script? Sorry, for the long question.
The second option would be to use bash's > and send netcat's output to
a file. This would be easy to process with perl and Regex given the \n
EOL, but that would require opening an external file and passing it to
a perl script for processing AND then somehow passing its return value
back into the bash script as a bash variable for entry into the CSV
file.
This is actually a fairly common idiom: save the output from netcat in
a temporary file, then use grep or awk or perl or what-have-you as
many times as necessary to extract data from that file:
# create a temporary file and arrange to have it
# deleted when the script exists.
tmpfile=$(mktemp tmpXXXXXX)
trap "rm -f $tmpfile" EXIT
# dump data from netcat into the
# temporary file.
nc somehost someport > $tmpfile
# extract some information into variable `myvar`
myvar=$(awk '/something/ {print $4}' $tmpfile)
That last line demonstrates how to get the output of something (in this case, an awk script) into a variable. If you were using perl to extract some information you could do the same thing.
You could also just write the whole script in perl, which might make your life easier.