Flume sink copying garbage data in hdfs - hdfs

While copying data from local path to HDFS sink, i am getting some garbage data in the file at HDFS location.
My config file for flume:
# spool.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /home/cloudera/spool_source
a1.sources.s1.channels = c1
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = flumefolder/events
a1.sinks.k1.hdfs.filetype = Datastream
#Format to be written
a1.sinks.k1.hdfs.writeFormat = Text
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
I am aopyuing file from local path "/home/cloudera/spool_source" to hdfs path "flumefolder/events".
Flume command:
flume-ng agent --conf-file spool.conf --name a1 -Dflume.root.logger=INFO,console
File "salary.txt" at local path "/home/cloudera/spool_source" is:
GR1,Emp1,Jan,31,2500
GR3,Emp3,Jan,18,2630
GR4,Emp4,Jan,31,3000
GR4,Emp4,Feb,28,3000
GR1,Emp1,Feb,15,2500
GR2,Emp2,Feb,28,2800
GR2,Emp2,Mar,31,2800
GR3,Emp3,Mar,31,3000
GR1,Emp1,Mar,15,2500
GR2,Emp2,Apr,31,2630
GR3,Emp3,Apr,17,3000
GR4,Emp4,Apr,31,3200
GR7,Emp7,Apr,21,2500
GR11,Emp11,Apr,17,2000
At the target path "flumefolder/events", the data is copied with garbage values as:
1 W��ȩGR1,Emp1,Jan,31,2500W��ȲGR3,Emp3,Jan,18,2630W��ȷGR4,Emp4,Jan,31,3000W��ȻGR4,Emp4,Feb,28,3000W��ȽGR1,Emp1,Feb,15,2500W����GR2,Emp2,Feb,28,2800W����GR2,Emp2,Mar,31,2800W����GR3,Emp3,Mar,31,3000W����GR1,Emp1,Mar,15,2500W����GR2,Emp2,
What is wrong in my configuration file spool.conf, i am unable to figure it out.

Flume configuration is case sensitive so change the filetype line to fileType, and fix the Datastream value too as it's also case sensitive
sinks.k1.hdfs.fileType = DataStream
Your current setup means the default of a sequence file is being used, hence the odd characters

Related

No FileSystem for scheme "s3" when trying to read a list of files with Spark from EC2

I'm trying to provide a list of files for spark to read as and when it needs them (which is why I'd rather not use boto or whatever else to pre-download all the files onto the instance and only then read them into spark "locally").
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])
No idea what local[3] is about but without this --master flag, I was getting another exception:
Exception: Java gateway process exited before sending the driver its port number.
Now, I'm getting this:
Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...
Not sure what o37.json refers to here but it probably doesn't matter.
I saw a bunch of answers to similar questions suggesting an addition of flags like:
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
I tried prepending it and appending it to the other flag but it doesn't work.
Just like the many variations I see in other answers and elsewhere on the internet (with different packages and versions), for example:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'
A typical example for reading files from S3 is as below -
Additional you can go through this answer to ensure the minimalistic structure and necessary modules are in place -
java.io.IOException: No FileSystem for scheme: s3
Read Parquet - S3
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk-bundle:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
hadoop_conf = sc._jsc.hadoopConfiguration()
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_key = config.get("****", "aws_access_key_id")
secret_key = config.get("****", "aws_secret_access_key")
session_key = config.get("****", "aws_session_token")
hadoop_conf.set("fs.s3.aws.credentials.provider", "org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_key)
s3_path = "s3a://xxxx/yyyy/zzzz/"
sparkDF = sql.read.parquet(s3_path)

How to store data on hdfs using flume with existing schema file

I have json data coming from source and i want to dump it on hdfs using flume in avro format for which i already have avsc file, i am using following configuration for sink but thats not picking my avsc file but creating its own schema :
agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.serializer =
org.apache.flume.serialization.AvroEventSerializer$Builder
agent1.sinks.sink1.hdfs.path = /tmp/data
agent1.sinks.sink1.schemaURL=file:///home/tmp/schema.avsc
log4j.appender.flume.AvroSchemaUrl=file:///home/tmp/schema.avsc
agent1.sinks.sink1.hdfs.rollSize = 1024
agent1.sinks.sink1.hdfs.rollCount = 0 agent1.sinks.sink1.hdfs.fileType
= DataStream agent1.sinks.sink1.channel = channel1
How can i do this?

Amazon S3 concatenate small files

Is there a way to concatenate small files which are less than 5MBs on Amazon S3.
Multi-Part Upload is not ok because of small files.
It's not a efficient solution to pull down all these files and do the concatenation.
So, can anybody tell me some APIs to do these?
Amazon S3 does not provide a concatenate function. It is primarily an object storage service.
You will need some process that downloads the objects, combines them, then uploads them again. The most efficient way to do this would be to download the objects in parallel, to take full advantage of available bandwidth. However, that is more complex to code.
I would recommend doing the processing on "in the cloud" to avoid having to download the objects across the Internet. Doing it on Amazon EC2 or AWS Lambda would be more efficient and less costly.
Based on #wwadge's comment I wrote a Python script.
It bypasses the 5MB limit by uploading a dummy-object slightly bigger than 5MB, then append each small file as if it was the last. In the end it strips out the dummy-part from the merged file.
import boto3
import os
bucket_name = 'multipart-bucket'
merged_key = 'merged.json'
mini_file_0 = 'base_0.json'
mini_file_1 = 'base_1.json'
dummy_file = 'dummy_file'
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
# we need to have a garbage/dummy file with size > 5MB
# so we create and upload this
# this key will also be the key of final merged file
with open(dummy_file, 'wb') as f:
# slightly > 5MB
f.seek(1024 * 5200)
f.write(b'0')
with open(dummy_file, 'rb') as f:
s3_client.upload_fileobj(f, bucket_name, merged_key)
os.remove(dummy_file)
# get the number of bytes of the garbage/dummy-file
# needed to strip out these garbage/dummy bytes from the final merged file
bytes_garbage = s3_resource.Object(bucket_name, merged_key).content_length
# for each small file you want to concat
# when this loop have finished merged.json will contain
# (merged.json + base_0.json + base_2.json)
for key_mini_file in ['base_0.json','base_1.json']: # include more files if you want
# initiate multipart upload with merged.json object as target
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
part_responses = []
# perform multipart copy where merged.json is the first part
# and the small file is the second part
for n, copy_key in enumerate([merged_key, key_mini_file]):
part_number = n + 1
copy_response = s3_client.upload_part_copy(
Bucket=bucket_name,
CopySource={'Bucket': bucket_name, 'Key': copy_key},
Key=merged_key,
PartNumber=part_number,
UploadId=mpu['UploadId']
)
part_responses.append(
{'ETag':copy_response['CopyPartResult']['ETag'], 'PartNumber':part_number}
)
# complete the multipart upload
# content of merged will now be merged.json + mini file
response = s3_client.complete_multipart_upload(
Bucket=bucket_name,
Key=merged_key,
MultipartUpload={'Parts': part_responses},
UploadId=mpu['UploadId']
)
# get the number of bytes from the final merged file
bytes_merged = s3_resource.Object(bucket_name, merged_key).content_length
# initiate a new multipart upload
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
# do a single copy from the merged file specifying byte range where the
# dummy/garbage bytes are excluded
response = s3_client.upload_part_copy(
Bucket=bucket_name,
CopySource={'Bucket': bucket_name, 'Key': merged_key},
Key=merged_key,
PartNumber=1,
UploadId=mpu['UploadId'],
CopySourceRange='bytes={}-{}'.format(bytes_garbage, bytes_merged-1)
)
# complete the multipart upload
# after this step the merged.json will contain (base_0.json + base_2.json)
response = s3_client.complete_multipart_upload(
Bucket=bucket_name,
Key=merged_key,
MultipartUpload={'Parts': [
{'ETag':response['CopyPartResult']['ETag'], 'PartNumber':1}
]},
UploadId=mpu['UploadId']
)
If you already have a >5MB object that you want to add smaller parts too, then skip creating the dummy file and the last copy part with the byte-ranges. Also, I have no idea how this performs on a large number of very small files - in that case it might be better to download each file, merge them locally and then upload.
Edit: Didn't see the 5MB requirement. This method will not work because of this requirement.
From https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby:
While it is possible to download and re-upload the data to S3 through
an EC2 instance, a more efficient approach would be to instruct S3 to
make an internal copy using the new copy_part API operation that was
introduced into the SDK for Ruby in version 1.10.0.
Code:
require 'rubygems'
require 'aws-sdk'
s3 = AWS::S3.new()
mybucket = s3.buckets['my-multipart']
# First, let's start the Multipart Upload
obj_aggregate = mybucket.objects['aggregate'].multipart_upload
# Then we will copy into the Multipart Upload all of the objects in a certain S3 directory.
mybucket.objects.with_prefix('parts/').each do |source_object|
# Skip the directory object
unless (source_object.key == 'parts/')
# Note that this section is thread-safe and could greatly benefit from parallel execution.
obj_aggregate.copy_part(source_object.bucket.name + '/' + source_object.key)
end
end
obj_completed = obj_aggregate.complete()
# Generate a signed URL to enable a trusted browser to access the new object without authenticating.
puts obj_completed.url_for(:read)
Limitations (among others)
With the exception of the last part, there is a 5 MB minimum part size.
The completed Multipart Upload object is limited to a 5 TB maximum size.

How to force store(overwrite) in HDFS using Mapreduce

How do i overwrite the existing output in HDFS with Mapreduce program.
In Pig there is statement called
rmf /user/cloudera/outputfiles/citycount
STORE rel into '/user/cloudera/outputfiles/citycount';
Similarly is there any way to achieve the same in mapreduce program
You can like this in your driver module.
conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String pathin = args[0];
String pathout = args[1];
fs.delete(new Path(pathout), true);
// it will delete the output folder if the folder already exists.

How to transfer data from one system to another system's HDFS (connected through LAN) using Flume?

I have a computer in LAN Connection . I need to transfer data from the system to another system's HDFS location using flume.
I have tried using ip address of the sink system, but it didn't work. Please help..
Regards,
Athiram
This can be achieved by using avro mechanism.
The flume has to be installed in both the machines. A config file with the following codes has to be made to be run in the source system , where the logs are generated.
a1.sources = tail-file
a1.channels = c1
a1.sinks=avro-sink
a1.sources.tail-file.channels = c1
a1.sinks.avro-sink.channel = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.sources.tail-file.type = spooldir
a1.sources.tail-file.spoolDir =<location of spool directory>
a1.sources.tail-file.channels = c1
a1.sinks.avro-sink.type = avro
a1.sinks.avro-sink.hostname = <IP Address of destination system where the data has to be written>
a1.sinks.avro-sink.port = 11111
A config file with the following codes has to be made to be run in the destination system , where the logs are generated.
a2.sources = avro-collection-source
a2.sinks = hdfs-sink
a2.channels = mem-channel
a2.sources.avro-collection-source.channels = mem-channel
a2.sinks.hdfs-sink.channel = mem-channel
a2.channels.mem-channel.type = memory
a2.channels.mem-channel.capacity = 1000
a2.sources.avro-collection-source.type = avro
a2.sources.avro-collection-source.bind = localhost
a2.sources.avro-collection-source.port = 44444
a2.sinks.hdfs-sink.type = hdfs
a2.sinks.hdfs-sink.hdfs.writeFormat = Text
a2.sinks.hdfs-sink.hdfs.filePrefix = testing
a2.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/user/hduser/
Now, the data from the log file in the source system will be written to hdfs system in the destination system.
Regards,
Athiram