How to read file in Apache Samza from local file system and hdfs system - hdfs

Looking for approach in Apache Samza to read file from local system or HDFS
then apply filters, aggregate, where condition, order by, group by into batch of data.
Please provide some help.

You should create a system for each source of data you want to use. For example, to read from a file, you should create a system with the FileReaderSystemFactory -- for HDFS, create a system with the HdfsSystemFactory. Then, you can use the regular process callback or windowing to process your data.

You can feed your Samza Job using standard Kafka producer. To make it easy for you. You can use Logstash, you need to create Logstash script where you specify:
input as local file or hdfs
filters (optional) here you can do basic filtering, aggregation etc.
kafka output with specific topic you want to feed
input
I was using this approach to feed my samza job from local file
Another approach could be using Kafka Connect
http://docs.confluent.io/2.0.0/connect/

Related

C++ Boost Log: How to use configuration file to rotate and zip logs?

I need to use Boost Logging library to rotate logs and zip the rotated logs, and I want to define it in a configuration file like in here:
# Logging core settings section. May be omitted if no parameters specified within it.
[Core]
DisableLogging=false
Filter="%Severity% > 3"
# Sink settings sections
[Sinks.MySink1]
# Sink destination type
Destination=Console
# Formatter string. Optional, by default only log record message text is written.
Format="<%TimeStamp%> - %Message%"
FileName="%N.log"
RotationSize=1000000000
How can I implement this?
I couldn't find a way to combine rotating and compressing in a configuration file.
It is recommended to use an external service like logrotate to compress log files. Compression is a time-consuming process, and you don't want the logging library to block your application progress while it compresses log files. For this reason Boost.Log does not support compression out of the box.
If for some reason you still want to perform compression within your process, you can do this by implementing a custom file collector that derives from the collector interface. The collector's store_file method has to perform all actions regarding moving and compressing files, removing the old files, etc. It will be called by the sink backend when it is time to rotate the currently written log file. You will set your collector to the sink backend by calling text_file_backend::set_file_collector method.
To integrate with settings files, you will need to register a sink factory that will use the parsed settings container to create and configure a file sink with your custom file collector. This is described here.

Continuously write to S3 file

I want to store user action logs continuously to s3 file for that session.
Requirements:
for a session single file
continuous write operations to s3
should be able to download that file at the end of the session.
Dont want to create new file for single session, want to update same file. Please suggest only AWS solutions.
Do i need to create stream and use it with s3 or using mediator storage system and push once in while.
Objects in Amazon S3 are immutable -- they cannot be modified after they are created.
From your description, a good solution would be to use Amazon Kinesis Data Firehose. Your app can stream data to the Firehose and it will combine data together based on size or time. A long session might therefore produce multiple output files, so you would need a separate process that combines those files together into a single file.

How to properly (scale-ably) read many ORC files into spark

I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.
Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.

How exactly does Spark on EMR read from S3?

Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?
I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns
Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data
When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.

Limit MQFTE file transfer to one file at a time

I have a MQFTE setup where we are receiving files from an external vendor. The files get dumped on a server in DMZ and we have an MQFTE agent that picks the files from that server and drops to our server.
We receive files in "sets" i.e. each incoming file has an associated xml file that describes and contains metadata about the file. E.g. a applicationform.pdf and applicationform.xml. The final application stores the pdf file based on the data/metadata in the xml.
Since the trigger is fired for each incoming file, we check in the trigger whether or not we've received the XML file and the content file (e.g. PDF).
However, I don't think this is the best approach as it adds to a lot of booking code to check for concurrency issues when both files arrive at same time. Is there a way to :
Restrict the trigger so that it only fires when both files have arrived? In my research this is not possible.
Configure the agent on the server so that it only receives one file at a time? Looking at the documentation, it seems like it can achieved but only on the agent initiating the transfer, not on the agent receiving the transfer? The documentation hints at monitorMaxResourcesInPoll and -bs parameter, but that would be on the source agent I guess. Since the agent is shared with multiple systems, this would impact them as well.
Also, I would appreciate any tips and suggestions or even alternative solutions to best meet the requirement.
I don't think there is a way to check for both files existing before the monitor triggers. What some users do is send all of the files they want to transfer, and then finally put a 'marker' file in the directory which the resource monitor looks for. Because the marker file is only written after all other files are ready to be sent, the monitor only transfers the files when they're all there.
In answer to 2) I you could set maxDestinationTransfers to 1 on the destination agent to limit it to receive a single transfer at a time. If a transfer contains multiple files they will be transferred in sequence so the destination is really only receiving 1 file at a time. monitorMaxResourcesInPoll simply limits the monitoring agent to the number of files it parses in the source directory per monitor poll. You could set that to 1 but if you want to transfer the PDF and the XML file in the same transfer you'd need to set it to 2. It's probably not the setting you want to use.