Compressed file ingestion using Flume - hdfs

Can I ingest any type of compressed file ( say zip, bzip, lz4 etc.) to hdfs using Flume ng 1.3.0? I am planning to use spoolDir. Any suggesion please.

You can ingest any type of file. You need to select an appropriate deserializer.
Below route works for compressed files. You can choose the options as you need:
agent.sources = src-1
agent.channels = c1
agent.sinks = k1
agent.sources.src-1.type = spooldir
agent.sources.src-1.channels = c1
agent.sources.src-1.spoolDir = /tmp/myspooldir
agent.sources.src-1.deserializer=org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.channels.c1.type = file
agent.sinks.k1.type = hdfs
agent.sinks.k1.channel = c1
agent.sinks.k1.hdfs.path = /user/myevents/
agent.sinks.k1.hdfs.filePrefix = events-
agent.sinks.k1.hdfs.fileType = CompressedStream
agent.sinks.k1.hdfs.round = true
agent.sinks.k1.hdfs.roundValue = 10
agent.sinks.k1.hdfs.roundUnit = minute
agent.sinks.k1.hdfs.codeC = snappyCodec

You may leave the file uncompressed at the source and use the compression algorithms provided by Flume for compressing the data when it is ingested to HDFS.
Avro sources and sinks also supports compression in-case you are planning to use them.

I wrote custom source component and resolve. The custom source can be used to ingest any kind of file.

Related

Batch Transform output not included in Amazon SageMaker Model Monitor job

I am trying to do data quality monitoring for a Batch transform job on a very simple .csv dataset with five inputs + one target variable. Both data capture configuration and monitoring job ended without error. However, the monitoring job reported a violation on data columns (five instead of six, the baseline is a similar dataset to the input). Any clues about that?
I successfully enabled data capture for the batch transform job using this sample code:
transformer = Transformer(
model_name=model_name,
instance_count=1,
instance_type="ml.m5.large",
accept="text/csv",
output_path=output_path,
sagemaker_session=pipeline_session
)
transformer.transform(
data=input_path,
data_type="S3Prefix",
content_type="text/csv",
batch_data_capture_config=BatchDataCaptureConfig(
destination_s3_uri= f"s3://{default_bucket}/{s3_prefix_destination}",
generate_inference_id=True,
),
split_type="Line"
)
It correctly gives as outputs two folders inside my S3 destination path named input/ and output/. For the data quality monitoring job I used this:
data_quality_model_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type="ml.t3.medium",
volume_size_in_gb=50,
max_runtime_in_seconds=3600,
)
schedule = data_quality_model_monitor.create_monitoring_schedule(
monitor_schedule_name="schedule",
batch_transform_input=BatchTransformInput(
data_captured_destination_s3_uri=f"s3://{default_bucket}/{s3_prefix_destination}",
destination="/opt/ml/processing/input",
dataset_format=MonitoringDatasetFormat.csv(header=False),
),
output_s3_uri=f's3://{default_bucket}/{s3_prefix_output}',
statistics=f"s3://{default_bucket}/{s3_prefix_statistics}",
constraints=f"s3://{default_bucket}/{s3_prefix_constraints}",
schedule_cron_expression=CronExpressionGenerator.hourly(),,
enable_cloudwatch_metrics=True,
)
It reports a violation on data columns (five instead of six). It seems that the monitoring job analyzes only the manifest file contained in the folder input/, missing the files inside output/. Probably there is an option to enable output/ control, but I did not find it.

Single source multiple sinks v/s flatmap

I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes:
I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?
Would using a flatmap operator and then having a single sink be better?
When you had a very simple pipeline with a single source and a single sink, something like this:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).
I don't see how using a flatmap would make any difference.
You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.

Google Cloud Platform: Speech to Text Conversion of Large Media Files

I'm trying to extract text from mp4 media file downloaded from youtube. As I'm using google cloud platform so thought to give a try to google cloud speech.
After all the installations and configurations, I copied the following code snippet to get start with:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code='en-US')
response = client.long_running_recognize(config, audio)
But I got the following error regarding file size:
InvalidArgument: 400 Inline audio exceeds duration limit. Please use a
GCS URI.
Then I read that I should use streams for large media files. So, I tried the following code snippet:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
#In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (types.StreamingRecognizeRequest(audio_content=chunk)for chunk in stream)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code='en-US')
streaming_config = types.StreamingRecognitionConfig(config=config)
responses = client.streaming_recognize(streaming_config, requests)
But still I got the following error:
InvalidArgument: 400 Invalid audio content: too long.
So, can anyone please suggest an approach to transcribe an mp4 file and extract text. I don't have any complex requirement of very large media file. Media file can be 10-15 mins long maximum. Thanks
The error message means that the file is too big and you need to first copy the media file to Google Cloud Storage and then specify a Cloud Storage URI such as gs://bucket/path/mediafile.
The key to using a Cloud Storage URI is:
RecognitionAudio audio =
RecognitionAudio.newBuilder().setUri(gcsUri).build();
The following code will show you how to specify a GCS URI for input. Google has a complete example on github.
public static void syncRecognizeGcs(String gcsUri) throws Exception {
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try (SpeechClient speech = SpeechClient.create()) {
// Builds the request for remote FLAC file
RecognitionConfig config =
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.FLAC)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.build();
RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();
// Use blocking call for getting audio transcript
RecognizeResponse response = speech.recognize(config, audio);
List<SpeechRecognitionResult> results = response.getResultsList();
for (SpeechRecognitionResult result : results) {
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
System.out.printf("Transcription: %s%n", alternative.getTranscript());
}
}
}

Apache Flume taking more time than copyFromLocal command

I have 24GB folderin my local file system. My task is to move that folder to HDFS. Two ways I did it.
1) hdfs dfs -copyFromLocal /home/data/ /home/
This took around 15mins to complete.
2) Using Flume.
Here is my agent
spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink_to_hdfs
# source
spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /home/data/
spool_dir.sources.src-1.fileHeader = false
# HDFS sinks
spool_dir.sinks.sink_to_hdfs.type = hdfs
spool_dir.sinks.sink_to_hdfs.hdfs.fileType = DataStream
spool_dir.sinks.sink_to_hdfs.hdfs.path = hdfs://192.168.1.71/home/user/flumepush
spool_dir.sinks.sink_to_hdfs.hdfs.filePrefix = customevent
spool_dir.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
spool_dir.sinks.sink_to_hdfs.hdfs.batchSize = 1000
spool_dir.channels.channel-1.type = file
spool_dir.channels.channel-1.checkpointDir = /home/user/spool_dir_checkpoint
spool_dir.channels.channel-1.dataDirs = /home/user/spool_dir_data
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink_to_hdfs.channel = channel-1
This step took almost an hour to push data to HDFS.
As per my knowledge Flume is distributed, so should not it be that Flume should load data faster than copyFromLocal command.
If you're looking simple at read and write operations flume is going to be at least 2x slower with your configuration as you're using a file channel - every file read from disk is encapsulated into a flume event (in memory) and then serialized back down to disk via the file channel. The sink then reads the event back from the file channel (disk) before pushing it up to hdfs.
You also haven't set a blob deserializer on your spoolDir source (so it's reading one line at a time from your source files, wrapping in a flume Event and then writing to the file channel), so paired with the HDFS Sink default rollXXX values, you'll be getting a file in hdfs per 10 events / 30s / 1k rather than a file per input file that you'd get with copyFromLocal.
All of these factors add up to give you slower performance. If you want to get a more comparable performance, you should use the BlobDeserializer on the spoolDir source, coupled with a memory channel (but understand that a memory channel doesn't guarantee delivery of an event in the event of the JRE being prematurely terminated.
Apache Flume is not intended for moving or copying folders from local file system to HDFS. Flume is meant for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. (Reference: Flume User Guide)
If you want to move large files or directories, you should use hdfs dfs -copyFromLocal as you have already mentioned.

Flume HDFS Sink generates lots of tiny files on HDFS

I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB.
Here is my current flume config:
a1.sources=o1
a1.sinks=i1
a1.channels=c1
#source configuration
a1.sources.o1.type=avro
a1.sources.o1.bind=0.0.0.0
a1.sources.o1.port=41414
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
#channle config
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.sources.o1.channels=c1
a1.sinks.i1.channel=c1
It is your typo in conf.
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
where in the line 'rollSize' and 'rollCount', you put il as i1.
Please try to use DEBUG, then you will find like:
[SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.shouldRotate:465) - rolling: rollSize: 1024, bytes: 1024
Due to il, default value of rollSize 1024 is being used .
HDFS Sink has a property hdfs.batchSize (default 100) which describes "number of events written to file before it is flushed to HDFS". I think that's your problem here.
Consider also checking all other properties: HDFS Sink .
This can possibly happen because of the memory channel and its capacity. I guess its dumping data to HDFS as soon as its capacity becomes full. Did you try using file channel instead of memory ?