Fluentd S3 output plugin not recognizing index

Fluentd S3 output plugin not recognizing index - amazon-web-services

I am facing problems while using S3 output plugin with fluent-d.
s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
using %index at end never resolves to _0,_1 . But i always end up with log file names as
sflow_.log
while i need sflow_0.log
Regards,

Can you paste your fluent.conf?. It's hard to find the problem without full info. File creations are mainly controlled by time slice flag and buffer configuration..
<buffer>
#type file or memory
path /fluentd/buffer/s3
timekey_wait 1m
timekey 1m
chunk_limit_size 64m
</buffer>
time_slice_format %Y%m%d%H%M
With above, you create a file every minute and within 1min if your buffer limit is reached or due to any other factor another file is created with index 1 under same minute.

Related

What is the cause of these google beta transcoding service job validation errors

"failureReason": "Job validation failed: Request field config is
invalid, expected an estimated total output size of at most 400 GB
(current value is 1194622697155 bytes).",
The actual input file was only 8 seconds long. It was created using the safari media recorder api on mac osx.
"failureReason": "Job validation failed: Request field
config.editList[0].startTimeOffset is 0s, expected start time less
than the minimum duration of all inputs for this atom (0s).",
The actual input file was 8 seconds long. It was created using the desktop Chrome media recorder api, with mimeType "webm; codecs=vp9" on mac osx.
Note that Stackoverlow wouldn't allow me to include the tag google-cloud-transcoder suggested by "Getting Support" https://cloud.google.com/transcoder/docs/getting-support?hl=sr

Like Faniel mentioned, your first issue is that your video was less than 10 seconds which is below the minimum 10 seconds for the API.
Your second issue is that the "Duration" information is likely missing from the EBML headers of your .webm file. When you record with MediaRecorder the duration of your video is set to N/A in the file headers as it is not known in advance. This means the Transcoder API will treat the length of your video is Infinity / 0. Some consider this a bug with Chromium.
To confirm this is your issue you can use ts-ebml or ffprobe to inspect the headers of your video. You can also use these tools to repair the headers. Read more about this here and here
Also just try running with the Transcoder API with this demo .webm which has its duration information set correctly.

This Google documentation states that the input file’s length must be at least 5 seconds in duration and should be stored in Cloud Storage (for example, gs://bucket/inputs/file.mp4). Job Validation error can occur when the inputs are not properly packaged and don't contain duration metadata or contain incorrect duration metadata. When the inputs are not properly packaged, we can explicitly specify startTimeOffset and endTimeOffset in the job config to set the correct duration. If the duration of the ffprobe output (in seconds) of the job config is more than 400 GB, it can result in a job validation error. We can use the following formula to estimate the output size.
estimatedTotalOutputSizeInBytes = bitrateBps * outputDurationInSec / 8;

Thanks for the question and feedback. The Transcoder API currently has a minimum duration of 10 seconds which may be why the job wasn't successful.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.

Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.

We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

Flume HDFS Sink generates lots of tiny files on HDFS

I have a toy setup sending log4j messages to hdfs using flume. I'm not able to configure the hdfs sink to avoid many small files. I thought I could configure the hdfs sink to create a new file every-time the file size reaches 10mb, but it is still creating files around 1.5KB.
Here is my current flume config:
a1.sources=o1
a1.sinks=i1
a1.channels=c1
#source configuration
a1.sources.o1.type=avro
a1.sources.o1.bind=0.0.0.0
a1.sources.o1.port=41414
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
#channle config
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.sources.o1.channels=c1
a1.sinks.i1.channel=c1

It is your typo in conf.
#sink config
a1.sinks.i1.type=hdfs
a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/myName/flume/events
#never roll-based on time
a1.sinks.i1.hdfs.rollInterval=0
#10MB=10485760
a1.sinks.il.hdfs.rollSize=10485760
#never roll base on number of events
a1.sinks.il.hdfs.rollCount=0
where in the line 'rollSize' and 'rollCount', you put il as i1.
Please try to use DEBUG, then you will find like:
[SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.shouldRotate:465) - rolling: rollSize: 1024, bytes: 1024
Due to il, default value of rollSize 1024 is being used .

HDFS Sink has a property hdfs.batchSize (default 100) which describes "number of events written to file before it is flushed to HDFS". I think that's your problem here.
Consider also checking all other properties: HDFS Sink .

This can possibly happen because of the memory channel and its capacity. I guess its dumping data to HDFS as soon as its capacity becomes full. Did you try using file channel instead of memory ?

How to handle FTE queued transfers

I have fte monitor with a '*.txt' as trigger condition, whenever a text file lands at source fte transfer file to destination, but when 10 files land at source at a time then fte is triggering 10 transfer request simultaneously & all the transfers are getting queued & stuck.
Please suggest how to handle this scenarios

Ok, I have just tested this case:
I want to transfer four *.xml files from directory right when they appear in that directory. So I have monitor set to *.xml and transfer pattern set to *.xml (see commands bellow).
Created with following commands:
fteCreateTransfer -sa AGENT1 -sm QM.FTE -da AGENT2 -dm QM.FTE -dd c:\\workspace\\FTE_tests\\OUT -de overwrite -sd delete -gt /var/IBM/WMQFTE/config/QM.FTE/FTE_TEST_TRANSFER.xml c:\\workspace\\FTE_tests\\IN\\*.xml
fteCreateMonitor -ma AGENT1 -mn FTE_TEST_TRANSFER -md c:\\workspace\\FTE_tests\\IN -mt /var/IBM/WMQFTE/config/TQM.FTE/FTE_TEST_TRANSFER.xml -tr match,*.xml
I got three different results depending on configuration changes:
1) just as commands are, default agent.properties:
in transfer log appeared 4 transfers
all 4 transfers tryed to transfer all four XML files
3 of them with partial success because agent could't delete source file
with success that transfered all files and deleted all source files
Well, with transfer type File to File, final state is in fact ok - four files in destination directory because the previous file are overwritten. But with File to Queue I got 16 messages in destination queue.
2) fteCreateMonitor command modified with parameter "-bs 100", default agent.properties:
in transfer log , there is only one transfer
this transfer is with partial success result
this transfer tryed to transfer 16 files (each XML four times)
agent was not able to delete any file, so source files remained in source directory
So in sum I got same total amount of files transfered (16) as in first result. And not even deleted source files.
3) just as commands are, agent.properties modified with parameter "monitorMaxResourcesInPoll=1":
in transfer log , there is only one transfer
this transfer is with success result
this transfer tryed to transfer four files and succeeded
agent was able to delete all source files
So I was able to get expected result only with this settings. But I am still not sure about appropriateness of the monitorMaxResourcesInPoll parameter set to "1".
Therefore for me the answer is: add
monitorMaxResourcesInPoll=1
to agent.properties. But this is in collision with other answers posted here, so I am little bit confused now.
tested on version 7.0.4.4

Check the box that says "Batch together the file transfers when multiple trigger files are found in one poll interval" (screen three).
Make sure that you set the maxFilesForTransfer in the agent.properties file to a value that is large enough for you, but be careful as this will affect all transfers.
You can also set monitorMaxResourcesInPoll=1 in the agent.properties file. I don't recommend this for 2 reasons: 1) it will affect all monitors 2) it may make it so that you can never catch up on all the files you have to transfer depending on your volume and poll interval.

Set your "Batch together the file transfers..." to a value more than 10:
Max Batch Size = 100

Limiting the lifetime of a file in Python

Helo there,
I'm looking for a way to limit a lifetime of a file in Python, that is, to make a file which will be auto deleted after 5 minutes after creation.
Problem:
I have a Django based webpage which has a service that generates plots (from user submitted input data) which are showed on the web page as .png images. The images get stored on the disk upon creation.
Image files are created per session basis, and should only be available a limited time after the user has seen them, and should be deleted 5 minutes after they have been created.
Possible solutions:
I've looked at Python tempfile, but that is not what I need, because the user should have to be able to return to the page containing the image without waiting for it to be generated again. In other words it shouldn't be destroyed as soon as it is closed
The other way that comes in mind is to call some sort of an external bash script which would delete files older than 5 minutes.
Does anybody know a preferred way doing this?
Ideas can also include changing the logic of showing/generating the image files.

You should write a Django custom management command to delete old files that you can then call from cron.
If you want no files older than 5 minutes, then you need to call it every 5 minutes of course. And yes, it would run unnecessarily when there are no users, but that shouln't worry you too much.

Ok that might be a good approach i guess...
You can write a script that checks your directory and delete outdated files, and choose the oldest file from the un-deleted files. Calculate how much time had passed since that file is created and calculate the remaining time to deletion of that file. Then call sleep function with remaining time. When sleep time ends and another loop begins, there will be (at least) one file to be deleted. If there is no files in the directory, set sleep time to 5 minutes.
In that way you will ensure that each file will be deleted exactly 5 minutes later, but when there are lots of files created simultaneously, sleep time will decrease greatly and your function will begin to check each file more and more often. To aviod that you add a proper latency to sleep function before starting another loop, like, if the oldest file is 4 minutes old, you can set sleep to 60+30 seconds (adding all time calculations 30 seconds).
An example:
from datetime import datetime
import time
import os
def clearDirectory():
while True:
_time_list = []
_now = time.mktime(datetime.now().timetuple())
for _f in os.listdir('/path/to/your/directory'):
if os.path.isfile(_f):
_f_time = os.path.getmtime(_f) #get file creation/modification time
if _now - _f_time < 300:
os.remove(_f) # delete outdated file
else:
_time_list.append(_f_time) # add time info to list
# after check all files, choose the oldest file creation time from list
_sleep_time = (_now - min(_time_list)) if _time_list else 300 #if _time_list is empty, set sleep time as 300 seconds, else calculate it based on the oldest file creation time
time.sleep(_sleep_time)
But as i said, if files are created oftenly, it is better to set a latency for sleep time
time.sleep(_sleep_time + 30) # sleep 30 seconds more so some other files might be outdated during that time too...
Also, it is better to read getmtime function for details.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fluentd S3 output plugin not recognizing index - amazon-web-services

I am facing problems while using S3 output plugin with fluent-d. s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension} using %index at end never resolves to _0,_1 . But i always end up with log file names as sflow_.log while i need sflow_0.log Regards,

Related

What is the cause of these google beta transcoding service job validation errors

Is there a maximum concurrency for AWS s3 multipart uploads?

Flume HDFS Sink generates lots of tiny files on HDFS

How to handle FTE queued transfers

Limiting the lifetime of a file in Python

Categories

Resources