I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes:
I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?
Would using a flatmap operator and then having a single sink be better?
When you had a very simple pipeline with a single source and a single sink, something like this:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).
I don't see how using a flatmap would make any difference.
You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.
Related
I am trying to get the object list from a bucket I have that has probably billions of objects in it, in hundreds of folders, each with xx millions of objects.
The question I have is, why does using s3 resource.objects.filter take so long? And is there a way to use it and still get the same result as using client?
test results below
s3 = boto3.resource("s3")
s3bucket = s3.Bucket("myBucket")
s3objects = s3bucket.objects.filter(Prefix="folder1/hello09_") # Also tried adding page_size(1000) no change
# folder1/ has xxx million objects
# folder1/hello09_ has xx million objects in it (ex. s3://myBucket/folder1/hello09_file0000001)
# takes about 21 minutes
tmp = [s3obj.key for s3obj in s3objects]
# also takes about 20 minutes
tmp = []
for s3obj in s3obj_list:
tmp.append(s3obj.key)
# takes about 2 minutes
# wrote a s3client loop that gets 1000 files at a time
# client = boto3.client('s3')
# client.list_objects_v2(Bucket="myBucket",Prefix="folder1/hello09_")
# loop with NextContinuationToken until no more, 1000 objects each
# takes about 31 minutes
# same s3 client loop but this time changed the prefix to be only up to the deliminator
# The extra 10 minutes might be due to appending the results into an array, key by key?
# client = boto3.client('s3')
# client.list_objects_v2(Bucket="myBucket",Prefix="folder1/")
# loop with NextContinuationToken until no more, 1000 objects each
My best guess is that bucket.objects.filter queries ALL objects up to the last deliminator(folder1/) and THEN filters with whatever is after the last deliminator(hello09_). Instead of straight up querying the whole filter path(folder1/hello09_) like client does.
I thought resource was only querying 1 file at a time or something when used like an array or looped over instead of batch grabbing 1000 files from it (can you do that?). But I had a similar situation where I had dozens of subPrefixes each with 1 file in it, and resource.bucket.objects.filter performed the same as client.list_objects_v2.
Is this a bug in the boto3 filter code, or a feature that can be circumvented so resource can still be used with the same performance as client.
UPDATE:
I didn't know I could get such detailed logs, thanks Anon Coward.
So I guess I was wrong, it was sending the correct filtered request.
No idea how to read the log, but the various errors, retry requests, region redirectors, dns checks, etc are there. I don't think that those should cause the extra 18 minutes. But I have no idea anyways. Maybe some sort of background overhead work or prep of data so that it can be consumed? Vs Client where there are no errors.
Also no option to use S3 Inventory Report at the moment. Being too slow and not being realtime being one of problems.
So does that mean the only options are to use client? As S3 resource has some sort of internal efficiency or overhead problems when dealing with large number of objects, since it seems to work fine with small number of objects (same speed as client)?
I was hoping some sort of settings change could make resource as performant, but if its deep internal then maybe no go. Which is a shame considering how easy it is use Resource and not needing to manage multiple calls with continuationTokens.
botocore.endpoint [DEBUG] Sending http request: <AWSPreparedRequest stream_output=False, method=GET,
url=https://myBucket.s3.xxxx.amazonaws.com/?prefix=myFolder1%2Fhello09_&encoding-type=url,
headers={'User-Agent': b'Boto3/1.20.24 Python/3.8.0 Windows Botocore/1.27.59 Resource', 'X-Amz-Date': b'xxx',
'X-Amz-Content-SHA256': b'xxx', 'Authorization': b'xxx', 'amz-sdk-invocation-id': b'xxx', 'amz-sdk-request': b'attempt=1'}>
...
botocore.parsers [DEBUG] Response headers: ...
...
[DEBUG] Event needs-retry.s3.ListObjects: calling handler <botocore.retryhandler.RetryHandler object at 0x000000????>
botocore.retryhandler [DEBUG] No retry needed.
UPDATE2:
Just for some closure I looked really closely at the http requests of both resource and client and there are differences. Resource does not specify which list, and uses markers instead of continuation Tokens, maybe S3 calculating where to continue from contributes to the slowness? Maybe the list method contributes to the slowness? (especially if list_objectsv1 is slower than v2 or harder to consume). Then there is what AnonCoward said, that resource inherently is slower because it performs more API calls and creates millions of (unneeded?) objects.
# client
url=https://myBucket.s3.xxx.amazonaws.com/
?list-type=2&
prefix=myFolder1%2Fhello_&
continuation-token=xxx
encoding-type=url,
# resource
url=https://myBucket.s3.xxx.amazonaws.com/
?
prefix=myFolder1%2Fhello_&
marker=myFolder1%2Fhello_world001&
encoding-type=url,
I've tried to use Spark AQE for dynamically coalescing shuffle partitions before writing. On default, spark creates too many files with small sizes. However, AQE feature claims that enabling it will optimize this and merge small files into bigger ones. This is critical for aws s3 users like me because having too many small files causes network congestion when trying to read the small files later.
Here is my spark configuration:
[('spark.executor.extraJavaOptions', '-XX:+UseG1GC'),
('spark.executor.id', 'driver'),
('spark.driver.extraJavaOptions', '-XX:+UseG1GC'),
('spark.driver.memory', '16g'),
('spark.sql.adaptive.enabled', 'true'),
('spark.app.name', 'pyspark-shell'),
('spark.sql.adaptive.coalescePartitions.minPartitionNum', '5'),
('spark.app.startTime', '1614929855179'),
('spark.sql.adaptive.coalescePartitions.enabled', 'true'),
('spark.driver.port', '34447'),
('spark.executor.memory', '16g'),
('spark.driver.host', '2b7345ffcf3e'),
('spark.rdd.compress', 'true'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1614929856024'),
('spark.ui.showConsoleProgress', 'true')]
The required parameters for AQE are all enabled, I also see AdaptiveSparkPlan isFinalPlan=true in the execution plan. When I run a small task (read a csv, do some calculations, do a join operation and write into parquet), it still generates too many small sized files in the parquet folder. Am i missing something or this feature is not doing what it promised?
When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.
I have 24GB folderin my local file system. My task is to move that folder to HDFS. Two ways I did it.
1) hdfs dfs -copyFromLocal /home/data/ /home/
This took around 15mins to complete.
2) Using Flume.
Here is my agent
spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink_to_hdfs
# source
spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /home/data/
spool_dir.sources.src-1.fileHeader = false
# HDFS sinks
spool_dir.sinks.sink_to_hdfs.type = hdfs
spool_dir.sinks.sink_to_hdfs.hdfs.fileType = DataStream
spool_dir.sinks.sink_to_hdfs.hdfs.path = hdfs://192.168.1.71/home/user/flumepush
spool_dir.sinks.sink_to_hdfs.hdfs.filePrefix = customevent
spool_dir.sinks.sink_to_hdfs.hdfs.fileSuffix = .log
spool_dir.sinks.sink_to_hdfs.hdfs.batchSize = 1000
spool_dir.channels.channel-1.type = file
spool_dir.channels.channel-1.checkpointDir = /home/user/spool_dir_checkpoint
spool_dir.channels.channel-1.dataDirs = /home/user/spool_dir_data
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink_to_hdfs.channel = channel-1
This step took almost an hour to push data to HDFS.
As per my knowledge Flume is distributed, so should not it be that Flume should load data faster than copyFromLocal command.
If you're looking simple at read and write operations flume is going to be at least 2x slower with your configuration as you're using a file channel - every file read from disk is encapsulated into a flume event (in memory) and then serialized back down to disk via the file channel. The sink then reads the event back from the file channel (disk) before pushing it up to hdfs.
You also haven't set a blob deserializer on your spoolDir source (so it's reading one line at a time from your source files, wrapping in a flume Event and then writing to the file channel), so paired with the HDFS Sink default rollXXX values, you'll be getting a file in hdfs per 10 events / 30s / 1k rather than a file per input file that you'd get with copyFromLocal.
All of these factors add up to give you slower performance. If you want to get a more comparable performance, you should use the BlobDeserializer on the spoolDir source, coupled with a memory channel (but understand that a memory channel doesn't guarantee delivery of an event in the event of the JRE being prematurely terminated.
Apache Flume is not intended for moving or copying folders from local file system to HDFS. Flume is meant for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. (Reference: Flume User Guide)
If you want to move large files or directories, you should use hdfs dfs -copyFromLocal as you have already mentioned.
Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}