I've tried to use Spark AQE for dynamically coalescing shuffle partitions before writing. On default, spark creates too many files with small sizes. However, AQE feature claims that enabling it will optimize this and merge small files into bigger ones. This is critical for aws s3 users like me because having too many small files causes network congestion when trying to read the small files later.
Here is my spark configuration:
[('spark.executor.extraJavaOptions', '-XX:+UseG1GC'),
('spark.executor.id', 'driver'),
('spark.driver.extraJavaOptions', '-XX:+UseG1GC'),
('spark.driver.memory', '16g'),
('spark.sql.adaptive.enabled', 'true'),
('spark.app.name', 'pyspark-shell'),
('spark.sql.adaptive.coalescePartitions.minPartitionNum', '5'),
('spark.app.startTime', '1614929855179'),
('spark.sql.adaptive.coalescePartitions.enabled', 'true'),
('spark.driver.port', '34447'),
('spark.executor.memory', '16g'),
('spark.driver.host', '2b7345ffcf3e'),
('spark.rdd.compress', 'true'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1614929856024'),
('spark.ui.showConsoleProgress', 'true')]
The required parameters for AQE are all enabled, I also see AdaptiveSparkPlan isFinalPlan=true in the execution plan. When I run a small task (read a csv, do some calculations, do a join operation and write into parquet), it still generates too many small sized files in the parquet folder. Am i missing something or this feature is not doing what it promised?
Related
I am trying to do data quality monitoring for a Batch transform job on a very simple .csv dataset with five inputs + one target variable. Both data capture configuration and monitoring job ended without error. However, the monitoring job reported a violation on data columns (five instead of six, the baseline is a similar dataset to the input). Any clues about that?
I successfully enabled data capture for the batch transform job using this sample code:
transformer = Transformer(
model_name=model_name,
instance_count=1,
instance_type="ml.m5.large",
accept="text/csv",
output_path=output_path,
sagemaker_session=pipeline_session
)
transformer.transform(
data=input_path,
data_type="S3Prefix",
content_type="text/csv",
batch_data_capture_config=BatchDataCaptureConfig(
destination_s3_uri= f"s3://{default_bucket}/{s3_prefix_destination}",
generate_inference_id=True,
),
split_type="Line"
)
It correctly gives as outputs two folders inside my S3 destination path named input/ and output/. For the data quality monitoring job I used this:
data_quality_model_monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type="ml.t3.medium",
volume_size_in_gb=50,
max_runtime_in_seconds=3600,
)
schedule = data_quality_model_monitor.create_monitoring_schedule(
monitor_schedule_name="schedule",
batch_transform_input=BatchTransformInput(
data_captured_destination_s3_uri=f"s3://{default_bucket}/{s3_prefix_destination}",
destination="/opt/ml/processing/input",
dataset_format=MonitoringDatasetFormat.csv(header=False),
),
output_s3_uri=f's3://{default_bucket}/{s3_prefix_output}',
statistics=f"s3://{default_bucket}/{s3_prefix_statistics}",
constraints=f"s3://{default_bucket}/{s3_prefix_constraints}",
schedule_cron_expression=CronExpressionGenerator.hourly(),,
enable_cloudwatch_metrics=True,
)
It reports a violation on data columns (five instead of six). It seems that the monitoring job analyzes only the manifest file contained in the folder input/, missing the files inside output/. Probably there is an option to enable output/ control, but I did not find it.
I'm using Kinesis Data Analytics on Flink to do stream processing.
The usecase that I'm working on is to read records from a single Kinesis stream and after some transformations write to multiple S3 buckets. One source record might end up in multiple S3 buckets. We need to write to multiple buckets since the source record contains a lot of information which needs to be split to multiple S3 buckets.
I tried achieving this using multiple sinks.
private static <T> SinkFunction<T> createS3SinkFromStaticConfig(String path, Class<T> type) {
OutputFileConfig config = OutputFileConfig
.builder()
.withPartSuffix(".snappy.parquet")
.build();
final StreamingFileSink<T> sink = StreamingFileSink
.forBulkFormat(new Path(s3SinkPath + "/" + path), createParquetWriter(type))
.withBucketAssigner(new S3BucketAssigner<T>())
.withOutputFileConfig(config)
.withRollingPolicy(new RollingPolicy<T>(DEFAULT_MAX_PART_SIZE, DEFAULT_ROLLOVER_INTERVAL))
.build();
return sink;
}
public static void main(String[] args) throws Exception {
DataStream<PIData> input = createSourceFromStaticConfig(env)
.map(new JsonToSourceDataMap())
.name("jsonToInputDataTransformation");
input.map(value -> value)
.name("rawData")
.addSink(createS3SinkFromStaticConfig("raw_data", InputData.class))
.name("s3Sink");
input.map(FirstConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("firstOutput", Output1.class));
input.map(SecondConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("secondOutput", Output2.class));
input.map(ThirdConverter::convertInputData)
.addSink(createS3SinkFromStaticConfig("thirdOutput", Output3.class));
//and so on; There are around 10 buckets.
}
However, I saw a big performance impact due to this. I saw a big CPU spike due to this (as compared to one with just one sink). The scale that I'm looking at is around 100k records per second.
Other notes:
I'm using bulk format writer since I want to write files in parquet format. I tried increasing the checkpointing interval from 1-minute to 3-minutes assuming writing files to s3 every minute might be causing issues. But this didn't help much.
As I'm new to flink and stream processing, I'm not sure if this much performance impact is expected or is there something I can do better?
Would using a flatmap operator and then having a single sink be better?
When you had a very simple pipeline with a single source and a single sink, something like this:
source -> map -> sink
then the Flink scheduler was able to optimize the execution, and the entire pipeline ran as a sequence of function calls within a single task -- with no serialization or network overhead. Flink 1.12 can apply this operator chaining optimization to more complex topologies -- perhaps including the one you have now with multiple sinks -- but I don't believe this was possible with Flink 1.11 (which is what KDA is currently based on).
I don't see how using a flatmap would make any difference.
You can probably optimize your serialization/deserialization. See https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html.
When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.
I have a two computer EMR cluster with PySpark installed reading data from s3. The code is a very simple filter and transform operation using sqlContext.readStream.text to fetch data from the bucket. The bucket is ~10TB large and has around 75k files organized by bucket/year/month/day/hour/* with * representing up to 20 files of 128MB in size. I started the streaming task by providing the bucket s3://bucket_name/dir/ and letting PySpark read all files in it. It's now being almost 2 hours, the job hasn't even started consuming data from s3 and the network traffic as reported by Ganglia is minimal.
I'm scratching my head about why is this process so slow and how can I increase its speed, since currently the machines I'm paying for are basically idle.
When I use .status and .lastProgress to track the status I get the following responses respectively:
{'isDataAvailable': False,
'isTriggerActive': True,
'message': 'Getting offsets from FileStreamSource[s3://bucket_name/dir]'}
and
{'durationMs': {'getOffset': 207343, 'triggerExecution': 207343},
'id': '******-****-****-****-*******',
'inputRowsPerSecond': 0.0,
'name': None,
'numInputRows': 0,
'processedRowsPerSecond': 0.0,
'runId': '******-****-****-****-*******',
'sink': {'description': 'FileSink[s3://dest_bucket_name/results/file_name.csv]'},
'sources': [{'description': 'FileStreamSource[s3://bucket_name/dir]',
'endOffset': None,
'inputRowsPerSecond': 0.0,
'numInputRows': 0,
'processedRowsPerSecond': 0.0,
'startOffset': None}],
'stateOperators': [],
'timestamp': '2018-02-19T22:31:13.385Z'}
Any ideas of what could be causing the data consumption to take so long? Is this normal behaviour? Am I doing something wrong? Any tips on how can this process be improved?
Any help is greatly appreciated. Thanks.
Spark checks for files in the source folder and tries to discover partitions by checking sub-folders' names to correspond pattern "column-name=column-value".
Since your data is partitioned by date then files should be structured in something like this: s3://bucket_name/dir/year=2018/month=02/day=19/hour=08/data-file.
Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}