Not able to perform wide HBase table scans - mapreduce

I'm facing a problem due to a BAD table design in HBase. The problem is millions of records end up under the same row key (1 cf). Till 2.5M I was able to run mapReduces with Spark by scanning a single row, but now some of the rows are reaching 5 or 6 millions and whenever I perform a scan or get, all my regionservers are down within a couple of minutes. I'm working with HDP 2.2 and HBase 0.98.4.2.2
So far I've tried:
confPoints.setInt("hbase.rpc.timeout",6000000)
...
scanPoints.setBatch(1000)
Before creating a new table with new rowkey design I really need to process this data. I'm new to HBase so maybe some of this suggestions sound stupid but:
Would increasing java heap size help in any way?
Is there a possibility of splitting a row into 2 or more rows?
Can I run a MapReduce over the raw stored data in HDFS without passing through HBase?
Any other idea?
Thanks!
EDITED:
Actually, the second option I think it is not feasible cause hbase doesn't let to update records, just delete + create again.
EDITED 2:
Each record in a row is about tens of bytes. The problem when having millions of records per row is when trying to scan this kind of row is that after a couple of minutes region servers start to go down one by one. Maybe trying to get a row of 512MB aprox. is too big for my cluster configuration: 6 nodes of 8GB each.
Searching in the HBase logs the only exception I can find is this on:
2015-08-25 15:07:19,722 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-0] handler.OpenRegionHandler: Opened my-hbase-table,20150807.33,1439222912086.e731d603bb5d1f0d593736eab922069c. on ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:19,724 INFO [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] regionserver.HRegion: Replaying edits from hdfs://ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:8020/apps/hbase/data/data/default/my-hbase-table/3bc481ff534f0907e6b99d5eff1793f5/recovered.edits/0000000000011099011
2015-08-25 15:07:19,725 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-2] zookeeper.ZKAssign: regionserver:60020-0x24f65d7e5df025c, quorum=ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XXX.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XX3.eu-west-1.compute.internal:2181, baseZNode=/hbase-unsecure Transitioned node 4945982779c1cba7b1726e77a45d405a from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2015-08-25 15:07:19,725 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-2] handler.OpenRegionHandler: Transitioned 4945982779c1cba7b1726e77a45d405a to OPENED in zk on ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:19,726 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-2] handler.OpenRegionHandler: Opened my-hbase-table,20150727.33,1438203991635.4945982779c1cba7b1726e77a45d405a. on ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:19,733 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] zookeeper.ZKAssign: regionserver:60020-0x24f65d7e5df025c, quorum=ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XXX.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XX3.eu-west-1.compute.internal:2181, baseZNode=/hbase-unsecure Attempting to retransition opening state of node 3bc481ff534f0907e6b99d5eff1793f5
2015-08-25 15:07:19,734 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] regionserver.HRegion: Applied 0, skipped 1, firstSequenceidInLog=11099011, maxSequenceidInLog=11099011, path=hdfs://ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:8020/apps/hbase/data/data/default/my-hbase-table/3bc481ff534f0907e6b99d5eff1793f5/recovered.edits/0000000000011099011
2015-08-25 15:07:19,734 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] regionserver.HRegion: Empty memstore size for the current region my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5.
2015-08-25 15:07:19,737 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:8020/apps/hbase/data/data/default/my-hbase-table/3bc481ff534f0907e6b99d5eff1793f5/recovered.edits/0000000000011099011
2015-08-25 15:07:19,759 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] wal.HLogUtil: Written region seqId to file:hdfs://ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:8020/apps/hbase/data/data/default/my-hbase-table/3bc481ff534f0907e6b99d5eff1793f5/recovered.edits/11099013_seqid ,newSeqId=11099013 ,maxSeqId=11099010
2015-08-25 15:07:19,761 INFO [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] regionserver.HRegion: Onlined 3bc481ff534f0907e6b99d5eff1793f5; next sequenceid=11099013
2015-08-25 15:07:19,764 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] zookeeper.ZKAssign: regionserver:60020-0x24f65d7e5df025c, quorum=ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XXX.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XX3.eu-west-1.compute.internal:2181, baseZNode=/hbase-unsecure Attempting to retransition opening state of node 3bc481ff534f0907e6b99d5eff1793f5
2015-08-25 15:07:19,773 INFO [PostOpenDeployTasks:3bc481ff534f0907e6b99d5eff1793f5] regionserver.HRegionServer: Post open deploy tasks for region=my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5.
2015-08-25 15:07:19,773 DEBUG [PostOpenDeployTasks:3bc481ff534f0907e6b99d5eff1793f5] regionserver.CompactSplitThread: Small Compaction requested: system; Because: Opening Region; compaction_queue=(0:1), split_queue=0, merge_queue=0
2015-08-25 15:07:19,774 DEBUG [regionserver60020-smallCompactions-1440529300855] compactions.RatioBasedCompactionPolicy: Selecting compaction from 4 store files, 0 compacting, 4 eligible, 10 blocking
2015-08-25 15:07:19,774 DEBUG [regionserver60020-smallCompactions-1440529300855] compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 3 permutations with 0 in ratio
2015-08-25 15:07:19,774 DEBUG [regionserver60020-smallCompactions-1440529300855] compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2015-08-25 15:07:19,775 DEBUG [regionserver60020-smallCompactions-1440529300855] regionserver.CompactSplitThread: Not compacting my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5. because compaction request was cancelled
2015-08-25 15:07:19,787 INFO [PostOpenDeployTasks:3bc481ff534f0907e6b99d5eff1793f5] catalog.MetaEditor: Updated row my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5. with server=ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:19,787 INFO [PostOpenDeployTasks:3bc481ff534f0907e6b99d5eff1793f5] regionserver.HRegionServer: Finished post open deploy task for my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5.
2015-08-25 15:07:19,788 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] zookeeper.ZKAssign: regionserver:60020-0x24f65d7e5df025c, quorum=ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XXX.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XX3.eu-west-1.compute.internal:2181, baseZNode=/hbase-unsecure Transitioning 3bc481ff534f0907e6b99d5eff1793f5 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2015-08-25 15:07:19,791 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] zookeeper.ZKAssign: regionserver:60020-0x24f65d7e5df025c, quorum=ip-XXX-XX-XX-XX2.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XXX.eu-west-1.compute.internal:2181,ip-XXX-XX-XX-XX3.eu-west-1.compute.internal:2181, baseZNode=/hbase-unsecure Transitioned node 3bc481ff534f0907e6b99d5eff1793f5 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
2015-08-25 15:07:19,791 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] handler.OpenRegionHandler: Transitioned 3bc481ff534f0907e6b99d5eff1793f5 to OPENED in zk on ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:19,791 DEBUG [RS_OPEN_REGION-ip-XXX-XX-XX-XXX:60020-1] handler.OpenRegionHandler: Opened my-hbase-table,20150824.33,1440473855617.3bc481ff534f0907e6b99d5eff1793f5. on ip-XXX-XX-XX-XXX.eu-west-1.compute.internal,60020,1440528949321
2015-08-25 15:07:20,344 INFO [B.DefaultRpcServer.handler=3,queue=3,port=60020] regionserver.HRegionServer: Client tried to access missing scanner 1
2015-08-25 15:07:20,346 DEBUG [B.DefaultRpcServer.handler=3,queue=3,port=60020] ipc.RpcServer: B.DefaultRpcServer.handler=3,queue=3,port=60020: callId: 36 service: ClientService methodName: Scan size: 25 connection: 172.31.40.100:42285
org.apache.hadoop.hbase.UnknownScannerException: Name: 1, already closed?
at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3150)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
at java.lang.Thread.run(Thread.java:745)
EDITED 3:
I've tried to do range scans within a row with a ColumnRangeFilter and it is working without putting down any region servers:
scan 'my-table', {STARTROW=>'row-key',ENDROW=>'row-key', FILTER=> ColumnRangeFilter.new(Bytes.toBytes('first_possible_column_prefix'),true,Bytes.toBytes('another_possible_column_prefix’),false)}
This code in Spark though is putting the region servers down, same behavior as before:
val scanPoints = new Scan()
scanPoints.setStartRow((queryDate+"."+venueId).getBytes())
scanPoints.setStopRow((queryDate+"."+venueId+"1").getBytes())
scanPoints.setFilter(new ColumnRangeFilter(Bytes.toBytes("first_possible_column_prefix"),true,Bytes.toBytes("another_possible_column_prefix"),false))
...
val confPoints = HBaseConfiguration.create()
confPoints.set(TableInputFormat.INPUT_TABLE, Utils.settings.HBaseWifiVisitorsTableName)
confPoints.set("hbase.zookeeper.quorum", Utils.settings.zQuorum);
confPoints.setInt("zookeeper.session.timeout", 6000000)
confPoints.set("hbase.zookeeper.property.clientPort", Utils.settings.zPort);
confPoints.set("zookeeper.znode.parent",Utils.settings.HBaseZNode)
confPoints.set("hbase.master", Utils.settings.HBaseMaster)
confPoints.set("hbase.mapreduce.scan.column.family","positions")
confPoints.setLong("hbase.client.scanner.max.result.size",2147483648L)
confPoints.setLong("hbase.server.scanner.max.result.size",2147483648L)
confPoints.setInt("hbase.rpc.timeout",6000000)
confPoints.setInt("hbase.client.operation.timeout",6000000)
confPoints.set(TableInputFormat.SCAN, convertScanToString(scanPoints))
...
val rdd = sc.newAPIHadoopRDD(confPoints, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]).cache()
If I would be able to make this Spark Job to work, I could iterate through the whole row scanning intervals to process it entirely.

Related

Process replayd killed by jetsam reason highwater

Recently I add a broadcast upload extension to my host app to implement system wide screen cast.I found the broadcast upload extension sometimes stopped for unknown reason.If I debug the broadcast upload extension process in Xcode, it stopped without stopping at a breakpoint(If the extension is killed for 50M bytes memory limit, it will stopped at a breakpoint, and Xcode will point out that it's killed for 50M bytes memory limit).For more imformation, I read the console log line by line.Finally, I found a significant line:
osanalyticshelper Process replayd [26715] killed by jetsam reason highwater
It looks like the ReplayKit serving process 'replayd' is killed by jetsam, and the reason is 'highwater'.So I searched the internet for more imformation.And I found a post:
https://www.jianshu.com/p/30f24bb91222
After reading that,I checked the JetsamEvent report in device, and found that when the 'replayd' process was killed it occupied 100M bytes memory.Is there a 100M bytes memory limit for 'replayd' process?How can I avoid it to occupy more than 100M bytes memory?
Further more, I found that this problem offen occured if the previous extension process is stopped via RPBroadcastSampleHandler's finishBroadcastWithError method.If I stop the extension via control center button, this rarely occured.
As comparison, when the 'Wemeet' app stop it's broadcast upload extension, it raraly cause this problem.I compared the console log when 'Wemeet' stop it's broadcast extension and the log my app stop it's broadcast extension.I found this line is different:
Wemeet:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x1080f7a40, sid:0x3456e, replayd(30213), 'prim'>; running count now 0
My app:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x107e869a0, sid:0x3464b, replayd(30232), 'prim'>; running count now 3
As we can see, the 'running count' is different.

GCP Composer - Airflow webserver shutdown constantly

I'm using GCP Composer with newest image version composer-1.16.1-airflow-1.10.15.
Mine webservers are dying from time to time because of some missing cache files
{cli.py:1050} ERROR - [Errno 2] No such file or directory
Does anybody know how to solve it?
Additional info:
Workers:
Node count 3 Disk size (GB) 20 Machine type n1-standard-1
Web server configuration:
Machine type composer-n1-webserver-8 (8 vCPU, 7.6 GB memory)
Configuration overrides:
UPDATE 27.04.2021
I've managed to find the place responsible for killing the web-server
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1032-L1138
GCP Composer is using Celery Executor underneath - soo during the check it tries to read some cache files that are already removed by workers?
I've found it! Aaand I'll report the bug to GCP Composer team
So if the config webserver.reload_on_plugin_change=True then cli is going into that section:
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1118-L1138
# if we should check the directory with the plugin,
if self.reload_on_plugin_change:
# compare the previous and current contents of the directory
new_state = self._generate_plugin_state()
# If changed, wait until its content is fully saved.
if new_state != self._last_plugin_state:
self.log.debug(
'[%d / %d] Plugins folder changed. The gunicorn will be restarted the next time the '
'plugin directory is checked, if there is no change in it.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = True
self._last_plugin_state = new_state
elif self._restart_on_next_plugin_check:
self.log.debug(
'[%d / %d] Starts reloading the gunicorn configuration.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = False
self._last_refresh_time = time.time()
self._reload_gunicorn()
def _generate_plugin_state(self):
"""
Generate dict of filenames and last modification time of all files in settings.PLUGINS_FOLDER
directory.
"""
if not settings.PLUGINS_FOLDER:
return {}
all_filenames = []
for (root, _, filenames) in os.walk(settings.PLUGINS_FOLDER):
all_filenames.extend(os.path.join(root, f) for f in filenames)
plugin_state = {f: self._get_file_hash(f) for f in sorted(all_filenames)}
return plugin_state
It is generating files to check by calling os.walk(settings.PLUGINS_FOLDER) function.
In the same time gcsfuse is deciding to delete part of these files
And an error happens - file is not found.
So disabling webserver.reload_on_plugin_change is making the work - but this option is really convenient so I'll create the bug ticket for google

Too many open files error while reindexing btc blockchain by electrs

I'm using electrs backend API documentation to build btc blockchain index engine and local HTTP API - https://github.com/Blockstream/electrs on linux os.
During the process of indexing an error occurred (I repeated whole process more than once and error occurred in the same place - according to my interpretation always at the finish of reading process [moments after to be precise]):
DEBUG - writing 1167005 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
TRACE - parsing 50331648 bytes
TRACE - fetched 101 blocks
DEBUG - writing 1144149 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
TRACE - fetched 104 blocks
DEBUG - writing 1221278 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
TRACE - skipping block 00000000000000000006160011df713a63b3bedc361b60bad660d5a76434ad59
TRACE - skipping block 00000000000000000005d70314d0dd3a31b0d44a5d83bc6c66a4aedbf8cf6207
TRACE - skipping block 00000000000000000001363a85233b4e4a024c8c8791d9eb0e7942a75be0d4de
TRACE - skipping block 00000000000000000008512cf84870ff39ce347e7c83083615a2731e34a3a956
TRACE - skipping block 0000000000000000000364350efd609c8b140d7b9818f15e19a17df9fc736971
TRACE - skipping block 0000000000000000000cc0a4fd1e418341f5926f0a6a5c5e70e4e190ed4b2251
TRACE - fetched 23 blocks
DEBUG - writing 1159426 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
DEBUG - writing 1155416 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
DEBUG - writing 232110 rows to RocksDB { path: "./db/mainnet/newindex/txstore" }, flush=Disable
DEBUG - starting full compaction on RocksDB { path: "./db/mainnet/newindex/txstore" }
DEBUG - finished full compaction on RocksDB { path: "./db/mainnet/newindex/txstore" }
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "IO error: While open a file for random read: ./db/mainnet/newindex/txstore/000762.sst: Too many open files" }', src/new_index/db.rs:192:44
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)
Size of db directory (where indexes are stored) is over 450GB. My open files limit is 1048576 (checked by ulimit -aH), so probably the problem is not there. I checked https://github.com/Blockstream/esplora/issues/133 task and no help. Any ideas what went wrong?
EDIT:
Soft limits (after checking by "ulimit -n") were equal 1024 - it's was the source of the problem. Setting it to 65000 solved it. I set it by "ulimit -n 65000" what worked only during one session in currently opened terminal. I changed etc/security/limits.conf, but the changes were not saved globally.

Fluentd S3 output plugin not recognizing index

I am facing problems while using S3 output plugin with fluent-d.
s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
using %index at end never resolves to _0,_1 . But i always end up with log file names as
sflow_.log
while i need sflow_0.log
Regards,
Can you paste your fluent.conf?. It's hard to find the problem without full info. File creations are mainly controlled by time slice flag and buffer configuration..
<buffer>
#type file or memory
path /fluentd/buffer/s3
timekey_wait 1m
timekey 1m
chunk_limit_size 64m
</buffer>
time_slice_format %Y%m%d%H%M
With above, you create a file every minute and within 1min if your buffer limit is reached or due to any other factor another file is created with index 1 under same minute.

'Premature end of Content-Length' with Spark Application using s3a

I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about 15 TB in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB.
By default the job spawns 130k tasks while reading dataset and mapping it to schema.
And then it fails around 70k tasks completions and after ~20 tasks failure.
Exception:
WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body
Looks like the s3 connection is getting closed prematurely.
I have tried nearly 40 different combos of configurations.
To summarize them: 1 executor to 3 executors per node, 18GB to 42GB --executor-memory, 3-5 --executor-cores, 1.8GB-4.0 GB spark.yarn.executor.memoryOverhead, Both, Kryo and Default Java serializers, 0.5 to 0.35 spark.memory.storageFraction, default, 130000 to 200000 partitions for bigger dataset. default, 200 to 2001 spark.sql.shuffle.partitions.
And most importantly: 100 to 2048 fs.s3a.connection.maximum property.
[This seems to be most relevant property to exception.]
[In all cases, driver was set to memory = 51GB, cores = 12, MEMORY_AND_DISK_SER level for caching]
Nothing worked!
If I run the program with half of the bigger dataset size (7.5TB), it finishes successfully in 1.5 hr.
What could I be doing wrong?
How do I determine the optimal value for fs.s3a.connection.maximum?
Is it possible that the s3 clients are getting GCed?
Any help will be appreciated!
Environment:
AWS EMR 5.7.0, 60 x i2.2xlarge SPOT Instances (16 vCPU, 61GB RAM, 2 x 800GB SSD), Spark 2.1.0
YARN is used as resource manager.
Code:
It's a fairly simple job, doing something like this:
val sl = StorageLevel.MEMORY_AND_DISK_SER
sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
val dataset_1: DataFrame = sparkSession
.read
.format("csv")
.option("delimiter", ",")
.schema(<schema: StructType>)
.csv("s3a://...")
.select("ID") //15 TB
dataset_1.persist(sl)
print(dataset_1.count())
tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
tmp2.write.csv(…)
dataset_1.unpersist()
Full Stacktrace:
17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
EDIT: We have another service which consume exactly same logs, it works just fine. But it uses old "s3://" scheme and is based on Spark-1.6. I'll try using "s3://" instead of "s3a://".