org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read snapshot info from:file:/tmp/hbase-cloudera/hbase - mapreduce

Hi I am trying to Export Hbase table snapshot to my local hdfs so that i can run mapreduce on that .
I have taken the snapshot of the Hbase table using below command
snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'
Also when i ran list_snapshots command i can see my snapshot also .
I have exported my Hbase table snapshot to my local HDFS directory using below command and copied successfully .
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16
Finally i have to run map reduce on the snapshot so below is my driver code to configure that job .
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // input table
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
new Path("/home/cloudera/archive/data/default/FundamentalAnalytic/bc95715f67e52547e86b5b096a1f1cb5/cf/d29205a44623434eba2d100a94d8ebfb_SeqId_4_"));
This is where i get error .
I dont know which path i have to give as last arg in initTableSnapshotMapperJob method.
When i run this code i get below error .
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read snapshot info from:file:/tmp/hbase-cloudera/hbase/.hbase-snapshot/FundamentalAnalyticSnapshot/.snapshotinfo
at org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:294)
at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.copySnapshotForScanner(RestoreSnapshotHelper.java:818)
at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.setInput(TableSnapshotInputFormatImpl.java:355)
at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat.setInput(TableSnapshotInputFormat.java:204)
at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableSnapshotMapperJob(TableMapReduceUtil.java:335)
at com.thomsonretuers.hbase.HBaseToFileDriver.run(HBaseToFileDriver.java:128)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.thomsonretuers.hbase.HBaseToFileDriver.main(HBaseToFileDriver.java:75)
Caused by: java.io.FileNotFoundException: File file:/tmp/hbase-cloudera/hbase/.hbase-snapshot/FundamentalAnalyticSnapshot/.snapshotinfo does not exist
One quick Question about Snapshot.
I want to take snapshot and run full table scan ,in that case scan
on snapshot will impact the region server performance?

I solved it after using correct path
Create snapshot
snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'
Export Snapshot to local hdfs
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16
Driver Job Configuration to rum mapreduce on Hbase snapshot
String snapshotName="FundamentalAnalyticSnapshot";
Path restoreDir = new Path("hdfs://quickstart.cloudera:8020/tmp");
String hbaseRootDir = "hdfs://quickstart.cloudera:8020/hbase";
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Snapshot name
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);
Also running mapreduce on Hbase snapshot will skip scan on Hbase table and also there will be no impact on region server.

Related

AWS Glue enableUpdateCatalog not creating new partitions after successful job run

I am having a problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table which has 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table?
Currently i have to manually run boto3 create_partition to create partitions on my glue catalog table. I want my job to automatically be able to create partitions as it discovers in S3 path separated by partition Keys
Code:
additionalOptions = {
"enableUpdateCatalog": True,
"updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
my_df = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="DataSink1",
additional_options=additionalOptions)
job.commit()
PS: I am currently using PARQUET format
Am i missing any Rights that has to be added to my job so that it can create partitions from the job itself?
I got it to work by adding useGlueParquetWriter: 'true' to the CATALOG table properties. And also I have added
format_options = {
'useGlueParquetWriter': True
}
in the write_dynamic_frame.from_catalog calls.
These steps got it to start working :)

Amazon redshift query aborts automatically after 1 hour

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();

Do we have mysqli_affected_rows equivalent for Amazon Redshift Copy command?

Just wanted to know if it is possible to get number of records loaded/failed using Amazon Redshift Copy Command fired through PHP script?
You are probably looking for the pg_last_copy_count query.
From the documentation:
Returns the number of rows that were loaded by the last COPY command
executed in the current session. PG_LAST_COPY_COUNT is updated with
the last COPY ID, which is the query ID of the last COPY that began
the load process, even if the load failed. The query ID and COPY ID
are updated when the COPY command begins the load process.

What's wrong with this HIVE script to export from DynamoDB to S3 in an AWS Data Pipeline?

Is there a problem with the HIVE script below or is this another issue, possibly related to the version of HIVE installed by AWS Data Pipeline?
The first part of my AWS Data Pipeline must export large tables from DynamoDB to S3 to later process using EMR. The DynamoDB table that I'm using for testing is only a few rows long, so I know that the data is formatted correctly.
The script associated with the AWS Data Pipeline "Export DynamoDB to S3" building block works correctly for tables that contain only primitive_types but don't export array_types. (reference - http://archive.cloudera.com/cdh/3/hive/language_manual/data-manipulation-statements.html)
I pulled out all Data Pipeline-specific stuff and am now trying to get the following minimal example based on the DynamoDB docs to work - (reference - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html)
-- Drop table
DROP table dynamodb_table;
--http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html
CREATE EXTERNAL TABLE dynamodb_table (song string, artist string, id string, genres array<string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "song:song,artist:artist,id:id,genres:genres");
INSERT OVERWRITE DIRECTORY 's3://umami-dev/output/colmap/' SELECT *
FROM dynamodb_table;
Here is the stack-trace / EMR errors that I'm see when running the above script -
Diagnostic Messages for this Task:
java.io.IOException: IO error in map input file hdfs://172.31.40.150:9000/mnt/hive_0110/warehouse/dynamodb_table
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:244)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:218)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:238)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.scan(AbstractDynamoDBRecordReader.java:176)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.fetchItems(HiveDynamoDBRecordReader.java:87)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:44)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:25)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Command exiting with ret '255'
I tried a few things already to debug, but none of them have been successful - creating an external table w/formatting, using a few different JSON SerDes. I'm not sure what to try next.
Many thanks.
I answered my own question by creating an EMR cluster and using Hue to quickly run HIVE queries in the Amazon environment.
The solution was to change the format of Items in the DynamoDB - what was originally a List of Strings is now a StringSet. Then my Hive tables could successfully operate on the array.
Logically speaking, I may lose the order of the Strings because I assume that a List is ordered but a Set is not. This doesn't matter in my specific problem.
Here's the relevant chunk of the final functioning Hive script -
-- depends on genres2 to be a StringSet (or not exist)
CREATE EXTERNAL TABLE sauce (id string, artist string, song string, genres2 array<string>)
STORED BY "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "id:id,artist:artist,song:song,genres2:genres2");
-- s3 location for export to
CREATE EXTERNAL TABLE pasta (id int, artist string, song string, genres array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
LOCATION "s3n://umami-dev/tmp2";
-- do the export
INSERT OVERWRITE TABLE pasta
SELECT * FROM sauce;