Forced region split in HBase not causing any splits - amazon-web-services

I have an EMR cluster running HBase on s3. I have a table with the following configuration
hbase.regionserver.region.split.policy = org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy
I have disabled the split policy because I want to run split commands manually.
So I have a region say 'e85b1fe7c708500a7ae44427a76b3391' whose size is 14GB. I issue the following split command on the region:
split 'e85b1fe7c708500a7ae44427a76b3391'
The command runs successfully on the hbase shell, but no region split occurs. Can anyone help me on this.

Even though a force-split (such as the 'split' command in hbase shell) should be expected to expected to ignore 'DisabledRegionSplitPolicy', and try splitting a region, I have found, at least in the versions 1.4.x, that the policy is stopping the force-split.
You may try changing to some other policy, say 'ConstantSizeRegionSplitPolicy'. If this doesn't fix, then the issue could be something else. In that case, enable debug level in logging both in master, and in the target region server (which is hosting the target region), and troubleshoot.

Based on information available in https://hbase.apache.org/book.html#manual_region_splitting_decisions it states The DisabledRegionSplitPolicy policy blocks manual region splitting..
If you want to avoid manual splits, you can increase the size for auto-split to something very large.

Related

AWS Service Quota: How to get service quota for Amazon S3 using boto3

I get the error "An error occurred (NoSuchResourceException) when calling the GetServiceQuota operation:" while trying running the following boto3 python code to get the value of quota for "Buckets"
client_quota = boto3.client('service-quotas')
resp_s3 = client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
In the above code, QuotaCode "L-89BABEE8" is for "Buckets". I presumed the value of ServiceCode for Amazon S3 would be "s3" so I put it there but I guess that is wrong and throwing error. I tried finding the documentation around ServiceCode for S3 but could not find it. I even tried with "S3" (uppercase 'S' here), "Amazon S3" but that didn't work as well.
What I tried?
client_quota = boto3.client('service-quotas') resp_s3 = client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
What I expected?
Output in the below format for S3. Below example is for EC2 which is the output of resp_ec2 = client_quota.get_service_quota(ServiceCode='ec2', QuotaCode='L-6DA43717')
I just played around with this and I'm seeing the same thing you are, empty responses from any service quota list or get command for service s3. However s3 is definitely the correct service code, because you see that come back from the service quota list_services() call. Then I saw there are also list and get commands for AWS default service quotas, and when I tried that it came back with data. I'm not entirely sure, but based on the docs I think any quota that can't be adjusted, and possibly any quota your account hasn't requested an adjustment for, will probably come back with an empty response from get_service_quota() and you'll need to run get_aws_default_service_quota() instead.
So I believe what you need to do is probably run this first:
client_quota.get_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')
And if that throws an exception, then run the following:
client_quota.get_aws_default_service_quota(ServiceCode='s3', QuotaCode='L-89BABEE8')

CloudWatch - Delete logs after transfered

I have a CloudWatch set up on my EC2 instance to transfer logs to specific log groups.
In time, those logs can grow quite big in size so I wanted to delete them for example, on weekly basis.
I was wondering if there is any option of setting up auto-cleanup from EC2 instance, of transferred logs using Cloudwatch?
What would be the best way to achieve that?
To remove the logfiles from EC2 running Linux, you have two choices:
If you're using logfiles that already rotate based on time or other value, you can use the auto_removal option to delete them after the log agent is finished. See docs.
If you're using a file that's constantly updated, you'll need to use logrotate, which is a program invoked by CRON that will rename, compress, and delete old files. There's a good intro doc here.
If you use logrotate, here's an example config that I've found useful for high-volume log sources. It performs a rotate if the file reaches 100 megabytes, rather than just doing it every day (you'll need to run it from cron.hourly to make that useful). Most important, it enables copytruncate, which will truncate the file in-place, allowing the program to continue writing to it.
/var/log/filename.log {
rotate 7
daily
maxsize 100M
nodateext
missingok
notifempty
copytruncate
compress
delaycompress
}

Errors importing large CSV file to DynamoDB using Lambda

I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)
I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.

Where are the EMR logs that are placed in S3 located on the EC2 instance running the script?

The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support

Pointing multiple projects' log sinks to one bucket

I have a few GCP projects with log sinks to different storage buckets. I'd like to combine them into a single bucket. But the stackdriver export doesn't add any distinguishing information to the object names it creates; they all look like cloudaudit.googleapis.com/activity/2017/11/14/00:00:00_00:59:59_S0.json
What will happen if I start pushing them all to a single bucket? Will the different project sinks overwrite each other's objects? Is there any way to distinguish which project created the logs just from the object?
If not, I guess I should switch to pubsub sinks, and then write some code that produces objects with more desirable names. Are there any established patterns or examples for doing this?
Update: I filed https://issuetracker.google.com/issues/69371200 for this issue.
To enable this, just select custom destination on the sink and point to the bucket with this format: storage.googleapis.com/[BUCKET_ID].
I've just enabled this in a couple of my projects, as I'm curious to see the results when exporting to a bucket. However, I have been using a single BQ sink for all my projects, and the tables created have all the logs mixed, so no logs lost when using a single BQ sink.
I'm assuming for a GCS sink will work in the same way, but I'll tell you in a couple of days.
If a single bucket sink does not work, you can always use a single BQ sink (that will help in analyzing the logs), and when you no longer want to have them in BQ, export them and store the files wherever you want.
Also, since you'll be writing to your sink constantly, you can't use nearline or coldline, so the storage pricing is better in BQ than a regional bucket (0.02 USD/GB in BQ vs somewhere between 0.02 and 0.35 USD/GB for regional storage, depending on the region; BQ has 10GB free monthly, GCS 5GB).
I would generally recommend using a BQ sink, but I'll tell you what happens with my bucket logs.
Update:
A few hours later, and I've verified that shared bucket sinks work pretty much as you would expect. It concatenates logs chronologically regardless of the project origin, and only creates a single file for each time window. Hope this helps! (I still prefer BQ as a log sink...)
Update 2:
For the behavior you seek in the feature request, I would use BQ, but you could just as easily grep the project ID and separate the logs:
grep '"logName":"projects/<your-project-id>/' mixed-log.json > single-project-log.json
Or just get a cloud function triggered by bucket updates (so, every time you receive a log file in the sink) to run this for you.
Or namespace you buckets and have a cloud function moving them to wherever you need as soon as they are written.
The possibilities are endless!
If you have an organization or folder which includes all the projects that you want to collect logs from, then you can create a sink that collects from all projects in that org/folder.
Unfortunatlely, you cannot do this from the Cloud Console. Instead you must use gcloud with the --organization or --folder option or the API.