how can I merge hdfs edit logs files? - hdfs

I have 20005 edit logs files in the NameNode which is a large number to me, is there a way I can merge them to fsimage ? I have restarted the NameNode, it did not help.

If you do not have HA enabled for NN, then you need to have a Secondary NameNode that does this.
If you have HA enabled, then your Standby NN does this.
If you have those, check for their logs and see what happens and why it fails. It is possible that you do not have enough RAM, and you need to increase the heap size of these roles, but that should be verified before with the logs.
If you do not have one of those beside the NN, then fix this and it will happen automatically, relevant configs that affect checkpoint timing:
dfs.namenode.checkpoint.period (default: 3600s)
dfs.namenode.checkpoint.txns (default: 1 million txn)
You can run the following commands as well, but this is a temporary fix:
hdfs dfsadmin -safemode enter
hdfs dfsadmin -rollEdits
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave
Note: after entering safemode HDFS gets read only until you leave safemode.

Related

CloudWatch - Delete logs after transfered

I have a CloudWatch set up on my EC2 instance to transfer logs to specific log groups.
In time, those logs can grow quite big in size so I wanted to delete them for example, on weekly basis.
I was wondering if there is any option of setting up auto-cleanup from EC2 instance, of transferred logs using Cloudwatch?
What would be the best way to achieve that?
To remove the logfiles from EC2 running Linux, you have two choices:
If you're using logfiles that already rotate based on time or other value, you can use the auto_removal option to delete them after the log agent is finished. See docs.
If you're using a file that's constantly updated, you'll need to use logrotate, which is a program invoked by CRON that will rename, compress, and delete old files. There's a good intro doc here.
If you use logrotate, here's an example config that I've found useful for high-volume log sources. It performs a rotate if the file reaches 100 megabytes, rather than just doing it every day (you'll need to run it from cron.hourly to make that useful). Most important, it enables copytruncate, which will truncate the file in-place, allowing the program to continue writing to it.
/var/log/filename.log {
rotate 7
daily
maxsize 100M
nodateext
missingok
notifempty
copytruncate
compress
delaycompress
}

More efficient use of aws s3 sync?

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)
The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.
As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu
S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?
The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.

distcp causing skewness in HDFS

I have a folder(around 2 TB in size) in HDFS, which was created using save method from Apache Spark. It is almost evenly distributed across nodes (I checked this using hdfs fsck).
When I try to distcp this folder (intra-cluster), and run hdfs fsck on the destination folder, it turns out to be highly skewed, that is, few nodes have a lot of blocks whereas few nodes have very less blocks stored on them. This skewness on HDFS is causing performance issues.
We tried moving the data using mv from source to destination (intra-cluster), and this time the skewness in the destination was fine, that is, the data was evenly distributed.
Is there any way to reduce the skewness in HDFS when using distcp?
The number of mappers in the distcp were equal to the number of nodes which were heavily loaded.
So I increased the number of mappers in distcp using the -m option to the number of machines present in the cluster, and the output was much lesser skewed.
An added benefit: the distcp job completed much quicker than what it used to take earlier.

When is data deleted from data nodes in case of hdfs dfs -rmr on a folder?

We know that as we run the rmr command, edit log is created. Do the data nodes wait for updates to FSImage before purging the data or that too happens concurrently? Is there any pre-condition around acknowledgement of transaction from Journal nodes? Just trying to understand how HDFS edits work wherein you could have massive change in disk size.. How long will it take before 'hdfs dfs -du -s -h /folder' and 'hdfs dfsadmin -report' reflect the decrease in size? We tried deleting 2TB of data and after 1 hour, the data nodes local folder (/data/yarn/datanode) still was not reduced by 2TB.
After deleting the data from HDFS hadoop keeps that data in trash folder and you need to run below command to free the disk space
Hadoop fs -expunge
Then the space will be released by the HDFS.
Or you can run below command while deleting the data to skip trash
Hadoop fs -rmr -skipTrash /folder
It will not move the data into trash.
Note: A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace.

Backup MySQL Amazon RDS

I am trying to setup Replica outside of AWS and master is running at AWS RDS. And I do not want any downtime at my master. So I setup my slave node and now I want to backup my current database which is at AWS.
mysqldump -h RDS ENDPOINT -u root -p --skip-lock-tables --single-transaction --flush-logs --hex-blob --master-data=2 --all-databases > /root/dump.sql
I tested it on my VM and it worked fine but when tying it with RDS it gives me error
mysqldump: Couldn't execute 'FLUSH TABLES WITH READ LOCK': Access denied for user 'root'#'%' (using password: YES) (1045)
Is it because i do not have super user privilege or how to I fix this problem? Please someone suggest me.
RDS does not allow even the master user the SUPER privilege, and this is required in order to execute FLUSH TABLES WITH READ LOCK. (This is an unfortunate limitation of RDS).
The failing statement is being generated by the --master-data option, which is, of course, necessary if you want to be able to learn the precise binlog coordinates where the backup begins. FLUSH TABLES WITH READ LOCK acquires a global read lock on all tables, which allows mysqldump to START TRANSACTION WITH CONSISTENT SNAPSHOT (as it does with --single-transaction) and then SHOW MASTER STATUS to obtain the binary log coordinates, after which it releases the global read lock because it has a transaction that will keep the visible data in a state consistent with that log position.
RDS breaks this mechanism by denying the SUPER privilege and providing no obvious workaround.
There are some hacky options available to properly work around this, none of which may be particularly attractive:
do the backup during a period of low traffic. If the binlog coordinates have not changed between the time you start the backup and after the backup has begin writing data to the output file or destination server (assuming you used --single-transaction) then this will work because you know the coordinates didn't change while the process was running.
observe the binlog position on the master right before starting the backup, and use these coordinates with CHANGE MASTER TO. If your master's binlog_format is set to ROW then this should work, though you will likely have to skip past a few initial errors, but should not have to subsequently have any errors. This works because row-based replication is very deterministic and will stop if it tries to insert something that's already there or delete something that's already gone. Once past the errors, you will be at the true binlog coordinates where the consistent snapshot actually started.
as in the previous item, but, after restoring the backup try to determine the correct position by using mysqlbinlog --base64-output=decode-rows --verbose to read the master's binlog at the coordinates you obtained, checking your new slave to see which of the events must have already been executed before the snapshot actually started, and using the coordinates determined this way to CHANGE MASTER TO.
use an external process to obtain a read lock on each and every table on the server, which will stop all writes; observe that the binlog position from SHOW MASTER STATUS has stopped incrementing, start the backup, and release those locks.
If you use any of these approaches other than perhaps the last one, it's especially critical that you do table comparisons to be certain your slave is identical to the master once it is running. If you hit subsequent replication errors... then it wasn't.
Probably the safest option -- but also maybe the most annoying since it seems like it should not be necessary -- is to begin by creating an RDS read replica of your RDS master. Once it is up and synchronized to the master, you can stop replication on the RDS read replica by executing an RDS-provided stored procedure, CALL mysql.rds_stop_replication which was introduced in RDS 5.6.13 and 5.5.33 which doesn't require the SUPER privilege.
With the RDS replica slave stopped, take your mysqldump from the RDS read replica, which will now have an unchanging data set on it as of a specific set of master coordinates. Restore this backup to your off-site slave and then use the RDS read replica's master coordinates from SHOW SLAVE STATUS Exec_Master_Log_Pos and Relay_Master_Log_File as your CHANGE MASTER TO coordinates.
The value shown in Exec_Master_Log_Pos on a slave is the start of the next transaction or event to be processed, and that's exactly where your new slave needs to start reading on the master.
Then you can decommission the RDS read replica once your external slave is up and running.
Thanks Michael, I think the most correct solution and the recommended by AWS is do the replication using a read replica as a source as explained here.
Having a RDS master, RDS read replica and an instance with MySQL ready, the steps to get an external slave are:
On master, increase binlog retention period.
mysql> CALL mysql.rds_set_configuration('binlog retention hours', 12);
On read replica stop replication to avoid changes during the backup.
mysql> CALL mysql.rds_stop_replication;
On read replica annotate the binlog status (Master_Log_File and Read_Master_Log_Pos)
mysql> SHOW SLAVE STATUS;
On server instance do a backup and import it (Using mydumper as suggested by Max can speed up the process).
mysqldump -h RDS_READ_REPLICA_IP -u root -p YOUR_DATABASE > backup.sql
mysql -u root -p YOUR_DATABASE < backup.sql
On server instance set it as slave of RDS master.
mysql> CHANGE MASTER TO MASTER_HOST='RDS_MASTER_IP',MASTER_USER='myrepladmin', MASTER_PASSWORD='pass', MASTER_LOG_FILE='mysql-bin-changelog.313534', MASTER_LOG_POS=1097;
Relace MASTER_LOG_FILE and MASTER_LOG_POS to the values of Master_Log_File Read_Master_Log_Pos you saved before, also you need an user in RDS master to be used by slave replication.
mysql> START SLAVE;
On server instance check if replication was success.
mysql> SHOW SLAVE STATUS;
On RDS read replica resume replication.
mysql> CALL mysql.rds_start_replication;
for RDS binlog position you can use mydumper with --lock-all-tables, it will use LOCK TABLES ... READ just to get the binlog coordinates and then realease it instead of FTWRL.
Michael's answer is extremely helpful and focuses on the main sticking point: you simply cannot GRANT the required SUPER privilege on RDS, and therefore you can't use the --master-data flag that would make things so much easier.
I read that it may be possible to work around this by creating or modifying a Database Parameter Group via the API, but I think using the RDS procedures is a better option.
The multi-tiered replication approach works well, though, and can include tiers outside RDS/VPC so it's possible to replicate from "Classic" EC2 to VPC using this method.
A lot of the necessary functionality is only in later releases of MySQL 5.5 and 5.6, and I strongly recommend you run the same version on all the DBs involved in the replication stack, so you may have to do an upgrade of your old DB before all of this, which means yet more tedium and replication and so on.
I had faced a similar problem a quick workaround to this is:
Create a EBS Volume to have an extra space or extend current EBS volume on EC2. (or if you have an extra space you can use that).
Use mysqldump command without --master-data or --flush-data directive to generate a complete (FULL) backup of db.
mysqldump -h hostname --routines -uadmin -p12344 test_db > filename.sql
admin is DB name and 12344 is the password
Above is for taking backup of one single DB, if required to take all DBs then specify --all-databases and also mention DB Names.
Create a Cron of this command to run once a day that will automatically generate the dump.
Please note that this will incur an extra cost if your DB Size is huge. as it creates a complete DB dump.
hope this helps
You need to
1- create a read replica on aws
2- make sure that this instance is catching up with the master
3- stop the replication and get the log_file and log_position parameters by
show slave status \G
4- dump the database and use the parameters logged in step 3 to start the replication on your own server.
5- start the slave.
Detailed instructions here
Either things have changed, since #Michael - sqlbot 's response, or there is a misunderstanding going on here (could be on my part),
You can use COPY to import a csv file into rds, at least on the postgres version, you just need to use FROM STDIN instead of directly naming the file,
which means you end up piping things like:
cat data.csv | psql postgresql://server:5432/mydb -U user -c "COPY \"mytable\" FROM STDIN DELIMITER ',' "