Vora Manager 1.3 log rotation - vora

Is there any log rotation in Vora 1.3? After 2 months of running Vora 1.3 I realized I'm almost of disk space on my nodes because /var/log/vora-manager is like 46 Gb. So I had to stop it, kill the logs and restart.
But maybe I missed some setting?
Edit 1: The log file is supposed to be stored in /var/log/vora/vora-manager, not the folder I mentioned above, but still I saw a huge log file there. The file /var/log/vora-manager is also mentioned in the line 178 of control.py script that is supposed to start a worked vora-manager.

You are right -- the vora-manager log file is not written into the standard /var/log/vora directory, instead it is written to /var/log/vora-manager. This has been corrected in Vora 1.4.
The logs should be rotating based on the vora_manager_log_max_file_size variable which is also set in Ambari.
Something must have gone wrong whenever vora tries to rotate the logs. I propose you search through your log file for the following line and see if it is followed by some kind of error:
vora.vora-manager-master: [c.740b0d26] : Running['sudo', '-i', '-u',
'root', '/usr/sbin/logrotate', '/etc/logrotate.d/vora-manager-master']
You can also change the verbosity of the logger by setting the vora_manager_log_level config variable in Ambari from INFO to WARNING. Be ware this will hide the log rotation log messages.

Related

Attempting to put a CloudWatch Agent onto an On-Premise server. Issues with cwagent-otel-collector

As said in the title, I am attempting to put a CloudWatch Agent (CW agent) on my On-Premise-Server (OPS).
After running this line of code that I got from the AWS User Guide to start the CW agent:
& $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a start
I got this error:
****** processing cwagent-otel-collector ******
cwagent-otel-collector will not be started as it has not been configured yet.
****** processing amazon-cloudwatch-agent ******
AmazonCloudWatchAgent has been started
I did/do not know what this was so I searched and found that when someone else had this issue, they did not create a config file.
I did create a config file (named config.json by default) using the configuration wizard and I am still having the issue.
I have tried looking into a number of pages on that user guide, but nothing has resolved the issue.
Thank you in advance for any assistance you can provide.
This message is info and not an error.
CloudWatch agent is bundled with the AWS OpenTelemetry collector agent. They're actually two agents. CloudWatch agent and Otel collector have separate configuration files. If you provide a config for one and not the other, it will only start the one that is configured. This is expected behavior.
Thank you for taking the time to answer. I have since resolved the issue (recently).
Everything from the command I was using to the path where the file resided was incorrect.
Starting over and going through all the steps again with background information helped.
The first installation combined with learning everything for the first time produced the issue.
Anyone having this issue I recommend that when you hit a wall like this you start over. I know it is not what anyone wants to do, but in the end it saved time.

Getting a GETHDFS ERROR Admin Yield Error

I am getting this error on my GetHDFS and I don't understand why I set the
Hadoop Configuration Resources, Kerberos Principal, Kerebros Keytab and there are files in the path I just checked via superputty and it's a valid path.
Currently the GetHDFS is just linked to a logAttribute as I am trying to get each step working before moving to the next.
Overall Process: GETHDFS -> PUTEMAIL, I am trying to print out a count of the rows of the path(csv)
It looks like your core-site.xml has a default filesystem that references a hostname 'gwhdpdevnnha' that is not reachable from where NiFi is running. You can trouble shoot this outside of NiFi by checking if you can run ping with that hostname from the terminal.

How to remove banned.users in hadoop Hortonworks

I have HDP Hortonworks 2.5.3 cluster, MAPREDUCE jobs in YARN are getting failed with the error:
java.io.IOException: DistCp failure: Job job_1498784032636_0015 has
failed:
Application application_1498784032636_0015 failed 2 times due to AM Container for appattempt_1498784032636_0015_000002 exited with
exitCode: -1000 For more detailed output, check the application
tracking page:
http://asterdart0005.labs.teradata.com:8088/cluster/app/application_1498784032636_0015 Then click on links to logs of each attempt. Diagnostics: Application
application_1498784032636_0015 initialization failed (exitCode=255)
with output: main : command provided 0 main : run as user is hdfs main
: requested yarn user is hdfs Requested user hdfs is banned
later i googled, it seems the hdfs user is banned user, as per the configuration in the file /etc/hadoop/conf/container-executor.cfg on each node, here is the content of the file:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
I have modified the file in all nodes (namenode, edge and data nodes), as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
#banned.users=hdfs,yarn,mapred,bin
min.user.id=500
and restarted all services in HDFS, YARN and MapReduce2 through Ambari, after restarting my jobs are failing with the same error, and checked the /etc/hadoop/conf/container-executor.cfg content, looks it reset to initial stage as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
any idea whats the solution here, to remove the users from the banned users list?
First thing to note is , you can not comment banned_users line, instead set correct users in value of banned_users list. (i.e. if you do not want to ban user hdfs then change banned.users=hdfs,yarn,mapred,bin to banned.users=yarn,mapred,bin). If you comment banned_users list then anyway by default hdfs, yarn and mapred will be banned.
Another thing, you can follow steps given below to propagate changes to all nodes.
​Go to Ambari server node
Modify /var/lib/ambari-server/resources/common-services/YARN/<version>/package/templates/container-executor.cfg.j2 to configure banned users.
Restart Ambari server and all Ambari agents

Wrong event time in CloudWatch log events

Found the solution after searching, but leaving this here if somebody happens to run into similar kind of confusion. See resolution in the end.
I'm trying to figure out why AWS CloudWatch log service fails to understand the right timestamp for my log events. Currently all my events are being saved under Time 2017-01-01 no matter what the actual timestamp in the event is.
I'm feeding the log from syslog where docker is saving the logged events and I configured docker to put the timestamp in format:
170105/103242 (%y%m%d/%H%M%S)
I configured awslogs service with parameters:
datetime_format = %y%m%d/%H%M%S
I restarted the service and hit the server, but still when I go to CloudWatch and see the log entries, even entries that indeed start with timestamp 170105/103242 are actually saved as events that belong to date 2017-01-01 containing all events between 01-01 and 01-05
When I look at the awslogs.log I can see following lines:
2017-01-05 11:05:28,633 - cwlogs.push - INFO - 29223 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2017-01-05 11:05:28,633 - cwlogs.push - INFO - 29223 - MainThread - Using default logging configuration.
This makes me think that the configuration probably isn't actually reading/using the datetime_format but I don't understand why it decides to end up using default. I tried to put
use_gzip_http_content_encoding = true
under general settings, but it doesn't change the errors.
I am running out of ideas - has anyone managed to configure awslogger in a way where the datetime_format is actually used correctly?
Edit:
I'm currently hacking more console logs to local python2.7 push.py to see what is going on :)
RESOLVED:
Ok, problem was that I came into this project after the initial setup had been created and I had the impression that the logger was configured to use the .conf file in location:
/etc/awslogs/awslogs.conf
that was dynamically populated.
The environment had a script that gave this location to awslogs-agent-setup.py which tried to make the agent understand that configuration should be read from here.
However this script didn't actually do what it was supposed to do and when the service started, it actually read the config from
/var/awslogs/etc/awslogs.conf
Which contained the default values.
So the actual resolution was to change the datetime_format parameter in the default config and forget about the config I thought the service was using.
Add logging to /var/awslogs/lib/python2.7/site-packages/cwlogs/push.py and see how the actual config parameters are interpreted.
You will probably find out that the service is actually using configuration file at default location:
/var/awslogs/etc/awslogs.conf
and hence you have to edit configuration values there for them to be actually read.

One or more services have started or stopped unexpectedly SPTimerService (SPTimerV4)

I have stop and restart services(Sharepoint Administration & Sharepoint Timer Service)
I cleaned the Configuration Cache by using mentioned steps.
Summary of the steps to clear the timer job:
Stop SharePoint Timer service on all servers in the farm.
Browse to C:\ProgramData\Microsoft\SharePoint\Config{GUID} where the {GUID} folder contains a bunch of XML files and NOT the files with a “.PERSITEDFILE” extension.
Delete all the XML files
Update the contents of the Cache.ini file to just say “1” (without quotes).
Restart the SharePoint Timer service on each server
Reanalyze the issue in Health Analyzer
Does anyone know why this keeps occurring and how I can stop it?
First of all try and check your ULS Logs and see if there is any error that arise.
Secondly try and maybe check the event viewer on your SharePoint server to see if any errors are shown and make sure you have enough disk space available.
and also you might want to check this :Clearing Timer Services
Let me know if you see any error post it here.
hope it helps.
Yotam.