Created a imageclassifier model built on renet50 to identify dog breeds. I created it in sagemaker studio. Tuning and training are done, I deployed it, but when I try to predict on it, it fails. I believe this is related to the pid of the worker because its first warning I see.
Getting following Cloudwatch log output says worker pid not available yet then soon after the worker dies.
timestamp,message,logStreamName
1648240674535,"2022-03-25 20:37:54,107 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...",AllTraffic/i-055c5d00e53e84b93
1648240674535,"2022-03-25 20:37:54,188 [INFO ] main org.pytorch.serve.ModelServer - ",AllTraffic/i-055c5d00e53e84b93
1648240674535,Torchserve version: 0.4.0,AllTraffic/i-055c5d00e53e84b93
1648240674535,TS Home: /opt/conda/lib/python3.6/site-packages,AllTraffic/i-055c5d00e53e84b93
1648240674535,Current directory: /,AllTraffic/i-055c5d00e53e84b93
1648240674535,Temp directory: /home/model-server/tmp,AllTraffic/i-055c5d00e53e84b93
1648240674535,Number of GPUs: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Number of CPUs: 1,AllTraffic/i-055c5d00e53e84b93
1648240674535,Max heap size: 6838 M,AllTraffic/i-055c5d00e53e84b93
1648240674535,Python executable: /opt/conda/bin/python3.6,AllTraffic/i-055c5d00e53e84b93
1648240674535,Config file: /etc/sagemaker-ts.properties,AllTraffic/i-055c5d00e53e84b93
1648240674535,Inference address: http://0.0.0.0:8080,AllTraffic/i-055c5d00e53e84b93
1648240674535,Management address: http://0.0.0.0:8080,AllTraffic/i-055c5d00e53e84b93
1648240674535,Metrics address: http://127.0.0.1:8082,AllTraffic/i-055c5d00e53e84b93
1648240674535,Model Store: /.sagemaker/ts/models,AllTraffic/i-055c5d00e53e84b93
1648240674535,Initial Models: model.mar,AllTraffic/i-055c5d00e53e84b93
1648240674535,Log dir: /logs,AllTraffic/i-055c5d00e53e84b93
1648240674535,Metrics dir: /logs,AllTraffic/i-055c5d00e53e84b93
1648240674535,Netty threads: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Netty client threads: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Default workers per model: 1,AllTraffic/i-055c5d00e53e84b93
1648240674535,Blacklist Regex: N/A,AllTraffic/i-055c5d00e53e84b93
1648240674535,Maximum Response Size: 6553500,AllTraffic/i-055c5d00e53e84b93
1648240674536,Maximum Request Size: 6553500,AllTraffic/i-055c5d00e53e84b93
1648240674536,Prefer direct buffer: false,AllTraffic/i-055c5d00e53e84b93
1648240674536,Allowed Urls: [file://.*|http(s)?://.*],AllTraffic/i-055c5d00e53e84b93
1648240674536,Custom python dependency for model allowed: false,AllTraffic/i-055c5d00e53e84b93
1648240674536,Metrics report format: prometheus,AllTraffic/i-055c5d00e53e84b93
1648240674536,Enable metrics API: true,AllTraffic/i-055c5d00e53e84b93
1648240674536,Workflow Store: /.sagemaker/ts/models,AllTraffic/i-055c5d00e53e84b93
1648240674536,"2022-03-25 20:37:54,195 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...",AllTraffic/i-055c5d00e53e84b93
1648240675536,"2022-03-25 20:37:54,217 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar",AllTraffic/i-055c5d00e53e84b93
1648240675536,"2022-03-25 20:37:55,505 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,515 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082",AllTraffic/i-055c5d00e53e84b93
1648240675786,Model server started.,AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,727 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,812 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,813 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:38.02598190307617|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,813 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:12.715518951416016|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,814 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:25.1|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,815 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29583.98046875|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,815 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:1355.765625|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,816 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:5.7|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]48",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.6.13",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,999 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,006 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 182, in <module>",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 154, in run_server",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 116, in handle_connection",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 89, in load_model",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_loader.py"", line 110, in load",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - initialize_fn(service.context)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/home/model-server/tmp/models/23b30361031647d08792d32672910688/handler_service.py"", line 51, in initialize",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - super().initialize(context)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py"", line 66, in initialize",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,114 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,416 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,461 [INFO ] W-9000-model_1 ACCESS_LOG - /169.254.178.2:39848 ""GET /ping HTTP/1.1"" 200 9",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:56,461 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,567 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]86",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.6.13",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,569 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 182, in <module>",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 154, in run_server",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 116, in handle_connection",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg)",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 89, in load_model",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240678037,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:57,991 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:59,096 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:59,097 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]114",AllTraffic/i-055c5d00e53e84b93
Model tuning and training came out alright so I'm not sure why it won't predict if that is fine. Someone mentioned to me that it might be due to entry point script, but I don't know what would cause it fail in predicting after deployed if it can predict fine during training.
Entry point script:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.models as models
import torchvision.transforms as transforms
import json
import copy
import argparse
import os
import logging
import sys
from tqdm import tqdm
from PIL import ImageFile
import smdebug.pytorch as smd
ImageFile.LOAD_TRUNCATED_IMAGES = True
logger=logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
def test(model, test_loader, criterion, hook):
model.eval()
running_loss=0
running_corrects=0
hook.set_mode(smd.modes.EVAL)
for inputs, labels in test_loader:
outputs=model(inputs)
loss=criterion(outputs, labels)
_, preds = torch.max(outputs, 1)
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
##total_loss = running_loss // len(test_loader)
##total_acc = running_corrects.double() // len(test_loader)
##logger.info(f"Testing Loss: {total_loss}")
##logger.info(f"Testing Accuracy: {total_acc}")
logger.info("New test acc")
logger.info(f'Test set: Accuracy: {running_corrects}/{len(test_loader.dataset)} = {100*(running_corrects/len(test_loader.dataset))}%)')
def train(model, train_loader, validation_loader, criterion, optimizer, hook):
epochs=50
best_loss=1e6
image_dataset={'train':train_loader, 'valid':validation_loader}
loss_counter=0
hook.set_mode(smd.modes.TRAIN)
for epoch in range(epochs):
logger.info(f"Epoch: {epoch}")
for phase in ['train', 'valid']:
if phase=='train':
model.train()
logger.info("Model Trained")
else:
model.eval()
running_loss = 0.0
running_corrects = 0
for inputs, labels in image_dataset[phase]:
outputs = model(inputs)
loss = criterion(outputs, labels)
if phase=='train':
optimizer.zero_grad()
loss.backward()
optimizer.step()
logger.info("Model Optimized")
_, preds = torch.max(outputs, 1)
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
epoch_loss = running_loss // len(image_dataset[phase])
epoch_acc = running_corrects // len(image_dataset[phase])
if phase=='valid':
logger.info("Model Validating")
if epoch_loss<best_loss:
best_loss=epoch_loss
else:
loss_counter+=1
logger.info(loss_counter)
'''logger.info('{} loss: {:.4f}, acc: {:.4f}, best loss: {:.4f}'.format(phase,
epoch_loss,
epoch_acc,
best_loss))'''
if phase=="train":
logger.info("New epoch acc for Train:")
logger.info(f"Epoch {epoch}: Loss {loss_counter/len(train_loader.dataset)}, Accuracy {100*(running_corrects/len(train_loader.dataset))}%")
if phase=="valid":
logger.info("New epoch acc for Valid:")
logger.info(f"Epoch {epoch}: Loss {loss_counter/len(train_loader.dataset)}, Accuracy {100*(running_corrects/len(train_loader.dataset))}%")
##if loss_counter==1:
## break
##if epoch==0:
## break
return model
def net():
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(
nn.Linear(2048, 128),
nn.ReLU(inplace=True),
nn.Linear(128, 133))
return model
def create_data_loaders(data, batch_size):
train_data_path = os.path.join(data, 'train')
test_data_path = os.path.join(data, 'test')
validation_data_path=os.path.join(data, 'valid')
train_transform = transforms.Compose([
transforms.RandomResizedCrop((224, 224)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
])
test_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
train_data = torchvision.datasets.ImageFolder(root=train_data_path, transform=train_transform)
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data = torchvision.datasets.ImageFolder(root=test_data_path, transform=test_transform)
test_data_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True)
validation_data = torchvision.datasets.ImageFolder(root=validation_data_path, transform=test_transform)
validation_data_loader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size, shuffle=True)
return train_data_loader, test_data_loader, validation_data_loader
def main(args):
logger.info(f'Hyperparameters are LR: {args.lr}, Batch Size: {args.batch_size}')
logger.info(f'Data Paths: {args.data}')
train_loader, test_loader, validation_loader=create_data_loaders(args.data, args.batch_size)
model=net()
hook = smd.Hook.create_from_json_file()
hook.register_hook(model)
criterion = nn.CrossEntropyLoss(ignore_index=133)
optimizer = optim.Adam(model.fc.parameters(), lr=args.lr)
logger.info("Starting Model Training")
model=train(model, train_loader, validation_loader, criterion, optimizer, hook)
logger.info("Testing Model")
test(model, test_loader, criterion, hook)
logger.info("Saving Model")
torch.save(model.cpu().state_dict(), os.path.join(args.model_dir, "model.pth"))
if __name__=='__main__':
parser=argparse.ArgumentParser()
'''
TODO: Specify any training args that you might need
'''
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
metavar="N",
help="number of epochs to train (default: 10)",
)
parser.add_argument(
"--lr", type=float, default=0.01, metavar="LR", help="learning rate (default: 0.01)"
)
parser.add_argument(
"--momentum", type=float, default=0.5, metavar="M", help="SGD momentum (default: 0.5)"
)
# Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--data", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
args=parser.parse_args()
main(args)
To test the model on the endpoint I sent over an image using the following code:
from sagemaker.serializers import IdentitySerializer
import base64
predictor.serializer = IdentitySerializer("image/png")
with open("Akita_00282.jpg", "rb") as f:
payload = f.read()
response = predictor.predict(payload)```
The model serving workers are either dying because they cannot load your model or deserialize the payload you are sending to them.
Note that you have to provide a model_fn implementation. Please read these docs here or this blog here to know more about how to adapt the inference scripts for SageMaker deployment. If you do not want to override the input_fn, predict_fn, and/or output_fn handlers, you can find their default implementations, for example, here.
I installed cygnus using RPMs on fiware image CentOS-7-x64 and I can't start it as a service, Here is my logs:
[centos#cygnus-mongo conf]$ sudo service cygnus start
Starting cygnus (via systemctl): Job for cygnus.service failed. See 'systemctl status cygnus.service' and 'journalctl -xn' for details.
[FAILED]
[centos#cygnus-mongo conf]$ sudo journalctl -xn
-- Logs begin at mer. 2015-10-07 07:48:29 UTC, end at mer. 2015-10-07 10:02:35 UTC. --
oct. 07 10:02:20 cygnus-mongo.novalocal su[5700]: pam_unix(su:session): session closed for user cygnus
oct. 07 10:02:22 cygnus-mongo.novalocal cygnus[5695]: cat: /var/run/cygnus/cygnus_mongo.pid: No such file or directory
oct. 07 10:02:22 cygnus-mongo.novalocal cygnus[5695]: [FAILED]
oct. 07 10:02:22 cygnus-mongo.novalocal cygnus[5695]: rm: cannot remove ‘/var/run/cygnus/cygnus_mongo.pid’: No such file or directory
oct. 07 10:02:22 cygnus-mongo.novalocal systemd[1]: cygnus.service: control process exited, code=exited status=1
oct. 07 10:02:22 cygnus-mongo.novalocal systemd[1]: Failed to start SYSV: cygnus.
-- Subject: Unit cygnus.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit cygnus.service has failed.
--
-- The result is failed.
oct. 07 10:02:22 cygnus-mongo.novalocal systemd[1]: Unit cygnus.service entered failed state.
oct. 07 10:02:34 cygnus-mongo.novalocal dhclient[1064]: DHCPREQUEST on eth0 to 192.168.111.71 port 67 (xid=0x761299ef)
oct. 07 10:02:34 cygnus-mongo.novalocal dhclient[1064]: DHCPACK from 192.168.111.71 (xid=0x761299ef)
oct. 07 10:02:35 cygnus-mongo.novalocal sudo[5774]: centos : TTY=pts/0 ; PWD=/usr/cygnus/conf ; USER=root ; COMMAND=/bin/journalctl -xn
Actually the directory /var/run/cygnus was not created, is it going to be created automatically?
Here is my configuration files:
agent_mongo.conf
cygnusagent.sources = http-source
cygnusagent.sinks = mongo-sink
cygnusagent.channels = mongo-channel
#=============================================
# source configuration
# channel name where to write the notification events
cygnusagent.sources.http-source.channels = mongo-channel
# source class, must not be changed
cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
# listening port the Flume source will use for receiving incoming notifications
cygnusagent.sources.http-source.port = 5050
# Flume handler that will parse the notifications, must not be changed
cygnusagent.sources.http-source.handler = com.telefonica.iot.cygnus.handlers.OrionRestHandler
# URL target
cygnusagent.sources.http-source.handler.notification_target = /notify
# Default service (service semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service = def_serv
# Default service path (service path semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service_path = def_servpath
# Number of channel re-injection retries before a Flume event is definitely discarded (-1 means infinite retries)
cygnusagent.sources.http-source.handler.events_ttl = 10
# Source interceptors, do not change
cygnusagent.sources.http-source.interceptors = ts gi
# TimestampInterceptor, do not change
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
# GroupinInterceptor, do not change
cygnusagent.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.GroupingInterceptor$Builder
# Grouping rules for the GroupingInterceptor, put the right absolute path to the file if necessary
# See the doc/design/interceptors document for more details
cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file = /usr/cygnus/conf/grouping_rules.conf
# ============================================
# OrionMongoSink configuration
# sink class, must not be changed
cygnusagent.sinks.mongo-sink.type = com.telefonica.iot.cygnus.sinks.OrionMongoSink
# channel name from where to read notification events
cygnusagent.sinks.mongo-sink.channel = mongo-channel
# FQDN/IP:port where the MongoDB server runs (standalone case) or comma-separated list of FQDN/IP:port pairs where the MongoDB replica set members run
cygnusagent.sinks.mongo-sink.mongo_hosts = 127.0.0.1:27017
# a valid user in the MongoDB server (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_username =
# password for the user above (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_password =
# prefix for the MongoDB databases
cygnusagent.sinks.mongo-sink.db_prefix = kura_
# prefix pro the MongoDB collections
cygnusagent.sinks.mongo-sink.collection_prefix = kura_
# true is collection names are based on a hash, false for human redable collections
cygnusagent.sinks.mongo-sink.should_hash = false
#=============================================
# mongo-channel configuration
# channel type (must not be changed)
cygnusagent.channels.mongo-channel.type = memory
# capacity of the channel
cygnusagent.channels.mongo-channel.capacity = 1000
# amount of bytes that can be sent per transaction
cygnusagent.channels.mongo-channel.transactionCapacity = 100
cygnus_instance_mongo.conf :
# Who to run cygnus as. Note that you may need to use root if you want
# to run cygnus in a privileged port (<1024)
CYGNUS_USER=cygnus
# Where is the config folder
CONFIG_FOLDER=/usr/cygnus/conf
# Which is the config file
CONFIG_FILE=/usr/cygnus/conf/agent_mongo.conf
# Name of the agent. The name of the agent is not trivial, since it is the base for the Flume parameters
# naming conventions, e.g. it appears in .sources.http-source.channels=...
AGENT_NAME=cygnusagent
# Name of the logfile located at /var/log/cygnus. It is important to put the extension '.log' in order to the log rotation works properly
LOGFILE_NAME=cygnus.log
# Administration port. Must be unique per instance
ADMIN_PORT=8081
# Polling interval (seconds) for the configuration reloading
POLLING_INTERVAL=30
Edit: add logs after lunching cygnus as a standalone application:
[centos#cygnus-mongo iot]$ ./cygnus.sh
+ exec /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp '/usr/cygnus/conf:/usr/cygnus/lib/*:/usr/cygnus/plugins.d/cygnus/lib/*:/usr/cygnus/plugins.d/cygnus/libext/*' -Djava.library.path= com.telefonica.iot.cygnus.nodes.CygnusApplication -f /usr/cygnus/conf/agent_mongo.conf -n cygnusagent
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/cygnus/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/cygnus/plugins.d/cygnus/lib/cygnus-0.8.2-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2015-10-08 15:50:32,629 (main) [INFO - com.telefonica.iot.cygnus.nodes.CygnusApplication.main(CygnusApplication.java:235)] Starting a Jetty server listening on port 8081 (Management Interface)
2015-10-08 15:50:32,655 (main) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2015-10-08 15:50:32,656 (main) [INFO - com.telefonica.iot.cygnus.nodes.CygnusApplication.main(CygnusApplication.java:238)] Starting Cygnus application
2015-10-08 15:50:32,656 (Thread-1) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26
2015-10-08 15:50:32,684 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2015-10-08 15:50:32,694 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloading configuration file:/usr/cygnus/conf/agent_mongo.conf
2015-10-08 15:50:32,714 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration.<init>(FlumeConfiguration.java:101)] Configuration property ignored: cygnusagent.sinks.mongo-sink.mongo_username =
2015-10-08 15:50:32,714 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,715 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,715 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration.<init>(FlumeConfiguration.java:101)] Configuration property ignored: cygnusagent.sinks.mongo-sink.mongo_password =
2015-10-08 15:50:32,715 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:930)] Added sinks: mongo-sink Agent: cygnusagent
2015-10-08 15:50:32,716 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,716 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,716 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,716 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:mongo-sink
2015-10-08 15:50:32,731 (Thread-1) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SocketConnector#0.0.0.0:8081
2015-10-08 15:50:32,744 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration for agents: [cygnusagent]
2015-10-08 15:50:32,745 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:150)] Creating channels
2015-10-08 15:50:32,758 (conf-file-poller-0) [INFO - org.apache.flume.channel.DefaultChannelFactory.create(DefaultChannelFactory.java:40)] Creating instance of channel mongo-channel type memory
2015-10-08 15:50:32,765 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:205)] Created channel mongo-channel
2015-10-08 15:50:32,766 (conf-file-poller-0) [INFO - org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:39)] Creating instance of source http-source, type org.apache.flume.source.http.HTTPSource
2015-10-08 15:50:32,782 (conf-file-poller-0) [INFO - com.telefonica.iot.cygnus.handlers.OrionRestHandler.<init>(OrionRestHandler.java:75)] Cygnus version (0.8.2.UNKNOWN)
2015-10-08 15:50:32,808 (conf-file-poller-0) [INFO - com.telefonica.iot.cygnus.handlers.OrionRestHandler.configure(OrionRestHandler.java:141)] Startup completed
2015-10-08 15:50:32,836 (conf-file-poller-0) [INFO - org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:40)] Creating instance of sink: mongo-sink, type: com.telefonica.iot.cygnus.sinks.OrionMongoSink
2015-10-08 15:50:32,856 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:119)] Channel mongo-channel connected to [http-source, mongo-sink]
2015-10-08 15:50:32,872 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)] Starting new configuration:{ sourceRunners:{http-source=EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{name:http-source,state:IDLE} }} sinkRunners:{mongo-sink=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#7caba647 counterGroup:{ name:null counters:{} } }} channels:{mongo-channel=org.apache.flume.channel.MemoryChannel{name: mongo-channel}} }
2015-10-08 15:50:32,872 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel mongo-channel
2015-10-08 15:50:32,968 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:110)] Monitoried counter group for type: CHANNEL, name: mongo-channel, registered successfully.
2015-10-08 15:50:32,968 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:94)] Component type: CHANNEL, name: mongo-channel started
2015-10-08 15:50:32,969 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink mongo-sink
2015-10-08 15:50:32,970 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source http-source
2015-10-08 15:50:32,972 (lifecycleSupervisor-1-4) [INFO - com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.initialize(GroupingInterceptor.java:92)] Grouping rules read:
2015-10-08 15:50:32,974 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.stopAllComponents(Application.java:101)] Shutting down configuration: { sourceRunners:{http-source=EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{name:http-source,state:IDLE} }} sinkRunners:{mongo-sink=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#7caba647 counterGroup:{ name:null counters:{} } }} channels:{mongo-channel=org.apache.flume.channel.MemoryChannel{name: mongo-channel}} }
2015-10-08 15:50:32,975 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.stopAllComponents(Application.java:105)] Stopping Source http-source
2015-10-08 15:50:32,978 (lifecycleSupervisor-1-1) [INFO - com.telefonica.iot.cygnus.sinks.OrionMongoBaseSink.start(OrionMongoBaseSink.java:139)] [mongo-sink] Startup completed
2015-10-08 15:50:32,984 (lifecycleSupervisor-1-4) [
- com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.parseGroupingRules(GroupingInterceptor.java:165)] Error while parsing the Json-based grouping rules file. Details=null
2015-10-08 15:50:32,984 (lifecycleSupervisor-1-4) [WARN - com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.initialize(GroupingInterceptor.java:98)] Grouping rules syntax has errors
2015-10-08 15:50:33,030 (lifecycleSupervisor-1-4) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26
2015-10-08 15:50:33,081 (lifecycleSupervisor-1-4) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SocketConnector#0.0.0.0:5050
2015-10-08 15:50:33,082 (lifecycleSupervisor-1-4) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:110)] Monitoried counter group for type: SOURCE, name: http-source, registered successfully.
2015-10-08 15:50:33,082 (lifecycleSupervisor-1-4) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:94)] Component type: SOURCE, name: http-source started
2015-10-08 15:50:33,083 (conf-file-poller-0) [INFO - org.apache.flume.lifecycle.LifecycleSupervisor.unsupervise(LifecycleSupervisor.java:171)] Stopping component: EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{name:http-source,state:START} }
2015-10-08 15:50:33,083 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Stopped SocketConnector#0.0.0.0:5050
2015-10-08 15:50:33,185 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:139)] Component type: SOURCE, name: http-source stopped
2015-10-08 15:50:33,185 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:145)] Shutdown Metric for type: SOURCE, name: http-source. source.start.time == 1444319433082
2015-10-08 15:50:33,185 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:151)] Shutdown Metric for type: SOURCE, name: http-source. source.stop.time == 1444319433185
2015-10-08 15:50:33,186 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.append-batch.accepted == 0
2015-10-08 15:50:33,186 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.append-batch.received == 0
2015-10-08 15:50:33,186 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.append.accepted == 0
2015-10-08 15:50:33,186 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.append.received == 0
2015-10-08 15:50:33,187 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.events.accepted == 0
2015-10-08 15:50:33,187 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.events.received == 0
2015-10-08 15:50:33,187 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: SOURCE, name: http-source. src.open-connection.count == 0
2015-10-08 15:50:33,187 (conf-file-poller-0) [INFO - org.apache.flume.source.http.HTTPSource.stop(HTTPSource.java:172)] Http source http-source stopped. Metrics: SOURCE:http-source{src.events.accepted=0, src.events.received=0, src.append.accepted=0, src.append-batch.accepted=0, src.open-connection.count=0, src.append-batch.received=0, src.append.received=0}
2015-10-08 15:50:33,187 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.stopAllComponents(Application.java:115)] Stopping Sink mongo-sink
2015-10-08 15:50:33,188 (conf-file-poller-0) [INFO - org.apache.flume.lifecycle.LifecycleSupervisor.unsupervise(LifecycleSupervisor.java:171)] Stopping component: SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#7caba647 counterGroup:{ name:null counters:{} } }
2015-10-08 15:50:33,189 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.stopAllComponents(Application.java:125)] Stopping Channel mongo-channel
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.lifecycle.LifecycleSupervisor.unsupervise(LifecycleSupervisor.java:171)] Stopping component: org.apache.flume.channel.MemoryChannel{name: mongo-channel}
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:139)] Component type: CHANNEL, name: mongo-channel stopped
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:145)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.start.time == 1444319432968
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:151)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.stop.time == 1444319433190
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.capacity == 1000
2015-10-08 15:50:33,190 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.current.size == 0
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.event.put.attempt == 0
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.event.put.success == 0
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.event.take.attempt == 1
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.stop(MonitoredCounterGroup.java:167)] Shutdown Metric for type: CHANNEL, name: mongo-channel. channel.event.take.success == 0
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)] Starting new configuration:{ sourceRunners:{http-source=EventDrivenSourceRunner: { source:org.apache.flume.source.http.HTTPSource{name:http-source,state:START} }} sinkRunners:{mongo-sink=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#7caba647 counterGroup:{ name:null counters:{runner.backoffs.consecutive=0} } }} channels:{mongo-channel=org.apache.flume.channel.MemoryChannel{name: mongo-channel}} }
2015-10-08 15:50:33,191 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel mongo-channel
2015-10-08 15:50:33,192 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:94)] Component type: CHANNEL, name: mongo-channel started
2015-10-08 15:50:33,192 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink mongo-sink
2015-10-08 15:50:33,193 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source http-source
2015-10-08 15:50:33,193 (lifecycleSupervisor-1-1) [INFO - com.telefonica.iot.cygnus.sinks.OrionMongoBaseSink.start(OrionMongoBaseSink.java:139)] [mongo-sink] Startup completed
2015-10-08 15:50:33,194 (lifecycleSupervisor-1-6) [INFO - com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.initialize(GroupingInterceptor.java:92)] Grouping rules read:
2015-10-08 15:50:33,194 (lifecycleSupervisor-1-6) [ERROR - com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.parseGroupingRules(GroupingInterceptor.java:165)] Error while parsing the Json-based grouping rules file. Details=null
2015-10-08 15:50:33,194 (lifecycleSupervisor-1-6) [WARN - com.telefonica.iot.cygnus.interceptors.GroupingInterceptor.initialize(GroupingInterceptor.java:98)] Grouping rules syntax has errors
2015-10-08 15:50:33,195 (lifecycleSupervisor-1-6) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26
2015-10-08 15:50:33,197 (lifecycleSupervisor-1-6) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started SocketConnector#0.0.0.0:5050
2015-10-08 15:50:33,197 (lifecycleSupervisor-1-6) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:94)] Component type: SOURCE, name: http-source started
Cygnus is supposed to create /var/run/cygnus/ when started. You can check here the path specification, and here the creation and PID assignement.
I'm wondering which are the permissions of your /var/run... Maybe they are too restrictive for the cygnus user.
Anyway, are you able to run Cygnus as a standalone application (not as a service) with no errors? I mean, executing this command:
$ /usr/cygnus/bin/cygnus-flume-ng agent --conf /usr/cygnus/conf -f /usr/cygnus/conf/agent_mongo.conf -n cygnusagent -Dflume.root.logger=INFO,console