Data Science/Engineering (Dev/Prod) Environment

Data Science/Engineering (Dev/Prod) Environment - google-cloud-platform

I am going to create environments. For now i have gcp machine and i run jupyter in there. Everytime, i need start it, and with 3 people it is hard to work in same environment. I know, there is docker, jupyter hub, but did not find and suitable roadmap to create dev/prod environment.
My aim to create dev and production environment. Everything should be on GCP.
Any suggested path ?
Thanks

You can take a look at the best practices for enterprise organizations. In order to properly split resources it's often advised to use different projects. However, depending on the GCP product, you could also use versions, such as with App Engine (see this StackOverflow thread).

Related

Machine Learning (NLP) on AWS. Cloud9? SageMaker? EC2-AMI?

I have finally arrived in the cloud to put my NLP work to the next level, but I am a bit overwhelmed with all the possibilities I have. So I am coming to you for advice.
Currently I see three possibilities:
SageMaker
Jupyter Notebooks are great
It's quick and simple
saves a lot of time spent on managing everything, you can very easily get the model into production
costs more
no version control
Cloud9
EC2(-AMI)
Well, that's where I am for now. I really like SageMaker, although I don't like the lack of version control (at least I haven't found anything for now).
Cloud9 seems just to be an IDE to an EC2 instance.. I haven't found any comparisons of Cloud9 vs SageMaker for Machine Learning. Maybe because Cloud9 is not advertised as an ML solution. But it seems to be an option.
What is your take on that question? What have I missed? What would you advise me to go for? What is your workflow and why?

I am looking for an easy work environment where I can quickly test my models, exactly. And it won't be only me working on it, it's a team effort.
Since you are working as a team I would recommend to use sagemaker with custom docker images. That way you have complete freedom over your algorithm. The docker images are stored in ecr. Here you can upload many versions of the same image and tag them to keep control of the different versions(which you build from a git repo).
Sagemaker also gives the execution role to inside the docker image. So you still have full access to other aws resources (if the execution role has the right permissions)
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
In my opinion this is a good example to start because it shows how sagemaker is interacting with your image.
Some notes on other solutions:
The problem of every other solution you posted is you want to build and execute on the same machine. Sure you can do this but keep in mind, that gpu instances are expensive and therefore you might only switch to the cloud when the code is ready to run.
Some other notes
Jupyter Notebooks in general are not made for collaborative programming. I think they want to change this with jupyter lab but this is still in development and sagemaker only use the notebook at the moment.
EC2 is cheaper as sagemaker but you have to do more work. Especially if you want to run your model as docker images. Also with sagemaker you can easily build an endpoint for model inference which would be even more complex to realize with ec2.
Cloud 9 I never used this service and but on first glance it seems good to develop on, but the question remains if you want to do this on a gpu machine. Because you're using ec2 as instance you have the same advantage/disadvantage.

One thing I'd like to call out first is SageMaker notebook is not the only IDE environment in which you can interact with other components of SageMaker such as training and hosting. In fact you can make API calls to SageMaker training/hosting through Cloud9 or any IDEs you've installed on EC2 or even your laptop, as long as you have AWS SDK or SageMaker Python SDK installed.
Regarding the choice of the IDE, it's really up to your particular needs. SageMaker notebook is Jupyter based (now also supports JupyterLab beta), ML focused, and fully managed. Hundreds of Python packages that are commonly used in ML, as well as Tensorflow, Keras, MxNet, SageMaker Python SDK, etc., are preinstalled and automatically maintained for you. It also integrates more closely with other components of SageMaker as one can imagine.
Cloud9 is a managed IDE too but it is for general purpose rather than ML specific. If you want to use Jupyter on cloud9 it requires extra work from your side. It does not preinstall and maintain the version of common ML/DL related packages like SageMaker notebook does.

implement a tool that uses the technologies - Jenkins, Docker, Docker Swarm, and AWS

I want to implement a tool that uses the technologies - Jenkins, Docker, Docker Swarm, and AWS - to achieve a deployment tool that our team of developers can use to manage instances and deploys.
Please recommend what technologies should we (both administrators and developers) be using, what needs to be built and what sorts of machines must be having.
Any help here would be much appreciated.

Your question is too generic to provide a specific answer, as there are different approaches to implement what you are trying to achieve. IMHO the best approach would be to talk with your existing dev team & administrators and come up with a solution which all parties find easy to manage and maintain container based environment rather than specifying several specific technologies.
Each tool you have mentioned has different capabilities and also there are other tools that provide the same features which would be more ideal for your situation. (Thats why proper understanding between Devs and admins are necessary on what you really want to achieve.) .
Since you have asked about what kind of machines you must be having (I suppose this is on AWS env) try Core OS on AWS instances. CoreOS (Container Linux) will be the best option to manage and run your container based environments. [About CoreOS]

Jenkins can run in a docker container and issue docker commands to deploy new docker containers that reside in the same swarm as jenkins. You also need to hook into a software repo like git. Jenkins Blue Ocean is something you could look at for pipe-lining your dev->build->test->deploy->maintain pipes. Also, Travis-ci, github, JIRA, and Dockerhub are useful components to what you are trying to achieve.

AWS AMI export to local Ubuntu

I have got a running AMI Ubuntu instance with CI/CD pipeline with Jenkins, Tomcat, Selenium, Sonar, Nagios etc .I have used elastic ip directly in different configuration to get it working.
Is there any easy and direct way to export this image to my local Ubuntu apart from AWS export/import service.

For what it's worth I don't agree with the down votes. Cloud is fast becoming "infrastructure as code" and I place emphasis on the code aspect. As the differences between programming and infrastructure management begin to fade (see FaaS - functions as a service) there needs to be room for having that conversation in this forum IMHO.
That said, let's move on to your questions. AWS provides instance export functionality though this solution may not fit all use cases. Two links below illustrate how to perform that export. YMMV. Good luck!
http://docs.aws.amazon.com/cli/latest/reference/ec2/create-instance-export-task.html
http://docs.aws.amazon.com/vm-import/latest/userguide/vmexport.html

Medium Hadoop / Spark Cluster Administration

Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.

If you have some experience in scripting language then you can go for chef. The recipes are already available for deployment and configuration of cluster and it's very easy to start with.
And if wants to do it by your own then you can use sshxcute java API which runs the script on remote server. You can build up the commands there and pass them to sshxcute API to deploy the cluster.

Check out Apache Ambari. Its a great tool for central management of configs, adding new nodes, monitoring the cluster, etc. This would be your best bet.

Is it good practice to use a VM for Django projects?

So I was looking at the Getting Started with Django http://gettingstartedwithdjango.com/ tutorial, and everything was done in a vm. The author set up a vm, and then created a virtualenv in the vm. Is this good practice to get started on a django project, or software projects in general? Why the need for a vm? What happens if I have more than one project - should I use two vms? Or just create additional virtualenvs in the original vm?
I'm still a student in school, and I'm working on my own personal side projects, so it'd be useful to get some input on how things are really done in the real world.
Thanks!

You do not need VMs. You can get through just fine using virtualenv with an environment for each project - especially just starting out in Django.
In the future, one of the times you may need a separate VM environment for your project is if it has a lot of unique infrastructure needs. It's much easier to setup a VM, setup the unique environment, and not have to alter it when you want to work on other projects.
Another common reason I see people using VMs is when they have a Windows machine but want to develop in Linux. It's easy to spin up a Linux VM and work there since Linux is more programmer friendly.

It's subjective. I leverage virtualenv and virtualenvwrapper for my development, which I do on Linux. There are instance where you might need to leverage two separate VMs...it just depends, although I haven't encountered this.
There's no unwritten rule that says you have to use a VM. Python (and many other languages/frameworks) simply work better on Linux, so many people will leverage VMs to run Linux on Windows or Mac to do their development in that environment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js