lyft/Cartography on EC2, is it possible? - amazon-web-services

I've been trying to Run cartography on my EC2 account for the last 2 days. I have no previous knowledge of Neo4j, But following their installation process doesn't work.
First I've tried to install Neo4j using rpm instructions for Neo4J website, no success acessing Neo4j on port 7474. Error: Connection refused.
Then I gave up trying to make Neo4J work on an EC2 installation, and used their MarketPlace AMi- Works Like a charm but I don't know what is being installed on that AMI. So I decided to install and run cartography on this instance.
My first problem was installing python, pip and java correctly. After everything working, I've discovered neo4j bolt port used my public IP, not my localhost. After thatI was able to finally execute Cartography, but Not it's giving me the following error:
neobolt.exceptions.ClientError: Supplied bookmark [FB:kcwQ40omSYgvSzKPpCQTXDOcCBSQ] does not conform to pattern neo4j:bookmark:v1:tx
Have Anyone really was able to use this?, every step along the way requires some specific libraries.
Thanks !

I maintain cartography and hope I can help (wish I saw this earlier though haha)
Few things to check:
Are you using Neo4j 4.x? cartography currently only supports 3.5.x.
To run for one AWS account,
AWS_PROFILE=profilename cartography --neo4j-uri <uri for your neo4j instance; usually bolt://localhost:7687>`
To run multiple accounts, set up an AWS config file and run
AWS_CONFIG_FILE=/path/to/your/aws/config cartography --neo4j-uri <uri for your neo4j instance; usually bolt://localhost:7687> --aws-sync-all-profiles
(see https://github.com/lyft/cartography/blob/master/docs/setup/install.md#cartography-installation)
If you have more questions feel free to open a GitHub issue or start a thread on our Slack (can talk about more specialized setups like if you're using containers or anything like that too)

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

Best practice to install Tableau Server as IaC

I am trying to figure it out which one is the best practice when creating a new server on AWS EC2.
To do that I choose Tableau Server. First time, as Tableau docs recommend I did the install myself, but I would like to keep as automatic as possible everything, the idea behind that is if ec2 get destroyed how can I recover everything fast?
I am using terraform to store as a code all the AWS infrastructure, but the installation itself is not automatic yet.
To do that, I have two options, ansible (never worked before) or in this particular case Tableau has an automated install script in python, which I could add in the EC2 template launch configuration,and then using terraform I can raise it in minutes.
Which one should be the choosen why? Both seems to acomplish the final goal.
Also it raises some kind of doubs such as:
It retrieves the server up and with a full instalation of the software, but to get all users, and all the Tableau setup I have to raise anyways an snapshot, right? Is there any other tool to do that?
Then, if the manual install of the software is fast enough, why then I should use IaC to keep the install as code, instead of document the script of installation? And just keep the Infrastructure as code?

AWS Glue Development Endpoint Not Working properly

I am trying to use a development Endpoint to interactively run and edit ETL scripts but there seems to some issues in the development endpoint just after creating it as i am getting errors in scala/python REPL and also unable to do SSH tunnel to remote interpreter.
Let me explain what i did exactly - I created a development endpoint in the AWS console with all the default configurations. While creating the development endpoint i only provided three things 'Development endpoint name' and 'IAM Role' and my 'pub ssh key'. This is how it looks after creation
Then Right After creating the endpoint i am connecting to the spark/python REPL, I am able to connect to them successfully but within couple of minutes of connecting, the REPL starts throwing errors without writing a single line of code. This is happening in all the REPL present in the development endpoints.
Also When I am trying to do SSH tunneling to remote interpreter to connect my Local Zeppelin Notebook it is throwing - "bind: Cannot assign requested address".
Couple of things that are working though -
Able to do ssh to the endpoint.
Created a Sagemaker notebook in the AWS glue that is attached to this development endpoint and this notebook seems to be working fine, although surely it is adding an additional cost and i don't want to continue using it.
Can anyone please help what am i doing wrong? Am I missing any important steps that is needed to be done on the machine right after creating the development endpoint?
Thanks in Advance!
Not very sure about this error but if you are using it smaller datasets then probably you would like to use Docker implementation as it will not add any additional cost and you can go on with your developments.
You can refer this blog on how to set it up
https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1

PrestoDB EMR Server refused connection

I have setup an EMR in AWS with PrestoDB installed on it, Earlier I was able to query with PrestoDB but somehow after a restart it stopped working and started giving following error
"Error running command: Server refused connection: http://ip-*---.us-west-2.compute.internal:8889/v1/statement"
I have looked into all configuration files and nothing seems to be wrong. I have also cross check Hive configuration files but could not get any success.
Could anyone who has encountered similar issue can help me.
Yes , You will have to restart presto on all machines .
Adding to the note I would like you to mention that give it a shot installing open source presto on EMR using presto admin. It has a lot of functions which will help you to avoid such issues .
Updating and maintaining the cluster is easy using presto admin
I know this is an old question, but I've run into this as well.
The likely reason is that you only restarted the Presto server on the coordinating node. You have to ssh into each core node and restart the Presto server there as well.
If this is a persistent Presto cluster, you would probably benefit from installing presto-admin. It's kind of a pain to set up the first time, but it makes this stuff much easier once it's in place.

How to work with a local development server and deploy to a production server in django?

I want to work locally on my django(1.7) project and regularly deploy updates to a production server. How would you do this? I have not found anything about this in the docs. I am confused about that because it seems like many people would want to do that and there should be some kind of standard solution to this. Or am I getting the whole workflow wrong?
I should note that I'm not expecting a step-by-step guide. I am just trying to understand the concept.
Assuming you already have your deployment server setup, and all you need to do is push code to your server, then you can just use git as a form of deployment.
Digital Ocean has a good tutorial at this link https://www.digitalocean.com/community/tutorials/how-to-set-up-automatic-deployment-with-git-with-a-vps
Push sources to a git repository from a dev machine.
pull sources on a production server. Restart uwsgi/whatever.
There is no standard way of doing this, so no, it cannot be included with Django or be thoroughly described in the docs.
If you're using a PaaS how you deploy depends on the PaaS. Ditto for a container like docker, you must follow the rules of that particular container.
If you're old-school and can ssh into a server you can rsync a snapshot of the code to the correct place after everything else is taken care of: database, ports, webserver setup etc. That's what I do, and I control stuff with bash scripts utilizing a makefile.
REMOETHOST=user#yourbox
REMOTEPATH=yourpath
REMOTE=$REMOTEHOST:$REMOTEPATH
make rsync REMOTE_URI=$REMOTE
ssh $REMOTEHOST make -C $REMOTEPATH deploy
My "deploy"-action is a monster but might be as easy as something that touches the wsgi-file used in order to reload the site. My medium complex ones cleans out stale files, run collectstatic and then reloads the site. The really complex ones creates a timestamped virtualenv, cloned database and remote code tree, a new server-setup that points to this, runs connection tests on the remote and if they succeed, switches the main site to point to the new versioned site, then emails me the version that is now in production, with the git hash and timestamp.
Lots of good solutions. Heroku has a good tutorial: https://devcenter.heroku.com/articles/getting-started-with-django
Check out a general guide for deploying to multiple PaaS providers here: http://www.paascheatsheet.com