I am running a job on an ec2 instance on startup with user data.
When the job starts I head to var/log and start tailing cloud-init-output.log and it shows me the correct output data. However, after I let the job run over night and my pc goes to sleep, I disconnect from my ssh connection, and when I open the cloud-init-output.log file, none of the data that I was seeing while tailing it is there anymore, instead it is just showing thing like kernel information.
Any ideas why this is happening, or how I can access the logs that I was seeing while tailing the file?
You haven't mentioned the exact operating system but as you are using tail it is a Linux based one. Cloud-init logging configuration by default is described in this document, and unless that specific distribution has made changes to it, it will mostly likely have a line like the one below,
which basically sends both stderr and stdout from all init scripts to /var/log/cloud-init-output.log:
output: {all: '| (umask 0026; tee -a /var/log/cloud-init-output.log)'}
Now, there are two main reasons for the behaviour you are seeing:
first of all, the cloud-init happens only once during the instance initialisation and bootstrap after it has been created.
The only information added to it later on happens after reboots, but it's mostly static. For example on Amazon Linux 2 it looks like this:
Cloud-init v. 19.3-44.amzn2 running 'init-local' at Fri, 24 Sep 2021 11:59:49 +0000. Up 7.03 seconds.
Cloud-init v. 19.3-44.amzn2 running 'init' at Fri, 24 Sep 2021 11:59:51 +0000. Up 9.39 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | eth0 | True | 172.31.36.56 | 255.255.240.0 | global | 0a:d1:9d:ab:5e:9f |
ci-info: | eth0 | True | fe80::8d1:9dff:feab:5e9f/64 | . | link | 0a:d1:9d:ab:5e:9f |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 172.31.32.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 169.254.169.254 | 0.0.0.0 | 255.255.255.255 | eth0 | UH |
ci-info: | 2 | 172.31.32.0 | 0.0.0.0 | 255.255.240.0 | eth0 | U |
ci-info: +-------+-----------------+-------------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | 9 | fe80::/64 | :: | eth0 | U |
ci-info: | 11 | local | :: | eth0 | U |
ci-info: | 12 | ff00::/8 | :: | eth0 | U |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 19.3-44.amzn2 running 'modules:config' at Fri, 24 Sep 2021 11:59:52 +0000. Up 10.80 seconds.
Cloud-init v. 19.3-44.amzn2 running 'modules:final' at Fri, 24 Sep 2021 11:59:53 +0000. Up 11.47 seconds.
Cloud-init v. 19.3-44.amzn2 finished at Fri, 24 Sep 2021 11:59:53 +0000. Datasource DataSourceEc2. Up 11.59 seconds
This means that any script that you add to user-data would run only once during the instance bootstrap, and unless it's set to re-spawn or run in the background it will most likely exit after that. In addition, it won't be executed again upon reboot.
the second reason is that you are tailing the log, and tail only outputs the last lines of a file. When your ssh connection breaks and you initiate the new one, you are seeing the last lines at that point in time. You should be still able to see the information you were observing the previous day by using less or cat or by grep-ing the log file for it (unless the cloud-init configuration has been altered and it's overwriting the information in the log, but that's unlikely).
That being said, based on the above user-data is not the right place for any jobs or scripts that you want to run and log continuously and you should look into other methods to start those jobs post-bootstrap like for example SSM Run Command and log any information you need in a dedicated log file which you can then send to CloudWatch.
Related
I'm trying to build this repository and am stuck at building some thirdparty dependencies (note that I'm very new and basically have no knowledge about c++ / cmake)
I'm strictly following the installation guide provided in the repo and am stuck at trying to build ngp with following command:
cmake ./thirdparty/instant-ngp -B build_ngp
I recieve following error message:
CMake Error at /opt/cmake-3.25.2-linux-x86_64/share/cmake-3.25/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
Failed to detect a default CUDA architecture.
Compiler output:
Call Stack (most recent call first):
CMakeLists.txt:11 (project)
I'm running Ubuntu 20.04 as OS and have a nvidia 2080 super installed in the computer
cmake --version --> 3.25.2
nvidia-smi output:
Tue Feb 14 13:40:27 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:2D:00.0 On | N/A |
| 0% 39C P8 20W / 250W | 449MiB / 8192MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 1971 G /usr/lib/xorg/Xorg 120MiB |
| 0 N/A N/A 2099 G /usr/bin/gnome-shell 36MiB |
| 0 N/A N/A 2542 G /usr/lib/firefox/firefox 226MiB |
+-----------------------------------------------------------------------------+
Would appreciate any kind of help and please let me know if I can provide more valuable info.
Background:
We are creating a SAAS app using Vue front-end, Django/DRF backend, Postgresgl, all running in a Docker environment. The benchmarks below were run on our local dev machines.
The process to register a new "owner" is rather complex. It does the following:
Create tenant and schema
Run migrations (done in the create schema process)
Create MinIO bucket
Load "production" fixtures
Run sync_permissions
Create an owner instance in the newly created schema
We are seeing some significant differences in processing times for some of the above steps running the registration process in different ways. In trying to figure out our issue, we have tried the following four methods to invoke the registration process:
from the Vue front-end hitting the API endpoint
from a REST client (Talend)
from the APIBrowser (provided by DRF)
(in some cases) via manage.py
We tried it from the REST client to try to eliminate Vue as the culprit, but we got similar times between Vue and the REST client.
We also saw similar times between the APIBrowser and the manage.py method, so in the tables below, we are comparing Talend to APIBrowser (or manage.py).
The issue:
Here are the processing times for several of the steps listed above:
|---------------------|--------|------------|--------|
| Process | Talend | APIBrowser | Factor |
|---------------------|--------|------------|--------|
| Create Tenant | 11.853 | 1.185 | 10.0 |
|---------------------|--------|------------|--------|
| Create MinIO Bucket | 0.386 | 0.273 | 1.4 |
|---------------------|--------|------------|--------|
| Load Fixtures | 0.926 | 0.215 | 4.3 |
|---------------------|--------|------------|--------|
| Sync Permissions | 61.115 | 5.390 | 11.3 |
|---------------------|--------|------------|--------|
| Overall | 74.280 | 7.053 | 10.5 |
|---------------------|--------|------------|--------|
In both cases (Talend and APIBrowser), it is running the exact same code. We don't understand why the REST client method takes more than 10 times as long as running from APIBrowser.
We then tried to get down to finer detail in our benchmark timing. We focused on the first step and quickly noticed that it was the process of running migrate_schemas that was the issue. Here's a list of processing times for each migration file it processed. This time, we ran the second pass via manage.py instead of APIBrowser, but as mentioned previously, those times were comparable.
|---------------------|--------|-----------|--------|
| Migration file | Talend | manage.py | Factor |
|---------------------|--------|-----------|--------|
| activity_log.0001 | 0.133 | 0.013 | 10.2 |
| countries.0001 | 0.086 | 0.013 | 6.6 |
| contenttypes.0001 | 0.178 | 0.022 | 8.1 |
| contenttypes.0002 | 0.159 | 0.033 | 4.8 |
| auth.0001 | 0.530 | 0.092 | 5.8 |
| auth.0002 | 0.124 | 0.022 | 5.6 |
| auth.0003 | 0.090 | 0.023 | 3.9 |
| auth.0004 | 0.097 | 0.027 | 3.6 |
| auth.0005 | 0.126 | 0.016 | 7.9 |
| auth.0006 | 0.079 | 0.006 | 13.2 |
| auth.0007 | 0.079 | 0.011 | 7.2 |
| auth.0008 | 0.100 | 0.011 | 9.1 |
| auth.0009 | 0.085 | 0.014 | 6.1 |
| auth.0010 | 0.121 | 0.015 | 8.1 |
| auth.0011 | 0.087 | 0.018 | 4.8 |
| users.0001 | 0.871 | 0.115 | 7.6 |
| admin.0001 | 0.270 | 0.035 | 7.7 |
| admin.0002 | 0.093 | 0.022 | 4.2 |
| admin.0003 | 0.091 | 0.024 | 3.8 |
| authtoken.0001 | 0.193 | 0.036 | 5.4 |
| authtoken.0002 | 0.395 | 0.090 | 4.4 |
| clients.0001 | 0.537 | 0.082 | 6.5 |
| clients.0002 | 0.519 | 0.145 | 3.6 |
| projects.0001 | 0.475 | 0.062 | 7.7 |
| projects.0002 | 0.293 | 0.062 | 4.7 |
| sessions.0001 | 0.191 | 0.023 | 8.3 |
| tasks.0001 | 0.241 | 0.122 | 2.0 |
| tenants.0001 | 0.086 | 0.017 | 5.1 |
|---------------------|--------|-----------|--------|
| Total time: | 10.404 | 1.618 | 6.4 |
|---------------------|--------|-----------|--------|
Our Theory:
We think it must have something to do with Talend (and Vue) initiating the process from a different domain (as it will be when the site is live), but in the case of APIBrowser, it starts from the actual endpoint (i.e. the same domain) that the endpoint is defined for.
That means, in our local environment, running from Vue, we are on local.dev and it hits the local.api endpoint. But running from APIBrowser, we go directly to local.api, then fill in the data on the form and POST it.
Our theory is that it must be affecting how files are accessed. The migrate_schemas process has to open many .py files. And the worst culprit, SyncPermissions, is processing many .yaml files where we have defined our default permission structure utilized by each tenant. I should point out that the LoadFixtures process also opens external .yaml files, but in this case, it only has one file to process, so the difference is minimized.
It may be like the difference between opening an image file in code vs. a template showing an image via HTML. In the HTML version, it's essentially another request on the server - which surely takes longer than programmatically opening an image on disk.
What we don't understand is why opening files in these processes would be affected by the two methods of initiating the process. Obviously, since the site will have to run in Vue, having the registration process take 70 seconds when we know it could be done in only 7 seconds is unacceptable.
Note:
I realize it is the norm here in SO to include code for the process in question, but in this case, both processes are running the exact same code - which is why I decided not to post several hundred lines of code here.
Edit (in response to #Iain Shelvington)
The process starts in the post() method of TenantRegister view:
class TenantRegister(APIView):
def post(self, request, *args, **kwargs):
...
tenant_data = request.data.pop('tenant', dict())
tenant_serializer = TenantSaveSerializer(data=tenant_data)
tenant_serializer.is_valid(raise_exception=True)
tenant = tenant_serializer.create(tenant_serializer.validated_data)
...
...which calls the create() method of TenantSaveSerializer:
class TenantSaveSerializer(serializers.ModelSerializer):
class Meta:
model = Tenant
fields = '__all__'
def create(self, validated_data):
...
tenant = Tenant.objects.create(**validated_data)
...
if has_schema and tenant.auto_create_schema:
try:
tenant.create_schema(check_if_exists=True, verbosity=self.verbosity)
post_schema_sync.send(sender=Tenant, tenant=tenant)
except Exception:
# We failed creating the schema, delete what
# was created and re-raise the exception.
tenant.delete(force_drop=True)
raise
else:
# Although we are not using the schema functions directly,
# the signal might be registered by a listener.
schema_needs_to_be_sync.send(sender=Tenant, tenant=self)
return tenant
...which calls the create_schema() method on the Tenant model instance:
def create_schema(self, check_if_exists=False, sync_schema=True,
verbosity=1):
connection = connections[get_tenant_database_alias()]
cursor = connection.cursor()
# Create the schema.
cursor.execute('CREATE SCHEMA "%s"' % self.schema_name)
call_command(
'migrate_schemas',
tenant=True,
schema_name=self.schema_name,
interactive=False,
verbosity=verbosity)
connection.set_schema_to_public()
return True
As for the timing of each migration, my colleague did those. I believe he said he just set verbosity to a higher value and the migrate_schemas process produced the timed output.
I am using Airflow and Cloud Composer and as I have some issues with Airflow Scheduler (it is slow or stops)
Version: composer-1.10.4-airflow-1.10.6
I launched a "huge" collect (because I will sometimes need it) with airflow to test the scalability of my pipelines.
The result is that my scheduler apparently only schedule the DAGs with few tasks, and the tasks of the big DAGs are not scheduled. Do you have insights or advices about that?
Here are information about my current configuration:
Cluster config:
10 Cluster nodes, 20 vCPUs, 160Go Memory
airflow config:
core
store_serialized_dags: True
dag_concurrency: 160
store_dag_code: True
min_file_process_interval: 30
max_active_runs_per_dag: 1
dagbag_import_timeout: 900
min_serialized_dag_update_interval: 30
parallelism: 160
scheduler
processor_poll_interval: 1
max_threads: 8
dag_dir_list_interval: 30
celery
worker_concurrency: 16
webserver
default_dag_run_display_number: 5
workers: 2
worker_refresh_interval: 120
airflow scheduler DagBag parsing (airflow list_dags -r):
DagBag loading stats for /home/airflow/gcs/dags
Number of DAGs: 27
Total task number: 32229
DagBag parsing time: 22.468404
---------------+--------------------+---------+----------+-----------------------
file | duration | dag_num | task_num | dags
---------------+--------------------+---------+----------+-----------------------
/folder__dags/dag1 | 1.83547 | 1 | 1554 | dag1
/folder__dags/dag2 | 1.717692 | 1 | 3872 | dag2
/folder__dags/dag3 | 1.53 | 1 | 3872 | dag3
/folder__dags/dag4 | 1.391314 | 1 | 210 | dag4
/folder__dags/dag5 | 1.267788 | 1 | 3872 | dag5
/folder__dags/dag6 | 1.250022 | 1 | 1554 | dag6
/folder__dags/dag7 | 1.0973419999999998 | 1 | 2904 | dag7
/folder__dags/dag8 | 1.081566 | 1 | 3146 | dag8
/folder__dags/dag9 | 1.019032 | 1 | 3872 | dag9
/folder__dags/dag10 | 0.98541 | 1 | 1554 | dag10
/folder__dags/dag11 | 0.959722 | 1 | 160 | dag11
/folder__dags/dag12 | 0.868756 | 1 | 2904 | dag12
/folder__dags/dag13 | 0.81513 | 1 | 160 | dag13
/folder__dags/dag14 | 0.69578 | 1 | 14 | dag14
/folder__dags/dag15 | 0.617646 | 1 | 294 | dag15
/folder__dags/dag16 | 0.588876 | 1 | 210 | dag16
/folder__dags/dag17 | 0.563712 | 1 | 160 | dag17
/folder__dags/dag18 | 0.55615 | 1 | 726 | dag18
/folder__dags/dag19 | 0.553248 | 1 | 140 | dag19
/folder__dags/dag20 | 0.55149 | 1 | 168 | dag20
/folder__dags/dag21 | 0.543682 | 1 | 168 | dag21
/folder__dags/dag22 | 0.530684 | 1 | 168 | dag22
/folder__dags/dag23 | 0.498442 | 1 | 484 | dag23
/folder__dags/dag24 | 0.46574 | 1 | 14 | dag24
/folder__dags/dag25 | 0.454656 | 1 | 28 | dag25
/create_conf | 0.022272 | 1 | 20 | create_conf
/airflow_monitoring | 0.006782 | 1 | 1 | airflow_monitoring
---------------+--------------------+---------+----------+------------------------
Thank you for your help
Airflow scheduler processes files in the DAGs directory in round-robin scheduling algorithm and this can cause long delays between tasks because the scheduler will not be able to enqueue a task whose dependencies recently completed until its round robin returns to the enclosing DAG's module. Multiple DAG objects can be defined in the same Python module, but this is generally discouraged from a fault isolation perspective. It may be necessary to define multiple DAGs per module.
Sometimes the best approach is to restart the scheduler:
Get cluster credentials as described in official documentation
Run the following command to restart the scheduler:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
Additionally, please restart the Airflow web server. Sometimes broken, invalid or resource intensive DAGs can cause webserver crashes, restarts or complete downtime. Once way to do so is remove or upgrade one of the PyPI packages from your environment.
Exceeding API usage limits/quotas
To avoid exceeding API usage limits/quotas or avoid running too many simultaneous processes, you can define Airflow pools in the Airflow web UI and associate tasks with existing pools in your DAGs. Refer to the Airflow documentation.
Check the logs in Logging section -> Cloud Composer Environment and look for any errors or warnings like: cannot import module, DagNotFound in DagModel.
Please, have a look to my earlier answer regarding memory. Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
Moreover, I would like to share with you an interesting article on Medium, regarding calculations for resource requests.
I hope you find the above pieces of information useful.
I want to start minikube cluster on specific network/network adapter in VirtualBox, so that I launch other VMs in same network like below
+-------+ +------+ +----------------+
| | | | | |
| VM2 | | VM1 | | Minikube |
| | | | | Cluster |
| | | | | |
+---+---+ +---+--+ +------------+---+
| | |
| | |
| +------+------------+ |
+--+ | |
| 192.168.10.0/24 +-----+
+-------------------+
But I don't see much options for networking in minikube start CLI
Is it possible to start minikube like that or any trick to setup like above?
When it comes to adjusting networking with minikube start you can use the following option:
--host-only-cidr string The CIDR to be used for the minikube VM (only supported with Virtualbox driver) (default "192.168.99.1/24")
As you can see in the table here by default NAT option doesn't give you access to Minikube Cluster VM neither from host nor from other guests (VMs) but you can additionally set port forwarding which is well described in this article.
Although mentioned minikube start doesn't support many options that allow you to modify networking of your default VM, you can easily modify it by adding additional bridged adapter once the Minikube VM is created using Virtualbox GUI or vboxmanage command line tool to modify your network settings as some users suggest here and here.
I have checked again, the minikube cluster is attached to 2 networks,
NAT
Host-Only Network(vboxnet1)
Since it has already connected to a adapter, I can attache VM to exiting adapter and use it like below
+--------+ +---------------------+
| | | Minikube |
| | | |
| VM | | eth1 eth0 |
| | | + + |
| | +---------------------+
+---+----+ | |
| | |
| | |
| +------------v------+ |
| | | v
+------->+ vboxnet1 | NAT
| 192.168.99.0/24 |
| |
+-------------------+
Any other suggestions are welcome
I am having an issue in a clean karaf (4.0.3) installing both camel-jetty and activemq (for example, activemq-client or activemq-broker) together. It doesn't matter the order the features are installed. The second one hangs during install with no information displayed in the karaf log beyond that the install has begun.
Has anyone seen this before? Is there a workaround? I tried having the activemq-broker in it's own instance but then my app that uses both camel-jetty and jms still need to have the activemq connector initialized and thus I need to load activemq bundles/features.
Here is the output of both install separately, but when performed one after the other the 2nd always hangs the karaf instance. There don't appear to be any bundles in common.
karaf#root()> feature:install activemq-client
karaf#root()> list
START LEVEL 100 , List Threshold: 50
ID | State | Lvl | Version | Name
------------------------------------------------------------------------
52 | Active | 80 | 5.12.1 | activemq-osgi
53 | Active | 80 | 3.3.0 | Commons Net
54 | Active | 80 | 2.4.2 | Apache Commons Pool
55 | Active | 80 | 1.0.1 | geronimo-j2ee-management_1.1_spec
56 | Active | 80 | 1.1.1 | geronimo-jms_1.1_spec
57 | Active | 80 | 1.1.1 | geronimo-jta_1.1_spec
58 | Active | 80 | 3.4.6 | ZooKeeper Bundle
63 | Active | 80 | 2.2.11.1 | Apache ServiceMix :: Bundles :: jaxb-impl
70 | Active | 80 | 3.18.0 | Apache XBean :: Spring
71 | Active | 80 | 0.6.4 | JAXB2 Basics - Runtime
(Performed clean karaf launch in between installs to get bundle listings)
karaf#root()> feature:install camel-jetty
karaf#root()> list
START LEVEL 100 , List Threshold: 50
ID | State | Lvl | Version | Name
--------------------------------------------------------------------------------
55 | Active | 80 | 2.12.2 | camel-core
56 | Active | 80 | 2.12.2 | camel-http
57 | Active | 80 | 2.12.2 | camel-jetty
58 | Active | 80 | 2.12.2 | camel-karaf-commands
59 | Active | 80 | 1.8.0 | Commons Codec
63 | Active | 80 | 3.1.0.7 | Apache ServiceMix :: Bundles :: commons-httpclient