Background:
We are creating a SAAS app using Vue front-end, Django/DRF backend, Postgresgl, all running in a Docker environment. The benchmarks below were run on our local dev machines.
The process to register a new "owner" is rather complex. It does the following:
Create tenant and schema
Run migrations (done in the create schema process)
Create MinIO bucket
Load "production" fixtures
Run sync_permissions
Create an owner instance in the newly created schema
We are seeing some significant differences in processing times for some of the above steps running the registration process in different ways. In trying to figure out our issue, we have tried the following four methods to invoke the registration process:
from the Vue front-end hitting the API endpoint
from a REST client (Talend)
from the APIBrowser (provided by DRF)
(in some cases) via manage.py
We tried it from the REST client to try to eliminate Vue as the culprit, but we got similar times between Vue and the REST client.
We also saw similar times between the APIBrowser and the manage.py method, so in the tables below, we are comparing Talend to APIBrowser (or manage.py).
The issue:
Here are the processing times for several of the steps listed above:
|---------------------|--------|------------|--------|
| Process | Talend | APIBrowser | Factor |
|---------------------|--------|------------|--------|
| Create Tenant | 11.853 | 1.185 | 10.0 |
|---------------------|--------|------------|--------|
| Create MinIO Bucket | 0.386 | 0.273 | 1.4 |
|---------------------|--------|------------|--------|
| Load Fixtures | 0.926 | 0.215 | 4.3 |
|---------------------|--------|------------|--------|
| Sync Permissions | 61.115 | 5.390 | 11.3 |
|---------------------|--------|------------|--------|
| Overall | 74.280 | 7.053 | 10.5 |
|---------------------|--------|------------|--------|
In both cases (Talend and APIBrowser), it is running the exact same code. We don't understand why the REST client method takes more than 10 times as long as running from APIBrowser.
We then tried to get down to finer detail in our benchmark timing. We focused on the first step and quickly noticed that it was the process of running migrate_schemas that was the issue. Here's a list of processing times for each migration file it processed. This time, we ran the second pass via manage.py instead of APIBrowser, but as mentioned previously, those times were comparable.
|---------------------|--------|-----------|--------|
| Migration file | Talend | manage.py | Factor |
|---------------------|--------|-----------|--------|
| activity_log.0001 | 0.133 | 0.013 | 10.2 |
| countries.0001 | 0.086 | 0.013 | 6.6 |
| contenttypes.0001 | 0.178 | 0.022 | 8.1 |
| contenttypes.0002 | 0.159 | 0.033 | 4.8 |
| auth.0001 | 0.530 | 0.092 | 5.8 |
| auth.0002 | 0.124 | 0.022 | 5.6 |
| auth.0003 | 0.090 | 0.023 | 3.9 |
| auth.0004 | 0.097 | 0.027 | 3.6 |
| auth.0005 | 0.126 | 0.016 | 7.9 |
| auth.0006 | 0.079 | 0.006 | 13.2 |
| auth.0007 | 0.079 | 0.011 | 7.2 |
| auth.0008 | 0.100 | 0.011 | 9.1 |
| auth.0009 | 0.085 | 0.014 | 6.1 |
| auth.0010 | 0.121 | 0.015 | 8.1 |
| auth.0011 | 0.087 | 0.018 | 4.8 |
| users.0001 | 0.871 | 0.115 | 7.6 |
| admin.0001 | 0.270 | 0.035 | 7.7 |
| admin.0002 | 0.093 | 0.022 | 4.2 |
| admin.0003 | 0.091 | 0.024 | 3.8 |
| authtoken.0001 | 0.193 | 0.036 | 5.4 |
| authtoken.0002 | 0.395 | 0.090 | 4.4 |
| clients.0001 | 0.537 | 0.082 | 6.5 |
| clients.0002 | 0.519 | 0.145 | 3.6 |
| projects.0001 | 0.475 | 0.062 | 7.7 |
| projects.0002 | 0.293 | 0.062 | 4.7 |
| sessions.0001 | 0.191 | 0.023 | 8.3 |
| tasks.0001 | 0.241 | 0.122 | 2.0 |
| tenants.0001 | 0.086 | 0.017 | 5.1 |
|---------------------|--------|-----------|--------|
| Total time: | 10.404 | 1.618 | 6.4 |
|---------------------|--------|-----------|--------|
Our Theory:
We think it must have something to do with Talend (and Vue) initiating the process from a different domain (as it will be when the site is live), but in the case of APIBrowser, it starts from the actual endpoint (i.e. the same domain) that the endpoint is defined for.
That means, in our local environment, running from Vue, we are on local.dev and it hits the local.api endpoint. But running from APIBrowser, we go directly to local.api, then fill in the data on the form and POST it.
Our theory is that it must be affecting how files are accessed. The migrate_schemas process has to open many .py files. And the worst culprit, SyncPermissions, is processing many .yaml files where we have defined our default permission structure utilized by each tenant. I should point out that the LoadFixtures process also opens external .yaml files, but in this case, it only has one file to process, so the difference is minimized.
It may be like the difference between opening an image file in code vs. a template showing an image via HTML. In the HTML version, it's essentially another request on the server - which surely takes longer than programmatically opening an image on disk.
What we don't understand is why opening files in these processes would be affected by the two methods of initiating the process. Obviously, since the site will have to run in Vue, having the registration process take 70 seconds when we know it could be done in only 7 seconds is unacceptable.
Note:
I realize it is the norm here in SO to include code for the process in question, but in this case, both processes are running the exact same code - which is why I decided not to post several hundred lines of code here.
Edit (in response to #Iain Shelvington)
The process starts in the post() method of TenantRegister view:
class TenantRegister(APIView):
def post(self, request, *args, **kwargs):
...
tenant_data = request.data.pop('tenant', dict())
tenant_serializer = TenantSaveSerializer(data=tenant_data)
tenant_serializer.is_valid(raise_exception=True)
tenant = tenant_serializer.create(tenant_serializer.validated_data)
...
...which calls the create() method of TenantSaveSerializer:
class TenantSaveSerializer(serializers.ModelSerializer):
class Meta:
model = Tenant
fields = '__all__'
def create(self, validated_data):
...
tenant = Tenant.objects.create(**validated_data)
...
if has_schema and tenant.auto_create_schema:
try:
tenant.create_schema(check_if_exists=True, verbosity=self.verbosity)
post_schema_sync.send(sender=Tenant, tenant=tenant)
except Exception:
# We failed creating the schema, delete what
# was created and re-raise the exception.
tenant.delete(force_drop=True)
raise
else:
# Although we are not using the schema functions directly,
# the signal might be registered by a listener.
schema_needs_to_be_sync.send(sender=Tenant, tenant=self)
return tenant
...which calls the create_schema() method on the Tenant model instance:
def create_schema(self, check_if_exists=False, sync_schema=True,
verbosity=1):
connection = connections[get_tenant_database_alias()]
cursor = connection.cursor()
# Create the schema.
cursor.execute('CREATE SCHEMA "%s"' % self.schema_name)
call_command(
'migrate_schemas',
tenant=True,
schema_name=self.schema_name,
interactive=False,
verbosity=verbosity)
connection.set_schema_to_public()
return True
As for the timing of each migration, my colleague did those. I believe he said he just set verbosity to a higher value and the migrate_schemas process produced the timed output.
Related
I am using Airflow and Cloud Composer and as I have some issues with Airflow Scheduler (it is slow or stops)
Version: composer-1.10.4-airflow-1.10.6
I launched a "huge" collect (because I will sometimes need it) with airflow to test the scalability of my pipelines.
The result is that my scheduler apparently only schedule the DAGs with few tasks, and the tasks of the big DAGs are not scheduled. Do you have insights or advices about that?
Here are information about my current configuration:
Cluster config:
10 Cluster nodes, 20 vCPUs, 160Go Memory
airflow config:
core
store_serialized_dags: True
dag_concurrency: 160
store_dag_code: True
min_file_process_interval: 30
max_active_runs_per_dag: 1
dagbag_import_timeout: 900
min_serialized_dag_update_interval: 30
parallelism: 160
scheduler
processor_poll_interval: 1
max_threads: 8
dag_dir_list_interval: 30
celery
worker_concurrency: 16
webserver
default_dag_run_display_number: 5
workers: 2
worker_refresh_interval: 120
airflow scheduler DagBag parsing (airflow list_dags -r):
DagBag loading stats for /home/airflow/gcs/dags
Number of DAGs: 27
Total task number: 32229
DagBag parsing time: 22.468404
---------------+--------------------+---------+----------+-----------------------
file | duration | dag_num | task_num | dags
---------------+--------------------+---------+----------+-----------------------
/folder__dags/dag1 | 1.83547 | 1 | 1554 | dag1
/folder__dags/dag2 | 1.717692 | 1 | 3872 | dag2
/folder__dags/dag3 | 1.53 | 1 | 3872 | dag3
/folder__dags/dag4 | 1.391314 | 1 | 210 | dag4
/folder__dags/dag5 | 1.267788 | 1 | 3872 | dag5
/folder__dags/dag6 | 1.250022 | 1 | 1554 | dag6
/folder__dags/dag7 | 1.0973419999999998 | 1 | 2904 | dag7
/folder__dags/dag8 | 1.081566 | 1 | 3146 | dag8
/folder__dags/dag9 | 1.019032 | 1 | 3872 | dag9
/folder__dags/dag10 | 0.98541 | 1 | 1554 | dag10
/folder__dags/dag11 | 0.959722 | 1 | 160 | dag11
/folder__dags/dag12 | 0.868756 | 1 | 2904 | dag12
/folder__dags/dag13 | 0.81513 | 1 | 160 | dag13
/folder__dags/dag14 | 0.69578 | 1 | 14 | dag14
/folder__dags/dag15 | 0.617646 | 1 | 294 | dag15
/folder__dags/dag16 | 0.588876 | 1 | 210 | dag16
/folder__dags/dag17 | 0.563712 | 1 | 160 | dag17
/folder__dags/dag18 | 0.55615 | 1 | 726 | dag18
/folder__dags/dag19 | 0.553248 | 1 | 140 | dag19
/folder__dags/dag20 | 0.55149 | 1 | 168 | dag20
/folder__dags/dag21 | 0.543682 | 1 | 168 | dag21
/folder__dags/dag22 | 0.530684 | 1 | 168 | dag22
/folder__dags/dag23 | 0.498442 | 1 | 484 | dag23
/folder__dags/dag24 | 0.46574 | 1 | 14 | dag24
/folder__dags/dag25 | 0.454656 | 1 | 28 | dag25
/create_conf | 0.022272 | 1 | 20 | create_conf
/airflow_monitoring | 0.006782 | 1 | 1 | airflow_monitoring
---------------+--------------------+---------+----------+------------------------
Thank you for your help
Airflow scheduler processes files in the DAGs directory in round-robin scheduling algorithm and this can cause long delays between tasks because the scheduler will not be able to enqueue a task whose dependencies recently completed until its round robin returns to the enclosing DAG's module. Multiple DAG objects can be defined in the same Python module, but this is generally discouraged from a fault isolation perspective. It may be necessary to define multiple DAGs per module.
Sometimes the best approach is to restart the scheduler:
Get cluster credentials as described in official documentation
Run the following command to restart the scheduler:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
Additionally, please restart the Airflow web server. Sometimes broken, invalid or resource intensive DAGs can cause webserver crashes, restarts or complete downtime. Once way to do so is remove or upgrade one of the PyPI packages from your environment.
Exceeding API usage limits/quotas
To avoid exceeding API usage limits/quotas or avoid running too many simultaneous processes, you can define Airflow pools in the Airflow web UI and associate tasks with existing pools in your DAGs. Refer to the Airflow documentation.
Check the logs in Logging section -> Cloud Composer Environment and look for any errors or warnings like: cannot import module, DagNotFound in DagModel.
Please, have a look to my earlier answer regarding memory. Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
Moreover, I would like to share with you an interesting article on Medium, regarding calculations for resource requests.
I hope you find the above pieces of information useful.
I have a software application which communicates with a Graphic User Interface using MQTT protocol.
The first software (let's call it app) publish messages and the second software (let's call it gui) subscribes to topics.
App software
The app software is a C++ application which embeds a MQTT client based on mosquittopp.
It sends messages on different topics in order to notify gui software.
Gui software
The gui software is a Qt application which embeds a MQTT client also based on mosquittopp.
When a message is received on a topic, a signal messageReceived is emitted.
The signal messageReceived is connected to a slot Message_received_from_topic.
All messages from all topics are processed into Message_received_from_topic slot.
Presenting the issue
A gui screen is composed of several fields. A field is updated when a message is received on the associated topic.
When gui stays on the same screen, receiving messages through MQTT allow to update fields without reloading the full screen.
However when changing to a screen which contains many fields, there are some glitches that appear because the screen is refreshed before all fields are generated.
In order to avoid those glitches, I want to lock screen refreshing before updating all fields at least once and then unlock refresh.
That's why not receiving MQTT messages in the right order is problematic.
Issue analysis
The issue I'm facing is that messages are not received in order by gui.
Here is a table that present how messages are sent and received.
+----------------------------------------------+----------------------------------------------+----------------------------------------------+
| Sent by **app** | On broker (using mosquitto_sub) | Received by **gui** |
+---------------------------+------------------+---------------------------+------------------+---------------------------+------------------+
| Topic | Message | Topic | Message | Topic | Message |
+---------------------------+------------------+---------------------------+------------------+---------------------------+------------------+
| widget/refresh | 0 | widget/refresh | 0 | widget/refresh | 0 |
| widget/input/value/9 | {"label":"2020"} | widget/input/value/9 | {"label":"2020"} | widget/refresh | 1 |
| widget/input/value/10 | {"label":"02"} | widget/input/value/10 | {"label":"02"} | widget/input/value/9 | {"label":"2020"} |
| widget/input/value/11 | {"label":"20"} | widget/input/value/11 | {"label":"20"} | widget/input/position/9 | 1 |
| widget/input/value/12 | {"label":"15"} | widget/input/value/12 | {"label":"15"} | widget/input/selection/9 | true |
| widget/input/value/13 | {"label":"06"} | widget/input/value/13 | {"label":"06"} | widget/input/value/10 | {"label":"02"} |
| widget/input/position/12 | 0 | widget/input/position/12 | 0 | widget/input/position/10 | 0 |
| widget/input/selection/12 | false | widget/input/selection/12 | false | widget/input/selection/10 | true |
| widget/input/position/13 | 0 | widget/input/position/13 | 0 | widget/input/value/11 | {"label":"20"} |
| widget/input/selection/13 | false | widget/input/selection/13 | false | widget/input/position/11 | 0 |
| widget/input/position/9 | 0 | widget/input/position/9 | 0 | widget/input/selection/11 | true |
| widget/input/selection/9 | true | widget/input/selection/9 | true | widget/input/value/12 | {"label":"15"} |
| widget/input/position/10 | 0 | widget/input/position/10 | 0 | widget/input/position/12 | 0 |
| widget/input/selection/10 | true | widget/input/selection/10 | true | widget/input/selection/12 | false |
| widget/input/position/11 | 0 | widget/input/position/11 | 0 | widget/input/value/13 | {"label":"06"} |
| widget/input/selection/11 | true | widget/input/selection/11 | true | widget/input/position/13 | 0 |
| widget/input/position/9 | 1 | widget/input/position/9 | 1 | widget/input/selection/13 | false |
| widget/refresh | 1 | widget/refresh | 1 | | |
+---------------------------+------------------+---------------------------+------------------+---------------------------+------------------+
Here messages are received in the wrong order, I'm also missing one message on topic "widget/input/position/9"
All publications and subscriptions are using QoS 2.
Questions
Does anybody knows why messages are not received in the right order by gui when they are correctly received using mosquitto_sub ?
Is there any configuration to apply to mosquitto broker in order to keep message order ?
I am using a window function to get the difference in the values of a column (downloads) between two dates. I'd also like to get the product of that difference multiplied by the size of the file to get the bytes downloaded for the period.
With the help of this community, I am able to get the number of downloads but cannot find the correct syntax to get the product of downloads * size.
Table 'files'
+---------------+------------------------+------+-----------+------------+
| site | full_path | size | downloads | date_stamp |
+---------------+------------------------+------+-----------+------------+
| Lawrenceville | lr1/dir1/subdir1/file1 | 1000 | 7 | 2019-08-08 |
| Lawrenceville | lr1/dir1/subdir1/file1 | 1010 | 9 | 2019-08-15 |
| Lawrenceville | lr1/dir1/subdir1/file2 | 1213 | 5 | 2019-08-08 |
| Lawrenceville | lr1/dir1/subdir1/file2 | 2000 | 5 | 2019-08-15 |
| Lawrenceville | lr1/dir2/subdir1/file1 | 2213 | 5 | 2019-08-15 |
| Rennes | rr1/dir1/subdir1/file3 | 200 | 3 | 2019-08-08 |
| Rennes | rr1/dir1/subdir1/file3 | 201 | 4 | 2019-08-15 |
+---------------+------------------------+------+-----------+------------+
SELECT site, sum(diff) FROM (SELECT site, downloads - lag(downloads, 1) OVER (PARTITION BY site, full_path ORDER BY date_stamp) AS diff FROM files WHERE date_stamp IN ('2019-08-15', '2019-08-08')) group by site
produces this:
+---------------+-----------+
| site | downloads |
+---------------+-----------+
| Lawrenceville | 2 |
| Rennes | 1 |
+---------------+-----------+
I have tried:
SELECT site, sum(diff), sum(sum(diff)*bytes) FROM (SELECT site, downloads - lag(downloads, 1), size OVER (PARTITION BY site, full_path ORDER BY date_stamp) AS diff, bytes FROM files WHERE date_stamp IN ('2019-08-15', '2019-08-08')) group by site
sqlite3.OperationalError: near "(": syntax error
Ideally I want this output:
+---------------+-----------+----------+
| site | downloads | bytes |
+---------------+-----------+----------+
| Lawrenceville | 2 | 2020 |
| Rennes | 1 | 201 |
+---------------+-----------+----------+
Lawrenceville had 2 downloads of file lr1/dir1/subdir1/file1 which is 1010 bytes (on 2019-08-15). File lr1/dir1/subdir1/file2 had no downloads for that period. It would be nice to include files lr1/dir1/subdir1/file2 and lr1/dir2/subdir1/file1 but they get excluded by the window function. I can get them with a separate query.
Rennes has 1 download of file rr1/dir1/subdir1/file3
If your current query works then you only need max() window function in the subquery:
SELECT site, sum(diff) downloads, sum(diff) * size bytes
FROM (
SELECT
site,
downloads - lag(downloads, 1) OVER (PARTITION BY site, full_path ORDER BY date_stamp) AS diff,
max(size) OVER (PARTITION BY site, full_path) AS size
FROM files
WHERE date_stamp IN ('2019-08-15', '2019-08-08')
)
group by site
See the demo.
Results:
| site | downloads | bytes |
| ------------- | --------- | ----- |
| Lawrenceville | 2 | 2020 |
| Rennes | 1 | 201 |
I've got a supervised data set with 6836 instances, and I need to know the predictions of my model for all the instances, not only for a test set.
I followed the approach train-test (2/3-1/3) to know about my rates TPR and FPR, and I've got the predictions about my test (1/3), but I need to know the predcitions about all the 6836 instances.
How can I do it?
Thanks!
In the classify tab in Weka Explorer there should be a button that says 'More options...' if you go in there you should be able to output predictions as plain text. If you use cross validation rather than a percentage split you will get predictions for all instances in a table like this:
+-------+--------+-----------+-------+------------+
| inst# | actual | predicted | error | prediction |
+-------+--------+-----------+-------+------------+
| 1 | 2:no | 1:yes | + | 0.926 |
| 2 | 1:yes | 1:yes | | 0.825 |
| 1 | 2:no | 1:yes | + | 0.636 |
| 2 | 1:yes | 1:yes | | 0.808 |
| ... | ... | ... | ... | ... |
+-------+--------+-----------+-------+------------+
If you don't want to do cross validation you also can create a data set containing all your data (training + test) and add it as test data. Then you can go to more options and show the results as Campino already answered.
I have two models, Version and Description.
class Version(models.Model):
version_name = models.CharField(max_length=100)
version_value = models.IntegerField()
url = models.CharField(max_length=240)
class Description(models.Model):
version = models.ForeignKey(Version)
lang = models.CharField(max_length=8)
content = models.TextField()
And a DescriptionSerializer.
class DescriptionSerializer(serializers.ModelSerializer):
version_name = serializers.RelatedField(source='version')
class Meta:
model = Description
fields = ('version_name', 'content')
They stored the descriptions of different versions in different languages.
E.g.
Version
+----+--------------+---------------+---------------------+
| id | version_name | version_value | url |
+----+--------------+---------------+---------------------+
| 1 | 1.0.0 | 1 | http://abc.net.tw/ |
| 2 | 1.0.1 | 2 | http://abc.net.tw/2 |
| 3 | 1.0.2 | 3 | http://abc.net.tw/3 |
| 4 | 1.0.3 | 4 | http://abc.net.tw/4 |
| 7 | 1.1.0 | 5 | http://abc.net.tw/5 |
| 8 | 1.1.1 | 6 | http://abc.net.tw/6 |
+----+--------------+---------------+---------------------+
Description
+------------+-------+---------+
| version_id | lang | content |
+------------+-------+---------+
| 1 | en_US | English |
| 1 | zh_TW | Chinese |
| 1 | es_ES | Spanish |
| 2 | en_US | English |
| 2 | zh_TW | Chinese |
| 2 | es_ES | Spanish |
| 3 | en_US | English |
| 3 | zh_TW | Chinese |
| 3 | es_ES | Spanish |
| 4 | en_US | English |
| 7 | en_US | English |
| 8 | en_US | English |
| 4 | es_ES | Spanish |
| 7 | es_ES | Spanish |
+------------+-------+---------+
I'm using django rest framework to implement a web API that returns the description of each version in certain language. If a description of certain language doesn't exist, use English version instead.
I can use following SQL to retrieve the desired result. I've read DRF's docs on relatedField and reverse relation. But I still can't figure out how to use django's ORM to do the same thing and to use it with django rest framework's serializer.
select
coalesce(d.id, d2.id), coalesce(d.version_id, d2.version_id), coalesce(d.lang, d2.lang), coalesce(d.content, d2.content)
from
version v
left outer join description d on v.id = d.version_id and d.lang='zh_TW'
left outer join description d2 on v.id = d2.version_id and d2.lang='en_US'
Please advise how to do it in django.
You can't use django orm for everything. There are numerous things you can't do with django. For those cases you either use straight up SQL (from django.db import connection, transaction etc...) or if the query results can be worked into objects you have described - then you can use raw queries (link)