I've create a project on Xcode 7 that generates code coverage data.
Inside its DerivedData folder, I can run llvm-cov show:
/usr/local/opt/llvm/bin/llvm-cov show -instr-profile Build/Intermediates/CodeCoverage/testetestes/Coverage.profdata Build/Intermediates/CodeCoverage/testetestes/Products/Debug-iphonesimulator/testetestes.framework/testetestes
This will produce an output like this:
/Users/marcelofabri/Desktop/testetestes/testetestes/Example.swift:
| 1|//
| 2|// Example.swift
| 3|// testetestes
| 4|//
| 5|// Created by Marcelo Fabri on 09/06/15.
| 6|// Copyright © 2015 Marcelo Fabri. All rights reserved.
| 7|//
| 8|
| 9|import UIKit
| 10|
| 11|class Example: NSObject {
1| 12| func testando() {
1| 13| if let url = NSURL(string: "dasdas") {
1| 14| print("ae \(url)")
0| 15| } else {
0| 16| print("oi")
0| 17| }
1| 18| }
| 19|}
/Users/marcelofabri/Desktop/testetestes/testetestes/OutraClasse.swift:
| 1|//
| 2|// OutraClasse.swift
| 3|// testetestes
| 4|//
| 5|// Created by Marcelo Fabri on 18/06/15.
| 6|// Copyright © 2015 Marcelo Fabri. All rights reserved.
| 7|//
| 8|
| 9|import UIKit
| 10|
| 11|class OutraClasse: NSObject {
| 12|
1| 13| func outroTestando() {
1| 14| if let numero = Int("123") {
1| 15| print("ae \(numero)")
0| 16| } else {
0| 17| print("oi")
0| 18| }
1| 19| }
| 20|
| 21|}
However, I'd like to get .gcov files, since it's what most tools use. Is there a way to do this without parsing the output and creating .gcov file manually?
According to Apple gcov is not a part of Xcode 7 coverage support. Gcov was gcc legacy that stayed around till appearance of replacement. Apparently they dropped legacy gcov file format support in favor of new intermediate format — profdata. I did research on my own and didn't find any tools that converts profdata back to gcov, however there is Slather from Venom. Slather is able to generate coverage reports in Gutter JSON, Cobertura XML, HTML and plain test. It also able to provide integration with popular service like Coveralls. Currently it works also only with gcov, but they have issue opened and PR request pending for support of profdata. They usually move fast, so it likely soon will be merged into master.
Also if you will decide to write your own tool there are multiple approaches that you may consider for reviewing:
Converting of plain text output from llvm-cov show
Converting of binary format of profdata by following format documentation
Help Slather guys an introduce cross-coverting from their model back into gcov, as soon as they will merge in profdata support
Related
Background:
We are creating a SAAS app using Vue front-end, Django/DRF backend, Postgresgl, all running in a Docker environment. The benchmarks below were run on our local dev machines.
The process to register a new "owner" is rather complex. It does the following:
Create tenant and schema
Run migrations (done in the create schema process)
Create MinIO bucket
Load "production" fixtures
Run sync_permissions
Create an owner instance in the newly created schema
We are seeing some significant differences in processing times for some of the above steps running the registration process in different ways. In trying to figure out our issue, we have tried the following four methods to invoke the registration process:
from the Vue front-end hitting the API endpoint
from a REST client (Talend)
from the APIBrowser (provided by DRF)
(in some cases) via manage.py
We tried it from the REST client to try to eliminate Vue as the culprit, but we got similar times between Vue and the REST client.
We also saw similar times between the APIBrowser and the manage.py method, so in the tables below, we are comparing Talend to APIBrowser (or manage.py).
The issue:
Here are the processing times for several of the steps listed above:
|---------------------|--------|------------|--------|
| Process | Talend | APIBrowser | Factor |
|---------------------|--------|------------|--------|
| Create Tenant | 11.853 | 1.185 | 10.0 |
|---------------------|--------|------------|--------|
| Create MinIO Bucket | 0.386 | 0.273 | 1.4 |
|---------------------|--------|------------|--------|
| Load Fixtures | 0.926 | 0.215 | 4.3 |
|---------------------|--------|------------|--------|
| Sync Permissions | 61.115 | 5.390 | 11.3 |
|---------------------|--------|------------|--------|
| Overall | 74.280 | 7.053 | 10.5 |
|---------------------|--------|------------|--------|
In both cases (Talend and APIBrowser), it is running the exact same code. We don't understand why the REST client method takes more than 10 times as long as running from APIBrowser.
We then tried to get down to finer detail in our benchmark timing. We focused on the first step and quickly noticed that it was the process of running migrate_schemas that was the issue. Here's a list of processing times for each migration file it processed. This time, we ran the second pass via manage.py instead of APIBrowser, but as mentioned previously, those times were comparable.
|---------------------|--------|-----------|--------|
| Migration file | Talend | manage.py | Factor |
|---------------------|--------|-----------|--------|
| activity_log.0001 | 0.133 | 0.013 | 10.2 |
| countries.0001 | 0.086 | 0.013 | 6.6 |
| contenttypes.0001 | 0.178 | 0.022 | 8.1 |
| contenttypes.0002 | 0.159 | 0.033 | 4.8 |
| auth.0001 | 0.530 | 0.092 | 5.8 |
| auth.0002 | 0.124 | 0.022 | 5.6 |
| auth.0003 | 0.090 | 0.023 | 3.9 |
| auth.0004 | 0.097 | 0.027 | 3.6 |
| auth.0005 | 0.126 | 0.016 | 7.9 |
| auth.0006 | 0.079 | 0.006 | 13.2 |
| auth.0007 | 0.079 | 0.011 | 7.2 |
| auth.0008 | 0.100 | 0.011 | 9.1 |
| auth.0009 | 0.085 | 0.014 | 6.1 |
| auth.0010 | 0.121 | 0.015 | 8.1 |
| auth.0011 | 0.087 | 0.018 | 4.8 |
| users.0001 | 0.871 | 0.115 | 7.6 |
| admin.0001 | 0.270 | 0.035 | 7.7 |
| admin.0002 | 0.093 | 0.022 | 4.2 |
| admin.0003 | 0.091 | 0.024 | 3.8 |
| authtoken.0001 | 0.193 | 0.036 | 5.4 |
| authtoken.0002 | 0.395 | 0.090 | 4.4 |
| clients.0001 | 0.537 | 0.082 | 6.5 |
| clients.0002 | 0.519 | 0.145 | 3.6 |
| projects.0001 | 0.475 | 0.062 | 7.7 |
| projects.0002 | 0.293 | 0.062 | 4.7 |
| sessions.0001 | 0.191 | 0.023 | 8.3 |
| tasks.0001 | 0.241 | 0.122 | 2.0 |
| tenants.0001 | 0.086 | 0.017 | 5.1 |
|---------------------|--------|-----------|--------|
| Total time: | 10.404 | 1.618 | 6.4 |
|---------------------|--------|-----------|--------|
Our Theory:
We think it must have something to do with Talend (and Vue) initiating the process from a different domain (as it will be when the site is live), but in the case of APIBrowser, it starts from the actual endpoint (i.e. the same domain) that the endpoint is defined for.
That means, in our local environment, running from Vue, we are on local.dev and it hits the local.api endpoint. But running from APIBrowser, we go directly to local.api, then fill in the data on the form and POST it.
Our theory is that it must be affecting how files are accessed. The migrate_schemas process has to open many .py files. And the worst culprit, SyncPermissions, is processing many .yaml files where we have defined our default permission structure utilized by each tenant. I should point out that the LoadFixtures process also opens external .yaml files, but in this case, it only has one file to process, so the difference is minimized.
It may be like the difference between opening an image file in code vs. a template showing an image via HTML. In the HTML version, it's essentially another request on the server - which surely takes longer than programmatically opening an image on disk.
What we don't understand is why opening files in these processes would be affected by the two methods of initiating the process. Obviously, since the site will have to run in Vue, having the registration process take 70 seconds when we know it could be done in only 7 seconds is unacceptable.
Note:
I realize it is the norm here in SO to include code for the process in question, but in this case, both processes are running the exact same code - which is why I decided not to post several hundred lines of code here.
Edit (in response to #Iain Shelvington)
The process starts in the post() method of TenantRegister view:
class TenantRegister(APIView):
def post(self, request, *args, **kwargs):
...
tenant_data = request.data.pop('tenant', dict())
tenant_serializer = TenantSaveSerializer(data=tenant_data)
tenant_serializer.is_valid(raise_exception=True)
tenant = tenant_serializer.create(tenant_serializer.validated_data)
...
...which calls the create() method of TenantSaveSerializer:
class TenantSaveSerializer(serializers.ModelSerializer):
class Meta:
model = Tenant
fields = '__all__'
def create(self, validated_data):
...
tenant = Tenant.objects.create(**validated_data)
...
if has_schema and tenant.auto_create_schema:
try:
tenant.create_schema(check_if_exists=True, verbosity=self.verbosity)
post_schema_sync.send(sender=Tenant, tenant=tenant)
except Exception:
# We failed creating the schema, delete what
# was created and re-raise the exception.
tenant.delete(force_drop=True)
raise
else:
# Although we are not using the schema functions directly,
# the signal might be registered by a listener.
schema_needs_to_be_sync.send(sender=Tenant, tenant=self)
return tenant
...which calls the create_schema() method on the Tenant model instance:
def create_schema(self, check_if_exists=False, sync_schema=True,
verbosity=1):
connection = connections[get_tenant_database_alias()]
cursor = connection.cursor()
# Create the schema.
cursor.execute('CREATE SCHEMA "%s"' % self.schema_name)
call_command(
'migrate_schemas',
tenant=True,
schema_name=self.schema_name,
interactive=False,
verbosity=verbosity)
connection.set_schema_to_public()
return True
As for the timing of each migration, my colleague did those. I believe he said he just set verbosity to a higher value and the migrate_schemas process produced the timed output.
I have a data set that tracks the location change for a person. The data set includes person ID, a serial number for records within person ID, start datetime for the current location, leave datetime for this location, current location code and the prior location code.
My goal is to reorganize the data set so that I know on each row for a location how long the person actually stayed and the rows should still stay in the order ofdate/time ascending.
see a snippet of the data set for example below:
| id| rec_no| enter_datetime| leave_datetime| loc| piror_loc|
---------------------------------------------------------------------------------
| 1| 1| 1/10/2009 6:27 pm|1/10/2009 6:29 pm| SICU^6108| 64^6422|
| 1| 2| 1/10/2009 6:29 pm|1/13/2009 5:26 pm| SICU^6108| SICU^6108|
| 1| 3| 1/13/2009 5:26 pm|1/16/2009 5:24 pm| 64^6440| SICU^6108|
| 1| 4| 1/16/2009 5:24 pm|1/16/2009 5:34 pm| SICU^SICX| 64^6440|
...
...
| 1| 8| 2/25/2009 3:45 pm|2/25/2009 3:58 pm| 64^6418| 64^6438|
| 1| 9| 2/25/2009 3:58 pm|3/9/2009 3:16 pm| 64^6418| 64^6418|
| 1| 10| 3/9/2009 3:16 pm|3/9/2009 3:16 pm| 64^6418| 64^6418|
The first two rows show that this patient stayed at "SICU^6108" until 1/13/2019 5:26 pm. So these two rows should be combined into one.
The last 3 records show that the stay at "64^6418" lasted until 3/9/2009 3:16 pm. Hence these last three rows should be combined into one.
Row with rec_no = 3 and rec-no = 4 should stay as they are.
The end goal of the data set should be like:
| id| rec_no| enter_datetime| leave_datetime| loc| prior_loc|
-----------------------------------------------------------------------------
| 1| 1| 1/10/2009 6:27 pm|1/13/2009 5:26 om| SICU^6108| 64^6422|
| 1| 3| 1/13/2009 5:26 pm|1/16/2009 5:24 pm| 64^6440| SICU^6108|
| 1| 4| 1/16/2009 5:24 pm|1/16/2009 5:34 pm| SICU^SICX| 64^6440|
...
...
| 1| 8| 2/25/2009 3:45 PM| 3/9/2009 3:16 PM| 64^6418| 64^6438|
I am using SAS. I am thinking I should use lag/lead procedures/functions to get the date/time value of the next row (or rows) if the location stays the same. But the issue that I am having is you don't know how many records to go down to get the correct end (or start) datetime for a location. In the data set example provided, the first two rows you only look down/ahead 1 row. But the last three records, you need to look down 2 rows.
What about sorting in reverse datetime order to achieve this?
proc sort data = have;
by id descending enter_datetime;
run;
data want;
set have;
by id;
retain final_leave_dt;
if missing(final_leave_dt) then do;
final_leave_dt = leave_datetime;
end;
if assigned_pat_loc_unit_nm ~= prior_pat_loc_unit_nm then do;
leave_datetime = final_leave_dt;
output;
call missing(final_leave_dt);
end;
run;
I could check for a range of values, use the BETWEEN operator.
MySQL [distributor]> select prod_name, prod_price from products where prod_price between 3.49 and 11.99;
+---------------------+------------+
| prod_name | prod_price |
+---------------------+------------+
| Fish bean bag toy | 3.49 |
| Bird bean bag toy | 3.49 |
| Rabbit bean bag toy | 3.49 |
| 8 inch teddy bear | 5.99 |
| 12 inch teddy bear | 8.99 |
| 18 inch teddy bear | 11.99 |
| Raggedy Ann | 4.99 |
| King doll | 9.49 |
| Queen doll | 9.49 |
+---------------------+------------+
9 rows in set (0.005 sec)
I reference to django docs and found gte, gt, lt, lte but no between.
How could I achieve the between functionality?
use this in django ORM products.objects.filter(prod_price__range=(3.49 , 11.99)) ref for more info
Suppose we are given dataset ("DATA") like :
YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY | ANDERSON | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN | JOHNSON | Spark|R; 90|56
2006 | NIHA | DIVA | w/o sports
and we have another dataset ("RESULT") like :
YEAR | FIRST NAME | LAST NAME
1992 | EMMA | CENA
2008 | JOY | ANDERSON
2008 | STEVEN | ANDERSON
2006 | NIHA | DIVA
and so on.
The output should be ("RESULT") :
YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA | CENA | | | |
2008 | JOY | ANDERSON | SPARK | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | PYTHON | 45 | FALSE | TRUE
2008 | JOY | ANDERSON | SCALA | 45 | FALSE | TRUE
2008 | STEVEN | ANDERSON | | | |
2006 | NIHA | DIVA | | | FALSE |
2008 | STEVEN | JOHNSON | SPARK | 90 | |
2008 | STEVEN | JOHNSON | SPARK | 56 | |
2008 | STEVEN | JOHNSON | R | 90 | |
2008 | STEVEN | JOHNSON | R | 56 | |
and so on.
Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on.
Hope you understand my query. And I am using spark-shell with spark dataframes.
Note that "Spark" and "spark" should be considered as same.
As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell
data:
val df = List(
("2008","JOY ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN ","JOHNSON ","Spark|R;90|56"),
("2006","NIHA ","DIVA ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")
I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give
val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
.withColumn("letter",explode(split('backrefReplace(0),"\\|")))
.select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
explode(split('backrefReplace(1),"\\|")).as("digits"),
'backrefReplace(2).as("tags")
)
which gives
scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE |letter|digits|tags |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45 |w/o sports;w datascience|
|2008|JOY |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45 |w/o sports;w datascience|
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |Spark |56 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |90 | |
|2008|STEVEN |JOHNSON |Spark|R;90|56 |R |56 | |
|2006|NIHA |DIVA |w/o sports | | |w/o sports |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:
List(("a;b;c")).toDF("str")
.withColumn("char",explode(split('str,";")))
.groupBy('str)
.pivot("char")
.count
.show()
+-----+---+---+---+
| str| a| b| c|
+-----+---+---+---+
|a;b;c| 1| 1| 1|
+-----+---+---+---+
Read more about pivot here
The final step is simply to do a left join on the second dataset (first "RESULT").
I've got a supervised data set with 6836 instances, and I need to know the predictions of my model for all the instances, not only for a test set.
I followed the approach train-test (2/3-1/3) to know about my rates TPR and FPR, and I've got the predictions about my test (1/3), but I need to know the predcitions about all the 6836 instances.
How can I do it?
Thanks!
In the classify tab in Weka Explorer there should be a button that says 'More options...' if you go in there you should be able to output predictions as plain text. If you use cross validation rather than a percentage split you will get predictions for all instances in a table like this:
+-------+--------+-----------+-------+------------+
| inst# | actual | predicted | error | prediction |
+-------+--------+-----------+-------+------------+
| 1 | 2:no | 1:yes | + | 0.926 |
| 2 | 1:yes | 1:yes | | 0.825 |
| 1 | 2:no | 1:yes | + | 0.636 |
| 2 | 1:yes | 1:yes | | 0.808 |
| ... | ... | ... | ... | ... |
+-------+--------+-----------+-------+------------+
If you don't want to do cross validation you also can create a data set containing all your data (training + test) and add it as test data. Then you can go to more options and show the results as Campino already answered.