I have an MDNode* (found using DebugInfoFinder). I want to find out all the other MDNodes in the module that use it, but its use list appears to be empty. How do I find that?
For example, I have something like:
...
!5 = metadata !{i32 100}
...
!8 = metadata !{i32 101, metadata !5}
...
If I have metadata !{i32 100}, how do I get a reference to metadata !{i32 101, metadata !5}?
Since the "use" of !5 is !8, another MDNode, this is intentional.
A MDNode is not considered to be a "user"; note that while Instruction is-a User, MDNode is not. Metadata cannot affect code generation, by design. Had MDNode been a "user" of values, dead values whose only use is in metadata could not be killed, which goes against the design.
Pragmatically, this means that to perform interesting analyses on metadata you will need to construct this usage graph on your own from the module. If this sounds expensive, worry not because DebugInfoFinder already kind-of does this. So you can replace it with your own analysis (same cost) that collects more useful information.
Related
In the Fairness and Explainability with SageMaker Clarify example, I am running a bias analysis on the 'Sex' facet ,where the facet value is 0, and the label is 0:
bias_config = clarify.BiasConfig(label_values_or_threshold=[0],
facet_name='Sex',
facet_values_or_threshold=[0],
group_name='Age')
This raises 2 questions:
How would I use it to detect bias on multi-label dataset? (I tried label_values_or_threshold=[0,1] but it didn't work). Would I need to re-run the job, each time for a different label?
Similarly, if I want to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would I need to run the bias detection job for each facet_name?
How would we use it to detect bias on multi-label dataset? (I tried label_values_or_threshold=[0,1] but it didn't work). Would we need to re-run the job, each time for a different label?
By "multi-label" do you mean "categorical" label or "multi-tags" label?
Clarify supports categorical label, for example, if label value is one of enums "Dog", "Cat", "Fish", then you can specify label_values_or_threshold=["Dog", "Cat"] and Clarify will split the dataset into advantaged group (samples with label value "Dog" or "Cat") and disadvantaged group (samples with label value "Fish").
Clarify doesn't support multi-tags label. By multi-tags I mean, for example, a dataset like below.
features are N sentences extracted from a web page
label is N tags to describe the web page is about. Like,
feature1, feature2, feature3, ..., label
"pop", "beatles", "jazz", ..., "music, beatles"
“iphone”, “android”, “browser”, ..., “computer, internet, design”
“php”, “python”, “java”, ... , “programming ,java, web, internet”
Similarly, if we wanted to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would we need to run the bias detection job for each facet_name?
Similarly, if I want to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would I need to run the bias detection job for each facet_name?
Clarify supports multiple facets in a single run, although the configuration is not exposed by the SageMaker Python SDK API.
If you use Processing Job API and compose the analysis_config.json by yourself, you can append a list of facet objects to the facet configuration entry (see Configure the Analysis). Example,
...
"facet": [
{
"name_or_index" : "Sex",
"value_or_threshold": [0]
},
{
"name_or_index" : "Age",
"value_or_threshold": [40]
}
],
...
If you have to use SageMaker Python SDK API, then a workaround is appending additional facets to the analysis config (not recommended but currently there is no better way),
bias_config = clarify.BiasConfig(label_values_or_threshold=[0],
facet_name='Sex',
facet_values_or_threshold=[0])
bias_config.analysis_config['facet'].append({
'name_or_index': 'Age',
'value_or_threshold': [40],
})
I'm trying to perform an S3 sync between prefixes in buckets in different accounts using boto3. My attempt proceeds by listing the objects in the source bucket/prefix in account A, listing the objects in the destination bucket/prefix in account B, and copying those objects in the former that have an ETag not matching the ETag of an object in the latter. This seems like the right way to do it.
But, it seems that even if the copy operation is successful, the ETag of the destination object is different each time I perform a copy. Specifically,
>>> # Here is the source object: {'Key': 'blah/blah/file_20210328_232250.parquet', 'LastModified': datetime.datetime(2021, 3, 28, 23, 38, 2, tzinfo=tzutc()), 'ETag': '"ba230f7a358cf1bee6c98250089da435"', 'Size': 52319, 'StorageClass': 'STANDARD'}
>>> client.copy_object(
CopySource={"Bucket": "source-bucket-in-acct-a", "Key": "blah/blah/file_20210328_232250.parquet"),
Bucket="dest-bucket-in-acct-b",
Key="blah/blah/file_20210328_232250.parquet"
)
... 'CopyObjectResult': {'ETag': '"84f11f744cf996e16a3af0d6d2fbee07"', 'LastModified': datetime.datetime(2021, 4, 20, 2, 23, 40, tzinfo=tzutc())}}
Notice that the ETag has changed. If I run the copy again, it will have yet again a different ETag. I've tried all manner of additional parameters to the copy request (MetadataDirective="COPY", etc.). I might be missing a thing that preserves ETag, but my understanding is that ETag is derived from the object's data, not its metadata.
Now, it says in the AWS documentation that the ETags are identical for a successful non-multipart copy operation, which this is, but this does not seem to be the case. It is not a multipart copy and I've checked the actual data; they are identical. Hence, my question:
How can an object's ETag change, if not for an unsuccessful copy?
Based on the comments.
Calculation of Etag hash for an object is not consistent and can't be fully used for checking integrity of the objects. From AWS blog:
ETag isn't always an MD5 digest, it can't always be used for verifying the integrity of uploaded files.
This is because the calculations of ETag depend on how object was created and encrypted:
Whether the ETag is an MD5 digest depends on how the object was created and encrypted.
See: https://teppen.io/2018/10/23/aws_s3_verify_etags/#calculating-an-s3-etag-using-python
Note: if you simply copy the file via AWS s3 web console, partsize is 16MB.
Mickey
I want to store arbitrary key value pairs. For example,
{:foo "bar" ; string
:n 12 ; long
:p 1.2 ; float
}
In datomic, I'd like to store it as something like:
[{:kv/key "foo"
:kv/value "bar"}
{:kv/key "n"
:kv/value 12}
{:kv/key "p"
:kv/value 1.2}]
The problem is :kv/value can only have one type in datomic. A solution is to to split :kv/value into :kv/value-string, :kv/value-long, :kv/value-float, etc. It comes with its own issues like making sure only one value attribute is used at a time. Suggestions?
If you could give more details on your specific use-case it might be easier to figure out the best answer. At this point it is a bit of a mystery why you may want to have an attribute that can sometimes be a string, sometimes an int, etc.
From what you've said so far, your only real answer it to have different attributes like value-string etc. This is like in a SQL DB you have only 1 type per table column and would need different columns to store a string, integer, etc.
As your problem shows, any tool (such as a DB) is designed with certain assumptions. In this case the DB assumes that each "column" (attribute in Datomic) is always of the same type. The DB also assumes that you will (usually) want to have data in all columns/attrs for each record/entity.
In your problem you are contradicting both of these assumptions. While you can still use the DB to store information, you will have to write custom functions to ensure only 1 attribute (value-string, value-int, etc) is in use at one time. You probably want custom insertion functions like "insert-str-val", "insert-int-val", etc, as well as custom read functions "read-str-val" etc al. It might be also a good idea to have a validation function that could accept any record/entity and verify that exactly one-and-only-one "type" was in use at any given time.
You can emulate a key-value store with heterogenous values by making :kv/key a :db.unique/identity attribute, and by making :kv/value either bytes-typed or string-typed and encoding the values in the format you like (e.g fressian / nippy for :db.types/bytes, edn / json for :db.types/string). I advise that you set :db/index to false for :kv/value in this case.
Notes:
you will have limited query power, as the values will not be indexed and will need to be de-serialized for each query.
If you want to run transaction functions which read or write the values (e.g for data migrations), you should make your encoding / decoding library available to the Transactor as well.
If the values are large (say, over 20kb), don't store them in Datomic; use a complementary storage service like AWS S3 and store a URL.
This is a theoretical Question. I know that i could very easily solve the stated example problem
by using some AWS GraphDatabase. I really need to let the lambda function itself to work on a bigger datastructe that is held in memory, not outsource the graph calculations or something. the Graph is just used as an example.
The Setting:
Let's say i want to use AWS Lambda for one of my projects.
This project provides an API to search for the shortest path in a never changing graph from Vertex A to Vertex B.
Since those requests are stateless it would be perfect for something like AWS Lambda.
The usage pattern:
This service is not used that often. Lets say like 10 times a day. But when someone
uses it, they will probably use it several times in a short period of time.
The problem:
The Graph to work on is static, it doesn't change EVER. But it is quite big, and if
constructed from some XML-like Data it takes a few seconds. In the program, the graph consists of
a few thousand instances of a vertex class, and every vertex has a set of adjacent vertices. (undirected graph)
The question:
How would i implement this with AWS lambda? (Example code will be java, but since this is theoretical and about aws and not java, it shouldn't matter)
Of course i could just construct this graph from XML every time the service is requested.
But that would take several seconds each time, to construct the never changing object structure,
which is not suitable.
I could persist the graph structure to ephemeral memory (/tmp) so it could get reused on subsequent
requests as long as they happen within the 4:30Minute(lets call those 4:30min a SESSION) window that aws keeps the lambda instance environments alive.
But loading it as a serialized version would probably also take some time.
I want to kind of persist the memory structure itself for subsequent requests during the same "session".
Reconstructing the Graph for every "session" would not be a problem if it can be used for subsequent requests within that window.
How would you solve this problem to work on a never changing structure that takes time to construct?
Or maybe i'm missing something completely?
EDIT:
Okay, i found out that you could put the construction of the graph into the initialization and store it in a variable. That solves the Question of how to reuse the graph for subsequent requests to the same container.
But the bigger questions still remains, is it possible to reuse NEVER changing (aka statless) datastructure that lives in memory?
I'm not sure this question is suitable for stackoverflow since strictly speaking, it is not about programming. But it definitely is not about servers (serverfault) either.
Since it is a static graph you can compute all the paths on beforehand and store them in an appropriate data format. Then the Lambda could use a simple lookup method to fetch the pre-calculated path. Exactly how this format looks like may vary, and depending on the size of the graph the storage methods may also vary, see examples below.
For file based storage you could use JSON, either as a key / value. The key contains the start and end vertices, the value is a list of intermediate vertices. If the vertices are not connected, there is no corresponding key, e.g.
{
"AB": ["C", "D", "E"],
"AC": [], // empty list indicates a direct connection
"AD": ["C"],
"AD": ["C", "D"],
"BA": ["E", "D", "C"],
// etc
}
If you would like to know the path from A to B you simply generate the key AB and the path will be C, D and E.
You can also use a nested JSON format:
{
"A": {
"B": ["C", "D", "E"],
"C": [], // empty list indicates that there is a direct connection between A and C
"D": ["C"],
"E": ["C", "D"]
},
"B": {
"A": ["E", "D", "C"]
// etc
}
// etc
}
Here, you find the path from A to B by looking up the value A.B.
For large graphs / data sets, DynamoDB could be one suitable option on the AWS stack. One way of modelling it could be to use a composite primary key where the start vertex could be the primary key and the end vertex the sort key. Then the path could be a list of strings with the intermediate vertices as in the file example.
When I run a query like "select count(x),y group by y", calcite does all the calculations in memory. So having enough data it can run out-of-mem. Is there a way to do aggregations using some other storage? There is a spark option but when I enable it I get an nullptr exception. Is that meant to use spark to calculate the results and how does it work?
I would like to talk a bit about my understanding on this.
firstly, calcite is data manipulation engine specialises in SQL optimisation. so it primarily focuses on figuring out the best execution plan.
there have been quite a few adapters on calcite. you can of course choose to push down the aggregation to backend to execute. like push down the aggregation to backend mysql etc...
in the case of csv adapter, I do think calcite will generate execution details to run the aggregation. as you suggested probably all in memory, and if the csv file is large enough, there would be OOM.
and yes, the SPARK option, if turned on. will enable calcite to generate SPAKR code instead of plain java code to execute the physical plan. and I assume yes it will to some extent solve the OOM you mentioned.
unfortunately, I haven't found official introduction to use SPARK to run calcite other than some test specs.
CalciteAssert.that()
.with(CalciteAssert.Config.SPARK)
.query("select *\n"
+ "from (values (1, 'a'), (2, 'b'))")
.returns("EXPR$0=1; EXPR$1=a\n"
+ "EXPR$0=2; EXPR$1=b\n")
.explainContains("SparkToEnumerableConverter\n"
+ " SparkValues(tuples=[[{ 1, 'a' }, { 2, 'b' }]])");