In a Roslyn Analyzer/Source Generator, is there any way to distinguish between direct and transitive assembly references - roslyn

AIUI, transitive references are an MSBuild rather than a Roslyn feature, and are to do with Package/ProjectReferences rather than assembly references. I believe the compiler just gets a set of assembly references supplied to it and no information about where they came from. As a result, I don't see any way to distinguish between direct and transitive references from within an analyzer or source generator (i.e. given a CodeAnalysis.Compilation object). Compilation.References includes the transitive ones, as does Compilation.ReferencedAssemblyNames.
Does anybody know a way to get at this information? Is it possible to get hold of an MsBuild Project object, from which it could be gleaned?

There's no way to get to this information; as you observed MSBuild is doing this, and by the time Roslyn gets the information it's just a flat list. We don't have any way to reach back to the MSBuild instance, for various technical reasons. If you clarify why you are needing this we might be able to give some advice for next steps.

Related

C++: How to write a program that finds all instances where function X, variable Y, or object Z are called?

Here's some background of what I'm trying to achieve.
I'm in need of parsing C++ source code to find all instances where function X is called. This seems doable in libclang as mentioned in this post: Find all references of specific function declaration in libclang (Python) (though the answer implies it isn't as simple as you might think).
However, problem with libclang is that using it on Windows is often not recommended by many people. I can't use it on Linux because I'm hoping to use it on existing Visual C++ code that uses winapi.
With this barrier, I asked a colleague and he suggest I just simply search the source code using regular expression. I have my doubts that this is easy.
Can someone tell me if this approach is recommended?
Edit to address the comment of what my goal is: I need to do it programmatically because I'm tryng to integrate it to an infastructure that checks where the code was editted and then gives you an output on which end-user functionality is affected by that edit and thus needs to be rechecked. If I were to do this manually via the "find references" options in IDE, this means "finding references" in multiple levels until I reach the end-user level which is a lot of work for large code and prone to error.

Creating a modular language in LLVM?

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?
This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

What's the best way to make a Roslyn analyzer configurable?

I'm playing around with making an analyzer for Roslyn. The one I'm making is a diagnostic that finds methods that are too long. I'd like to make whatever is considered 'too long' configurable, preferably one configuration for an entire solution or project. What would be the best way to go about this?
The only option I have in mind is to search the assembly for a particular configuration attribute. This would require an attribute for each project in the solution. Also it requires the user of the diagnostic to reference a library specific to the diagnostic that defines this attribute.
Is this a good idea, and what are the other options?
You can pass additional files to the analyzers. These can then be reached from the analysis context. But this approach is not that developed yet in Roslyn. For example if the file changes, the analyzers are not notified about the change.
For an example you can check out the SonarLint repository.
Also, keep an eye on this GitHub issue, where the discussion is going on how parameters and data sharing should be done in the upcoming Roslyn version.

Is it possible to inject code into translation unit immediately before compilation

I build my C++ code base with MSVC++ 2008 and 2010. Is it even possible to get translation unit, analyze it, insert some code if necessary and then pass on to the compilation process? Original source code should not be affected.
Sure, it should be transparent for a developer who builds a project. Finally, it will only affect object files. Visual studio is very powerful. I guess, there should be some kind of plugin API or hooks to do that. Please, give me a hint.
I don't believe this is possible as you describe it, though I don't know for sure. It would certainly be non-trivial. The only similar project that springs to mind is OpenMP, but I got the impression that Microsoft was the one who implemented their version of it.
I could see a template engine such as Cheetah sufficing though. You would likely give up your bells and whistles like code completion and intellisense though.
Basically, you would set up the files to use a custom compiler to generate the new code in another file. The C++ compiler would then compile the generated files. I don't think it would be elegant or pleasant to use, to be frank.
I've used CMake to do similar things, though I did not target it as a general tool. I wrote a one off for some content generation.
Maybe if you actually describe some of the specifics of what you want to do we can offer a better solution.

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).
I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!
I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.
If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.
Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.