Graph Data Structure for Static Analysis

Static code analysis is analysis performed without running the program by analyzing the structure and logic of the code, it ensures adherence to coding standards and guidelines., which helps in identifying errors early in the development process. SonarQube, Coverity, and Checkmark are the tools and popular companies that help you do this, But using these tools for codebases can be tough and lead to a huge signal-to-noise ratio.

Working at Trilogy we are looking to build rules that can be integrated as part of the development, i.e. IDE and CI / CD pipeline so that most code defects can be caught and migration/maintaining an old code can be made easy. So we tried to implement these tools to help us but we found them quite limiting

Syntax Over Semantics: Most of the tools still use tree structure primarily that captures syntactic details and hierarchy, but it may not fully represent the deeper semantic relationships and behaviors of the code.
Inaccurate Alerts: Without dynamic context, the analysis might flag code as problematic (false positives) or miss subtle issues that only occur during execution (false negatives).
Control vs. Data Flow: While ASTs capture the control structure of the code, they often fall short in modeling data flow accurately, which is essential for detecting issues like aliasing or subtle bugs involving variable states.

At Trilogy we wanted to Detect things like Feature Envy, Middle Man, and Sepgatti Code which were not possible using these tools :

Limited Context: ASTs primarily represent the syntactic structure of code. While they can show method calls and property accesses, they often lack the broader context of how and why a method interacts with data from another class.
Delegation Ambiguity:
An AST might reveal simple delegation patterns, but determining whether the delegation is unnecessary or a deliberate design decision (e.g., for abstraction or future-proofing) is challenging.
Cross-Cutting Concerns:
An AST focuses on the local structure of a single file or module and may not capture the broader interdependencies and coupling between components

For this, we thought of building an intermediary data structure that has trees but also, the using binding data. Binding data provides the semantic context that the AST alone does not offer. This includes information about variable declarations, type information, scope resolution, and how identifiers (e.g., methods, variables) connect to their definitions. It “binds” references in the code to their corresponding definitions, allowing for a deeper understanding of how different parts of the code interact.

By merging an AST with binding data, a CodeGraph provides a comprehensive view of both the syntactic and semantic aspects of code. This integrated graph is particularly useful for control flow analysis, allowing developers and tools to detect issues, optimize performance, and understand complex code interactions more effectively.

Detecting Feature Envy:

The CodeGraph can highlight how methods interact with data across classes. If a method frequently accesses properties or calls methods from another class, the graph will show heavy interconnections. To classify this pattern as feature envy, you need to implement specific heuristics or metrics coupling analysis, dependency ratios that quantify whether a method is overly reliant on another class's data.

Detecting Middleman Classes

A middleman class, which primarily delegates calls without adding substantial logic, might appear in the CodeGraph as a node with numerous outgoing delegation edges and little internal computation. Detecting such classes will require rules to measure the value added by the class versus the number of delegated calls. The graph can provide the data for interpreting it as a middleman's smell demands extra logic.

Furthermore, we also merged this graph by flows between statements like a variable's definition and its subsequent usage and the order in which code executes like conditional branches and loops.

Benefits of Codegraph

functions

Enhanced Precision in Analysis:
Combining the CodeGraph (with the PDG ) allows static analysis tools to understand not only the structure and relationships of code elements but also how data and control flow through the program. This can improve the detection of complex bugs and vulnerabilities.
Comprehensive View for Code Smell Detection:
With the merged graph, you can analyze both the high-level design (from the CodeGraph) and the detailed dependencies (from the PDG). This dual perspective is useful when identifying issues like feature envy, middleman classes, or shotgun surgery since you can measure both structural complexity and the extent of interdependencies.
Facilitates Program Slicing and Refactoring:
The integrated graph supports program slicing—isolating portions of code relevant to specific computations or behaviors—which aids debugging, security analysis, and informed refactoring decisions.
Improved Optimization Opportunities:
By understanding both the structural organization and the underlying dependencies, developers can better optimize code by pinpointing redundant or overly complex functions and simplifying them.

I have also published a patent for further reading of this here which tasks in detail about the creating of graph https://patents.google.com/patent/US10915304B1/