In the previous blog, I illustrated how even the simple question of “What do the TODOs say?” becomes complex when answered in a manner that highlights most urgent results. An engineering manager doesn’t just need a high level answer, but the ability to understand and investigate where that answer comes from. Thus, even one very simple question needs additional processing beyond the traditional RAG/LLM framework. Of course, engineering managers want to stay on top of all of their code, as well as ask many questions, multiplying the problem.
The naive approach of throwing all of the questions to just an LLM is akin to just asking every question to the smartest engineer. This might be a good strategy a priori, but there are many things that any generalist will simply not know about the code. Good engineers rely on static analysis tools and documentation in addition to the code as well as external resources, e.g. vulnerable open source libraries and the latest regulations in each geography. Additionally, there is a need to re-ask the questions periodically, as the codebase is updated, regulations change, new vulnerabilities are discovered, etc.
The challenge in understanding code using AI begins with knowing the right questions to ask and how to ask them. This is prompt engineering, but not only playing with wording. To get high quality answers from an LLM, conceptual questions must be broken down to be more explicit and concrete. For example, at Flux, we’ve found we get the best results with LLMs when we don’t ask questions such as, “What type of UI does this repo have?” but instead break it down into questions about frontend frameworks, configuration files for display, styling libraries, etc.
Additionally, while some questions can be answered using just the code as an input data source, many require ancillary information. For example, git commit history is necessary to identify individuals who committed certain code. Employee directory information ties a git handle to a developer, and a developer to a department or team. Of course, git commit history is helpful for many questions, and thus becomes a vital data source. Similarly, external information about vulnerabilities should be imported, and linked to the parts of the code leveraging them to get a deeper understanding of their use within the code. Thus, an effective code understanding platform involves a symphony of interconnected tools and techniques.
Thus, at Flux, we spend considerable time evaluating tools to ensure that we are selecting the best-in-breed. For example, there are a plethora of tools that give insight into code quality, but many are better suited for the individual developer or a small group reviewing a given pull request. For us, it’s not just about the tools, it’s about how to leverage domain knowledge to pull out the appropriate code and ancillary information to feed the LLM. If an engineering leader learns about a vulnerable library in their code using a static analysis tool, their next step is deeper research to learn how integrated it is to the codebase, how it is being used, etc. The runtime performance can be evaluated by adding logging, traces, etc. This enhances the code with information about its performance, e.g. race conditions and runtime bugs. At Flux, we use the static analysis tool SonarQube which feeds several different assessments of our code quality evaluation. We extract the relevant SonarQube chunks and pass them without additional processing to get them into a different structured format. If there’s no need for the LLM, we don’t pass the information through it. Sometimes what comes out of the static analysis tool is all that you need.
The complexity inherent in reliable interconnected parts demands modularity. Committing to using best-in-breed tools in a rapidly evolving ecosystem necessitates being flexible about switching out components without impacting the rest of the system. We have put significant thought around how to augment the code passed to the LLM with additional context from static analysis reports. Often just a fraction of the report is actually relevant to the question, and optimal results rely on extracting it. Additional post-processing may be needed to get the report into the best format for the LLM, e.g. aggregating or filtering low level details. Rather than rigid APIs, LLMs can take a wide variety of data formats which makes joining data sources much easier and more tolerant of changes in the underlying format. Together, this allows us to optimize our component tools with less overhead.
In the next post, I’ll introduce a taxonomy for determining how the questions and code should be parceled out to the LLM, depending on the class.
Rachel Lomasky is the Chief Data Scientist at Flux, where she continuously identifies and operationalizes AI so Flux users can understand their codebases. In addition to a PhD in Computer Science, Rachel applies her 15+ years of professional experience to augment generative AI with classic machine learning. She regularly organizes and speaks at AI conferences internationally - keep up with her at her LinkedIn here.