Let’s say you wanted to create a fine-tuned LLM that (1) fixes code and (2) optimizes code. But you aren’t GitHub and don’t have access to a large amount of training data. You could perform some model arbitrage from a larger model but also find data in more interesting places. Fortunately, we have tools specially made for (1) finding runtime errors and (2) optimizing code. Compilers.
Compilers give us orders of magnitude more tokens to train on in the form of equivalent blocks of code. That data might also be helpful in providing examples to the model for:
Optimizing code. Compilers not only package code into runnable binaries but also optimizes code. It transforms code into faster but still equivalent code. For example, it might be inlining a function, removing dead code, loop invariants, deduplication, and more. You could take some rules and end up with many copies of (practically provable) equivalent code.
Better transpilation. It also might give us an interesting encoder/decoder from human-readable code to intermediate representations (IR) like LLVM IR or WebAssembly. Using a corpus of code, you can compile to WebAssembly to get a bunch of code-WebAssembly pairs. Learning a better decoder from WebAssembly to Language X means that we might be able to go from human-readable Language X to WebAssembly to another human-readable version of Language Y. That could be powerful. You might be able to transpile a whole library of code to a different language (hand-wavy, I know).
Finding errors. It’s possible that we can learn from some of the static analysis that goes on in compilers. This might help LLMs output more correct code more of the time.
Considerations
Platform-specific or architecture-specific optimizations are most likely unwanted and should be filtered out.
It’s important to understand what intermediate language you’re optimizing: LLVM IR, Microsoft’s Common Intermediate Language, or something else.
Readability is very important. This data won’t help with that. But maybe it can learn more general rules.