Skip to content

Introduction

Test 2 Code with LLMs (T2C) is a project that explores the capabilities of Large Language Models (LLMs) in generating code from various types of tests, including Unit, Integration, and Acceptance tests. The idea is that, passing a directory that contains all the tests, T2C should be able to understand the context and generate code that satisfies the requirements defined by the tests. A usage example would be:

Bash
t2c generate --tests ./tests --output ./generated_code

where ./tests is a directory containing the test files, and ./generated_code is the directory where the generated code will be saved.

A possible application in case of a successful experiment could be the integration into the CI/CD pipelines. It would highly increase the productivity since the entire code generation is demanded to the LLM working on a remote machine, while the dev team can focus on how to engineer the next requirements into tests. The pipeline would look something like this:

flowchart TD
    Push(Push to feature branch) --> Trigger((Pipeline triggers))
    Trigger --> CodeGen(Generate code based on tests)
    CodeGen --> Test(Test generated code with provided tests)
    Test --> Check{tests pass?}
    Check -->|yes| PR(Pull Request on the develop branch)
    Check -->|no| Check2{tries > upperBound?}
    Check2 -->|yes| Fail(Pipeline fails)
    Fail --> Warning(Notify failure to developers)
    Check2 -->|no| CodeGen
    PR --> Continue(...)

Project Objectives

The project aims to demonstrate how the role of a software developer can be shifted towards more of a test provider, with the LLM handling the code generation based on those tests. The project will pursuit its goal with two main approaches:

  • research the best combination of test kinds to let the model succeed in code generation;
  • create a pipeline infrastructure that could be exploited in case of a successful experiment.

The project will use case study of increased complexity like:

  • Tic Tac Toe
  • Snake
  • Space Invaders

Moreover, experiments will be done among different models to find the sweet spot that maximizes the results. The following models will be used:

  • Mistral
  • DeepSeek R1
  • Smollm2
  • Qwen3
  • GitHub Copilot
  • Gemini Flash (until the API free tier limits)