Comparing AI-Generated Code in Different Programming Languages

Updated Nov 22, 2024 • 17 min read

Since most AI-powered tools have their roots in Python, is this the language where AI code generation works best?

It’s safe to say that Python is the language of AI. The vast majority of AI-powered tools are built with Python, so an interesting question arises: do they favor Python while generating code? Will they perform better at creating a project in Python compared to other programming languages?

In one of my projects, I wanted to test this hypothesis with a clear comparison of the differences in the code quality generated by AI tools when the only difference is the programming language used.

Tools of choice: AutoGPT and GPT-Engineer

AutoGPT and GPT-Engineer looked the most promising in my initial testing.

For this research, each of them is asked to create a REST API for movies in 3 programming languages: Java, JavaScript and Python.

For each language, the project is generated 3 times in order to evaluate the average result. The prompt isn’t too complex, but challenging enough for the tools to be fully implemented correctly. The idea is that it will expose some imperfections in the implementations and potential differences in their severeness depending on the selected programming language.

The prompt is always identical, the only difference being the programming language and its framework indicated in the prompt. Tested tools will be examined by the level in which they managed to fulfill the requirements from the prompt. The prompt template is the following:

Create a REST API for movies. Use <language_name> language with <framework_name> framework.

For the project structure, follow the best practices in <framework_name> framework.

For data persistence use PostgresQL.

There are 3 resources: movie, actor and director. Name the database tables in plural form.

Each director has a first name, a last name, date of birth (timestamp) and a list of movies directed (relation one-to-many with movies table).

Each actor has a first name, a last name, date of birth (timestamp) and a list of movies in which they played (relation many-to-many with movies table).

Each movie has a title, a genre, release date (timestamp), production cost (in US dollars), total revenue (in US dollars), a director (relation many-to-one with directors table) and a list of actors who played in it (relation many-to-many with actors table). Movie genre can only be one the following values: “COMEDY”, “DRAMA”, “THRILLER” or “HORROR”.

Additionally, each record must include a unique ID generated by the database.

All properties for all resources are required (non-nullable).

For each of the resources, implement CRUD operations: list all records, get a record by ID, create a record, update a record by ID, delete a record by ID.

For POST and PUT or PATCH endpoints (creating and updating records) add input validation, ensuring that the data provided by the API client is complete (no data is missing) and of correct type.

For GET endpoints used for listing all records, include pagination mechanism.

Add a health-check endpoint under “/healthcheck” route.

Cover the project code completely with unit tests.

No endpoint implementation can be left blank or left with a comment of type “TODO add implementation here”. No other requirement from this prompt can be left blank or left with a comment of type “TODO add implementation here”. All of the program logic must be fully implemented and the final result must be a complete, fully working solution.

TL;DR test results

Both GPT-Engineer and AutoGPT generated good-quality code and a decent repository to start a Java Spring-based project
With JavaScript, gpt-engineer did well, but the result of AutoGPT generation required a lot of fixing to get it running, making it too time-consuming.
Surprisingly, both tools performed the worst with Python. GPT-Engineer missed several requirements but overall still provided a decent starting point for a project. However, AutoGPT failed with Python, generating almost nothing of value.

How GPT-Engineer performed

GPT-Engineer handled my task best with the use of Java, followed by Python, and then JavaScript in last place.

It is worth nothing that the differences in code quality were not striking. In all cases the generated codebases required at least a few tweaks, in some cases even manually adding some missing files or parts of the code, based on the examples generated by gpt-engineer.

Fortunately, filling in the gaps is easy thanks to the generated project structure and examples, and can be almost seamless with the help of a tool like GitHub Copilot, which can suggest the missing pieces based on those already existing. In all of the tested languages gpt-engineer can help develop a solid base for a new project.

Keep reading for more details or scroll down to see how AutoGPT performed.

Java

The code generated is pretty good and gpt-engineer can be considered as a good tool to start a Java Spring-based project (as long as you keep the initial prompt as explicit as possible).
The code quality is good, the generated application can be easily started after minor tweaks.
All data transfer objects (DTOs) were generated in each attempt exactly as described. The H2 Embedded database was used, and configurations to it were 100% correct, no tweaks needed.
3/4 classes of unit tests were generated every attempt with 2-3 tests inside each class. The tests were quite simple in terms of test coverage and functional coverage.
All mentioned CRUD endpoints were generated every time with very simple logic inside the 'controllers' (mainly just pointing to the appropriate service call). In one attempt, it even added a Swagger config for the REST endpoints, which was not specified in the prompt.
The health check endpoint was generated with a simple “OK/NOK“ reply.
The structure and quality of all generated code is pretty much the same as with a Spring Initializr run, with one difference being that sometimes an extra config was needed to be able to run the gpt-engineer generated code.

JavaScript

The amount of missing files and unfulfilled requirements is significant. However, the existing parts of the code are syntactically and logically correct and well-structured. Combining the various parts of the requirements from different iterations on the prompt may serve as a good base for starting a new project.
The code quality is fine, but many of the expected files are left with blank content or missing.
In all 3 attempts GPT-Engineer used the specified language and framework.
The code is well-organized into properly named directories, according to common conventions in the Express framework.
All attempts failed to code an actual connection to the database. In two attempts, appropriate models were generated for the entities with the use of a sequelize library. One implementation was rather laconic, and only one could be considered solid, containing the constraints from the requirements – auto generated ID, non-nullable fields and relationships.
Only one attempt resulted in implementing the full logic for all CRUD operations on the resources. In other attempts, the logic was only implemented for one resource, while the controllers for remaining resources were either blank or completely missing.
In two attempts input validation was added for the CREATE and UPDATE operations. Only one attempt included a basic pagination mechanism in endpoints listing the records. In all attempts, the health-check route was included.
In all 3 attempts test files were created, but were left completely empty, just containing a TODO comment saying that tests should be added there.

Python

The overall amount of missing files and unfulfilled requirements is not negligible, although more tolerable then with JavaScript. The existing parts of the code are syntactically and logically correct and for the most part well-structured. Combining the various parts of the requirements from different iterations on the prompt may definitely serve as a good base for starting a new project.
The overall code quality is quite good. Some of the expected files are left with blank content or missing, but noticeably less than with JavaScript.
In all 3 attempts gpt-engineer used the specified language and framework. In two cases the code was well-organized into properly named directories, according to common conventions in the FastAPI framework. In one case, there was no structure whatsoever, the project code was roughly split into a few files all placed in the top level working directory.
Only one attempt provided an actual connection to the database. However, in all attempts the appropriate entity schemas and database models were generated, including correct data types of entity properties, non-nullable constraints, auto-generated IDs and relationships in the database models.
Just like with JavaScript, only one attempt provided the full logic for all CRUD operations on the resources. For the other two attempts, the logic was only implemented for one resource, while the controllers for remaining resources were either blank or completely missing.
Once more strikingly similar to JavaScript, in two attempts dedicated classes were generated especially for the sole purpose of validating the inputs, separately for CREATE and UPDATE operations.
All attempts included a basic pagination mechanism in endpoints listing the records. In all attempts, the health check route was included.
What is worth noticing, most endpoints included error handling mechanisms around the coded interactions with a database, even though it was not a part of the requirements in the prompt.
In all 3 attempts test files were created, but they were left mostly empty, either with an example test for one of the routes for each resource, or example tests for all operations on a single resource, leaving the rest to be filled in by the developer.

How AutoGPT performed

General note – by design, AutoGPT is highly interactive and generates the best quality code when a human developer is actively engaged in the development cycle, providing feedback after each step of the repository generation. However, it is possible to accept upfront all suggestions generated by AutoGPT, making it an autonomous tool, and this is the approach I used here.

Providing my feedback would make the results of this comparison useless, but just keep in mind that AutoGPT will achieve better results than mine when you work together with it.

Java

All in all, the code generated is pretty good and AutoGPT can be considered as a good tool to ramp up a Java Spring-based project.
The code quality is fine. In all 3 attempts, with some minor tweaks, the application can be started easily.
The biggest issue I encountered on every attempt is that AutoGPT was trying to use Maven CLI as a generation code tool. If Maven was not installed and accessible in the terminal, it would loop, and wouldn’t exit the loop without explicit feedback. To exit the loop you either need to point it to a installation tutorial of Maven (also has some limitations here), or you can point it to Spring Initializr and extract the initial skeleton from there.
All data transfer objects (DTOs) were generated in all attempts exactly as described. As a database, an H2 Embedded database was used and configurations to it were fully generated, with no need for tweaks.
1/2 classes of unit tests were generated every attempt with 2-3 tests inside each class. The tests were quite simple in terms of test coverage and functional coverage.
All mentioned CRUD endpoints were generated every time with very simple logic inside the 'controllers' (mainly just pointing to the appropriate service call).
In two attempts, the health check endpoint was generated with a simple “OK/NOK“ reply, in one attempt it was skipped.
Same as with gpt-engineer, the structure and quality of generated code was the same as with Spring Initializr run, with the only difference being that sometimes we need to add an extra config.

JavaScript

The amount of missing files and unfulfilled requirements is large. The existing parts of the code are syntactically correct and, for the most part, also logically correct even if not always properly connected together into a working application. In this state, creating a solid project base from combining the various parts of the requirements from different iterations could pose a challenge and might be disproportional to time and effort required from the user.
The general code quality is fine, however the overall performance is mediocre due to large amounts of missing or logically not connected code.
In all 3 attempts it used the specified language and framework. When it comes to the project structure, in two attempts the code is roughly organized into properly named directories and files, according to common conventions in Express framework. In the third attempt, the tool barely generated any code at all, just two files; one for the server setup and one for tests, both almost completely empty.
Only one attempt provided an actual connection to the database. An appropriate file for setting up the database was included, but it wasn’t used in the entry file. In another attempt an SQL file with database setup was generated, but not used anywhere further in the codebase.
In contrast to gpt-engineer, no appropriate models or entity files were generated in any attempt. Only one of the generated projects included a package.json file with required dependencies.
Two attempts resulted in implementing the full logic for all CRUD operations on the resources. For the other attempt, the logic was completely omitted (TODO comments).
Worth noting here, one attempt included error handling mechanisms around the coded interactions with a database in all endpoints, even though it was not a part of the requirements in the prompt.
No input validation was generated for the CREATE and UPDATE operations in any attempt. Only one attempt included a basic pagination mechanism in endpoints listing the records. In all attempts, the health check route was included.
In two attempts, test files were created and included an example test for a single route on each of the resources, leaving the rest to be implemented by the developer.

Python

The general code quality generated for this task is by far the worst of all. The only requirement fulfilled in all three attempts was using the Python language. The result is useless.

In all attempts the project structure was very poor. All generated projects consisted of a few files placed in the base directory, only roughly structuring the code into a few logical subparts. Despite theoretically correct syntax of the generated code, a lot of files/imports are missing, and there’s no logic connecting the existing files into an actual coherent project.
In two attempts there was no use of a database for data persistence. The API data is saved in lists placed directly in the entry file, and would be erased once the execution is finished.
Only one attempt provided an actual connection to the database, however in the same attempt no actual endpoints were created in order to interact with the API.
In all attempts some entity models were generated, however in all cases incorrect or incomplete.
Only one attempt resulted in implementing the full logic for all CRUD operations on the resources, however only a list in the entry file was used for their storage. In another attempt, the only generated endpoint was the health check. One attempt failed to generate any endpoint at all.
No input validation was generated for the CREATE and UPDATE operations in any attempt. No pagination mechanism was generated. A health check route was included in two attempts.
No meaningful test files were generated in any attempt.

Summary

When it comes to creating a REST API, AutoGPT handles the task very differently depending on the used programming language.

With Java, the overall project quality was quite good and only required a few corrections before being used as a new project base. The projects generated with JavaScript were of noticeably worse quality, leaving the developer much more work in order to create a solid project from the generated content. Quite surprisingly, the codebase generated with Python was the worst quality and could not be used even as a blueprint for a good project base.

Once again, keep in mind that my results are biased, as AutoGPT is a tool you’re supposed to cooperate with and give feedback to get the best results. This test was only to point out the differences in tool performance based on the used programming language, and overall, the result is that Java – not Python – was the easiest language for AI tools to generate a codebase.