AI Tools Comparison: How to Fast-Track Your Project Setup
In a world where time and performance are the be-all and end-all, AI tools truly move the needle. They streamline workflows and boost productivity, allowing developers to achieve more in less time.
The question here is: how effective AI tools are in speeding up the initial stages of projects? Can they create good-quality boilerplate code?
Our team researched 6 AI tools – GPT-engineer, Smol-dev, MetaGPT, AutoGPT, ChatDev, and GPT-pilot – to determine how useful they truly are in creating landing pages, mobile apps, and entire web applications.
In our research endeavor, we decided to start a project with an AI tool like GPT-Engineer. To avoid AI implementation errors, it's crucial to consider the nuances of each tool and how they align with your project requirements.
Alok Ranjan, Software Engineering Manager at Dropbox
Accelerating project setup with AI tools
AI tools can improve the speed of the development process. They achieve this by automating repetitive and time-consuming tasks, such as generating boilerplate code or setting up basic project structures.
The benefits of employing AI tools in development process offer many advantages like:
Speed and Efficiency: AI tools automate repetitive tasks and analyze data swiftly, freeing developers to tackle complex problems sooner. This leads to quicker project completion and problem-solving.
Cost Optimization: By automating routine tasks, AI reduces labor costs and optimizes resource use. This results in lower operational costs and makes projects more financially viable.
Capacity Constraints: AI increases capacity by handling multiple tasks, enabling smaller teams to undertake larger projects. This allows for efficient use of limited resources, expanding project potential.
Comparison of AI tools' capabilities
We aim to expedite the initial stages of projects with AI-powered tools, allowing us to focus on delivering value to our clients rather than spending time on writing boilerplate code. The tools presented below may also serve as educational resources for learning about AI-powered technology. The primary goal of this comparison is to assist readers in selecting the most suitable tool for their software development needs.
AI tools score across categories
The table below presents scores for each tested tool across several categories. Despite our best efforts to maintain objectivity in scoring, it is challenging to eliminate one’s personal perspective when interacting with such novel, powerful, and oftentimes surprising technology. As this research marks the beginning of our exploration into this topic at Netguru, we anticipate and welcome further development and updates to this document by other developers who have the opportunity to experiment with these or other AI-powered tools. For each category in the score table, you will find the rationale behind the scores and a brief description of each tool’s performance in that aspect.
The used scores:
Score | Label |
0 | Bad |
1 | OK |
2 | Good |
3 | Very good |
Table with AI tools comparison
Sources:
Category: Generated code quality
A general note on code quality generated by AI-powered tools is that users must undergo an iterative process of “try → assess → improve → try again” to learn how to formulate prompts to achieve the desired results. However, a consistent pattern across all these tools quickly emerges: very explicit and detailed prompts yield the best results.
GPT-engineer GOOD
The generated code often lacks certain configuration files or may have minor flaws in terms of program logic or syntax. These issues can be relatively easily addressed either by refining the initial prompt and rerunning the program or by manual corrections. Despite the non-interactive nature of this tool, the code quality is generally good.
Smol-dev OK
The generated code frequently lacks configuration files or exhibits moderate flaws in terms of program logic and syntax. Rectifying these issues typically requires significant effort from the user.
MetaGPT OK
Considering that MetaGPT is focused more on the end-to-end process flow rather than solely on code generation (being capable of generating user stories / competitive analysis / requirements / data structures / documents, etc. ), its generated code is not bad but not the greatest either. It often defaults to generating code in Python with Flask when language / tech stack specifications are not provided. The quality of the generated code is decent, sometimes contingent on the specificity of the initial prompt.
AutoGPT GOOD
The quality of the generated code varies greatly depending on the user’s interaction with the tool. As this tool is highly interactive, the user plays an active role in shaping the codebase at each step of its creation. With experienced developers, it is possible to achieve high-quality code, albeit with considerable effort.
ChatDev OK
Despite the involvement of multiple virtual agents, each with a specific role, the final code quality tends to be somewhat disappointing. Even though the generated codebase is typically free of bugs, it often lacks certain files, contains syntax errors, or lacks proper structure.
GPT-pilot GOOD
Primarily tested with Java and NodeJs, the quality of the generated code was pretty good. One important thing to mention here would also be the fact that the debugging features are fairly effective – if an error occurs when running the code, GPT-pilot provides relevant suggestions and even implements relevant fixes.
Additionally, it prompts for the developer’s confirmation or manual interaction and provides hints on where to make necessary adjustments.
Category: Interactivity
GPT-engineer OK
This tool operates in a non-interactive manner, generating code based on an initial prompt file. However, users can adopt an iterative approach by reviewing the generated code, refining the prompt, and rerunning the process.
Smol-dev OK
Similar to GPT-engineer, this tool functions in a non-interactive manner, generating the code based on an initial prompt file. Users can iterate on their work by reviewing the generated code, improving the prompt, and running the process again.
MetaGPT OK
MetaGPT operates as a non-interactive tool, requiring careful consideration of the quality and quantity of information provided in the initial prompt to obtain better-quality code/documents.
AutoGPT VERY GOOD
Distinguished by its high interactivity, AutoGPT actively engages users throughout every step of the code creation process. The tool clearly outlines each step, provides the reasoning behind its actions, presents the anticipated changes in the code, and solicits user acceptance or feedback at each juncture. This level of interaction allows for precise and granular work with the tool, albeit demanding considerable attention and engagement from the user.
ChatDev GOOD
This tool offers two modes of operation: autonomous and “human in the loop”. In the latter mode, users do not actively participate in the code generation but can provide feedback at the end which they can address any required corrections. However, this feature may not function optimally.
GPT-pilot VERY GOOD
Among the listed solutions, GPT-pilot stands out for its exceptional interactivity, particularly as it can be integrated as a Visual Studio plugin. Offering a consistent user experience, it facilitates real-time file creation, provides code validations, and offers helpful hints for manual interventions. Additionally, users have the flexibility to backtrack to any step in the workflow and restart the generation process from that point.
Category: Ease of use
GPT-engineer VERY GOOD
Extremely straightforward to run and understand the way it works. A single cycle of iteration with this tool lasts only a few minutes, with the majority of time spent on crafting the prompt. Subsequently, it requires no additional input from the user. It boasts an almost entirely flat learning curve.
Smol-dev VERY GOOD
Highly intuitive to run and grasp its functionality. A single work cycle with this tool consumes just a few minutes, most of which will be spent on writing the prompt. Once completed, it requires no additional input from the user. The learning curve is nearly nonexistent.
MetaGPT GOOD
Relatively simple to utilize the tool and notably swift in terms of processing time, usually it takes only a couple of minutes (5-6 minutes) to generate both the documentation (user stories, data diagrams, design diagrams, etc.) and the code.
AutoGPT GOOD
Despite resembling a conversation with a human, users must learn how to phrase prompts effectively to avoid unexpected behavior from the tool, e.g. overwriting existing project files or prematurely terminating the agent (before the task is actually finished). The learning curve is moderately steep.
ChatDev VERY GOOD
The ease of use depends on whether the user opts for an automatic or “human-in-the-loop” approach. However, since this tool is relatively autonomous regardless of the chosen mode, the learning curve is rather flat.
GPT-pilot VERY GOOD
Installation is exceptionally straightforward, given its availability as an out-of-the-box plugin for Visual Studio. Once installed, interaction is mainly prompt-based, with infrequent manual intervention supplemented by helpful hints on what to do. Debugging is largely automated, with a high probability of issue resolution (Java, NodeJS).
Category: Overall user experience
GPT-engineer VERY GOOD
The tool generates complex, high-quality code, requiring minimal effort from the user. Consequently, the overall user experience is very good.
Smol-dev OK
Although the tool demands little effort from the user, it also yields somewhat unsatisfactory results. Hence, the overall user experience is acceptable, but has room for improvement.
MetaGPT GOOD
A pleasant tool to use, straightforward, and highly reliable, with rare occurrences of errors. Its speed in generating code/documentation contributes to an overall positive user experience, allowing for swift iterations.
AutoGPT VERY GOOD
Interaction with this tool is highly engaging, making it perhaps the most exciting and interesting tool on this list, albeit at the cost of increased effort. In some cases, the quality of the final result may not align with the user exerted effort, leading to variable overall user experiences ranging from excellent to somewhat disappointing.
ChatDev GOOD
This tool sets high expectations due to its main concept of simulating a “tiny software house” with multiple agents. While not overly difficult to use and offering some customization. However, compared to other tools, ChatDev often falls short of delivering results commensurate with its description. Thus, although observing the tool at work can be quite interesting, the final user experience is, at best, good.
GPT-pilot VERY GOOD
The tool generates the quality code with a highly satisfactory user experience. Being packaged as a plugin makes it fun and easy to be used.
AI tools comparison summary
After considering the performance of the tested tools across various categories, two stand out: GPT-engineer and AutoGPT. Both of them are capable of generating good or very good quality code, which is an essential factor when using them for project setup or writing boilerplate code, thereby saving the developer’s time and effort.
GPT-engineer impresses with its efficient user-to-code quality ratio. The only thing the user needs to do is write a well-detailed prompt file and then the tool will take care of the rest. The rapid generation process allows for quick adjustments if flaws are detected in the final output, making the correct appropriate part of the prompt and running the program again.
AutoGPT, while capable of generating high-quality code, requires a lot of attention from the user to achieve optimal results. As mentioned above, this tool keeps the user “in the loop” of generating the code by presenting its goals, reasoning, and proposed, seeking user approval before execution. This transparency makes it a perfect educational tool for understanding similar AI-driven tools.
When it comes to tasks extending beyond code generation, MetaGPT deserves consideration. It is capable of generating fairly good documentation (e.g. user stories, data diagrams, design diagrams, etc.) when provided with high-quality information in the prompts.
Combining the strengths of other tools, GPT-pilot emerges as an easy-to-use option that delivers quality code and offers a pleasant interaction experience.
Comparing AI tools for project base generation in various programming languages
Given that the majority of AI-powered tools are built with Python, it’s likely (partially confirmed by prior research observations) that they may exhibit a bias towards generating code in Python. Consequently, they might perform better when tasked with creating projects in Python rather than other programming languages.
Research aim
This part of research aims to assess the capabilities of various AI-powered tools to create a project in different programming languages, starting from an identical initial prompt.
The goal is to provide a clear comparison of code quality generated by each tool when the only difference is the programming language. This comparison should indicate the usability of each particular tool with a given programming language.
AutoGPT and GPT-engineer compared in 3 programming languages
After testing out a few tools in the previous research, two were selected for this task: AutoGPT and GPT-engineer, both considered the most promising. In this research, each tool was tasked with creating a REST API for movies in 3 programming languages: Java, JavaScript, and Python. For each language, the project was generated 3 times to evaluate the average result rather than relying on a single attempt. The prompt was of medium complexity, designed to challenge the tools to ensure correct implementation and expose potential imperfections and differences in their severity based on the selected programming language. The prompt remained identical, the only difference being the programming language and its framework indicated in the prompt. Tested tools were evaluated based on their ability to fulfill the prompt requirements. The prompt template is as follows:
Prompt template |
Create a REST API for movies. Use <language_name> language with <framework_name> framework. For the project structure, follow the best practices in <framework_name> framework. For data persistence use PostgresQL. There are 3 resources: movie, actor, and director. Name the database tables in plural form. Each director has a first name, a last name, a date of birth (timestamp), and a list of movies directed (relation one-to-many with movies table). Each actor has a first name, a last name, a date of birth (timestamp), and a list of movies in which they played (relation many-to-many with movies table). Each movie has a title, a genre, release date (timestamp), production cost (in US dollars), total revenue (in US dollars), a director (relation many-to-one with directors table), and a list of actors who played in it (relation many-to-many with actors table). Movie genre can only be one of the following values: “COMEDY”, “DRAMA”, “THRILLER”, or “HORROR”. Additionally, each record must include a unique ID generated by the database. All properties for all resources are required (non-nullable). For each of the resources, implement CRUD operations: list all records, get a record by ID, create a record, update a record by ID, and delete a record by ID. For POST and PUT or PATCH endpoints (creating and updating records) add input validation, ensuring that the data provided by the API client is complete (no data is missing) and of the correct type. For GET endpoints used for listing all records, include the pagination mechanism. Add a health-check endpoint under the “/healthcheck” route. Cover the project code completely with unit tests. No endpoint implementation can be left blank or left with a comment of type “TODO add implementation here”. No other requirement from this prompt can be left blank or left with a comment of type “TODO add implementation here”. All of the program logic must be fully implemented and the final result must be a complete, fully working solution. |
GPT-engineer
Java
Overall, the code quality is satisfactory. In all 3 attempts the application generated, it could be easily started with minor tweaks.
All data transfer objects (DTOs) were consistently generated exactly as described. The tool utilized an H2 Embedded database with configurations generated without requiring any adjustments to start the application.
Regarding unit tests, three out of four classes were generated in each attempt, with 2-3 tests inside each class. However, the tests were quite simple in terms of test coverage and functional coverage.
All mentioned CRUD endpoints were generated each time with very simple logic inside the 'controllers' (mainly just pointing to the appropriate service call). In one of the 3 attempts, Swagger config was even added for the REST endpoints (although not explicitly requested in the prompt).
Additionally, a health check endpoint was included with a simple “OK/NOK“ reply.
The generated code structure and quality closely resemble what one would expect from a run on https://start.spring.io/ (the only difference being that sometimes we needed to add some extra config to be able to run the GPT-engineer generated code).
Overall, the generated code is pretty good and GPT-engineer can be considered as a reliable tool to kick off a Java Spring-based project (provided the initial prompt is detailed – example above).
JavaScript
The overall code quality is good but many expected files are either blank or missing. In all 3 attempts GPT-engineer used the specified language and framework. Both the project structure and the code are well-organized into properly named directories, following common conventions in the Express framework.
In no attempt was an actual connection to the database coded. In two attempts appropriate models were generated for the entities using a sequelize library, where one implementation was rather laconic and only the other one could be considered solid, containing the constraints from the requirements – auto generated ID, non-nullable fields and relationships.
Only one attempt resulted in implementing the full logic for all CRUD operations on the resources. For the other two attempts, the logic was only implemented for one resource, while the controllers for the remaining resources were either blank or completely missing.
In two attempts input validation was added for the CREATE and UPDATE operations. Only one attempt included a basic pagination mechanism in endpoints listing the records. In all attempts, the health-check route was included.
Although test files were created in all three attempts, they were left empty, just containing a TODO comment informing that the tests should be added there.
Summing up, the amount of missing files and unfulfilled requirements is significant. However, for the existing parts of the code, they are syntactically and logically correct and well-structured. Combining the various parts of the requirements from different iterations on the prompt may serve as a good base for starting a new project.
Python
The overall code quality is quite good, some of the expected files are left with blank content or missing, but noticeably less than in the projects generated in JavaScript. In all 3 attempts GPT-engineer used the specified language and framework. When it comes to the project structure, in two cases the code is well-organized into properly named directories, according to common conventions in the FastAPI framework. In one case, there was no structure whatsoever, the project code was roughly split into a few files all placed in the top-level working directory.
Only in one attempt an actual connection to the database was coded. However, in all attempts appropriate entity schemes and database models were generated, including correct data types of entity properties, non-nullable constraints, auto generated ID, and relationships in the database models.
Similar to the JavaScript projects, only one attempt resulted in implementing the full logic for all CRUD operations on the resources. For the other two attempts, the logic was only implemented for one resource, while the controllers for the remaining resources were either blank or missing.
Once more strikingly similar to the JavaScript projects, in two attempts input validation was added for the CREATE and UPDATE operations. Dedicated classes were generated especially for the sole purpose of validating the inputs, separately for CREATE and UPDATE operations. All attempts included a basic pagination mechanism in endpoints listing the records. In all attempts, the health-check route was included. What is worth noticing, most endpoints included error-handling mechanisms around the coded interactions with a database, even though it was not part of the requirements in the prompt.
Test files were generated in all attempts. They were mostly empty, however, they either contained an example test for one of the routes for each resource or example tests for all operations on a single resource, leaving the rest to be filled in by the developer.
In summary, the overall amount of missing files and unfulfilled requirements is not negligible, although more tolerable than in the case of the project generated in JavaScript. The existing parts of the code are syntactically and logically correct and for the most part, well-structured. Combining the various parts of the requirements from different iterations on the prompt may serve as a good base for starting a new project.
Summary for GPT-engineer
Results of this research suggest that when it comes to creating a REST API, GPT-engineer handles the task best with the use of Java programming language, followed by Python, and JavaScript in the last place. It is worth noting that the differences in the code quality were not striking. In all cases, the generated codebases required at least a few corrections, in some cases even manually adding some missing files or parts of the code, based on the examples generated by GPT-engineer. Fortunately, such a process of “filling in the gaps” can be done very easily thanks to the generated project structure and examples, and can be almost seamless with the help of a tool like GitHub Copilot, which can suggest the missing pieces based on those already existing. In all of the tested languages, GPT-engineer can help develop a solid base for a new project.
AutoGPT
General note for AutoGPT: by design, this tool is highly interactive and generates the best quality code when a human developer is actively engaged in the development cycle, providing feedback to the AI agent after each step of the repository generation. However, it is possible to accept upfront all suggestions generated by AutoGPT, making it an autonomous tool. The second approach was used in this research. Even though this may lead to less satisfactory results in code quality, this research aims to reveal the performance differences of the tool resulting solely from the programming language / framework it is requested to use. Any human input would make the results useless for that purpose. While reviewing the notes below, the reader should keep in mind that this tool has much more potential when used in the first, cooperative mode.
Java
In general the code quality is ok. In all 3 attempts the application generated, it can be started easily with simple tweaks.
The biggest issue I encountered on every attempt is the fact that AutoGPT is always trying to use Maven CLI as a generation code tool and if it does not have Maven installed and accessible in the terminal it will enter a loop and without explicit feedback, it won’t exit the loop. To exit the loop you either need to point it to an installation tutorial of Maven (also has some limitations here), or you can point it to https://start.spring.io/ and extract the initial skeleton from there.
All data transfer objects (DTOs) were generated in all attempts exactly as described. The H2 Embedded database was used and configurations to it were generated without requiring adjustments to start the application).
When it comes to unit tests, 1 or 2 classes were generated every attempt with 2 or 3 tests inside each class. The tests were quite simple in terms of test coverage and functional coverage.
All mentioned CRUD endpoints were generated every time with very simple logic inside the 'controllers' (mainly just pointing to the appropriate service call).
In 2 of the 3 attempts, the health check endpoint was generated with a simple “OK/NOK“ reply, in 1 attempt it was simply skipped.
The generated code structure and quality closely resemble that of a https://start.spring.io/ run (the only difference being that sometimes we need to add some extra config to be able to run the AutoGPT generated code).
All in all, the code generated is pretty good and AutoGPT can be considered as a recommendable tool to rump up a Java Spring-based project (provided the initial prompt is as explicit as possible – example above, and that we follow up with feedback, avoiding only 'y' answers).
JavaScript
The general code quality is OK, however the overall performance is mediocre due to large amounts of missing or logically not connected code. In all 3 attempts GPT-engineer used the specified language and framework. When it comes to the project structure, in two attempts the code was roughly organized into properly named directories and files, according to common conventions in the Express framework. In the third attempt, the tool barely generated any code at all – it was literally two files, one for the server setup and the other one for tests, both almost empty.
Only in 1 attempt an actual connection to the database was coded successfully, and despite that the appropriate file for setting up the database was included, ultimately it just was not used in the entry file. In another attempt an SQL file with a database setup was generated, also not used anywhere further in the codebase. In contrast to GPT-engineer, no appropriate models or entity files were generated in any attempt. Also, only one of the generated projects included a package.json
file with the required dependencies.
Two attempts resulted in implementing the full logic for all CRUD operations on the resources. For the other attempt, the logic was completely omitted (TODO comments). It is worth noting that 1 attempt included error-handling mechanisms around the coded interactions with a database in all endpoints, even though it was not a part of the requirements in the prompt.
No input validation was generated for the CREATE and UPDATE operations in any attempt. Only 1 attempt included a basic pagination mechanism in endpoints listing the records. In all attempts, the health-check route was included.
In 2 attempts test files were created and included an example test for a single route on each of the resources, leaving the rest to be implemented by the developer.
Summing up, the amount of missing files and unfulfilled requirements is large. The existing parts of the code are syntactically correct and for the most part also logically correct even if not always properly connected to a working software. In this state, creating a solid project base by combining the various parts of the requirements from different iterations could pose a challenge and might be disproportionate to the time and effort required from the user.
Python
The general code quality generated for this task is by far the worst in this research. The only requirement fulfilled in all 3 attempts was using the Python language. In all cases, the project structure was very poor. All generated projects consisted of a few files placed in the base directory, only roughly organizing the code into a few logical subparts. Despite the theoretically correct syntax of the generated code, there were a lot of missing files / imports and logic connecting the existing files into an actual coherent project.
In 2 iterations there was no trace of any attempt to use a database for the data persistence. The API data was saved in lists placed directly in the entry file and would be erased once the program execution was finished. Only in 1 attempt an actual connection to the database was coded successfully, however, in the same attempt no actual endpoints were created in order to interact with the API. In all iterations some entity models were generated, however in all cases incorrect or incomplete.
Only 1 attempt resulted in implementing the full logic for all CRUD operations on the resources, however as mentioned before a list in the entry file was used for their storage. In another attempt, the only generated endpoint was the health check. The last one failed to generate any endpoint at all.
No input validation was generated for the CREATE and UPDATE operations in any attempt. No pagination mechanism was generated. A health-check route was included in two attempts.
In no attempt, any meaningful test files were generated.
Summing up, the quality of the generated code is way below expectations.
Summary for AutoGPT
Results of this research suggest that when it comes to creating a REST API, AutoGPT performs differently depending on the used programming language. With Java, the overall project quality was quite good and only required a few corrections before being used as a new project base. The projects generated with JavaScript were of noticeably worse quality, leaving the developer much more work in order to create a solid project from the generated content. Quite surprisingly, the codebase generated with the use of Python was definitely the worst and could not be used even as a blueprint for a good project base.
Once again, the reader should keep in mind that the results obtained in this autonomous mode were expected to be much less useful / satisfying than the results possible to obtain with AutoGPT when used in an interactive way, and only served for pointing out the differences in the tool performance based on the used programming language.
Results and implications: GPT-engineer is the top AI tool for project setup
Both GPT-engineer and AutoGPT stood out in generating good or excellent quality code, but GPT-engineer is the one to be recommended for project setup. This AI tool proved to be the best when it comes to how it works and its iterative nature allows for efficient refinement of code through prompt adjustments.
Across the board, GPT-engineer exhibited very good ease of use and provided a top-level user experience, producing impressive code complexity with minimal user input.
It demonstrated proficiency across different programming languages, particularly excelling with Java.
It’s a useful tool for speeding up projects’ set up.
Moving forward, you can leverage these findings to streamline your software development processes, with GPT-engineer serving as a dependable foundation for project initialization.
For best results remember to:
- Ensure detailed and explicit prompts for AI-powered tools to enhance the quality of generated code. Clear prompts can guide the AI agents to produce more accurate results.
- Consider combining the strengths of multiple AI tools to compensate for individual tool limitations. For instance, you might want to use GPT-engineer for initial setup and AutoGPT for fine-tuning and optimization.
- Continuously evaluate and update AI tool usage based on performance and advancements in the tools themselves. This ensures that the selected tools align with project requirements and evolving technology landscapes.
- Maintain human oversight throughout the code generation process, especially in autonomous modes, to address any discrepancies and ensure the generated code meets project standards.
Deploying AI in business is a strategic move towards optimizing operational efficiency and driving innovative solutions.
While AI-powered tools show promise in expediting project setup, careful consideration of their capabilities, modes of operation, and integration with human input is essential for achieving satisfactory results across different programming languages.