Automated Test Creation with GPT-Engineer: A Comparative Experiment

Updated Mar 1, 2025 • 19 min read

GPT-Engineer promises to generate automated tests with minimal human intervention. However, like any AI solution, the quality of its output depends greatly on how well it’s guided.

To explore this, I set up an experiment to find the most effectiveway to use GPT-Engineer for automated test creation. My goal was to determine which test case format—whether mermaid diagrams, Behavior-Driven Development (BDD) scenarios, or standard test cases—yields the most reliable results. I’ll analyze the process, evaluate the tool's performance, and share recommendations to help you get the most out of GPT-Engineer’s capabilities.

Research focus: Finding the optimal test case format

The main goal of my research was to identify which test case format would allow GPT-Engineer to generate the most accurate and reliable automated tests. I tested three formats—Mermaid diagrams, Behavior-Driven Development (BDD) scenarios, and standard test cases—using TypeScript and Playwright to automate user interactions in a web application. These interactions included tasks like logging in, adding items to the cart, and completing a purchase. The generated test cases were then evaluated based on three key objectives.

Key Objectives:

Accuracy: How precise and reliable are the tests generated by GPT-Engineer? Do they capture the intended functionality without errors?
Efficiency: Does GPT-Engineer save time and effort compared to manually writing tests?
Effectiveness: How well do the generated tests meet industry standards in terms of quality and completeness?

Test case formats explored

Mermaid

Mermaid diagrams are a simple and visual way to represent workflows and processes through flowcharts and diagrams. In this experiment, I used mermaid diagrams to map out user flows for a web application, expecting GPT-Engineer to translate these visual workflows into functional test scripts.

Automate all the scenarios listed below as mermaid diagrams with “TypeScript” and “Playwright”:

 graph TD

 %% Login Functionality
 subgraph "1. Login Functionality"
 direction TB
 %% Standard User Login
 A1[Go to  --> A2[Enter Username: standard_user]

 A2 --> A3[Enter Password: secret_sauce]
 A3 --> A4[Click Login Button]
 A4 --> A5{Is User Logged in?}
 A5 --> |Yes| A6[Redirect to Inventory Page]
 A5 --> |No| A7[Display Error: Invalid Credentials]

 %% Locked Out User Login
 B1[Go to  --> B2[Enter Username: locked_out_user]
 B2 --> B3[Enter Password: secret_sauce]
 B3 --> B4[Click Login Button]
 B4 --> B5[Display Error: Sorry, this user has been locked out.]

 %% Problem User Login
 C1[Go to  --> C2[Enter Username: problem_user]
 C2 --> C3[Enter Password: secret_sauce]
 C3 --> C4[Click Login Button]
 C4 --> C5[Redirect to Inventory Page]
 C5 --> C6[Check if Images/Functionality Are Broken]

 %% Performance Glitch User Login
 D1[Go to  --> D2[Enter Username: performance_glitch_user]
 D2 --> D3[Enter Password: secret_sauce]
 D3 --> D4[Click Login Button]
 D4 --> D5[Experience Login Delay]
 D5 --> D6[Redirect to Inventory Page]

 %% Invalid User Login
 E1[Go to  --> E2[Enter Username: invalid_user]
 E2 --> E3[Enter Password: invalid_password]
 E3 --> E4[Click Login Button]
 E4 --> E5[Display Error: Username and password do not match any user in this service.]

 %% Username Missing
 F1[Go to  --> F2[Leave Username Empty]
 F2 --> F3[Enter Password: secret_sauce]
 F3 --> F4[Click Login Button]
 F4 --> F5[Display Error: Username is required.]

 %% Password Missing
 G1[Go to  --> G2[Enter Username: standard_user]
 G2 --> G3[Leave Password Empty]
 G3 --> G4[Click Login Button]
 G4 --> G5[Display Error: Password is required.]

 %% Product List Display
 H1[Logged in as any user] --> H2[Navigate to Inventory Page]
 H2 --> H3{Is Product List Displayed?}
 H3 --> |Yes| H4[Verify Product List Displays Correctly]
 H3 --> |No| H5[Report Missing/Error in Product List]

 %% Add to Cart Functionality
 I1[Inventory Page - Click Add to Cart on any product] --> I2[Navigate to Cart]
 I2 --> I3{Is Product in Cart?}
 I3 --> |Yes| I4[Verify Product Appears in Cart]
 I3 --> |No| I5[Report Product Missing/Error]

 %% Remove from Cart Functionality
 J1[Cart - Click Remove Button on any product] --> J2[Navigate to Cart]
 J2 --> J3{Is Product Removed from Cart?}
 J3 --> |Yes| J4[Verify Product Successfully Removed]
 J3 --> |No| J5[Report Product Removal Failed/Error]

 %% View Cart Contents
 K1[Click Cart Icon] --> K2{Are Cart Contents Displayed?}
 K2 --> |Yes| K3[Verify Cart Displays Selected Products]
 K2 --> |No| K4[Report Cart Contents Missing/Error]

 %% Proceed to Checkout
 L1[Cart Page - Click Checkout Button] --> L2[Redirect to Checkout Information Page]

 %% Successful Checkout
 M1[Checkout Info - Enter First Name: John] --> M2[Enter Last Name: Doe]
 M2 --> M3[Enter Zip/Postal Code: 12345]
 M3 --> M4[Click Continue Button]
 M4 --> M5[Review Order]
 M5 --> M6[Click Finish Button]
 M6 --> M7[Verify "Thank You for Your Order" Confirmation Message]

 %% Checkout with Missing Information
 N1[Checkout Info - Leave Required Fields Empty] --> N2[Click Continue Button]
 N2 --> N3[Display Error: Fields Are Required]

 O1[Menu - Click Logout Button] --> O2[Redirect to Login Page]

 %% Invalid Login
 P1[Login - Enter Invalid Credentials] --> P2[Click Login Button]
 P2 --> P3[Display Error: Username and Password Do Not Match]

 %% Missing Checkout Information
 Q1[Checkout Info - Leave Required Fields Empty] --> Q2[Click Continue Button]
 Q2 --> Q3[Display Error: Fields Are Required]

In the first case, GPT-Engineer did not deliver the expected results. While it successfully generated some page objects and test files, it only included a few of the required test cases. Instead of covering all the scenarios as mapped out, it inserted placeholder comments like, “// ... Other add to cart tests will follow the same pattern ...” or “// ... Other login tests will follow the same pattern ...”. This meant that large portions of the flow were left incomplete, requiring manual intervention to finish the tests.

 test.describe('Login Functionality', () => {
   let loginPage: LoginPage;
   let inventoryPage: InventoryPage;

 test.beforeEach(async ({ page }) => {
   loginPage = new LoginPage(page);
   inventoryPage = new InventoryPage(page);
   await loginPage.visit('/');
 });

 test('Standard User Login', async () => {
   await loginPage.login('standard_user', 'secret_sauce');
   expect(await inventoryPage.isProductListDisplayed()).toBeTruthy();
 });

 test('Locked Out User Login', async () => {
   await loginPage.login('locked_out_user', 'secret_sauce');
   expect(await loginPage.getErrorMessage()).toContain('locked out');
 });

 // ... Other login tests will follow the same pattern ...
 });

In the checkout process, GPT-Engineer added a comment saying, “// Assuming the user is already logged in and on the inventory page,” even though a login function was explicitly defined and should have been used. This assumption led to incomplete tests and missed steps in the flow. Overall, the results were underwhelming. While GPT-Engineer managed to generate some structure, it fell short in delivering a complete and accurate set of tests.


test.beforeEach(async ({ page }) => {
   // Assuming the user is already logged in and on the inventory page
   inventoryPage = new InventoryPage(page);
   cartPage = new CartPage(page);
   await page.goto('/inventory.html');
 });

Behavior-Driven Development (BDD) Scenarios

Behavior-Driven Development (BDD) is a testing approach that focuses on specifying the behavior of software through scenarios written in plain language. These scenarios describe how users interact with the application, which makes BDD an ideal format for bridging the gap between non-technical stakeholders and developers. In this experiment, I used BDD scenarios to outline key user interactions in the web application.

Create automated tests base on the test cases written below in BDD. Test on "https://www.saucedemo.com/". Use "Playwright", "Typescript" and follow the
test cases:

Test Case #1: User Login, Product Selection, and Checkout Process

Feature: User can log in, select products, add them to the cart, and complete the checkout process successfully.

Scenario: User successfully logs in, selects products, and completes checkout
   Given the user is on the login page
   When the user enters a valid username "standard_user"
   And the user enters a valid password "secret_sauce"
   And the user clicks on the login button
   Then the user should be redirected to the products page

Scenario: User selects products and adds them to the cart
   Given the user is on the products page
   When the user browses the list of products
   And the user selects a product by clicking the "Add to cart" button
    And the user selects another product and adds it to the cart
   Then the cart icon should show the correct number of items

Scenario: User views cart and proceeds to checkout
   Given the user has added products to the cart
   When the user clicks on the shopping cart icon
   Then the user should see all added products listed in the cart
   When the user clicks on the "Checkout" button
   Then the user should be taken to the checkout information page                                                                                      
                                                                                         Scenario: User enters checkout information
   Given the user is on the checkout information page
   When the user enters First Name, Last Name, and Postal Code
   And the user clicks the "Continue" button
   Then the user should be taken to the overview page with a summary of the order

Scenario: User completes the purchase
   Given the user is on the order overview page
   When the user reviews the order summary
   And the user clicks the "Finish" button
   Then the user should see a confirmation message indicating the completion of the purchase

Scenario: User logs out
   Given the user has completed the purchase
   When the user clicks on the menu button
   And the user selects "Logout"
   Then the user should be redirected to the login page


Test Case #2: Add and Remove Item from Cart

Feature: User can add and remove items from the cart, and the UI correctly reflects the cart's state.

Scenario: User adds an item to the cart and removes it
   Given the user is on the products page
   When the user selects a product by clicking the "Add to cart" button
   Then the cart icon should update to show "1" item
 
   When the user clicks on the cart icon
   And the user clicks the "Remove" button next to the product
   Then the product should be removed, and the cart icon should show "0" items

Scenario: UI correctly shows the number of items in the cart
   Given the user has removed all items from the cart
   When the user navigates back to the products page
   Then the cart icon should consistently show "0" items across different pages


Test Case #3: Add Multiple Items and Complete Checkout

Feature: User can add multiple different items to the cart, complete the checkout, and verify the cart is empty after checkout.

Scenario: User adds multiple items to the cart and completes the checkout
   Given the user is on the products page
   When the user selects multiple products by clicking on their "Add to cart" buttons
   Then the cart icon should update to show the correct number of added items

  When the user clicks on the cart icon
  Then all selected products should be listed in the cart

  When the user clicks on the "Checkout" button
  And the user enters First Name, Last Name, and Postal Code on the checkout information page
  And the user clicks the "Continue" button
  Then the user should be taken to the order overview page with a summary of the selected items

   When the user reviews the order summary
   And the user clicks the "Finish" button
   Then a confirmation message should be displayed indicating the completion of the purchase

   When the user clicks on the "Back to Products" button if available
   Then the cart icon should show "0" items, indicating that the checkout process was completed, and the cart is now empty

In this case, the results were even more disappointing, suggesting that BDD might not be the best fit for GPT-Engineer. The tool struggled to interpret the BDD scenarios correctly, managing to implement only the first scenario while leaving the rest incomplete. Instead of generating full test scripts, it inserted placeholder comments like, “// Additional scenarios would follow the same pattern...” and “// Implement each scenario following the same pattern as the previous files...”. This left significant gaps in the test coverage and required manual effort to fill in the missing scenarios.

import { test, expect } from '@playwright/test';

test.describe('Add and Remove Item from Cart', () => {
 // Implement each scenario following the same pattern as the previous file...
 });

In addition to incomplete scenarios, GPT-Engineer failed to include essential elements like page objects and Playwright configuration. This made the output difficult to use and would have required significant refactoring. Overall, this was the least effective result, and it would have been easier to create the tests manually rather than using GPT-Engineer with the BDD approach.

Standard test cases

Standard test cases follow a structured format, detailing clear steps, expected results, and specific objectives for each test. This method provides a straightforward, step-by-step guide for testing, making it easy to outline user actions and expected outcomes. The goal was to see how well GPT-Engineer could interpret these highly detailed, step-by-step instructions and turn them into automated test scripts.

Create automated tests base on the test cases written below. Test on "https://www.saucedemo.com/". Use "Playwright", "Typescript" and follow the test
cases:   
                                                                                        Test case #1
Objective: Verify that a user can log in, select products, add them to the cart, and complete the checkout process successfully.
Precondition:
 - The tester has valid login credentials.
 - The application is up and running.
Test Steps:
 - Login:
Navigate to "https://www.saucedemo.com/".
Enter a valid username (e.g., "standard_user").
Enter the corresponding password (e.g., "secret_sauce").
Click on the login button.
Verify redirection to the products page.
 - Select Products:
Browse through the list of products.
Select a product by clicking on its "Add to cart" button.
Select another product and add it to the cart as well.
Verify that the cart icon shows the correct number of items.
 - View Cart and Checkout:
Click on the shopping cart icon to view the selected  products.
Verify that all added products are listed in the cart.
Click on the "Checkout" button.
 - Enter Checkout Information:                                Enter First Name, Last Name, and Postal Code in the input fields.
Click on the "Continue" button.
Verify that the user is taken to the overview page with a summary of the order.
 - Complete Purchase:
Review the order summary.
Click on the "Finish" button.
Verify that a confirmation message is displayed, indicating the completion of the purchase.
 - Logout:
Click on the menu button on the top left corner.
Select "Logout".
Verify redirection to the login page.
 - Expected Result:
The user is able to successfully log in, select products, view them in the cart, provide necessary information during checkout, complete the purchase, and
finally log out. A confirmation message regarding the successful purchase is displayed at the end of the process.

Test case #2
Objective:
Verify that a user can add an item to the cart, remove it, and confirm that the UI correctly reflects the number of items in the cart.

Preconditions:
The tester has valid login credentials.                    The application is up and running.
The user is logged in and on the products page.
Test Steps:
1. Add an Item to the Cart:
Select a product by clicking on its "Add to cart" button.
Verify: The cart icon updates to show "1" item.
2. Remove the Item from the Cart:
Click on the cart icon to navigate to the cart page.
Click on the "Remove" button next to the product.
Verify: The product is removed, and the cart icon shows "0" items.
3. Verify UI Updates:
Navigate back to the products page.
Verify: The cart icon consistently shows "0" items across different pages.
Expected Result:
The UI correctly displays the number of items in the cart after adding and removing an item, showing "0" when the cart is empty.


Test case #3

Objective:
Verify that a user can add multiple different items to the cart, proceed to checkout, complete the purchase, and verify the checkout process is successful.

Preconditions:
The tester has valid login credentials.
The application is up and running.                             The user is logged in and on the products page.
Test Steps:
1. Add Multiple Items to the Cart:
Select several products by clicking on their "Add to cart" buttons.
Verify: The cart icon updates to show the correct number of added items.
2. View Cart:
Click on the cart icon to view the selected products.
Verify: All selected products are listed in the cart.
3. Proceed to Checkout:
Click on the "Checkout" button.
Enter First Name, Last Name, and Postal Code in the checkout information page.
Click on the "Continue" button.
Verify: The user is taken to the order overview page with a summary of the selected items.
4. Complete Purchase:
Review the order summary.
Click on the "Finish" button.
Verify: A confirmation message is displayed, indicating the completion of the purchase.
5. Verify Checkout Completion:
Click on the "Back to Products" button if available.
Verify: The cart icon shows "0" items, indicating that the checkout process was completed, and the cart is now empty.
Expected Result:
The user is able to add multiple different items to the cart, proceed to checkout, successfully complete the purchase, and the cart is empty after checkout
completion. A confirmation message regarding the successful purchase is displayed at the end of the process.

The standard test cases approach yielded the most satisfying results. GPT-Engineer successfully generated the necessary page objects and used them effectively in the code. It covered all the required steps and checks, producing tests that required only minimal refactoring. However, achieving this level of accuracy depended heavily on how precise the input was—clear, detailed descriptions of the test objectives, steps, and verification points were essential for success. The more specific the input, the better the output from GPT-Engineer.

That said, GPT-Engineer still relied on some outdated methods, such as waitForSelector(), which required additional refactoring to meet modern coding standards. Despite these minor issues, the standard test cases approach proved to be the most reliable, with only a few tweaks needed to bring the generated tests up to the desired level.

async verifyRedirectionToProducts() {
   await this.page.waitForSelector('.inventory_list');
  }

Common problems encountered

Throughout the experiment, several issues were identified across different test case formats.

With mermaid diagrams, GPT-Engineer failed to generate all the required tests. It often inserted comments suggesting that additional tests should follow the same pattern, rather than completing them. Additionally, it made incorrect assumptions, such as presuming users were already logged in, instead of using the provided login functions.

In the BDD scenarios, GPT-Engineer performed similarly, only implementing a few tests and leaving placeholder comments for the rest. It also neglected to add critical elements like page objects and the necessary Playwright configuration, making the output incomplete.

The standard test cases approach was the most reliable but still had its flaws. GPT-Engineer used outdated methods, such as waitForSelector(), which required refactoring to align with modern best practices.

Common issues across all approaches included the use of outdated library versions and reliance on discouraged functions, particularly for assertions. GPT-Engineer also failed to fully leverage built-in methods for locators and assertions, resulting in less-than-optimal test code.

Recommended steps for creating effective test cases with GPT-Engineer

To get the best results from GPT-Engineer, it’s important to provide clear and detailed input. Below are some key guidelines to follow when creating test cases:

Be precise: Clearly specify what you want to test. The more specific your instructions, the more accurate the generated tests will be.
Add a meaningful summary: Include a concise summary that describes the test’s objective, which will be used as the test description.
Define detailed test steps: Outline each step of the test in detail, leaving no room for ambiguity.
Specify assertions: Clearly state what needs to be verified at each stage of the test to ensure accurate results.
Break down larger flows: Divide complex processes into smaller, manageable parts to help GPT-Engineer follow the flow correctly.
Define expected results: Be explicit about the desired outcomes of the test.

Post-creation steps:

Review tests: After generating the tests, review them to ensure they align with your expectations.
Rerun GPT-Engineer if needed: Running GPT-Engineer multiple times on the same prompt can sometimes yield better results, as the output may vary with each run.
Update and refactor: Make sure to update outdated libraries, refactor discouraged methods, and optimize helper functions to ensure all tests pass and meet modern coding standards.

Conclusion: Is GPT-Engineer worth using for test automation?

While GPT-Engineer can significantly streamline test creation, it requires careful guidance. The more specific you are about what needs to be tested and verified, the better the results will be. For those willing to invest the time in crafting detailed test cases, GPT-Engineer can be a valuable tool. However, for less structured formats like mermaid diagrams or BDD scenarios, the tool may fall short, requiring more manual intervention to complete the tests.

In conclusion, using standard test cases proved to be the most reliable approach for test automationwhen working with GPT-Engineer. However, success with this tool hinges on how precise and detailed the input is. GPT-Engineer doesn’t make assumptions or add extra assertions on its own—it strictly follows the instructions it’s given. As a result, the quality of the generated tests is directly tied to the clarity and thoroughness of the test case definitions.