What Makes a Good Prompt for Automated Testing? Key Insights from GPT-Engineer Research
Recent advancements in artificial intelligence, particularly with tools like GPT-engineer, have introduced new possibilities for generating automated test cases. However, crafting prompts that yield robust and accurate test cases involves more than just technical skill—it requires an understanding of the subtleties of AI-driven prompt design.
This research builds on our previous study of GPT-engineer’s role in quality assurance, specifically its application in creating functional automated tests for UI and API scenarios.
The initial findings highlighted GPT-engineer’s potential to streamline and improve automated testing, with a critical insight: the success of generated test cases hinges on the quality of the prompt.
In this study, we explore what makes an effective prompt, analyzing different structures and strategies to optimize GPT-engineer’s output for UI test cases.
Methodology for evaluating prompt effectiveness
To investigate GPT-engineer's capabilities in generating effective UI tests, this experiment evaluated various prompt structures for creating Cypress-based test cases. Focusing on UI test generation, we designed structured prompts around individual test cases to address the variability observed in previous outcomes.
Three main types of scenarios guided the experiment: happy paths (standard success flows), unhappy paths (error or edge cases), and screen-based prompts focused on specific UI elements. Each prompt was designed to test the AI’s ability to produce useful tests with minimal intervention, with test cases generated from scratch in every trial to ensure consistency.
The goals of this experiment were to answer three primary questions:
- What makes a prompt ‘good’ for generating UI tests with GPT-engineer?
- How detailed does the input need to be for GPT-engineer to produce accurate functional tests?
- Can useful tests be generated without overly explicit instructions, or does the AI require step-by-step guidance?
Guide & analysis: Experiment steps and observations
This section outlines the steps taken during the experiment and the key observations from testing various prompts.
To assess how GPT-engineer responds to different prompt styles, I used multiple structured prompts to generate test cases. This analysis highlights the prompts that produced the most successful or challenging results.
Happy paths vs. unhappy paths
GPT-engineer generally demonstrates an understanding of the distinction between “happy” (standard success flows) and “unhappy” (error or edge cases) paths. When explicitly prompted to cover unhappy paths, the generated test cases were notably promising.
Compared to phrasing such as “positive/negative tests,” using “happy/unhappy paths” in prompts led to more accurate test case generation. For example:
Create automated UI tests for " Swag Labs " with the use of Cypress and Page Object Pattern. Use the latest Cypress version. Use the correct folder/file structures and naming conventions. Make sure to include the necessary configuration files. Create README file with instructions how to set up the project, how to run the tests and also information which test cases have been covered by automated tests. Generate random test data if needed. The tests should cover the following steps: log in, add products to cart, go to checkout, complete checkout, log out. Cover happy and unhappy paths. Check necessary validations and add useful assertions. |
With this prompt, GPT-engineer generated specs covering both happy and unhappy paths. Below is an example of the output:
describe('Login Tests', () => {
beforeEach(() => {
loginPage.visit();
});
it('should log in with valid credentials', () => {
loginPage.login('standard_user', 'secret_sauce');
cy.url().should('include', '/inventory.html');
});
it('should show error message with invalid credentials', () => {
loginPage.login('invalid_user', 'invalid_password');
loginPage.getErrorMessage().should('be.visible').and('contain', 'Username and password do not match');
});
});
Main test scenario
Providing GPT-engineer with a primary test scenario—even without detailed steps or specific cases—already sets a solid foundation for test generation. Simply identifying the main flow that a GPT-engineer should cover can lead to promising results.
For instance, the following prompt instructed GPT-engineer to generate UI tests for a successful purchase on "Swag Labs," using Cypress and the Page Object Pattern:
Create automated UI tests for " Swag Labs " with the use of Cypress and Page Object Pattern. Use the latest Cypress version. Use the correct folder/file structures and naming conventions. Make sure to include the necessary configuration files. Create README file with instructions how to set up the project, how to run the tests and also information which test cases have been covered by automated tests. Generate random test data if needed. The tests should cover the main scenario (successful purchase) and additional unhappy paths. |
With this prompt, GPT-engineer was able to produce a comprehensive test case that covers both login and checkout, along with valid and invalid credential checks:
it('should login with valid credentials and complete a purchase', () => {
cy.fixture('users').then((users) => {
const { validUser } = users;
loginPage.fillUsername(validUser.username);
loginPage.fillPassword(validUser.password);
loginPage.submit();
inventoryPage.addItemToCart('Sauce Labs Backpack');
inventoryPage.goToCart();
cartPage.checkout();
checkoutPage.fillFirstName('John');
checkoutPage.fillLastName('Doe');
checkoutPage.fillPostalCode('12345');
checkoutPage.continue();
checkoutPage.finish();
});
});
it('should fail login with invalid credentials', () => {
cy.fixture('users').then((users) => {
const { invalidUser } = users; loginPage.fillUsername(invalidUser.username);
loginPage.fillPassword(invalidUser.password);
loginPage.submit();
cy.get('.error-message-container').should('be.visible');
});
While results may vary across applications, GPT-engineer generally identifies the main flows and relevant selectors, providing a solid starting point for automated UI testing.
Covering particular screens
Another approach, though less successful than scenario-based prompts, involves instructing GPT-engineer to focus on specific screens of the application. This method can be useful when individual page functionalities need testing rather than a complete end-to-end flow.
Create positive and negative automated UI tests for " Swag Labs " with the use of Cypress, Typescript and Page Object Pattern. Create README file with instruction how to set up the project, how to run the tests and also information which test cases have been covered by automated tests. Generate random test data if needed. Cover the most important pages of the website – login page, inventory page, cart page. |
Using this prompt, GPT-engineer generated individual tests that focused on functionality within each page. Here’s an example of a test case that covers adding and removing items from the cart:
beforeEach(() => {
loginPage.visit();
loginPage.fillUsername('standard_user');
loginPage.fillPassword('secret_sauce');
loginPage.submit();
});
it('should add and remove an item from the cart', () => {
inventoryPage.addItemToCart('Sauce Labs Backpack');
cartPage.visit();
cartPage.getCartItems().should('have.length', 1);
cartPage.removeItemFromCart('Sauce Labs Backpack');
cartPage.getCartItems().should('have.length', 0);
});
While this approach generated functional tests for each page, it is less suited for complex end-to-end scenarios, such as purchase flows. For comprehensive testing, prompts focused on full user journeys tend to yield more cohesive results.
Using explicit test cases with steps
Directly specifying a flow or test case in the prompt proved to be the most successful method for generating accurate tests—a conclusion consistent with earlier research. Here, I aimed to determine how detailed a scenario or case should be to produce effective tests with GPT-engineer. Do we need to provide detailed, step-by-step instructions, objectives, and expected results? Or is a basic outline sufficient?
After multiple trials, it appears that GPT-engineer can deliver well-structured tests using only a straightforward sequence of steps, such as “log in, select products, go to checkout.” For instance:
Create an automated test for the purchase flow on " Swag Labs ". Use “Cypress” (version 13.13.3) and follow the test case: The user is able to successfully log in, select products, view them in the cart, provide necessary information during checkout, complete the purchase, and finally log out. A confirmation message regarding the successful purchase is displayed at the end of the process. The test should use the Page Object Pattern. Classes created in the page objects should be referenced in the test code. |
The test generated with this prompt was complete and functional, covering all specified steps. Below is a sample of the output:
it('should complete the purchase flow successfully', () => {
loginPage.visit();
loginPage.login('standard_user', 'secret_sauce');
productsPage.add_to_cart('Sauce Labs Backpack');
productsPage.navigate_to_cart();
cartPage.checkout();
checkoutPage.fill_information('John', 'Doe', '12345');
checkoutPage.continue_checkout();
checkoutPage.finish_checkout();
confirmationPage.get_confirmation_message().should('contain', 'THANK YOU FOR YOUR ORDER');
logoutPage.logout();
The only minor inconsistency was that, while the prompt mentioned “products,” only one product was added in the test. In other runs, sometimes multiple products appeared, suggesting that including exact quantities in the prompt can improve consistency.
In another test case:
Create an automated test for Software Development Company | Netguru . Use Cypress, Typescript and cover the following test case:
The test should use the Page Object Pattern. Classes created in the page objects should be referenced in the test code. |
it('should display error messages when submitting an empty estimate request form', () => {
homePage.visit();
homePage.acceptCookies();
homePage.goToEstimateProjectPage();
estimateProjectPage.fillFirstName('John');
estimateProjectPage.fillLastName('Doe');
estimateProjectPage.requestEstimate();
estimateProjectPage.verifyErrorMessages();
});
While this sequential format produced similar results, it was slightly less concise, which may or may not be beneficial depending on the test complexity. Overall, specific but minimal steps in the prompt yield efficient and usable tests.
Create an automated test for Software Development Company | Netguru . Use Cypress, Typescript and cover the following steps:
The test should use the Page Object Pattern. Classes created in the page objects should be referenced in the test code. |
This type of prompt returns more or less the same results, but it’s less compact.
Using multiple test cases/scenarios
GPT-engineer can effectively handle multiple test cases when provided with clear instructions that outline the scope of each scenario. Testing this capability, we prompted GPT-engineer to generate UI tests for the purchase flow on “Swag Labs” with Cypress, specifying two distinct scenarios:
Create automated tests for the purchase flow on " Swag Labs ". Use “Cypress” (version 13.13.3) and cover the following test cases:
|
GPT-engineer successfully generated code to cover both scenarios, as shown below:
describe('Purchase Flow Tests', () => {
beforeEach(() => {
loginPage.visit();
});
it('User can complete a purchase', () => {
// Login
loginPage.login('standard_user', 'secret_sauce');
// Add item to cart
inventoryPage.addItemToCart('Sauce Labs Backpack');
// Go to cart
inventoryPage.goToCart();
// Proceed to checkout
cartPage.proceedToCheckout();
// Provide information and continue
checkoutPage.provideInformation('John', 'Doe', '12345');
// Verify purchase completion
confirmationPage.getConfirmationMessage().should('have.text', 'THANK YOU FOR YOUR ORDER');
// Logout
headerPage.logout();
});
it('User cannot proceed without providing personal information', () => {
// Login
loginPage.login('standard_user', 'secret_sauce');
// Add item to cart
inventoryPage.addItemToCart('Sauce Labs Backpack');
// Go to cart
inventoryPage.goToCart();
// Proceed to checkout
cartPage.proceedToCheckout();
// Continue without providing information
checkoutPage.continue();
// Verify error message
cy.get('.error-message-container').should('be.visible');
});
});
This approach confirmed that GPT-engineer can generate cohesive code for multiple scenarios within the same prompt, provided that each test case is clearly defined. This method is particularly effective for creating comprehensive test suites that cover both positive and negative flows.
Specifying the correct framework version
One of the main challenges encountered was ensuring GPT-engineer used the correct version of the testing framework, especially for Cypress. When no version was specified in the prompt, GPT-engineer produced inconsistent results—sometimes using the latest version, but often defaulting to older ones. Even when the latest version appeared to be used, the generated project often had issues like incorrect folder structures, missing configuration files, and outdated naming conventions.
Explicitly requesting the latest version of Cypress in the prompt yielded similarly mixed outcomes:
Create automated test for the purchase flow on " Swag Labs ". Use the latest version of “Cypress” and follow the test case: (...) |
With this phrasing, the results varied unpredictably. In some instances, an older version of Cypress (e.g., version 9) was used; in others, the latest version (e.g., 13). When the latest version was used, there were frequent issues with project organization: for example, the tool would sometimes use the “integration” folder rather than the correct “e2e” folder, or it would apply outdated naming conventions (e.g., purchase_flow.spec.js
instead of purchase_flow.cy.js
). Consequently, while the tests might open in the latest Cypress app, they often failed to run due to incorrect file or folder setups.
Even when specifying a particular version, such as:
Create an automated test for the purchase flow on " Swag Labs ". Use “Cypress” (version 13.13.3) and follow the test case: |
or using instructions like:
Make sure the project is set up in a correct way. Apply the folder structure and naming conventions used in Cypress v 13.13.3 |
the results remained inconsistent. While one run might successfully apply the expected structure, another run with the same prompt could generate a completely different configuration.
Single spec vs. multiple specs
By default, GPT-engineer tends to place all generated tests within a single spec file unless specified otherwise. For instance, if the prompt requests particular test cases, they are often grouped together in one file. However, when instructed to cover individual screens of an application, GPT-engineer typically generates a separate spec file for each screen.
To ensure each test is placed in its own spec file, it’s best to include this explicitly in the prompt. For example:
Create automated tests for " Swag Labs ". Use “Cypress” (version 13.13.3) and cover the following test cases: 1. The user is able to successfully log in, select products, view them in the cart, provide necessary information during checkout, complete the purchase, and finally log out. A confirmation message regarding the successful purchase is displayed at the end of the process. Each test should be placed in a separate spec file. The tests should use the Page Object Pattern. Classes created in the page objects should be |
With this instruction, each test case was successfully placed in a distinct spec file. Below is an example for the fourth test case, where an incorrect login triggers an error message:
describe('Incorrect Login', () => {
it('should display an error message with incorrect credentials', () => {
const loginPage = new LoginPage();
loginPage.visit();
loginPage.login('incorrect_user', 'wrong_password');
cy.get('.error-message-container').should('contain', 'Epic sadface: Username and password do not match any user in this service');
});
});
By clearly specifying file separation in the prompt, GPT-engineer can organize test cases in individual spec files, making it easier to manage and troubleshoot specific scenarios within a larger test suite.
Using the page object pattern
To leverage the Page Object Pattern in generated tests, it’s essential to reference it specifically in the prompt. However, merely mentioning the pattern by name may not be enough—GPT-engineer tends to create the page objects but might not apply them directly in the test cases. To ensure proper implementation, it helps to specify that the page object classes should be actively used within the test steps.
Create automated test for the purchase flow on " Swag Labs ". Use “Cypress”, Page Object Pattern and follow the test case: Objective: Verify that a user can log in, select products, add them to the cart, and complete the checkout process successfully. Precondition:
Test Steps:
Navigate to " Swag Labs ".
Browse through the list of products.
Click on the shopping cart icon to view the selected products.
Enter First Name, Last Name, and Postal Code in the input fields. Verify that the user is taken to the overview page with a summary of the order.
Review the order summary.
Click on the menu button on the top left corner. Expected Result: The user is able to successfully log in, select products, view them in the cart, provide necessary information during checkout, complete the |
In response to this prompt, GPT-engineer generated code covering each screen but didn’t consistently apply the page objects within the test steps:
// Select Products
cy.addProductToCart('Sauce Labs Backpack');
cy.addProductToCart('Sauce Labs Bolt T-Shirt');
cy.get('.shopping_cart_badge').should('have.text', '2');
// View Cart and Checkout
cy.get('.shopping_cart_link').click();
cy.url().should('include', '/cart.html');
cy.get('.cart_item').should('have.length', 2);
cy.get('#checkout').click();
This example shows that, although the structure of the test was correct, GPT-engineer sometimes skipped referencing the page object classes in the test flow. Including an explicit instruction, such as “Ensure that page object classes are referenced in the test code,” can help mitigate this issue and enforce the use of the Page Object Pattern throughout the test script.
Including specific assertions in tests
By default, GPT-engineer includes basic assertions in generated tests, typically sufficient for standard scenarios. However, if specific assertions are needed, these can be specified in the prompt to achieve more targeted test validations.
Create an automated test for the purchase flow on " Swag Labs ". Use “Cypress” (version 13.13.3) and follow the test case: The user is able to successfully log in, select products, view them in the cart, remove one product, proceed to checkout, provide necessary information during checkout, complete the purchase, and finally log out. A confirmation message regarding the successful purchase is displayed at the end of the process. The test should contain the following assertions:
The test should use the Page Object Pattern. Classes created in the page objects should be referenced in the test code. |
GPT-engineer was able to incorporate the specified assertions, generating the following code:
// Go to cart and verify items
productsPage.openCart()
cartPage.getCartItemsCount().should('have.length', 2)
// Remove one product from cart
cartPage.removeItemFromCart('Sauce Labs Bike Light')
cartPage.getCartItemsCount().should('have.length', 1)
// Proceed to checkout
cartPage.proceedToCheckout()
checkoutPage.enterFirstName('John')
checkoutPage.enterLastName('Doe')
checkoutPage.enterPostalCode('12345')
checkoutPage.continueCheckout()
checkoutPage.finishCheckout()
// Verify checkout confirmation
checkoutPage.getConfirmationMessage().should('contain.text', 'THANK YOU FOR YOUR ORDER')
})
By clearly specifying the required assertions, GPT-engineer successfully validated the number of items in the cart and checked for a confirmation message on the checkout screen. This approach enables more precise test coverage and enhances the reliability of the test output.
Incorporating hooks in tests
GPT-engineer often includes hooks, like beforeEach
, in generated tests, especially when instructed to create tests for specific application screens. However, to ensure hooks are consistently applied, it helps to specify their use explicitly within the prompt.
For example, the following prompt was used to request tests that utilize hooks:
Create positive and negative automated UI tests for " Swag Labs " with the use of Cypress, Typescript and Page Object Pattern. Generate random test data if needed. Cover the most important pages of the website – login page, inventory page, cart page. Use hooks. |
Even without specifying the exact content of the hooks, GPT-engineer generated sensible beforeEach hooks for each spec, providing a streamlined setup process. Below is an example of the output:
describe('Inventory Page Tests', () => {
const loginPage = new LoginPage();
const inventoryPage = new InventoryPage();
beforeEach(() => {
loginPage.visit();
loginPage.fillUsername('standard_user');
loginPage.fillPassword('secret_sauce');
loginPage.submit();
cy.url().should('include', '/inventory.html');
});
it('should add an item to the cart', () => {
inventoryPage.addItemToCart('Sauce Labs Backpack');
inventoryPage.getCartBadge().should('contain', '1');
});
it('should remove an item from the cart', () => {
inventoryPage.addItemToCart('Sauce Labs Backpack');
inventoryPage.removeItemFromCart('Sauce Labs Backpack');
inventoryPage.getCartBadge().should('not.exist');
});
});
Challenges encountered
Several challenges emerged during the experiment, many of which mirrored issues observed in previous research. These included:
- Framework version inconsistencies: GPT-engineer sometimes defaulted to outdated framework versions, affecting test compatibility.
- Unpredictable project structure: The generated project structure often varied, with incorrect folder organization or naming conventions, which occasionally required manual adjustments.
- Non-existent selectors: In some cases, GPT-engineer generated selectors that didn’t correspond to actual elements, leading to test failures.
- Unpredictable results: The same prompt could yield different project setups on separate runs, complicating the reproducibility of the test cases.
While some of these issues can be minimized with carefully crafted prompts and minor refactoring, this approach is mostly effective for addressing test content and structure. For broader challenges, such as consistent project configuration, correct dependencies, and framework usage, a more hands-on approach remains necessary to ensure reliable, functional test outcomes.
Overall effectiveness and limitations
Using GPT-engineer for automated test generation is feasible and can yield solid results, but users should not expect fully deployable tests without manual refinement. In most cases, refactoring is necessary, especially concerning the project’s configuration and setup. The tool shows promise but requires user involvement to ensure quality and accuracy.
Crafting effective prompts
This research set out to answer three key questions: What makes a ‘good prompt’ for generating UI tests with GPT-engineer? How complex does the input need to be for useful output? And, can good tests be created without overly explicit instructions?
From a QA perspective, a good prompt generally includes:
- Frameworks and design patterns to use
- Flows, scenarios, or test cases to be covered
- Test steps for each case
- Specific elements to consider (e.g., assertions, hooks, classes)
- Desired test structure
GPT-engineer performs well with minimal instructions. For instance, asking it to simply “log in” is often enough for it to generate the right commands and selectors. Adding complexity to prompts generally enhances results, but concise instructions can also yield valuable tests. Additionally, using multiple test cases in a single prompt often leads to well-organized code covering each scenario.
Future research and considerations
This research is far from exhaustive—there are more areas to explore, such as consistently applying the correct frameworks and libraries. For example, in one instance, GPT-engineer generated Python code when Cypress was requested. Moreover, selector accuracy remains a challenge, especially on complex pages, where the tool sometimes generated non-existent selectors. Future research could focus on refining these areas to maximize GPT-engineer’s reliability and utility.