Testing in the Age of AI-Generated Code

There’s a dangerous idea floating around developer communities: “If AI writes the code, we don’t need as many tests.” The reasoning goes something like this – AI produces correct code most of the time, so testing is less important.

This is exactly backwards. AI-generated code needs more testing, not less. Here’s why, and here’s how to do it without losing the speed advantage that AI gives you.

The fundamental testing paradox

When you write code by hand, you build mental models as you type. You think through edge cases because you manually handle them. You understand the failure modes because you construct the logic step by step. Your brain is a secondary verification system running alongside your fingers.

When AI generates code, that secondary verification doesn’t happen. You read the code, sure – but reading is not the same as constructing. Research in cognitive science consistently shows that people overestimate their understanding of code they’ve read versus code they’ve written. This is called the “illusion of understanding,” and AI-generated code triggers it constantly.

Tests are your defense against this illusion. They force you to articulate what “correct” means independently of the implementation. If the tests pass, the code works regardless of whether you fully understand every line. If the tests fail, you know exactly where to look.

TDD with AI: the best workflow nobody uses

Test-Driven Development has been around for decades, and it’s been divisive for just as long. But AI changes the calculus completely. TDD with AI is genuinely better than either TDD alone or AI alone.

Here’s the workflow:

Step 1: Write the tests first (yourself)

def test_calculate_shipping_domestic_standard():
    result = calculate_shipping(weight=2.5, country="US", speed="standard")
    assert result.cost == 8.99
    assert result.estimated_days == 5

def test_calculate_shipping_domestic_express():
    result = calculate_shipping(weight=2.5, country="US", speed="express")
    assert result.cost == 15.99
    assert result.estimated_days == 2

def test_calculate_shipping_international():
    result = calculate_shipping(weight=2.5, country="DE", speed="standard")
    assert result.cost == 24.99
    assert result.estimated_days == 10

def test_calculate_shipping_overweight():
    with pytest.raises(ShippingError, match="exceeds maximum"):
        calculate_shipping(weight=150, country="US", speed="standard")

def test_calculate_shipping_unknown_country():
    with pytest.raises(ShippingError, match="unsupported country"):
        calculate_shipping(weight=2.5, country="XX", speed="standard")

These tests encode your business requirements. You define what correct behavior looks like.

Step 2: Give AI the tests and ask it to implement

Here are my tests for a shipping cost calculator. 
Implement the calculate_shipping function that makes 
all tests pass.

[paste tests]

Use our ShippingError class from src/errors.py. 
Rate data should come from our rates config, not 
be hardcoded.

Step 3: Run the tests

If they pass, you have a verified implementation. If they don’t, tell AI which tests failed and why. It’ll iterate.

Why this works so well: The human writes the specification (tests), and the AI writes the implementation. Each side plays to its strength. You bring domain knowledge and edge-case awareness. AI brings speed and boilerplate tolerance. The tests serve as a contract between the two.

The AI-generated test trap

Here’s where things get tricky. AI is happy to generate tests for you too. But AI-generated tests have a systemic flaw: they tend to test the implementation, not the behavior.

When you write calculate_shipping and then ask AI to “write tests for this function,” it will read the implementation and generate tests that verify the code does what it currently does. This is tautological. If the code has a bug, the tests will verify the bug.

# AI-generated test that just mirrors the implementation
def test_calculate_shipping():
    # This test passes because it matches the code,
    # not because the business logic is correct
    result = calculate_shipping(weight=2.5, country="US", speed="standard")
    assert result.cost == 9.99  # Is 9.99 correct? Who knows.

The test doesn’t verify that $9.99 is the right price. It just verifies that the function returns $9.99. If the implementation incorrectly calculates the cost, this test will pass happily.

The fix: Write your assertions from business requirements, not from running the code. Know what the answer should be before you check what it is.

If you do use AI to generate test scaffolding, always review the assertions against your actual requirements. Keep the boilerplate (setup, teardown, test structure) and replace the assertions with your own expected values.

What to test when AI writes the code

Not everything needs the same level of testing. Here’s a priority framework for AI-generated code:

Always test: Business logic

If the code makes decisions that affect users or money, test it thoroughly. AI is particularly prone to subtle logic errors in complex conditional chains. Test every branch.

Always test: Integration boundaries

Where your code talks to databases, APIs, file systems, or external services. AI often generates code that works in isolation but breaks when it hits the real dependency. Integration tests catch what unit tests miss.

Always test: Error handling paths

AI tends to generate happy-path code that’s excellent and error-handling code that’s adequate. “Adequate” error handling leads to swallowed exceptions, misleading error messages, and silent failures in production. Test that your errors are informative, not just present.

Spot-check: Pure transformations

Simple data transformations, formatting functions, and mapping logic. AI handles these well. A few representative test cases are usually enough.

Skip testing: Generated boilerplate

If AI generated a standard Express route handler or a React component that just renders props, you probably don’t need bespoke tests. Your integration and end-to-end tests should cover these.

AI as a test generation assistant (done right)

AI is excellent at generating test infrastructure even when you shouldn’t trust its assertions. Use it for:

Test data factories:

Generate a factory function that creates realistic User 
objects for testing. Include edge cases: users with no 
email, users with very long names, users with special 
characters, users in different timezones.

Test setup and teardown:

Write the test setup for our Stripe webhook handler tests. 
We need a mock Express request/response, a Stripe event 
fixture, and database seeding with a test user and 
subscription.

Coverage gap identification:

Here's my function and my current tests. What scenarios 
am I not covering? Don't write the tests -- just list 
the gaps.

That last prompt is particularly powerful. AI is better at identifying what’s untested than at writing correct tests, because identification requires pattern matching (AI’s strength) while writing correct assertions requires domain knowledge (your strength).

The speed argument, reframed

“But testing slows me down!” Yes, and AI made the code-writing part faster. You can afford to spend some of that saved time on verification.

The math: AI generates a function in 30 seconds that would have taken you 15 minutes. You spend 5 minutes writing tests. Net savings: 9.5 minutes, and you have tests that the hand-written version probably wouldn’t have had.

AI-assisted development doesn’t eliminate the need for rigor. It gives you a time budget to afford rigor you couldn’t justify before.

The developers who treat AI-generated code as “probably correct” will ship bugs they don’t understand. The developers who write tests first and let AI implement against those tests will ship faster and more reliably. Testing isn’t the tax on AI-assisted development. It’s the thing that makes it actually work.