If you gave nine different AI tools the same enterprise design challenge, would they solve the same problems? Would they prioritize the same features? Would any of them create accessible interfaces?

I decided to find out.

As design leaders, we’re seeing AI tools proliferate across our organizations. But most conversations stay surface-level: “AI is transformative” or “AI will replace designers.” I wanted to move past the hype and understand something more fundamental: how do different AI models actually approach complex design problems?

So I created a detailed design brief for “ConnectIQ,” an AI-enhanced contact center platform for companies with 500-5000+ employees, and tested it with nine AI tools:

Text AI (5 models): ChatGPT, Claude, Perplexity, Microsoft Copilot, Gemini
Visual AI (4 tools): Figma Make, Lovable, v0.dev, Claude Artifacts

Same prompt. Same requirements. Nine completely different results.

The most critical finding? Every single tool, all nine, failed accessibility.

The Experiment

The Challenge

ConnectIQ needed to serve three user types with conflicting needs:

  • Customer service agents handling 100+ daily interactions
  • Supervisors managing teams of 10-20 agents
  • Operations managers forecasting and optimizing at scale

The design had to balance innovation with enterprise reality: real-time performance (<200ms), security compliance (SOC 2, GDPR, HIPAA), legacy system integration, and change management for thousands of users.

This tests whether AI understands:

  • Multi-persona complexity
  • Enterprise constraints
  • Strategic vs. tactical thinking
  • Innovation within operational limits

The Method

Text AI Phase: I gave five models the same detailed prompt asking for product strategy, information architecture, key features, workflows, design principles, AI integration philosophy, and success metrics. Target: 1,500-2,500 words.

Visual AI Phase: I gave four tools a minimal prompt: “Design the Agent Workspace and Supervisor Dashboard for ConnectIQ.” No mockups, no detailed specs. Just the concept.

Then I evaluated everything systematically: feature depth, enterprise understanding, creativity, usability, and critically, accessibility.

What the Text AI Models Revealed

Each model had a distinct “personality” and approach:

ChatGPT: The Balanced Generalist

ChatGPT delivered 2,500 words of tightly structured analysis. Every section used hierarchical bullets, making it highly scannable. The response balanced strategic thinking with tactical details.

Key strengths: Exceptionally organized (could scan in 5 minutes), technical precision (specific <200ms latency targets), proposed 7 well-balanced modules covering all bases without bloat.

Unique contributions:

  • ABAC (Attribute-Based Access Control) for fine-grained permissions, showing deeper security thinking
  • “Evidence-First Cards” UI pattern where AI recommendations include rationale and source links
  • Best adherence to length constraint (2,500 words vs. 1,500-2,500 target)

Limitations: Somewhat conservative competitive positioning, standard feature naming (“Agent Workspace,” “Supervision Console”), and less emphasis on human factors like agent wellness.

Claude: The Creative Humanist

Claude wrote 2,800 words in narrative prose rather than bullets. The response introduced novel frameworks and emphasized human-centered design in ways other models missed entirely.

Key strengths: Only model to propose “Agent Wellness Portal” (private space for agents to manage mental health, schedules, growth metrics), memorable feature naming (“Sentiment Shield,” “Knowledge Nexus”), story-driven workflows with specific timings.

Unique contributions:

  • “EQ Layer” concept: tracking human states (agent stress, customer sentiment), not just productivity metrics
  • “Just-in-Time not Just-in-Case” design principle
  • “Human-in-the-Loop Guarantee” as core philosophy (AI suggests, humans approve)

Limitations: Less enterprise-specific detail on compliance and integrations, only 2 workflows instead of 3, fewer modules than competitors (5 vs. 7).

Perplexity: The Research Scholar

Perplexity delivered 2,600 words with academic but accessible writing. The response felt grounded in research about how real enterprise systems work.

Key strengths: Forward-thinking technical architecture (edge processing for latency-sensitive features, only model to mention this), natural language search interface, separated platform KPIs from AI-specific performance metrics (sophisticated measurement thinking).

Unique contributions:

  • Edge computing for AI inference (shows understanding of real-time constraints)
  • “AI authority levels” governance concept (who can adjust AI behavior by role)
  • Explicit AI performance metrics separate from business KPIs

Limitations: Less memorable product positioning than Claude, longer latency targets (300ms vs. 200ms), standard module structure.

Microsoft Copilot: The Technical Architect

Copilot produced 4,000+ words (double the target), reading like a complete PRD ready for development teams. The detail level was exhaustive.

Key strengths: Most implementation-ready (specific costs like $350K, 85% accuracy targets), Monte Carlo staffing simulation, causal inference models for impact analysis, comprehensive security and governance (bias testing, field-level PII encryption).

Unique contributions:

  • Monte Carlo methods for scenario analysis (operations research sophistication)
  • Causal inference models (beyond correlation to actual causation)
  • Most aggressive latency targets (<200ms)
  • Explicit bias testing in model governance

Limitations: Exceeded length by 2x (makes it harder to consume), least creative positioning, overwhelming detail for a strategy document, less focus on human experience.

Gemini: The Clarity Advocate

Gemini delivered 2,400 words with the clearest problem-solution framing. The response was the most stakeholder-friendly, requiring the least editing for presentations.

Key strengths: Simplest value articulation (“reduce effort for humans, increase confidence in decisions”), most explicit AI visibility philosophy (when AI should be visible vs. invisible), intent-centric positioning (vs. workflow-centric competitors), exactly 7 focused KPIs.

Unique contributions:

  • “Graceful degradation” philosophy (AI signals uncertainty clearly)
  • “Reduce effort, increase confidence” as memorable value prop
  • Proactive vs. reactive framing
  • “Non-judgmental interface” principle

Limitations: Less technical innovation, fewer memorable concepts than Claude, standard module architecture, and lighter implementation detail.

The Pattern: Strategic vs. Tactical Split

The models fell into clear camps:

Strategic thinkers (Claude, Gemini): Novel frameworks, clear positioning, human-centered emphasis Balanced (ChatGPT, Perplexity): Equal weight to vision and execution
Tactical implementer (Copilot): Exhaustive specifications, ready for engineering

What they all missed: Accessibility, change management, user research methodology, mobile experience, offline capabilities, internationalization.

This reveals AI’s collective blind spots: what’s underrepresented in training data gets consistently overlooked.

What the Visual AI Tools Revealed

I selected four tools based on their relevance to enterprise design roles: Figma Make (industry standard), Lovable (modern full-stack), v0.dev (enterprise credibility from Vercel), and Claude Artifacts (known for thoughtful outputs).

Figma Make: Industry Standard Polish

Figma Make generated exceptionally polished interfaces with a clean, professional aesthetic. Light theme with teal accents, three-panel layout (queue, conversation, customer context + AI assistant).

What they all missed: Accessibility, change management, user research methodology, mobile experience, offline capabilities, internationalization.

This reveals AI’s collective blind spots: what’s underrepresented in training data gets consistently overlooked.

Agent workspace showed: customer profile with lifetime value ($12,450), CSAT history (4.8/5.0), Premium tier badge, queue with 5 waiting interactions and priority badges, AI Assistant panel with 94% confidence scores, sentiment detection alert (“Negative Sentiment Detected”), suggested responses with confidence levels, and knowledge base integration.

Supervisor dashboard included: 8 KPI cards (Active Agents, Avg Handle Time, CSAT, Queue Wait Time, SLA Compliance, First Contact Resolution), real-time team monitoring with individual agent cards showing current status and performance, live alerts panel, and queue health by channel.

Strength: Looks production-ready, like software you’d find at Salesforce or ServiceNow. Every pixel feels carefully considered.

Weakness: Red/orange text on light backgrounds (contrast violations), color-only priority system without icons, “Negative Sentiment” alert with red background. Score: 8/15 on accessibility.

Lovable: Full-Stack Prototyper

Lovable didn’t just generate mockups. It built functional code: React components, TypeScript, routing, and state management. The application actually worked.

Modern teal/violet color scheme with clean three-panel architecture. Agent workspace featured: queue with customer cards and channel badges (Chat, Email, Voice), active conversation with message threading, dual-tab interface (Customer profile + AI Assist), AI suggested responses with 94% confidence and copy/use buttons, and knowledge base articles.

Supervisor dashboard showed: 6 KPI cards with trend indicators, queue health by channel (Voice, Chat, Email) with SLA performance bars, team member list with individual stats and progress bars, interaction volume chart, CSAT trend chart, and alerts panel.

Strength: Functional prototype in hours, not weeks. Bridges design and development. Shows technical depth.

Weakness: Orange/red priority badges (accessibility issues), color-coded progress bars, red alert badges. Score: 8/15 on accessibility.

v0.dev: Power User Interface

v0.dev created a sophisticated dark-themed interface optimized for information density. The design packed maximum features into every screen.

Agent workspace included: queue showing 4 active conversations with priority dots and wait times, multi-channel switching (chat/voice/email), two states generated (active chat with AI detection, live voice call with timer and controls), real-time transcription, sentiment bars with specific percentages (Frustration 78%, Urgency 65%), confidence scores on suggestions (95%), customer lifetime value ($125,000), and bottom performance stats (47 handled, 94% CSAT, 2:34 avg time).

Supervisor dashboard featured: 8 KPI cards with trend indicators, team overview grid with agent cards showing photos and current activity, live activity feed, escalations panel with “Take Over” button, and negative sentiment alerts.

Strength: Most enterprise-ready features, deepest domain knowledge (understood multi-channel, supervisor intervention, real-time monitoring), power user optimization.

Weakness: Severe accessibility violations. Extensive red text on dark throughout (urgent priorities, negative sentiment, trend indicators), orange/yellow text, color-only priority system. Score: 3/15 on accessibility. Would fail enterprise audits.

Claude Artifacts: Accessibility Leader

Claude chose a light theme with generous whitespace and simplified information density. The design prioritized clarity and comprehension over feature count.

Agent workspace showed: clean queue with simple cards, focused conversation view, customer profile with key metrics (lifetime value, CSAT, member since, total orders), emoji sentiment indicators (😟 Concerned, Anxiety detected), AI assistant with “Active, Confidence 94%,” recommended actions with rationale, and quick action buttons.

Supervisor dashboard included: 4 main KPI cards with people/clock/star icons, queue status (5 large metrics), team performance trend chart (line graph with smooth curves), and professional data visualization.

Strength: Best accessibility baseline (12/15 score), emoji as non-color indicators, high contrast throughout, clear visual hierarchy, most stakeholder-friendly for presentations.

Weakness: Still not fully WCAG compliant (red “Live Queue” badge, some contrast issues), less feature-rich than competitors, simpler than power users might want.

The Split: Design AI vs. Development AI

The tools revealed a fundamental distinction:

Design-focused (Figma Make, v0.dev, Claude):

  • Output: Pixel-perfect mockups
  • Thinking: Screen-by-screen
  • Strength: Visual refinement
  • Use for: Presentations, exploration, aesthetics

Development-focused (Lovable):

  • Output: Working code
  • Thinking: System-level
  • Strength: Functional prototypes
  • Use for: Technical validation, rapid testing

Both valuable. Neither complete. Leaders need both.

The Universal Failure: Accessibility

All four tools made similar mistakes:

  • Red/orange text on dark or light (contrast violations)
  • Color-only indicators (priorities, sentiment, status)
  • Missing icons or patterns to supplement color
  • Progress bars using color alone

Impact: 8% of male users (colorblind) would struggle. Users with low vision couldn’t read low-contrast text. Enterprise procurement audits would fail these interfaces.

The pattern was clear: beautiful doesn’t mean accessible.

The Critical Finding: AI’s Accessibility Crisis

Every single AI tool failed accessibility. All nine.

This wasn’t random. It revealed a systematic gap.

[Image: Comparison showing accessibility violations – red text examples from v0.dev, Figma Make, and Lovable with WCAG contrast ratios]

The Violations

Text AI Models:

  • Zero mentioned WCAG standards
  • Zero mentioned screen readers
  • Zero mentioned colorblindness
  • Zero mentioned keyboard navigation

Visual AI Tools:

  • v0.dev: Extensive red text on dark (3/15 score)
  • Lovable: Orange/red priority badges (8/15 score)
  • Figma Make: Red alerts, color-only indicators (8/15 score)
  • Claude: Best of group but still not compliant (12/15 score)

The Impact

In enterprise software, these violations mean:

  • Legal risk: ADA lawsuits, failed Section 508 compliance
  • Excluded users: 8% of males are colorblind, millions use screen readers
  • Failed procurement: Enterprise RFPs require WCAG AA compliance
  • Expensive remediation: Fixing post-launch costs 10x more

Why This Matters

Accessibility isn’t a nice-to-have. It’s a fundamental requirement. And AI seems to systematically ignore it.

This reveals what’s missing from AI training data: accessibility best practices, inclusive design thinking, and WCAG guidelines.

Until that changes, human designers must be the accessibility advocates. 

What Design Leaders Need to Know

1. Tool Selection is Strategic

Different tools for different purposes:

Use text AI for:

  • Strategic exploration (multiple models for diverse perspectives)
  • Documentation (ChatGPT, Copilot for structure)
  • Research synthesis (Perplexity for grounded insights)

Use design AI for:

  • Visual concepts (Figma Make for polish, v0.dev for features)
  • Stakeholder presentations (clean, professional mockups)
  • Rapid iteration on aesthetics

Use development AI for:

  • Functional prototypes (Lovable for working code)
  • Technical validation (test actual workflows)
  • Bridging design and engineering

Never use AI alone for:

  • Final design decisions
  • Accessibility compliance
  • User research
  • Strategic pivots

2. Multi-Tool Synthesis is Optimal

No single tool was “best.” The optimal approach:

  1. Generate concepts from multiple tools
  2. Identify patterns and divergences
  3. Synthesize best ideas with human judgment
  4. Add what AI missed (especially accessibility)
  5. Validate with real users

3. Accessibility Requires Human Vigilance

Since every AI tool failed:

  • Assume AI outputs are not accessible
  • Audit every interface for WCAG AA compliance
  • Add icons/patterns to supplement color
  • Test with users who have disabilities
  • Make accessibility non-negotiable

4. Skills Design Leaders Need Now

AI Fluency:

  • Prompt engineering (clear constraints yield better outputs)
  • Critical evaluation (know when AI is generic vs. useful)
  • Tool selection strategy (match capabilities to challenges)

Amplified Human Skills:

  • Synthesis across sources (AI generates, you synthesize)
  • Strategic thinking (AI handles tactics, you handle strategy)
  • Accessibility advocacy (AI won’t do this for you)
  • Ethical judgment (evaluate “should we?” not just “can we?”)

Key Takeaways

About AI Tools:

  • Each has distinct strengths (ChatGPT structure, Claude creativity, Copilot depth, Gemini clarity)
  • Design AI and Development AI serve different purposes
  • Multi-tool synthesis beats single-tool reliance
  • AI amplifies good process but can’t replace it

About Accessibility:

  • All 9 tools failed (systematic, not random)
  • Text AI ignored it entirely
  • Visual AI violated contrast, relied on color-only indicators
  • Human oversight is essential, not optional

About Design Leadership:

  • Synthesis is your superpower (AI generates, you decide)
  • Tool selection is strategic knowledge
  • Accessibility advocacy is your responsibility
  • AI should accelerate, not replace, human-centered design

What’s Next

This experiment represents one data point in a rapidly evolving field. The tools will improve. New capabilities will emerge. But the fundamental insights remain: AI amplifies design leadership when we apply critical judgment, accessibility requires human vigilance, and synthesis across tools creates better outcomes than any single source.

As design leaders, our role is to:

  • Evaluate tools systematically (not follow hype)
  • Guide teams in effective AI integration
  • Advocate for what AI misses (especially accessibility)
  • Maintain human-centered design as a foundation

The question isn’t whether AI will transform design practice. It already has. The question is how we lead through that transformation while maintaining quality, inclusivity, and strategic thinking.

Go Deeper

Want the full analysis?

  • Detailed breakdowns of all 9 AI tools
  • Complete comparison matrices
  • Full prompt strategy and methodology
  • Comprehensive accessibility analysis
  • Extended synthesis and recommendations

Follow me on LinkedIn to know when I post the full downloadable document.

Let’s Connect

If you’re navigating AI transformation in design leadership or simply love experimenting with new design tools + tech, I’m interested in comparing notes.

Areas of Focus:

  • Systematic (or even quick and dirty) evaluation of AI tools for design teams
  • Building AI-literate design practices that maintain quality and accessibility
  • Evolving design leadership in AI-augmented organizations

I’m particularly interested in conversations with:

  • Design leaders navigating AI transformation
  • Organizations building AI-enhanced enterprise products
  • Teams addressing the accessibility gap in AI-generated designs

Connect with me:

This article summarizes an experiment testing 9 AI tools (5 text models, 4 visual tools) on the same enterprise design challenge. The complete analysis, methodology, and findings will be available in a full white paper format soon.