Projects - AirDawg Labs

Multimodal Evaluation

Adversarial Model Failure Testing

$15K

Designed and executed targeted prompts to induce specific failure modes across four frontier LLMs (GPT-5, Claude Sonnet 4.5, and others). Created a comprehensive multimodal evaluation suite testing edge cases, reasoning failures, and adversarial scenarios.

Deliverables:

5,000+ adversarial prompts
Multi-model failure analysis
Verified test cases across 4 LLMs
Detailed failure categorization

Results:

Delivered 2 weeks ahead of schedule
92% prompt success rate
Client extended for Phase 2
45% under market pricing

STEM Reasoning

PhD-Level Math & Science Dataset

$8K

Created expert-level STEM problems across mathematics, physics, chemistry, and biology requiring multi-step reasoning and domain expertise. All problems verified by PhD contributors with ground truth answers and detailed solution paths.

Deliverables:

3,500 PhD-level problems
Verified ground truth answers
Detailed solution explanations
Multi-domain coverage (STEM)

Results:

2× faster than vendor quotes
100% accuracy verification
Peer-reviewed by domain experts
Repeat client engagement

Finance & Economics

Financial Analysis Benchmark

$4K

Developed complex financial reasoning tasks including valuation models, market research scenarios, and investment analysis problems. Created by finance professionals with CFA-level expertise and verified against industry standards.

Deliverables:

1,200 finance problems
Real market scenarios
Valuation & analysis tasks
Multi-source synthesis required

Results:

Completed in 3 weeks
CFA-level verification
Industry-grade accuracy
40% cost reduction vs market

Software Engineering

Coding Benchmark Suite

$3.5K

Built comprehensive coding challenges including algorithmic problems, system design scenarios, and real-world development tasks. Covered multiple programming languages with complete test suites and reference implementations.

Deliverables:

800 coding challenges
Multi-language support
Complete test coverage
Production-grade scenarios

Results:

1.5× speed vs timeline
100% test pass rate
Client satisfaction: 5/5
Featured in client's product

CLI & DevOps

Terminal Reasoning Tasks

$2.5K

Designed multi-step command-line interface tasks testing system-level reasoning. Each task includes Docker environments, reference solutions, and automated test suites challenging frontier models on real-world DevOps scenarios.

Deliverables:

300 CLI reasoning tasks
Docker environments included
Automated test suites
<40% model pass rate achieved

Results:

Delivered ahead of schedule
Complex multi-step verification
Containerization best practices
Client expanding scope

Our Projects

5

$30K

2×

45%

Adversarial Model Failure Testing

Deliverables:

Results:

PhD-Level Math & Science Dataset

Deliverables:

Results:

Financial Analysis Benchmark

Deliverables:

Results:

Coding Benchmark Suite

Deliverables:

Results:

Terminal Reasoning Tasks

Deliverables:

Results:

Ready to Start Your Project?