The Real Problem: Data Trapped in Documents

For healthcare and financial services organizations, contracts are not just legal documents—they are the operational backbone. Yet, critical business information remains locked in unstructured formats: PDFs, scanned copies, complex tables, and nested exhibits. Manual review is slow, error-prone, and expensive. Traditional CLM systems only capture predefined fields, missing the nuanced terms that drive reimbursement rates, vendor discounts, and compliance obligations. The result? Missed savings, payment delays, and operational inefficiencies that cost millions.

Doczy.ai, built by AArete on AWS, directly attacks this problem. Instead of treating documents as flat text, the solution uses a patented hybrid approach that preserves hierarchical structure and semantic meaning. The architecture orchestrates Amazon S3, Lambda, Textract, Bedrock, ECS, CloudWatch, and Secrets Manager to create a fully automated pipeline from document upload to actionable dashboard.

The Architecture: Smart Chunking + Dual Clustering

The core innovation lies in two stages: smart chunking and dual clustering.

Smart Chunking

After Amazon Textract extracts raw text and metadata, a proprietary algorithm doesn't just split by paragraphs. It uses semantic and keyword search to decompose text into context-aware chunks, preserving one-to-many relationships (e.g., a single clause that applies to multiple service levels). Sequential identifiers and metadata-driven grouping organize these chunks into field groups, detecting overlaps and removing duplications.

Dual Clustering Engine

This is where Doczy.ai differentiates itself. Two lenses analyze every document simultaneously:

  • Semantic clustering: Text is converted into embeddings (numerical representations of meaning). Similar ideas are grouped even when expressed in different words.
  • Structural clustering: Pattern-recognition algorithms identify clause types, formatting conventions, table layouts, and hierarchical organization. A three-level nested exhibit is treated differently than a straightforward schedule.

Projection algorithms then compare both clusters side by side, synthesizing them into a unified model that captures both meaning and context. This convergence drives the 99% accuracy rate.

Key Metrics That Matter

MetricValue
Documents processed (22 months)2.5 million (50M pages)
Amazon Bedrock API calls137 million
Total tokens processed442 billion
Cumulative client savings~$330 million
Manual processing time reduction97%
Accuracy vs. rules-based systems99% (vs. 55%)

Limitations and Considerations

While impressive, this architecture is not a silver bullet. The smart chunking and dual clustering require significant upfront configuration and domain-specific tuning. Organizations with highly irregular document formats (e.g., handwritten notes, non-standard templates) may see lower accuracy. Additionally, the reliance on Amazon Bedrock means costs scale with token usage—at 442 billion tokens, this is not a cheap solution for small-scale deployments. Latency can also be a concern for real-time processing of large volumes.

Next Steps for Learning

If you're building similar document intelligence pipelines, start by experimenting with Amazon Textract and Bedrock's Claude or Titan models for extraction. Then, focus on your chunking strategy—this is where the most architectural leverage lies. Use metadata to preserve document structure, and always validate with a dual approach (semantic + structural) to catch edge cases.

For further reading, check out this deep dive on Microsoft's Maia 200 AI inference accelerator for understanding hardware acceleration for LLMs, or explore how Nemotron-Personas-Brazil is building culturally-grounded AI datasets—a different but complementary approach to sovereign AI.

AI-powered contract intelligence dashboard showing document analysis and insights on AWS System Abstract Visual

# Example: Simulating the dual clustering logic in Python
# This is a simplified illustration, not production code.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import re

# Simulate extracted chunks from Amazon Textract
chunks = [
    "Provider agrees to reimburse at 85% of billed charges",
    "Term: 12 months, renewable automatically",
    "Confidentiality clause: both parties shall maintain...",
    "Payment terms: net 30 days from invoice date",
    "Termination: 60 days written notice required",
]

# Semantic clustering using embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(chunks)
semantic_labels = KMeans(n_clusters=2, random_state=42).fit_predict(embeddings)

# Structural clustering using regex patterns (simplified)
structural_labels = []
for chunk in chunks:
    if re.search(r'reimburs|payment|invoice', chunk, re.IGNORECASE):
        structural_labels.append(0)  # financial clause
    elif re.search(r'termin|confidential|renew', chunk, re.IGNORECASE):
        structural_labels.append(1)  # legal clause
    else:
        structural_labels.append(2)

# Projection: combine both clusterings
final_labels = []
for sem, struc in zip(semantic_labels, structural_labels):
    # In real system, this is a learned mapping
    combined = f"sem{sem}_struc{struc}"
    final_labels.append(combined)

print("Combined cluster labels:", final_labels)
# Output: ['sem0_struc0', 'sem1_struc1', ...]

Cloud architecture diagram of Doczy.ai on AWS with Lambda, Bedrock, S3 and Textract services IT Technology Image

The Business Impact: Beyond Accuracy

Doczy.ai's 97% reduction in manual processing time is not just a cost-saving metric—it fundamentally changes how organizations operate. Health plans can now configure claims systems automatically from contract terms, removing manual data entry and configuration errors. Vendor invoice verification becomes real-time, catching discrepancies before payment. The centralized metadata repository enables continuous contract analysis, identifying opportunities to renegotiate terms or consolidate vendors.

However, adopting such a system requires organizational readiness. Teams must be trained to trust AI outputs, and a feedback loop for edge cases is essential. Doczy.ai uses few-shot and multi-shot prompting, continuously editing prompts based on real outputs—this compounding accuracy improvement is a best practice for any production AI system.

Architectural Best Practices to Steal

  • Use metadata to preserve document hierarchy – don't flatten your chunks.
  • Combine semantic and structural clustering – meaning without structure is brittle.
  • Instrument everything with CloudWatch – monitor token usage, latency, and error rates.
  • Secure secrets early with Secrets Manager – don't let security be an afterthought.
  • Design for continuous improvement – use real outputs to refine prompts and models.

Data flow visualization from unstructured contracts to structured insights with 99% accuracy Programming Illustration

Conclusion

Doczy.ai on AWS demonstrates how modern cloud services can solve complex document-heavy business problems. The patented combination of smart chunking, dual clustering, and prompt optimization delivers 99% accuracy at massive scale—processing 2.5 million documents and generating $330 million in savings over 22 months. For any organization drowning in unstructured contracts, this architecture provides a proven blueprint.

Start small: pick one contract type, build a pipeline with Textract and Bedrock, and iterate on your chunking strategy. The technology is ready—now it's about execution.

Related Reading

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.