How to Create Sizing Plans for Custom Models in Microsoft Foundry: Fine-Tuning GPT Models from the Catalog for Specific Use Cases

Microsoft Foundry (also known as Azure AI Foundry) provides a unified platform for discovering, fine-tuning, deploying, and managing AI models. Its extensive Model Catalog includes hundreds of foundation models from OpenAI (GPT family), Microsoft, Meta, Anthropic, and open-source providers. For enterprise projects requiring domain-specific performance, security, or cost optimization, teams often start with a GPT model from the catalog and apply model refinement (fine-tuning) to create a custom model tailored to their use case—such as customer support agents, compliance document analysis, or industry-specific chatbots.

Sizing in this context means capacity planning: estimating and configuring the right compute resources, throughput, latency, and costs for both the fine-tuning job and the production deployment of your custom model. Poor sizing leads to high costs, throttled performance, or underutilized resources. This guide walks through how to create a practical sizing plan, with a focus on GPT-based custom models refined via supervised fine-tuning (SFT), direct preference optimization (DPO), or reinforcement fine-tuning (RFT).

Why Refine GPT Models in Foundry for Specific Use Cases?

Base GPT models (e.g., GPT-4o, GPT-4.1, GPT-4.1-mini) are general-purpose and powerful, but they often underperform on proprietary data, terminology, or edge cases. Fine-tuning in Foundry:

  • - Improves accuracy and relevance with your own JSONL-formatted conversational data.
  • - Reduces prompt engineering effort and token usage (lowering inference costs).
  • - Supports advanced methods like LoRA for efficient parameter updates.
  • - Maintains enterprise features: data residency options, prompt caching, and seamless integration with agents or copilots.

 

Fine-tuned models are still fully managed by Microsoft but appear as custom deployments in your Foundry resource. Sizing becomes critical here because deployment options directly impact hourly hosting fees, token pricing, and guaranteed throughput.

 

Step 1: Prepare Your Fine-Tuning Project (Dataset Sizing Considerations)

 

Before any deployment sizing, size your training data correctly:

  • - Minimum: 10 examples (but aim for hundreds or thousands for meaningful improvement).
  • - Best practice: Start with 50 high-quality, human-curated examples; doubling dataset size often yields linear quality gains.
  • - File format: JSONL (UTF-8), <512 MB per file, total files ≤1 GB per resource.
  • - Structure: Chat Completions format (supports vision for multimodal GPT models).
  • - Impact on sizing: Larger datasets increase training time/cost and may require more epochs or higher learning-rate multipliers. Use the **Developer** training tier (spot capacity, lowest cost) for experimentation and **Global/Standard** for production-grade jobs.

Generate synthetic data in the Foundry portal (Data tab → Synthetic Data) if labeled data is limited, then validate quality before training.

Step 2: Run the Fine-Tuning Job

1. In the Foundry portal → Models → select a supported GPT model (e.g., gpt-4o-mini, gpt-4.1 series).

2. Upload training/validation files.

3. Choose customization method (SFT, DPO, or RFT) and training tier (Standard for data residency, Global for cost savings, Developer for evaluation).

4. Set hyperparameters (epochs, learning rate multiplier, batch size) or use defaults.

5. Monitor metrics: training/validation loss, token accuracy, and checkpoints.

 

Training jobs run on managed capacity; quotas apply (max 3–5 simultaneous jobs depending on tier). No manual VM sizing is needed—Foundry abstracts this. Once complete, you receive a fine-tuned model ID (e.g., `gpt-4.1-mini-2025-04-14.ft-xxx`).

 

### Step 3: Create Your Deployment Sizing Plan (The Core of Custom Model Sizing)

 

This is where “create sizing” happens. Fine-tuned GPT models support the same deployment types as base models, but with custom weights:

 

#### Deployment Types and When to Use Them

- **Standard / Global Standard** — Pay-per-token + hourly hosting fee. Good for variable traffic. Global offers cost savings (weights may temporarily leave your geography).

- **Developer Tier** — No hourly fee, ideal for testing/evaluation (no SLA).

- **Provisioned Throughput (PTU)** — Recommended for production. You purchase fixed **Provisioned Throughput Units (PTUs)** for guaranteed capacity, stable latency, and predictable hourly billing. PTUs are shared regionally with base models.

 

**Key Sizing Metrics to Calculate**

- Expected **Requests per Minute (RPM)**

- Average **input tokens** and **output tokens** per request

- Peak vs. average load

- Latency requirements (generations consume more PTU capacity than prompts)

 

PTU-to-throughput conversion varies by model version. For GPT-4o and later models, input and output tokens are weighted differently. Use Microsoft’s guidance or the Azure OpenAI capacity calculator (available in the portal or via docs) to convert your call shape into required PTUs.

 

**Practical Sizing Workflow**

1. **Collect historical or estimated workload data** (from pilot tests with the base GPT model or similar applications).

2. **Run benchmarks** in the Foundry playground or via the official benchmarking tool to measure real tokens-per-minute (TPM) under load.

3. **Calculate PTUs**:

   - Example formula (approximate):  

     PTUs needed ≈ (RPM × (input tokens × input weight + output tokens × output weight)) / TPM per PTU  

     (Exact TPM-per-PTU values are model-specific and listed in the PTU documentation.)

4. **Choose minimum PTU commitment** (e.g., 15 PTU for Global/Data-zone, 50 PTU for Regional in many cases).

5. **Factor in quota** — PTU quota is granted per subscription/region. Check availability in the Azure portal before deployment.

6. **Add headroom** (20–50%) for peaks and future growth.

 

For non-PTU deployments, size by setting `sku.capacity` (higher values increase throughput but raise costs). Maximum fine-tuned model deployments per resource is typically 10.

 

**Example Sizing for a Customer Support Agent Use Case**  

- Workload: 200 RPM, avg. 800 input tokens + 300 output tokens per request.  

- Model: Fine-tuned GPT-4.1-mini.  

- Result: ~X PTUs (use calculator for exact). Deploy as Provisioned Throughput in a supported region (e.g., North Central US).  

- Estimated cost: Fixed hourly PTU rate + any reserved capacity discounts.

 

### Step 4: Deploy and Validate the Sized Custom Model

 

  • 1. In the Foundry portal, go to your fine-tuned model → **Deploy**.
  • 2. Select deployment type, name (e.g., `my-gpt-custom-support-v1`), and PTU size (or capacity).
  • 3. For production, enable auto-deployment during fine-tuning if available.
  • 4. Test with real traffic using the Chat Playground or your application code (reference the deployment name in API calls).
  • 5. Monitor via Azure Monitor: token usage, latency, PTU utilization, and errors.

 

Inactive deployments (>15 days without calls) are auto-deleted to control costs, but the underlying model remains available for redeployment.

 

Step 5: Ongoing Optimization and Scaling

  • - Scale up/down: Update PTU allocation or sku.capacity manually (no auto-scaling yet).
  • - Cost controls: Use Azure Reservations for PTU discounts; leverage prompt caching; prefer smaller models (e.g., GPT-4.1-mini) after fine-tuning.
  • - Multi-region: Deploy across regions for global apps (cross-region supported with proper permissions).
  • - Quotas & Limits: Track max training jobs, files, and PTU quota in the portal to avoid blocking.
  • - Iterate: Use continuous fine-tuning (train on a previous fine-tuned model) and A/B test checkpoints.

Best Practices for GPT Catalog + Refinement Projects

  • - Start small: Fine-tune and deploy in Developer tier first, measure real metrics, then size PTU for production.
  • - Data quality > quantity: Poor data can degrade performance—validate with evaluation jobs.
  • - Combine techniques: Use RAG + fine-tuning for hybrid gains.
  • - Governance: Apply content filters, monitor for drift, and maintain model versions.
  • - Security: Fine-tuned models support the same enterprise controls (private networking, encryption) as base GPT models.

Conclusion

Creating a solid sizing plan in Microsoft Foundry turns a generic GPT model into a high-performing, cost-effective custom solution tailored to your exact use case. By focusing on workload profiling, PTU calculations, and the right deployment type, you avoid over-provisioning while guaranteeing reliable performance.

Whether you’re building an internal agent or a customer-facing product, Foundry’s fine-tuning + sizing workflow gives you full control without managing infrastructure. Start today in the Azure AI Foundry portal (ai.azure.com), explore the Model Catalog, and iterate from base GPT to production-ready custom model.

For the latest PTU calculators, pricing, and region availability, refer to the official Microsoft Foundry documentation. Happy refining!

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading

Topics Highlights

About @ridife

This blog will be dedicated to integrate a knowledge between academic and industry need in the Software Engineering, DevOps, Cloud Computing and Microsoft 365 platform. Enjoy this blog and let's get in touch in any social media.

Month List

Visitor