MedGemma 1.5: Google’s Open Medical AI Just Got Serious

2026-01-16T00:00:00+00:00

Google just dropped MedGemma 1.5, and it’s a significant upgrade for anyone building clinical AI systems. As someone who’s spent the last two years building production medical AI at Mount Sinai, I want to break down what’s actually new and why it matters.

What’s New in 1.5

MedGemma 1.5 isn’t just an incremental update. It adds entirely new capabilities:

mindmap
  root((MedGemma 1.5))
    3D Imaging
      CT Volumes
      MRI Sequences
    Longitudinal
      Prior Comparisons
      Disease Progression
    Pathology
      Whole Slide Images
      Multi-patch Input
    Documents
      Lab Reports
      EHR Data
    Localization
      Bounding Boxes
      Anatomical Features

Architecture Overview

MedGemma builds on Gemma 3’s decoder-only transformer with a specialized medical image encoder:

flowchart TB
    subgraph Input
        A[Medical Image] --> B[MedSigLIP Encoder]
        C[Text Prompt] --> D[Tokenizer]
    end

    subgraph Processing
        B --> E[256 Image Tokens]
        D --> F[Text Tokens]
        E --> G[Gemma 3 Decoder]
        F --> G
    end

    subgraph Output
        G --> H[Generated Text]
        H --> I[Diagnosis/Report/Analysis]
    end

Key Specs

Specification	Value
Parameters	4B
Context Length	128K tokens
Image Resolution	896 x 896
Image Tokens	256 per image
Output Length	8192 tokens
Architecture	Decoder-only Transformer
Attention	Grouped-Query Attention (GQA)

Performance Deep Dive

Medical Text Reasoning

The text-only benchmarks show solid improvements over the previous version:

{
  "title": {
    "text": "Medical QA Benchmarks",
    "left": "center"
  },
  "tooltip": {
    "trigger": "axis"
  },
  "legend": {
    "data": ["Gemma 3 4B", "MedGemma 1 4B", "MedGemma 1.5 4B"],
    "top": "10%"
  },
  "xAxis": {
    "type": "category",
    "data": ["MedQA", "MedMCQA", "PubMedQA", "MMLU Med"]
  },
  "yAxis": {
    "type": "value",
    "min": 40,
    "max": 80
  },
  "series": [
    {
      "name": "Gemma 3 4B",
      "type": "bar",
      "data": [50.7, 45.4, 68.4, 67.2]
    },
    {
      "name": "MedGemma 1 4B",
      "type": "bar",
      "data": [64.4, 55.7, 73.4, 70.0]
    },
    {
      "name": "MedGemma 1.5 4B",
      "type": "bar",
      "data": [69.1, 59.8, 68.2, 69.6]
    }
  ]
}

Imaging Performance

The real story is in the imaging benchmarks. MedGemma 1.5 shows dramatic improvements in 3D imaging and whole-slide pathology:

Task	Gemma 3 4B	MedGemma 1 4B	MedGemma 1.5 4B
CT Classification (7 conditions)	54.5%	58.2%	61.1%
MRI Classification (10 conditions)	51.1%	51.3%	64.7%
WSI Pathology (ROUGE)	2.3	2.2	49.4
EyePACS Fundus	14.4%	64.9%	76.8%
Longitudinal CXR	59.0%	61.1%	65.7%

The jump from 2.2 to 49.4 ROUGE on whole-slide pathology is remarkable.

Quick Start Code

Here’s how to get started with the pipeline API:

from transformers import pipeline
from PIL import Image
import requests
import torch

# Initialize the pipeline
pipe = pipeline(
    "image-text-to-text",
    model="google/medgemma-1.5-4b-it",
    torch_dtype=torch.bfloat16,
    device="cuda",
)

# Load a chest X-ray
image_url = "https://upload.wikimedia.org/wikipedia/commons/c/c8/Chest_Xray_PA_3-8-2010.png"
image = Image.open(
    requests.get(image_url, headers={"User-Agent": "example"}, stream=True).raw
)

# Create the prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this X-ray"}
        ]
    }
]

# Generate
output = pipe(text=messages, max_new_tokens=2000)
print(output[0]["generated_text"][-1]["content"])

Or using the model directly for more control:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/medgemma-1.5-4b-it"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

# Process inputs
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

# Generate with control over parameters
with torch.inference_mode():
    generation = model.generate(
        **inputs,
        max_new_tokens=2000,
        do_sample=False
    )

Integration Architecture

Here’s how I’m thinking about integrating MedGemma into a clinical RAG system:

flowchart LR
    subgraph Input Layer
        A[Clinical Query] --> B{Query Type?}
        I[Medical Image] --> B
    end

    subgraph Processing
        B -->|Text Only| C[Text RAG Pipeline]
        B -->|Image + Text| D[MedGemma 1.5]
        C --> E[Vector Search]
        E --> F[Context Assembly]
        F --> D
    end

    subgraph Output
        D --> G[Clinical Response]
        G --> H[Citation + Evidence]
    end

What This Means for Clinical AI

The Good

Open weights - Unlike GPT-4V or Med-PaLM, you can actually run this locally
4B parameters - Fits on a single GPU, practical for production
3D imaging support - CT/MRI interpretation was a major gap
EHR understanding - Finally, a model that can parse clinical notes

The Limitations

Important: MedGemma is not intended for direct clinical use. All outputs require independent verification.

Single-image evaluation only (no multi-image comparison in benchmarks)
Not optimized for multi-turn conversations
Prompt-sensitive (more than base Gemma)
English-only evaluation

Comparison with Alternatives

quadrantChart
    title Model Positioning
    x-axis Closed Source --> Open Source
    y-axis General Purpose --> Medical Specialized
    quadrant-1 Best for Research
    quadrant-2 Enterprise Only
    quadrant-3 Limited Medical Use
    quadrant-4 Production Ready
    GPT-4V: [0.2, 0.4]
    Med-PaLM: [0.15, 0.85]
    LLaVA-Med: [0.75, 0.6]
    MedGemma: [0.8, 0.8]
    Gemma-3: [0.85, 0.3]

Next Steps

I’m planning to:

Benchmark against our RAG system - How does MedGemma compare to Mistral-7B for clinical QA?
Test 3D imaging - We have CT data from the MMAP pipeline
Fine-tune for hematology - Our OncoCITE system could benefit from better image understanding

Resources

What are you building with MedGemma? I’d love to hear about your use cases in the comments below.

Why I’m Finally Building in Public

2026-01-16T00:00:00+00:00

For years, I’ve built tools that stayed locked in production silos. Internal pipelines. Clinical systems. Things that worked, but that nobody outside my team would ever see.

That changes this year.

The Problem with Building in Private

When you work in clinical AI, there’s a natural tendency toward secrecy. Patient data is sensitive. Institutional knowledge feels proprietary. And honestly, it’s easier to ship fast when you’re not thinking about documentation.

But I’ve started to feel the cost of this approach. Every time I solve a problem, I solve it alone. Every time someone else hits the same wall, they start from scratch. The wheel gets reinvented constantly.

This site is my commitment to building in public. Here’s what you’ll find:

Projects — Production systems I’ve built at Mount Sinai, including multi-agent architectures, GPU-accelerated pipelines, and clinical RAG systems. Where possible, I’ll share code, architectures, and lessons learned.

Publications — My papers on genomic curation, clinical decision support, and AI-generated text detection. All with links to preprints and code.

Blog — Deep dives into problems I’m solving. Not polished tutorials—more like field notes from someone figuring things out in real time.

A Glimpse at What I Build

Here’s an example of the kind of systems I work on—a multi-agent architecture for genomic evidence extraction:

flowchart LR
    A[Literature] --> B[Extraction Agent]
    B --> C[Validation Agent]
    C --> D[Knowledge Graph]
    D --> E[Clinical Query]
    E --> F[Evidence Report]

This is a simplified view of OncoCITE, a system that automatically extracts genomic evidence from scientific papers. I’ll be writing more about the architecture decisions and lessons learned.

What I’m Learning

I’m also using this year to go deeper on model interpretability. I’ve spent years making models work. Now I want to understand why they work—and more importantly, when they don’t.

I’m currently participating in SPAR and other AI safety programs. Expect posts on mechanistic interpretability, feature visualization, and what happens when you actually look inside the black box.

Let’s Connect

If you’re working on multi-agent systems, clinical AI, or interpretability—I’d love to hear from you. The best ideas come from unexpected conversations.

You can reach me at quidwaiali@gmail.com or connect on LinkedIn.

Let’s build something.

Ali Quidwai