IFStruct: Measuring structured-output compliance
News

IFStruct: Measuring structured-output compliance

Structured output remains a common failure mode for language models, especially when schemas become complex, and strings require careful escaping. Constrained generation can enforce syntactic validity, but it cannot by itself make the model choose the right fields, values, or escaped content. Even under a schema constraint, the model's logits still need to meaningfully reflect the user's requested structure.

We built IFStruct, a generative benchmark that tests output validity and schema following. The task is highly learnable for small models: LFM2.5-350M, trained with RL on a dedicated held-out training split, can exceed the performance of far larger models like Qwen3.5-4B and granite-4.0-h-tiny.

Structured output is one of the most common real-world tasks for LLMs. IFStruct targets something existing benchmarks miss: that organic user requests present schema requirements in a variety of ways, often with additional constraints about formatting requirements.

What IFStruct measures

The intent of IFStruct is to give a narrow signal on the question:

“"Can the model produce valid structured output and follow diverse schema requirements?"”

Each IFStruct prompt asks the model to generate a handful of instances of a randomized item ("Generate two recipes for blueberry pancakes"), with the schema requirements presented in a variety of different ways. The content itself is not assessed, only the structure. This is by design: We aim to measure the ability to follow instructions for output format in isolation, without conflating the signal with generative quality, data extraction ability, or reasoning ability.

The verifier requires that output satisfy a target schema, including the exact fields, types, enums, numeric bounds, and item counts. The requested output format is either JSON or YAML, with additional sampled constraints such as code fencing and the option to allow or disallow extra commentary.

Scoring is binary: a sample passes only if every requested structural constraint is satisfied. The task difficulty is calibrated so that frontier models can achieve near-100 % accuracy.

Generating the dataset

The IFStruct test set was produced generatively by sampling a set of curated taxonomies to produce a diverse set of schemas. An example:

Generate exactly 3 poetry anthologies for a regional writing contest shortlist.
Compose field values as they would appear in a typeset proof copy before printing.
This anthology is for a regional writing contest shortlist.
Each poem text must preserve its line breaks, indentation, and blank lines exactly as the poet laid them out. In the poem texts, include quotation marks around spoken phrases within the poem.

Each poetry anthology for a regional writing contest shortlist should have:
- `anthology_title`: string - Title of the anthology or collection
- `editor_name`: string
- `poems`: a list of 3-4 anthology poems, each with:
- `poems[].poem_title`: string - Title of the poem
- `poems[].poet_name`: string
- `poems[].poem_text`: string - The complete poem with all line breaks, indentation, and stanza gaps preserved
- `poems[].line_count`: integer (between 2 and 40)

Return JSON. Wrap the array in an object with the key `poetry_anthology`.
Follow the requested field set exactly, without adding related fields.
No code block needed - output the JSON directly.

The example above presents the schema as a plain-English bullet list, but that is only one of several presentation styles sampled independently of the schema itself.

Some samples present the required format as an annotated example:

[  // 1 item
  {
    "vendor_name": string,  // Vendor, contractor, or company issuing the invoice
    "invoice_total_usd": number (≥50, ≤25000),
    "tax_total_usd": number (≥0, ≤3000),
    "currency": "usd"|"eur"|"gbp"|"cad",
    "paid_by_bank_transfer_allowed": boolean
  },
  ...
]

Half of the samples were then rewritten by an LLM in human-like prose to emulate a natural chat request, where requirements are folded in as plain English sentences. Multiple writing styles were used in the rewriting step for variety, including “thinking-out-loud”, where multiple revisions and changes of mind occur before the final spec is settled. The intention is to create additional realism and difficulty beyond simply presenting a neat, already-finalized schema to follow.

The full set of presentation styles includes:

  • An ad-hoc description of the schema in a naturally written chat request
  • Bullet points with explicit paths (poems[].poem_text)
  • Raw JSON Schema
  • Annotated structure examples (JSON or YAML)
  • Flat path glossaries
  • ASCII tables

Many requests deliberately stress string escaping, asking for multiline free text with embedded quotes, code snippets, file paths, and stack traces to be packed into otherwise valid JSON or YAML, a common real-world breaking point.

Validation

The validator extracts the content from a fenced block if there is one and falls back to a raw parse otherwise, rewarding/penalizing fencing only when the prompt explicitly required/forbade it.

Two kinds of constraints are checked. The first are sampled per prompt, independently of the schema, and govern the shape of the response:

  • Format: JSON or YAML.
  • Top-level shape: a bare array, or a single object wrapping the array under an exact key ("Wrap the array in an object with the key poetry_anthology.").
  • Item count: an exact number ("Generate exactly 3…") or a range ("Generate 2-4…").
  • Code fencing: required in a specific fence (```json, ```yaml), required generically, or explicitly forbidden ("output the JSON directly").
  • Commentary: sometimes allowed, sometimes banned ("Output only the JSON, with no additional text or commentary.").
  • No extra fields: every prompt forbids keys beyond the requested set.

The second comes from the schema itself, checked by a recursive walk over every object and nested list:

  • Fields: required fields must be present. Optional ones may be omitted with no penalty. A response fails if it contains any key that the schema did not ask for. The rule is strict on purpose because a model that invents fields does not follow the schema, and for the code consuming the output, an unexpected key is noise at best and a parsing break at worst.
  • Types: strings, integers, numbers, booleans, with a boolean never accepted for a numeric field or the reverse.
  • Valid values: enum values must match one of the allowed literals, and numeric values must fall within the allowed range.
  • Nested structure: nested objects and lists of objects are validated to the same standard, including minimum and maximum lengths on the nested lists.

Results

The generative design of the dataset means easy production of training data. We trained LFM2.5-350M with GRPO on a held-out dataset.

The base checkpoint passes just 21.10% of test samples. After RL, the same model scores 44.90%. That matches or beats models far larger than it, including Qwen3.5-4B (36.25%) and granite-4.0-h-tiny (38.75%).

Model

IFStruct

StructEval†

Structured Output

Benchmark (Text)*

gemma-4-31B-it

95.90

80.21

85.12

gpt-oss-20b

91.95

79.19

85.28

Nemotron-3-Nano-30B-A3B

86.80

76.90

84.57

Qwen3-8B

79.75

78.06

85.18

gemma-4-E4B-it

76.65

75.64

81.77

granite-4.1-8b

68.45

74.69

83.64

gemma-4-E2B-it

64.85

76.18

78.05

LFM2.5-350M + IFStruct RL

44.90

45.99

68.00

granite-4.0-h-tiny

38.75

67.87

79.50

Qwen3.5-4B

36.25

72.75

81.03

Qwen3.5-2B

33.15

69.18

80.82

LFM2.5-350M

21.10

43.80

56.07

Qwen3.5-0.8B

15.50

49.45

70.03

granite-4.0-h-350m

1.70

26.34

6.44

gemma-3-270m-it

0.50

31.28

20.50

* The Structured Output Benchmark was evaluated on the text-only subset, and the score reported is a difficulty-weighted mean of reported metrics. Link: https://huggingface.co/datasets/interfaze-ai/sob
† We evaluate the non-renderable subset of StructEval that includes JSON, CSV, TOML, XML & YAML. Link: https://github.com/TIGER-AI-Lab/StructEval

Prior work in this domain targets schema-following and output validity, often in conjunction with instruction-following, data extraction, and reasoning components. On schema conformance, JSONSchemaBench [1] evaluates around 10,000 real-world JSON schemas, mainly to compare constrained-decoding engines, though its harness integration also scores a model's native, unconstrained ability to satisfy a schema. StructEval [2] spans 18 formats and 44 task types with soft, partly content-based scoring. DeepJSONEval [3] stresses deep nesting. Another class of structured-output evals targets accuracy in extraction: Structured Output Benchmark (SOB) [4] across text, image, and audio sources, ExtractBench [5] for document-to-JSON, and LLMStructBench [6] under several API and prompt enforcement settings, with STED [7] offering a soft tree-edit similarity metric in place of strict scoring.

The gap IFStruct fills is to present structured output requirements in diverse and realistic ways, including ad-hoc, naturalistic chat requests. We include constraints that users commonly request, like code fencing or disallowing extra commentary, as well as requests that are adversarial to common failure modes, such as string escaping.

What's next / limitations

The design choice of assessing only the structural requirements, not content correctness or quality, means a respondent can ignore some of the content-related parts of the request (like “output 10 lines of poetry”) and still pass, as long as they have followed the schema. Because of the lack of quality validation, when optimizing for the eval (e.g., during RL), one should add their own quality signal, such as an LLM-judge, as part of the reward function. 

The difficulty of IFStruct requests is calibrated to be discriminative for low- to mid-ability models and saturates at the frontier ability level. Future iterations could increase difficulty by targeting failure modes with extra content-based requirements that are explicitly assessed, or increasing schema nesting & complexity.

The validator doesn't yet enforce all JSON constraints: regex-shaped string patterns, date and email formats, length bounds, array uniqueness, conditional dependencies, duplicate key rejection, and deliberate stressing of YAML's scalar coercion gotchas (an unquoted no becomes a boolean, 1.10 becomes 1.1).

Conclusion

Many production LLM workflows depend on structured output. However, most evaluations either rely on constrained decoding or evaluate both formatting compliance and content quality. This makes it hard to isolate a model’s output validity and schema following.

We're releasing IFStruct as an open-source benchmark that measures structured-output compliance, and the test set on Huggingface.

Acknowledgements

Written by Sam Paech, with contributions from Maxime Labonne, Tim Seyde, and Leonie Monigatti.

References

[1] Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. (2025). Generating Structured Outputs from Language Models: Benchmark and Studies (JSONSchemaBench). https://arxiv.org/abs/2501.10868

[2] Li et al. (2025). StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs. https://arxiv.org/abs/2505.20139

[3] DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models. (2025). https://arxiv.org/abs/2509.25922

[4] Abhinav Kumar Singh, Harsha Vardhan Khurdula, Yoeven D Khemlani, and Vineet Agarwal. (2026). The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models (SOB). https://arxiv.org/abs/2604.25359

[5] Ferguson et al. (2026). ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction. https://arxiv.org/abs/2602.12247

[6] Sönke Tenckhoff, Mario Koddenbrock, and Erik Rodner. (2026). LLMStructBench: Benchmarking Large Language Model Structured Data Extraction. https://arxiv.org/abs/2602.14743

[7] Wang et al. (2025). STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability. https://arxiv.org/abs/2512.23712

[8] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. (2023). Instruction-Following Evaluation for Large Language Models (IFEval). https://arxiv.org/abs/2311.07911