The language of the boundary

It all started with a simple need: using a generated JSON schema in one of our endpoint operations defined with Open API Spex. What should have taken ten minutes turned into one year of hobby-time coding sessions to build a JSON Schema validator library for Elixir.

The root cause

Three years ago, I was working on an ETL application. Such applications generally define “transformers” and we were exporting a full JSON Schema listing all possible transformers and their parameters, so our frontend could generate corresponding UI and validate the pipelines before updating them on the Elixir backend.

That JSON schema was made available to the frontend but it was not yet used in controllers to validate the requests that created or updated pipelines.

We had an operation definition for Open API Spex, and the schema at something like priv/schemas/pipeline.schema.json. Now how do you merge the two? How to do basically this:

defmodule PipelineSchema do
  require OpenApiSpex

  OpenApiSpex.schema(%{
    properties: %{
      name: %Schema{type: :string},
      definition: schema_file("priv/schemas/pipeline.schema.json")
    },
  })
end

How can you define the schema_file/1 function in a way that works with Open API Spex?

Well, you can’t. Open API Spex is based on OpenAPI 3.0, which does not implement JSON Schema, but rather a subset/superset, and works only with its schema definition structure. It cannot use another schema definition.

By that time, OpenAPI 3.1 was already around, and this version of the spec ditched the subset/superset nonsense to integrate with standard JSON schemas.

So I started to build an alternative OpenAPI library from scratch that would implement spec 3.1 and JSON schema validation with support for arbitrary raw schemas instead of relying on a custom definition.

A few months in, after a couple rewrites, the codebase had grown significantly but it was still only about JSON schemas so I extracted it, thinking that if it is flexible enough to build an OpenAPI library on top of it, it would probably be useful for someone else as well. And then JSV was born.

Now instead of doing all of this I could have used a different technology. I had the feeling that something was missing though. I really wanted to keep our JSON schema based validation, using the exact same rules in the frontend and backend. Validating with something else was never in question.

I now have a clearer view of why it was the right call and this post explains why.

The language of the boundary

Why are you validating stuff anyway?

Let me state the obvious: we need to validate data when it cannot be trusted. We should be able to trust data that is in your system, so in general untrusted data is external, it comes from inputs, crossing a boundary.

But validation rules are not only a defense mechanism, they establish a contract between parties that have different interests. The validation step is not a wall, but a gate. Your application may define the shape of the data, but both sides of the boundary rely on it. The other side is unknown: doc generators, client generators, client test suites, frontend generators, etc.

The people I work with are Python or Typescript developers, they don’t use any of the validation libraries that are available for Elixir. They don’t even know they exist. But they do know JSON Schema. So we generally start from there, writing contracts together in a language that everyone knows well, a lingua franca, JSON schemas. From there, everyone is able to work on their own side, without any care for how the other side is implemented, as there is no reason to care.

It doesn’t matter what language we write the inside of a black box in but it does matter what comes in and what goes out. […] What we need to do is pin down what’s happening at the boundaries, and observe the boundaries of the program.

Joe Armstrong – The Do’s and Don’ts of Error Handling

Admittedly, JSON Schema is a bit verbose, it has some quirks and it is not pretty. But it is usable in all common stacks, every developer or LLM can read it. To me it is the language of the boundary. What is yours?

The landscape

Every now and then, a new data validation library is announced on Elixir Forum. I can understand why as I was there too, announcing JSV, hoping for good reception, whereas I could just have used ExJsonSchema.

Writing code is fun! Design, architecture, even some bugs can be fun. When I started JSV I had a simple mental model, if the schema says type is integer, then use is_integer, easy peasy. It looks like a nice problem to solve.

Many of these new libraries are DSL-based. A nice DSL always looks good and JSV has some macros too. Plus macros are so easy to write in Elixir, everyone wants to have some fun with them. My first one had a “now I’m a real Elixir developer” feel.

Macros are also very practical to use in Elixir, so it really feels like using a DSL-based validation library is the right call, especially in the early stages of a project where it’s more important to build fast than to lay the foundations of an ecosystem.

Such libraries keep coming, we now have more than a dozen, but I feel like they only solve “the validation on the Elixir side problem”. It is a very important problem to solve and some do it very well, allowing to use the full Elixir expressiveness we love. But to me it is only half of the problem scope. When another program will need the validation rules, complexity will ring at the door.

The problem they should be solving is a boundary problem, and for this you need a solution that can cross the boundary. This cannot be Elixir code.

Rules as data

Programs do not cross the boundary. But data does, and for this reason, validation rules should not be programs but data. This is why JSON Schema is a very good solution to the untrusted input validation problem.

Rules-as-data can be derived from validation rules written in a DSL or keywords or structs. Many libraries will provide a JSON schema conversion feature, so you can share the rules with the world. It works well until there is a drift. If the library that you are using does not have this feature, you’re in trouble. Maintaining two sets of rules manually, the Elixir code that actually validates and the rules exposed to the world, is just hell.

And even when you can derive the validation rules from validation code, you do not have a single source of truth, one that every other party is based upon. This single source is much easier to achieve when it is expressed simply in maps, lists and names. JSON does not have atoms, tuples, structs, records or funs. It can feel lacking but it is a common ground. Every programming language can represent maps, lists and names. Even most human languages can express the same ideas as a JSON schema. This is a real strength.

In Elixir there is always the atom vs. string problem. In code, atoms are preferred by almost all of us, but when loading from a file you want strings. You can parse a JSON file as atoms, but you don’t want to do that at runtime with dynamically generated schemas.

So this translated into a simple decision for JSV: any term that resembles a JSON schema should be usable as a schema.

This was a hard constraint and required building a normalization pipeline so JSV would accept all forms of schemas, including structs thanks to a protocol.

So for instance the same schema can be written as %{type: :string} or %{"type" => "string"}, or a mix of the two:

schemas = [
  %{type: :string, format: :date},
  %{"type" => "string", "format" => "date"}
]

schemas
|> Enum.map(&JSV.build!(&1, formats: true))
|> Enum.map(&JSV.validate!("2020-12-01", &1, cast_formats: true))

[~D[2020-12-01], ~D[2020-12-01]]

As stated above, JSV ships with the traditional defschema macro too:

defmodule SugarSchema do
  use JSV.Schema

  defschema name: string(),
            age: integer()
end

It is a concise way to quickly define schemas. But when you want to use schemas from a shared repo, or a local set of files, you may want to do this instead:

File.write!("/tmp/my-schema.json", """
{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"}
  }
}
""")

defmodule FileBasedSchema do
  use JSV.Schema

  defschema "/tmp/my-schema.json"
            |> File.read!()
            |> Jason.decode!(keys: :atoms)
end

Migrating from one solution to the other is easy, as both structs are equivalent:

data = %{"name" => "Alice", "age" => 123}

[
  JSV.validate!(data, JSV.build!(SugarSchema)),
  JSV.validate!(data, JSV.build!(FileBasedSchema))
]

[
  %SugarSchema{age: 123, name: "Alice"},
  %FileBasedSchema{age: 123, name: "Alice"}
]

The point is that validation rules should remain data even at compile-time. Storing a schema in a module attribute is another simple way to do it.

Other JSON Schema libraries may have similar tooling (Exonerate lets you compile the schemas directly as Elixir code for instance, it’s very cool). Or you can just use Elixir to manipulate your schemas when needed, if it’s just maps, lists and names.

The spec

Okay buddy, JSON Schema is great and all but what dialect should I pick then? Draft 4? Draft 5? Draft 6? Draft 7? Draft 8? Wait there is no Draft 8! They can’t even get versioning right, Draft 8 is actually 2019-09! Should I pick 2019-09? Or 2020-12 maybe?

Look there are no more JSON Schema versions than Star Wars movies, it’s not that complicated!

Yes, the versioning is a bit confusing, semver could help there. But it’s because JSON Schema is a living standard, maintained over the years by an organization.

You should use the latest spec, 2020-12. If you consider contracts to be important code, they should be maintained and updated to new versions. Plus 2020-12 is the version used by OpenAPI. Draft-07 is a good one too: it is used by many systems. Schemas that you will import from outside will probably be based on that version. JSV implements both, other libraries will most likely implement Draft 7 at least.

Many applications have simple needs, most schemas will work with any of these versions anyway.

JSON Schema is not a library

There are multiple JSON Schema libraries in the Elixir ecosystem, and there will be more in the future. If you use JSV and I disappear tomorrow without having time to transfer ownership, or if you try another library and like it more, the migration will be straightforward. If your boss asks you to rewrite your Elixir backend in Next.js with Claude Code because “it’s going to be faster to add new features” then you can directly hand over a package of schemas before quitting.

Choose JSON Schema because it’s a spec, the library you use to validate the schemas is an implementation detail.

JSON Schema validates itself

Contracts are important code, this code should be checked. JSON Schema lets you validate your rules. As a spec for data validation rules expressed as data, it would be awkward if that was not possible!

This is especially interesting if some of the schemas are created dynamically, and it’s quite easy to do, just use the meta-schema (the "$schema") as a "$ref":

bad_schema = %{
  # type should be an array or a string like "integer", "string", etc.
  "type" => %{"string" => true, "integer" => true}
}

meta_schema = %{"$ref": "https://json-schema.org/draft/2020-12/schema"}
root = JSV.build!(meta_schema)
JSV.validate!(bad_schema, root)

** (JSV.ValidationError) json schema validation failed

at: "#"
by: "https://json-schema.org/draft/2020-12/schema#"
errors:
  - (allOf) value did not conform to all of the given schemas

    at: "#/type"
    by: "https://json-schema.org/draft/2020-12/meta/validation#/properties/type"
    errors:
      - (anyOf) value did not conform to any of the given schemas

        at: "#/type"
        by: "https://json-schema.org/draft/2020-12/meta/validation#/properties/type/anyOf/1"
        errors:
          - (type) value is not of type array

        at: "#/type"
        by: "https://json-schema.org/draft/2020-12/meta/validation#/$defs/simpleTypes"
        errors:
          - (enum) value must be one of the enum values: "array", "boolean", "integer", "null", "number", "object" or "string"

    at: "#"
    by: "https://json-schema.org/draft/2020-12/meta/validation#"
    errors:
      - (properties) property 'type' did not conform to the property schema

And of course you can validate the meta schema with itself, as it is a regular JSON schema.

JSON Schema is ubiquitous

Maybe the most important aspect is that being a well-known spec, JSON Schema is used everywhere. Say you want to accept some Claude Code config from an endpoint. Should you define the entire Claude Code configuration schema with a custom set of rules? Someone else already did, though, you may want to use their work instead… You just need the URL for it.

schema = %{
  properties: %{
    name: %{type: "string", description: "name of the agent"},
    maxTokens: %{type: "integer"},
    config: %{"$ref": "https://www.schemastore.org/claude-code-settings.json"}
  }
}

resolver_opts = [
  allowed_prefixes: ["https://www.schemastore.org/"],
  cache_dir: "/tmp/schema-cache"
]

root = JSV.build!(schema, resolver: {JSV.Resolver.Httpc, resolver_opts})

data = %{
  "name" => "Agent Foo",
  "maxTokens" => 1.0e6,

  # Channel is not valid here
  "config" => %{"autoUpdatesChannel" => "not a valid channel"}
}

JSV.validate!(data, root)

As expected with an invalid config, it is rejected:

** (JSV.ValidationError) json schema validation failed

at: "#/config/autoUpdatesChannel"
by: "https://www.schemastore.org/claude-code-settings.json#/properties/autoUpdatesChannel"
errors:
  - (enum) value must be one of the enum values: "stable" or "latest"

at: "#/config"
by: "https://www.schemastore.org/claude-code-settings.json#"
errors:
  - (properties) property 'autoUpdatesChannel' did not conform to the property schema

at: "#"
by: "#"
errors:
  - (properties) property 'config' did not conform to the property schema

Because JSON Schema is ubiquitous, you can easily reuse rules defined by third parties when dealing with them. All with a single "$ref"!

The elephant in the room (Ecto)

Some people claim that Ecto is all you need to validate external input, with the benefit of having the same validation system for the external inputs and data that goes into your database.

Ecto validation rules cannot leave the Elixir world as they are deeply entangled with database concerns. Ecto deals with the persistence boundary. Uniqueness, foreign keys, serialization, etc. It is not about the correctness of the input data but rather correctness of the business rules.

The end

When I started writing JSV I was only solving a simple technical problem: being able to generate and share our validation rules independently of any backend validation tooling, while still being able to integrate them easily where needed.

I am now convinced that this is not only desirable, but also the basis of collaboration between apps and people using different stacks, and JSON Schema is still the best fit for that in my opinion.

So for your next app, pet project or library, I encourage you to ask yourself whether you are validating data that crosses a boundary and, if so, to validate it with something that can be understood by both sides of that boundary.