Update 10: Safety through anonymity (Part 2)

08 May 2026

This week, we’ve been continuing with our exploration of ways to anonymise data, something we first wrote about in a previous post, and have been focused on a new model from OpenAI called Privacy Filter.

A lot of the data our tools need to work with is messy and sensitive. It may be located in a range of disconnected places, such as forms, comments, emails, or exported files like PDFs, and is highly likely to always contain personal information along with clues that could be used to reveal a person’s identity.

Before we can safely use AI to analyse or transform this data, we need a reliable and efficient method for anonymising it, and OpenAI’s new model is full of promise.

Unlike many of their other models, it is “open-weight”, which means it can be downloaded and run locally on a resource we control. More importantly, it is not just a rule-based detector looking for obvious things like emails and phone numbers. It is engineered to understand context, which matters a lot in real-world education data where a pupil is not always identified by a neat “first name, last name” pattern. They might be identified by their initials or by implication through references to things such as class names. Traditional techniques employed by rule-based detectors, such as pattern matching, struggle with such dynamics.

That the model can be run locally is especially relevant to our plans, as it means we could theoretically anonymise data before it leaves a customer’s computer and enters a wider AI workflow. This fits our intended approach well: mitigate the risk in the input data through localised anonymisation, then allow an externally hosted AI to work with it.

Another useful quality is that the model can be fine-tuned. In simple terms, fine-tuning means taking a trained model and adapting it to perform better in a particular use case. This is different from full model training, which starts from scratch or near-scratch and requires huge datasets and biblical amounts of computing power, or “compute” as it is now known.

While the default model already covers common personal information, such as names, addresses and emails, school data has additional personal information types we need to consider. Being able to fine-tune the model to better handle these additional types will make it possible for us to progressively improve the performance of our automated anonymisation method.

Our initial tests have been promising. The model was very good at detecting and anonymising names, dates and phone numbers, though it missed a few instances of initials. This would be a great candidate for fine-tuning, specifically to improve the model's ability to recognise initials, as they are commonly used in documents such as EHCPs.

We wish you a fantastic weekend.