Synthetic Data - 3 Things Insurers Need to Understand
One prediction says that by 2024, 60% of data used to train artificial intelligence systems will be synthetic. Another prediction says that by 2030, synthetic data will completely overshadow real data in AI models. So synthetic data is not a future thing, but a present day trend that insurers will be thinking about now.
While we may not have heard much about synthetic data, it has been behind some disturbing news stories. For example, deep fake images are generated using synthetic data. What this tells us is that like everything digital, synthetic data is not neutral, but in need of careful scrutiny and governance.
So what is it exactly? Synthetic data is artificially-generated data, created using generative AI, that has analytical value. It is not real world data, but data created by one type of AI for use in another type of AI. Its value is that…
“It can be used to replace collected data by preserving or mimicking its properties or to supplement collected data to improve its completeness or to enhance privacy protections.”
A significant application of synthetic data today is in computer vision for autonomous vehicles. The software is trained on synthetically generated images rather than capturing and labelling millions of hours of real-life footage. So an autonomous vehicle will be steered only partly in terms of what is actually observed ahead. The implications of that need to be understood.
I’m going to look at three aspects of synthetic data that insurers need to understand.
It’s More than Technical
There’s a tendency to look at synthetic data through a technological lens. This is understandable, but also partial. It’s important to also look at synthetic data through a socio-economic lens. That’s because all data comes into existence as part of a social, economic and political process.
Synthetic data is shaped by the scope and depth of real world data. It is needed because of issues with that real world data, such as completeness, expense and utility. So before an insurer asks whether it needs synthetics data, it needs to understand how complete its data is, and why that is so. Is it data that is difficult to come by, and if so, what are the reasons for it being difficult to come by? Like it or not, race, gender and poverty are recognised factors in data being difficult to come by.
So in judging the need for synthetic data, the insurer must start by judging why that need exists in the first place, and whether the reasons for that need existing are recognisably fair and non-discriminatory. The danger is that the use of synthetic data could inadvertently overlay, and so mask, ethical problems in the governance of that real world data.
To explore this in more depth, I'd recommend reading this book.
Fidelity, not Just Utility
Let’s suppose that no such ethical problems exist. The insurer then needs to consider how the introduction of synthetic data influences the performance of its digital decision systems. This is about how the models generating the synthetic data are validated and the relationship this has with the decisions the system is going to be outputting.
The problem here can be illustrated through the issue of hallucinations in generative AI (more here). Hallucinations are false information introduced into the output of generative AI in order to give as best and complete a response as possible. Yet false information is, in effect, lies. And this is an experience that more and more people are experiencing, as they try out generative AI and find that it can not only create false information, but also back up that false information by generating false source references.
Synthetic data may therefore have clear utility for digital decision systems, but this needs to be supported in equal measure by managing its fidelity. This means that if you’re introducing synthetic data into a claims decision system, this needs to be supported by a robust review of how the risks of unfair decisions, of discriminatory decisions, are being managed.
The Importance of Difference
I chose claims decision systems because they are that part of the insurance lifecycle where the insurer has to perform their side of the contract in pretty precise terms. For those many claims that lie in the middle and average of loss events, there should be little to worry about, but where the claim type or the nature of the claimant are unusual or different, then clear concerns arise from how the dials and levers of the models generating that synthetic data were set.
Consider the Motability fleet, made up of around 650,000 consumers, all of whom have a significant disability. If synthetic data underlies the claims decision system being used for that fleet, fairly obvious questions arise as to how the model generating that synthetic data recognised data associated with people with disabilities, how the introduction of that synthetic data into the wider dataset was fashioned by data associated with people with disabilities, and so on.
If this isn’t done properly, the level of unfair and discriminatory decisions being output will rise. Some of you may be thinking along the lines of this being something people will just have to live with, and for sure, digital decisions are never going to be perfect. However, to make that judgement, you have to know what you’re living with at the moment and be able to judge and justify how acceptable that is. Otherwise, you’re just winging it.
An Important Upside
It’s important to also highlight an important upside to synthetic data. This comes from its potential for understanding and addressing bias. People can sometimes be understandably nervous about disclosing their race, on the basis of what past experience has taught them. Yet for a digital decision system to be assessed for bias, information about race matters.
Synthetic data is one way of bringing more data about race into a dataset. I’m sure that there are lots of clever ways in which this can be done. What’s important here is that synthetic data per se isn’t labelled as good or bad, but more the way in which it is created and put to use. As I said earlier, these are as much socio-economic questions as technological ones.
Synthetic data is a tool, not a solution. Its generation and introduction into a digital decision system needs to be carefully managed, for the simple reason that it is not real data. Remember the forecast I mentioned earlier - by 2030, synthetic data will completely overshadow real data in AI models. If the digital decision systems that will be at the heart of insurance by 2030 or earlier are to be run on synthetic data, the outcomes being generated need to be robust enough, accurate enough and fair enough for the real world that consumers live in. The future of insurance will be shaped not by synthetic data, but by the judgements we make in relation to synthetic data.