Data Ethics starts with Rigorous and Critical Due Diligence

Insurers are often referred to as long established users of large datasets, familiar with the opportunities and challenges they present. Yet it’s a reputation that is going to be seriously tested over the next few years. That’s because trust in this digital era comes not from talk of ‘we have been doing this for ages’, but from evidence that ‘we are doing this right’. Insurers will be challenged on whether they have been thorough enough, careful enough, concerned enough about ‘doing data science right’.

This is not about whether they’ve bought the right hardware and software, or about whether they’ve recruited the best people. I’m sure they’ve done both. It’s about a shift taking place in the standards against which insurers’ use of data and algorithms are being judged. This shift comes from the realisation that some of the outputs being generated by corporate handling of data and algorithms can sometimes be unfair, could sometimes be discriminatory, and more generally, are sometimes of questionable quality. Such concerns are why the UK regulator launched a study of data ethics in regulated firms.

There will of course be lots of good and equitable things that insurers are doing with data and algorithms. Yet relying on that ‘good people doing good things’ angle misses the point. And the point at issue here is that insurers have to be able to show how they are embedding that ‘good people doing good things’ into the DNA of their digital projects, and to be able to prove that their standards match the public’s standards when it comes to the fair and equitable use of data and algorithms.

Insurers are not Trusted with Data

You may question how big an issue this really is. If so, consider independent research published in February 2020 by the UK insurer trade body, the Association of British Insurers. It found that 86% of consumers were concerned about organisations selling or sharing information about them when those organisations didn’t have permission to do so. More than half (55%) remain uncomfortable with this even when they have given permission for their data to be shared. And only 13% gave insurance a score between 8 and 10 (where 10 is ‘trust completely’), when asked to rate the extent to which they trust the sector to use their information and data in their best interests.

These findings amount to what the researchers referred to as a ‘double layered lens of mistrust’ that frames consumer attitudes to the use of their data for insurance purposes. Now, in the past, the sector has responded to such concerns by urging the public to ‘just trust us – we are the professions’. That approach just does not work. In fact, it exposes the sector to ridicule.

The response needs instead to be premised upon careful and critical analysis of the datasets and algorithms being put to use. If this does not happen, then the sector can expect to face a string of serious confrontations that will make the dual pricing saga look like a picnic. I’m afraid I’m hearing about evidence of practices that make such confrontation more rather than less likely.

Due Diligence on Data

In this third post in a series about ‘data and power’, I’m going to be looking at how the sector approaches its core resource: data. And what I’ll be covering will have direct relevance to how insurers should undertake due diligence on data, both sourced from within and from data brokers.

And before I start, just in case any doubts remain, consider this recent piece of news, about the Information Commissioners Office ruling on the practices of data brokers Experian, Equifax and TransUnion. These firms, with whom most insurers work, have to ‘fundamentally change’ how they handle data or face a huge fine. And that’s just in respect of data protection legislation.

Insurers need to have robust procedures in place to ensure that the data they are using, and the way in which they are using it, fall within their regulated and legal responsibilities. This means they need to ask the right questions, understand the purpose of asking those questions, be able to weigh up the veracity and completeness of the answer obtained, and to both record and communicate this for interested parties.

This needs to be done at three stages – planning, operations and oversight. And it needs to be done with both internally and externally sourced data, and throughout the lifecycle of that data’s use. So this points to it being more of a cultural norm within data science projects, and less of a compliance tick list exercise.

Simple but Radical Questions

So what are the questions that need to be asked. At their most fundamental, it boils down to these five questions. Where did this data come from? Who collected it? When was it collected? How was it collected? Why was it collected?

These questions are more radical than you might think. Most data when acquired prompts forward looking questions like ‘what can we use it for’. These five questions turn that around and require you to start by looking backwards, at what some call the data’s biography, and others call its provenance.

The answers these five questions should elicit will provide you with an understanding of the context within the dataset is situated. And there are two things about that context that are really important to remember. Firstly, that context should highlight the financial, political and social conditions under which the dataset came into being. And those conditions will then make clear why some aspects of this dataset are present and other aspects aren’t. It will make clear how the quality of this dataset compares with the quality standards for your digital projects. And it will make clear the conflicts of interest embedded within the dataset that make it suitable, or not, for the uses you have planned for it.

The second thing to remember about this context is that it shows just how far from raw this data actually is. The data you have before you will invariably be already cooked in some way. In other words, it will have already gone through a variety of filters and processing before reaching you. And the nature of those filters and processing should tell you the extent to which pre-configuration has already been baked into the dataset.

The Stories Written into Numbers

What this means is that when people talk about ‘how the numbers speak for themselves’, they are forgetting to mention that it is a story that has been pre-scripted with a purpose in mind, that has already been edited, and that is already shaped towards a particular conclusion. It is not the numbers that are speaking, but rather the people whose time, resources and interests have baked those numbers into that particular story.

Recognising the context in which a dataset is situated allows the insurer to understand more about the dataset’s strengths and weaknesses. These will in turn allow the insurer to validate it against its regulatory and legal responsibilities. Asking strong questions and demanding exacting answers moves that insurer close to fulfilling those responsibilities.

The Answer is not Synthetic

Insurers can sometimes turn to the use of synthetic data to address shortfalls in a dataset. Their intention is to raise the quality and accuracy of the dataset, often to meet internal standards. Is that really the right approach though? Data ethics is less about creating more and more accurate systems, and more about creating more and more equitable systems.

If your dataset has gaps in it, then rather than make those gaps go away, explore them. Gaps in data exist for reasons, and drilling down into those reasons can uncover limitations that could have a material impact on your firm meeting its responsibilities. Those gaps may exist because some people within that population do not have data to give, or are difficult to collect data from, or mistrust you enough to not contribute data, or evidence what you’re looking for through other means.

Such reasons lie behind the ethical issues associated with facial recognition, with inclusion, with bias, with unfairness. And this needs to be recognised less in a data ethics programme, and more in core functions like due diligence, audit and oversight, product design, underwriting, marketing and claims. In fact, this needs to be recognised wherever you use data. By all means use a data ethics programme to kick start this type of thinking, but be sure to transition it quickly into core functions so that it becomes ‘just another part of how we work round here’.

Who asks the Questions?

I want to end with something that is central to the success of this ethical way of thinking about data. It is the question of who should be asking questions such as those five I mentioned earlier. Should it be technical people like data scientists? Or should it be those in charge of underwriting or product design? Perhaps even someone from the customer side of the firm? Well, it needs to be someone who is a little bit of each of these people, with perhaps a little weighting towards the customer side. What it shouldn’t be is just the data science people. Yes, they have a lot of expertise in data, but they also see it through a particular lens, which is not always helpful.

To Sum Up

The challenges that insurers will face around data ethics over the next few years have their origins in how those insurers have acquired and assembled their datasets. A window currently exists for insurers to respond to those emerging challenges, before the FCA lays them out in their market review for everyone to raise their eyebrows over. The best way to respond is to subject those datasets to rigorous and critical due diligence, and to take the brave but necessary decisions to cut out those that fail to meet standards and obligations. This blog has sought to frame what a rigorous and critical due diligence could look like, and to emphasise the importance of embedding this into the culture of ‘how we do data round here’.