Making my own "AI" made me feel dirty

This is part 2 of the story of how I built an algorithm that tries to predict your Body Mass Index (BMI) from just a photo. In the first part I pointed out the things I discovered while researching how the industry creates these models. 

In this second part I want to tell you about all the questionable choices I had to make while creating my own machine learning model. These are the same choices "data scientists" have to make on a daily basis. And I'll be honest: some of the things I did felt scientifically and even morally questionable. 

Choice #1: precision

In the previous article I discussed how a BMI score isn't actually universal. Even though its common place to "compare their score", we realy shouldn't. For example, when looking at BMI you should also weigh gender and culture. Men tend to have wider faces, which in this case would affect the score. Athletes are another good example. They tend to have a high BMI. This is not because they are overweight, but because they have a lot of muscle mass, which is relatively heavy. So if someone says you have a high BMI, just say you're an athlete.

Did I take any of this nuance into account? Nope. 

I created the same blunt machine learning model that everyone does. I felt I could get away with it a little because I wanted to show people what a normal BMI prediction model actually does. I told myself that to show how bad things are, I should do as the industry does, and not care about any of these... details. Still, it didn't feel good. We're only at the first step, and I've already thrown scientific accuracy out of the window.

Choice #2: what training data should I use?

When I downloaded all those BMI analysis projects on Github, I was surprised to find that a few of them came complte with the photos they were trained on. Specifically, I got one dataset of Chinese celebrities, and one of people who had been arrested in the USA. This is quite incredible, since it means that if you were arrested long ago, and the charges dropped, your face and data would still live on in this dataset.

I found thousands of mugshots in a Github repository...

...complete with an excelsheet of all these people's data.

Most of the projects I found simply scraped these photos from the internet. Just as Clearview AI does, the trick is to convince yourself (and others) that because data is on the internet, it's also automatically 'public'. 

This isn't true of course. There are copyright issues, terms and conditions that must be observed, and other laws and regulations that prohibit this. Not in the least of which: it can be unethical. For example, I'm pretty sure the people sharing their progress pics on Reddit don't know that their photos are being harvested on a large scale to train machine learning algorithms. Even if "your face could be scraped and used to train machine learning models that could ultimately judge you negatively" was literally in the terms and conditions, only a cynical person would say that people truly consented to this when they clicked on the 'I agree' button.

I decided I wouldn't scrape the data myself, but would use the data that I had downloaded from Github already. Oddly, these projects were given permissive copyright licenses, so I could technically say "your face was available under an open source MIT license". Funny.. in a sad way.

I now have 4044 pictures of Chinese celebrities on my computer

Choice #3: what ratios should I mix the datasets?

I then started training my model. I had to decide in which ratios I would mix the two datasets. If I trained the algorithm on 90% Chinese celebrities and 10% American mugshots, then the model would become better at judging Chinese faces, and worse at judging American faces. And vice versa.

I found that this choice had a large influence. A model that was trained on mostly Chinese photos would be good at juding Chinese photos, but really really bad at judging Americans.

In what ratio should I mix the datasets?

If I mixed the photos 50/50, then the results would be equally bad for both, but at least it wouldn't be horrible for either culture. So if I wanted to make a 'universal' model that worked for all cultures, I would technically have to mix in photos from all cultures in relevant amounts, while accepting that the resulting model would be a jack of all trades, and a master of none.

So the lesson here for me was that creating 'universal' detection algorithms comes with strong penalties, and that it should always be preferable to tailor a machine learning model to it's intended audience.

But if I look at the market, I rarely see this cultural aspect being taken into account. Most companies that provide "AI As A Service" (AIAAS) have an incentive to keep their products generic, so that it works for most use cases. But this also means the price for that these services are bound to make lots of mistakes.

I've heard it say that these machine learning predictions are 'good to accurate enough' 60% of the time, wrong 20% of the time, and very wrong another 20% of the time.

It's important to understand that most companies, an algorithm doesn't have to be accurate for it to be valuable. Let's imagine an insurance agency prefers not to insure people with a BMI over 30. The algorithm judges all applications, and at some point will have judged 100 people to have an unhealthy BMI. In 60% of those cases, the algorithm had guessed right, and will have saved the company money. However, the other 40 people will have been misjudged, and they won't get offered insurance - or they will need to pay more - despite being a healthy enough weight.

This is not a graph

Choice #4: should I massage the data?

In this picture you can see the interface I saw when I was training the model. In it you can see two graphs for the 'meh' face ratio, which is the one that relates to the area above your eyebrows. As you can see in the first graph, there are some outliers. These may be rare people with very thin eyebrows, or people with unibrows. Who knows. The point is that these people influence the bar graph below. The model has to take these rare people into account. Because of this, the entire model will be less precise when it comes to measuring 'normal' people.

In politics we say "count all the votes". But in data science it's very common to prune and 'massage' the data a bit, which can mean that you remove some datapoints. Doing so would improve accuracy for the majority of people, but at the cost of making the model totally freak out when it comes across those people with rare eyebrows.

I tend to mentally visualize machine learning models like those old-school games where balls fall through an area full of pegs. Training a model is like carefully placing the pegs in such a way that certain shapes will hopefully fall into the correct buckets below. In this case the idea is that certain (combinations of) facial features will result into the face falling into the correct bucket. 

The 'problem' with the outliers is that you have to add a lot of pegs to the board for a type of face that seems relatively rare. And because the face type is so rare, it's also hard to accurately adjust the pegs; there just aren't enough faces to test with. So the option to just leave out those rare faces altogether becomes very seductive.

So we once again find the same accuracy trade-off: trying to make the model work better in one area makes it less capable in another. In this case it's a choice between optimizing for common faces or uncommon faces.

If you have rare eyebrows or unusual eyes: I'm sorry.

I decided to follow common industry practices, and remove the outliers. If you have rare eyebrows or unusual eyes: I'm sorry.

Removing the people with rare types of eyebrows

Choice #5: do I feel OK about the correlations?

At its heart, machine learning is just an advanced form of statistics. When you are training a machine learning algorithm, you are letting the system try to find the optimal line through a dataset. 

But that assumes that a line could be drawn that represents the data accurately enough in the first place. 

In the image below you see seven graphs for the pruned dataset, in which I removed the pesky outliers. This makes it easier to ascertain for ourselves if it makes sense to claim that certain face ratios correlate with BMI.

The seven ratios, each in relation to BMI

If there were strong correlations, you would see graphs where the shape of the blobs would clearly indicate a rise or fall as the BMI increases. But to my eye things don't seem to be that clear cut.

Firstly, let's look at graph number four - the eyebrow one. You can see that an increase in BMI doesn't really lead to a clear shift in the eyebrow ratio. If you were asked to draw a line through it, and it couldn't be horizontal (since that means there is no correlation), how would you draw it?

Secondly, in graph number 2, 3 and 5, you could draw a diagonal line through the datasets, but the original datasets have more of a triangle shape. This means there are a lot of people with find the same face ratio while having very different BMI scores. In other words: the correlation is pretty weak. it might work for a part of the population, but certainly not everyone.

The sixth dataset has an odd shape as well. How would you draw a straight line through it? Keep in mind that if you prefer the first one, you are essentially saying there is no correlation - that an increase in BMI isn't really reflected by a change in this face ratio.

Of course, in this case we don't have to justify drawing a line through these datasets, since the machine learning system will do it for us. But I'm not sure a human scientist would have felt very comfortable doing it in the first place. If you are an actual scientist, I'd love to hear your thoughts.

Now it could be that despite these weak correlations, taking all seven of them into account at once could result in a more convincing 7-dimensional picture. It could be that for each person there may be two or three ratios that taken together may correlate with BMI more. But I don't have that picture. So I don't know.

For this project I felt that it was 'probably good enough'.

Choice #6: is it accurate enough?

The graph below shows the result of an accuracy test on the final algorithm I created. As you can see, for about 500 photos the model was off by somewhere between 0 and 2 BMI points. Next, for about 230 photos it was off between 2 and 4 BMI points. For about 70 photos it was off between 4 and 6 points, and so forth.

How far off was my model when guessing people's BMI?

Mind you, I was testing the algorithm on the same data it was trained with. This is a condition that already favours the algorithm, since it has 'seen' these photos before. You might expect it to be very accurate. But it's not, and is still able to be wildly off the mark - up to 18 BMI points for a some photos. Since the BMI scale here only goes from 14 to 48, that's quite a lot.

You could argue the model might improve if trained on more photos, or if I used more 'pegs' to create a larger model. This very valid criticism highlights more of the trade-offs that designers have to make. In my case, I wanted the model to be small enough to be downloadable, so it had to remain relatively small. After some experimentation I settled on a model that was about three megabytes. Any smaller, and the accuracy dropped off quickly. Any bigger, and the accuracy would increase, but very slowly. 

So I once again made a conscious choice, judging the model to be 'accurate enough'.

In conclusion

I hope by now it's clear that, by their very nature, these models can never be perfectly accurate. Not only do these systems pose ethical questions, but by their very nature they require a designer to make trade-offs.

So the question becomes: how many poor judgements are acceptable? It is enough to, on average, make fewer mistakes than people? Mind you, that's a weird thing to say in this case. The human equivalent of this 'AI' would be having employees judge customers' BMI scores by looking at their face, while not expressly telling them they are being judged. 

I suspect one of the things that makes 'AI' so valuable is not the supposed accuracy of the results, but the advantages in the 'social' realm. In this BMI case, it can enable an organisation to expand into practices that would be deemed unacceptable and creepy if done by humans. Similarly, even if the accuracy would even worse than judgements by people, the fact that many people are more likely to accept 'AI' driven "computer says no" would likely be valuable. As long as people feel that the outcome is more neutral or objective because it as generated by a computer program, then it allows for what I tend to call the 'evaporation of responsability'.

I'd argue it would be healthy to refer to 'AI' as statistics in fancy clothes.

In that context, I'd argue it would be healthy to refer to 'AI' as statistics in fancy clothes. This comparison might help us think more critically about these systems. 

Firstly, there is a strong historical paralel. When statistics were invented there was probably a period where people thought it was the coolest thing since sliced bread. Objective data! Insights! But then we learnt to be more critical, and realise that statistics can be be the result of selective data sources, massaged data, and so forth. Over time we developed a healthy skepticism, summarized in the famous quote: "There are lies, damn lies, and statistics". 

The same type of nuanced thinking is overdue when it comes to 'AI' systems. Just because these use data, that doesn't make them scientific. Just because they use math, that doesn't make them scientific. I'd go so far as to propose that 'scientist' should become a protected term, since all kinds data cowboys (and cowgirls) are allowed to call themselves 'data scientist' now.

Even if you try to be as honest and accurate as possible, designing a machine learning model demands making trade-offs. In the end these machine learning models offer a simplified model of a very complex reality. When these are used to make predictions about people's lives, they are going to make mistakes. By their nature, simplified models will always fall short.

You might not care so much about the shortfalls of these systems. You might believe you're in the group of people that is normal and well-behaved, so you hope you'll be judged accurately and favourably.

The reality is that it doesn't work like that. These system aren't just inaccurate, they are unpredictable. It might be that your outlier eyebrows have already caused you to pay higher health insurance premiums. 

It's likely the computer has already unfairly said no you many times. But ask yourself: how would you find out?

No comments

Leave a comment