The Unsexy Truth: A 10-Step Guide to Data Labeling That Actually Works

3 min readJust now

Below is a very terse and opinionated set of steps to data labeling and model evaluation.

I’ve found myself repeating this on a few occasions lately, so I’m publishing a short guide I can link people to.

What’s written is applicable to traditional ML models before the infamous “ChatGPT moment,” but it has become a much more ubiquitous concern nowadays. This omits a ton of details and nuances, but it’s enough to help build an intuitive foundation.

You might be wondering who I am and why I’m writing on this topic.

From 2016 to 2020, I had the opportunity to architect Magic Leap’s Augmented Reality Cloud, focusing on mapping, localization, and spatial anchors. Later, I worked on Waymo’s Planner Evaluation team with an emphasis on Vulnerable Road User (VRU) safety. In both roles, I wasn’t the lead algorithm or model developer, but I was “that engineer” — IYKYK — who got to collaborate closely with some of the most brilliant minds I’ve ever met.

10-Step Guide

Subjective Problem Definition: Identify a problem that lacks a clear, objective answer.
Question Generation: Build a set of simple binary or multiple-choice questions, including an “unsure” option for edge cases.
Expert Hunt: Gather the world’s top experts (dozens), focusing on those who live and breathe the niche topic.
Expert Surveys: Have these specialists answer the questions from Step 2.
Golden Data Creation: Combine expert answers into a “golden dataset.”
Large-Scale Labeling: Expand to a crowd-sourced pool (hundreds or thousands). Aim for volume.
Labeler Scoring: Compare the crowd’s labels to your golden data and assign each labeler (not label) a score.
Weighted Integration: Use those scores to weigh the crowd-sourced labels during training, fine-tuning, RLHF, or downstream evaluations.
Embrace the Boredom: Labeling is tedious and unsexy. Remind yourself that ‘true ground truth’ may not exist even with the world’s top experts in the same room.
Repeat the Cycle: Keep iterating until you get lucky and see good results.

10-Step Example

Subjective Problem Definition: Is a particular crosswalk interaction between vehicles and pedestrians risky?
Question Generation: Present a set of videos with multiple-choice answers ranging from ‘not risky’ to ‘very risky’ showing real or synthetic interactions between a vehicle and a pedestrian (i.e., a VRU).
Expert Hunt: Gather dozens of high-paid PhDs from the world’s top academic institutions who specialize in the autonomous vehicle industry.
Expert Surveys: Have them spend half their day labeling data instead of doing ‘intellectual’ work. Maybe host a pizza party to make it less dull.
Golden Data Creation: Compile their answers into 1,000 golden labels (10 experts × 100 questions).
Large-Scale Labeling: Use a crowd-sourcing system like Mechanical Turk to scale labels from 1,000 to 100,000.
Labeler Scoring: Compare the crowd labels to the golden dataset; grade each labeler’s precision and recall.
Weighted Integration: Develop an internal, proprietary metric and discard labels that fall below a quality threshold.
Embrace the Boredom: Acknowledge repetitive, unglamorous work and the absence of a perfect ground truth.
Repeat the Cycle: Keep looping until the model behaves “good enough.”

Nothing here is ground breaking and companies like Scale AI are heavily leaning in this direction. I wouldn’t be surprised if we’ll see a marketplace of small but very expensive labeled data sets n the near future.

In the meantime, I hope this helped!

The Unsexy Truth: A 10-Step Guide to Data Labeling That Actually Works

10-Step Guide

10-Step Example

Written by Daniel Olshansky

No responses yet