Annotation is often the most arduous part of the artificial intelligence (AI) model training process. That’s particularly true in computer vision — traditional labeling tools require human annotators to outline each object in a given image. Labeling a single pic in the popular Coco+Stuff dataset, for example, takes 19 minutes; tagging the whole dataset of 164,000 images would take over 53,000 hours.
Fortunately, Google’s developed a solution that promises to cut down on labeling time dramatically. It’s called Fluid Annotation, and it employs machine learning to annotate class labels and outline every object and background region in a picture. Google claims it can accelerate the creation of labeled datasets by a factor of three.
The demo’s available on the web here.
Fluid Annotation starts from the output of a pretrained semantic segmentation model (Mask R-CNN), which generates roughly 1,000 image segments with class labels and confidence scores. (Essentially, it associates each pixel of an image with a class label, such as “flower,” “person,” “road,” or “sky.”) The segments with the highest confidences are passed onto human workers for labeling.
Annotators can modify the image through a dashboard, choosing what to correct and in which order. They’re able to swap the label of an existing segment with another from an auto-generated shortlist, add a segment to cover a missing object, remove an existing segment, or change the depth order of overlapping segments.
“Fluid Annotation is a first exploratory step towards making image annotation faster and easier,” Jasper Uijlings and Vittorio Ferrari, senior research scientists at Google’s machine perception division, wrote in a blog post. “In future work we aim to improve the annotation of object boundaries, make the interface faster by including more machine intelligence, and finally extend the interface to handle previous unseen classes for which efficient data collection is needed the most.”
Google’s not the only one applying AI to data annotation.
San Francisco startup Scale employs a combination of human data labelers and machine learning algorithms to sort through raw, unlabeled streams for clients like Lyft, General Motors, Zoox, Voyage, nuTonomy, and Embark. Supervisely operates on the same model: a combination of deep learning models and crowd collaboration. And Sweden-based Mapillary creates a database of street-level images and uses computer vision technology to analyze the data contained in those images.
Firms like DefinedCrown take a different tack. The three-year-old Seattle-based startup, which describes itself as a “smart” data curation platform, offers a bespoke model-training service to clients in customer service, automotive, retail, health care, and enterprise sectors.