“Data is the most important part of modern machine learning applications, including the perception systems of self-driving cars that rely on data for training. At present, cars are equipped with many sensors. These sensors collect information and input it into the car computer. Then the information must be processed and annotated in real time so that the car can understand the situation in front of the road while driving. However, the algorithm on the car computer needs to be trained on how to classify, so the accuracy of data labeling becomes very important. The following are some of Annotell’s explorations on data annotation.
“
Data is the most important part of modern machine learning applications, including the perception systems of self-driving cars that rely on data for training. At present, cars are equipped with many sensors. These sensors collect information and input it into the car computer. Then the information must be processed and annotated in real time so that the car can understand the situation in front of the road while driving. However, the algorithm on the car computer needs to be trained on how to classify, so the accuracy of data labeling becomes very important. The following are some of Annotell’s explorations on data annotation.
Annotated data is essential. It serves two purposes: to train an algorithm on the computer in the car to interpret the collected information, and to verify that the computer has indeed learned to interpret the collected information correctly. Since annotation data is used for these two key purposes, the quality of the annotations is critical. Ultimately, low-quality annotations can cause the car to misunderstand what is happening on the road.
The process of annotating data always includes some human decisions. The first challenge is to get people to agree to annotate the recorded data correctly, and creating such annotation guidelines is sometimes not as easy as people think. It is often necessary to have a wealth of experience in effectively designing annotation guidelines to improve quality. The second challenge is to perform annotations on a scale under the guidance of the guidelines.
How to judge the validity of the data set?
One way to quantify the quality of annotations is the accuracy and recall of annotated datasets. Consider the type of annotation, where an object in the camera image (such as an approaching vehicle) is marked by a bounding box. When inferring the quality of such data sets, there are two important questions (i) whether the object of interest is correctly labeled by the bounding box, and (ii) whether the bounding box actually contains the object of interest.
There is an incorrect label in the diagram above. In the perfectly annotated data set, neither of the above two errors exist. Therefore, one way to define quality is to calculate the extent to which these errors occur in the annotated data set.Such as calculating
The ratio of the bounding box that actually represents the object. This is called accuracy. Ideally, the accuracy is 1.
The ratio of objects correctly annotated with bounding boxes. This is the so-called recall. Ideally, the recall rate is 1.
However, calculating the accuracy and recall of the data set also requires manual critical inspection of each frame in the entire data set, which can be as expensive as the annotation process itself! In order to gain efficiency in calculating precision and recall, the Annotell team relies on statistical data to infer precision and recall. Manually critically review only the well-selected subset of all annotated statistics, and use probability theory to draw conclusions about the entire data set.
In more detail, they use Bayesian methods to calculate the posterior distribution to improve accuracy and recall the entire data set, which depends on the annotated sub-sample that has been vetted. It not only provides estimates of precision and recall, but it also quantifies the uncertainty in these estimates. For example, we can calculate the so-called 95% confidence lower limit, which means that it can be determined that the 95% accuracy or recall rate is not lower than this threshold.
Annotell provides a cost-effective tool for measuring the quality of annotations based on accuracy and recall level and certainty of the level.
The Links: LC171W03-A4 M170EG01-V9