Baseline results of the IPN Hand Dataset

Tests are divided into isolated and continuous hand gesture recognition. If you want your results to be included here, please send the information (method, reference & raw results) to gibran@ieee.org.

Isolated Hand Gesture Recognition

We segment all testing videos into isolated gesture samples based on the beginning and ending frames manually annotated. The learning task is to predict class labels for each gesture sample. We use classification accuracy, which is the percent of correctly labeled examples, and the confusion matrix of the predictions, as evaluation metrics for this test.

Ref.	Model	Input	Modality	Results	Inference*
[1]	ResNeXt-101	32-frames	RGB-Flow	86.32 %	50.8 ms
[1]	ResNeXt-101	32-frames	RGB-Seg	84.77 %	37.4 ms
[2]	ResNeXt-101	32-frames	RGB	83.59 %	27.7 ms
[3]	C3D	32-frames	RGB	77.75 %	76.2 ms
[1]	ResNet-50	32-frames	RGB-Seg	75.11 %	25.9 ms
[1]	ResNet-50	32-frames	RGB-Flow	74.65 %	39.9 ms
[3]	ResNet-50	32-frames	RGB	73.10 %	18.2 ms

*Inference time measured in a single NVIDIA GTX 1080ti GPU

Confusion matrix of the best result [1] using ResNext-101 model with 32-frames RGB-flow:

Continuous Hand Gesture Recognition

We use the Levenshtein accuracy [3] as evaluation metric for this test. This metric employs the Levenshtein distance to measures the distance between sequences by counting the number of item-level changes. The difference between the sequence of predicted and ground truth gestures is measured. For example, if a ground truth sequence is [1,2,3,4,5,6,7,8,9] and predicted gestures of a video is [1,2,7,4,5,6,6,7,8,9], the Levenshtein distance is 2. Thus, the Levenshtein accuracy is obtained by averaging this distance over the number of true target classes. In the example, the accuracy is 1-(2/9)x100 = 77.78%.

Ref.	Model	Input	Modality	Results	Inference*
[1]	ResNeXt-101	32-frames	RGB-Flow	42.47 %	53.7 ms
[1]	Resnet-50	32-frames	RGB-Flow	39.47 %	43.1 ms
[1]	ResNeXt-101	32-frames	RGB-seg	39.01 %	39.9 ms
[1]	Resnet-50	32-frames	RGB-seg	33.27 %	29.2 ms
[3]	ResNeXt-101	32-frames	RGB	25.34 %	30.1 ms
[3]	Resnet-50	32-frames	RGB	19.78 %	20.4 ms

*Inference time measured in a single NVIDIA GTX 1080ti GPU

References

[1] G. Benitez-Garcia, et al., IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition, in ICPR 2020. [code]

[2] D. Tran, et al., Learning spatiotemporal features with 3d convolutional networks, in CVPR 2015.

[3] O. Kopuklu, et al., Real-time hand gesture detection and classification using convolutional neural networks, in FG 2019. [code]