TY - JOUR
T1 - Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection
AU - Sarikaya, Duygu
AU - Corso, Jason J.
AU - Guru, Khurshid A.
N1 - Publisher Copyright:
© 1982-2012 IEEE.
PY - 2017/7
Y1 - 2017/7
N2 - Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem in RAS video understanding, using a strictly computer vision approach and the recent advances of deep learning. We propose an architecture using multimodal convolutional neural networks for fast detection and localization of tools in RAS videos. To the best of our knowledge, this approach will be the first to incorporate deep neural networks for tool detection and localization in RAS videos. Our architecture applies a region proposal network (RPN) and a multimodal two stream convolutional network for object detection to jointly predict objectness and localization on a fusion of image and temporal motion cues. Our results with an average precision of 91% and a mean computation time of 0.1 s per test frame detection indicate that our study is superior to conventionally used methods for medical imaging while also emphasizing the benefits of using RPN for precision and efficiency. We also introduce a new data set, ATLAS Dione, for RAS video understanding. Our data set provides video data of ten surgeons from Roswell Park Cancer Institute, Buffalo, NY, USA, performing six different surgical tasks on the daVinci Surgical System (dVSS) with annotations of robotic tools per frame.
AB - Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem in RAS video understanding, using a strictly computer vision approach and the recent advances of deep learning. We propose an architecture using multimodal convolutional neural networks for fast detection and localization of tools in RAS videos. To the best of our knowledge, this approach will be the first to incorporate deep neural networks for tool detection and localization in RAS videos. Our architecture applies a region proposal network (RPN) and a multimodal two stream convolutional network for object detection to jointly predict objectness and localization on a fusion of image and temporal motion cues. Our results with an average precision of 91% and a mean computation time of 0.1 s per test frame detection indicate that our study is superior to conventionally used methods for medical imaging while also emphasizing the benefits of using RPN for precision and efficiency. We also introduce a new data set, ATLAS Dione, for RAS video understanding. Our data set provides video data of ten surgeons from Roswell Park Cancer Institute, Buffalo, NY, USA, performing six different surgical tasks on the daVinci Surgical System (dVSS) with annotations of robotic tools per frame.
KW - Object detection
KW - image classification
KW - laparoscopes
KW - multi-layer neural network
KW - telerobotics
UR - http://www.scopus.com/inward/record.url?scp=85028403958&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028403958&partnerID=8YFLogxK
U2 - 10.1109/TMI.2017.2665671
DO - 10.1109/TMI.2017.2665671
M3 - Article
C2 - 28186883
AN - SCOPUS:85028403958
SN - 0278-0062
VL - 36
SP - 1542
EP - 1549
JO - IEEE Transactions on Medical Imaging
JF - IEEE Transactions on Medical Imaging
IS - 7
M1 - 7847313
ER -