TY - GEN
T1 - Accessible human-error interactions in AI applications for the blind
AU - Hong, Jonggi
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/8
Y1 - 2018/10/8
N2 - People who are blind experience challenges when performing everyday tasks that are heavily dependent on vision such as identifying objects, clothing, and packages of food as well as entering text on a smartphone using touchscreen keyboards. To overcome these challenges, they would typically use other sensory channels such as touch, taste, smell, and hearing. For example, they use a screen reading application such as VoiceOver to identify keys on a touchscreen keyboard or Braille to recognize everyday objects (e.g., by attaching adhesive Braille labels to them). Machine learning (ML) applications such as automatic speech recognition (ASR) and computer vision can make it easier for this population to carry out such tasks by allowing access to the visual world and interactions through preferred modalities. Prior work [2, 34] has shown that indeed speech input is the preferred method for text input on a mobile device for blind people. Also, many computer vision application have been proposed to enable these users navigate in unfamiliar indoor environments [3, 30], access printed text [5, 25], and identify objects of interest [1, 20, 27] using the built-in cameras on their mobile devices. Although advances in ML promise improved accuracies, ML applications are inherently error prone. The accuracy of ASR systems is reaching 5.1% word error rate for English [33]. In computer vision, the top-5 performance of image classifiers has only 3.7% error rate [26]. Given that these numbers are averages reported on benchmarking datasets, users may face additional challenges when these models are deployed in real-world application especially in conditions deviating from those that the models were trained on. For example, ASR errors can occur due to speaker variation, disfluency, background noise, ambiguity of words, and mistakes from users [14, 18]. Also, object recognition errors, even though models are typically trained on limited numbers of objects, could occur due to lack of discriminative characteristics, lighting conditions, and reflective surfaces, but more so for blind users due to their challenges in photo taking. These could result in different background clutter, scale, viewpoints, occlusion, and image quality than in photos taken by sighted users used in training [20, 35].
AB - People who are blind experience challenges when performing everyday tasks that are heavily dependent on vision such as identifying objects, clothing, and packages of food as well as entering text on a smartphone using touchscreen keyboards. To overcome these challenges, they would typically use other sensory channels such as touch, taste, smell, and hearing. For example, they use a screen reading application such as VoiceOver to identify keys on a touchscreen keyboard or Braille to recognize everyday objects (e.g., by attaching adhesive Braille labels to them). Machine learning (ML) applications such as automatic speech recognition (ASR) and computer vision can make it easier for this population to carry out such tasks by allowing access to the visual world and interactions through preferred modalities. Prior work [2, 34] has shown that indeed speech input is the preferred method for text input on a mobile device for blind people. Also, many computer vision application have been proposed to enable these users navigate in unfamiliar indoor environments [3, 30], access printed text [5, 25], and identify objects of interest [1, 20, 27] using the built-in cameras on their mobile devices. Although advances in ML promise improved accuracies, ML applications are inherently error prone. The accuracy of ASR systems is reaching 5.1% word error rate for English [33]. In computer vision, the top-5 performance of image classifiers has only 3.7% error rate [26]. Given that these numbers are averages reported on benchmarking datasets, users may face additional challenges when these models are deployed in real-world application especially in conditions deviating from those that the models were trained on. For example, ASR errors can occur due to speaker variation, disfluency, background noise, ambiguity of words, and mistakes from users [14, 18]. Also, object recognition errors, even though models are typically trained on limited numbers of objects, could occur due to lack of discriminative characteristics, lighting conditions, and reflective surfaces, but more so for blind users due to their challenges in photo taking. These could result in different background clutter, scale, viewpoints, occlusion, and image quality than in photos taken by sighted users used in training [20, 35].
KW - Accessibility
KW - Automatic speech recognizer
KW - Machine learning
KW - Personalized object recognizer
KW - Speech input
UR - http://www.scopus.com/inward/record.url?scp=85058322078&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058322078&partnerID=8YFLogxK
U2 - 10.1145/3267305.3267321
DO - 10.1145/3267305.3267321
M3 - Conference contribution
AN - SCOPUS:85058322078
T3 - UbiComp/ISWC 2018 - Adjunct Proceedings of the 2018 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2018 ACM International Symposium on Wearable Computers
SP - 522
EP - 528
BT - UbiComp/ISWC 2018 - Adjunct Proceedings of the 2018 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2018 ACM International Symposium on Wearable Computers
T2 - 2018 Joint ACM International Conference on Pervasive and Ubiquitous Computing, UbiComp 2018 and 2018 ACM International Symposium on Wearable Computers, ISWC 2018
Y2 - 8 October 2018 through 12 October 2018
ER -