Computer Science 5th Year Thesis Presentation

Location:
8102 - Gates & Hillman Centers

Speaker:
JACK PAPARIAN , 5th Year Masters Student
http://www.andrew.cmu.edu/user/jpaparia

When building datasets for supervised machine learning problems, data is often labeled manually by human annotators. In domains like medical imaging, acquiring labels can be prohibitively expensive. Both active learning and crowdsourcing have emerged as ways to frugally label datasets. In active learning, there has been recent interest in algorithms that exploit the data's structure to direct querying. When learning from crowds, one must balance the accuracy and cost of different teachers when gathering labels; weak teachers are assumed to be most accurate when labeling samples from label-homogeneous regions of space. In this thesis, we explore how the data's structure can be leveraged for both of these techniques. The sequential probability ratio test (SPRT) provides the backbone for our contributions. Using the SPRT, we provide a cluster-based active learning algorithm to find a small, homogeneous partitioning of the data. We also use the SPRT to measure the confidence of a weak teacher's label by analyzing its estimates on neighboring labels. The optimality of the SPRT allows the algorithms to inherently minimize the average number of queries required before their termination. Thesis Committee: Christopher J. Landmead Carl Kingsford Copy of Thesis Document


Add event to Google
Add event to iCal