Abstract: We study the challenging problem of simultaneously localizing a sequence of instructional diagram queries in a video. This requires understanding not only the individual diagram queries but ...