The purpose of this study is to investigate whether visuo-spatial working memory (WM) is involved in the processing of spatial passages auditorily presented to the Chinese advanced learners of Japanese language. In the experiments, the participants’ verbal WM capacity was manipulated as an independent variable. Spatial tapping tasks, divided as simple tapping task and target tracking task, were used to detect how visuo-spatial WM behaves during listening comprehension. Nonspatial and spatial texts were used as listening materials in experiments. Free-recall task and true/false judgment test were used to check the comprehension and memorization of the texts. As a result, it was found that participants with higher verbal WM capacity had better performance to memorize the information presented in both texts. Further, only the interference effect of target tracking task on the spatial text was observed, suggesting that visuo-spatial WM is only involved in the process of text comprehension and memorization when spatial information is involved. The interference effect of target tracking task and better performance of higher verbal WM capacity participants emphasize the possibility that during the processing of second language listening comprehension, image is generated only after the verbal information processing has been completed, via the function of visuo-spatial sketchpad and central executive, and used in the parsing and utilization.