To confirm the hypothesis that a phonemic perception is a process in which perceptual representation of speech signals is matched to phonemic representation in long-term memory, performances in identification learning of sinewave analogs of speech sounds were compared under three different instructions. Six sinewave analog stimuli were constructed, using three time-varying sinusoids, to imitate formant structures of vowel sounds /a/, /i/, and /u/ in a female and a male voice. These are usually not perceived as involuntarily speech sounds. Participants were randomly assigned to three instructions groups : speech instructions group with matched labels, who were told to identify analogs using vowel labels matched to original speech, speech instructions group with mismatched labels, who were told to identify them using mismatched vowel labels, and nonspeech instructions group, who were told to identify them as inharmonic nonmusical cords using arbitrary labels, "A", "B", and "C". It was shown that performance was improved through repetition of identification in all instructions groups, but that, with regard to performance level, speech instructions group with matched labels was superior to other instructions groups, and speech instructions group with mismatched labels was the most inferior. These results replicated previous findings (Takayama, 2002a). In addition, it was found that voice of original vowels or tokens of analog vowels influenced identification both in speech instructions group with mismatched labels and nonspeech instructions group, but not in speech instructions group with matched labels. Therefore, it could be concluded that matching to phonemic representation in memory might occur at more abstract or categorical level of representation, not at the level where acoustic details of speech sounds were preserved.