This paper introduces a novel approach to AR dialogue systems, utilizing the SIMMC2-Point dataset to incorporate pointing modality. It employs BART and CLIP models to design multi-modal dialogues capturing spatial and attribute data. Ablation experiments underscore the pointing modality’s importance, advancing AR dialogue systems for immersive interactions.