Inspired by the recent work in SpatialVLM, we reproduce similar data synthesis pipelines using openly available models.
We compare our results to alternative annotation pipelines like RAM-Grounded-SAM.
Our repo uses a simple pipeline in docker compose to produce datasets suitable for fine-tuning multimodal models like LLaVA.