Using Generative AI to count objects on images
And other AI vision applications
Ever tried getting a Generative AI to count objects in an image—like “how many runners have purple socks?”—and been amused (or frustrated) by the results?
Same.
For all their sophistication, most LLMs have struggled with practical, structured image analysis, especially when it comes to reliable object counting.
That’s why Moondream 3 feels like a game-changer.
I recently put its preview release through a few real-world tests.
It’s not perfect yet, but it’s the first LLM I’ve seen that makes object counting and structured image understanding genuinely accessible—jumping beyond simple image captions to answer questions like “how many?” and “where?” with actual context.
It represent a tangible improvement in several areas of vision-language reasoning.
For my tests, it managed:
Handling specific queries about objects and their locations within an image
Producing structured outputs such as tables or JSON from visual content
Extracting text from photos and diagrams with improved OCR capabilities
Managing longer, multi-step reasoning tasks, thanks to an increased context window.
Your company URL → AI Analysis → 10 initiatives of AI agents for your business.
That’s it. That’s the tool: IntelleGEN Compass.
It’s free to use.
One thing worth noting is the training and customization options:
Moondream 3 is intended to be adapts well—whether through a few prompt examples or traditional fine-tuning.
It’s not a finished solution, but it’s a step forward for those interested in practical uses of vision-LM models and structured image analysis.Curious to try it?
You can explore the Moondream Playground and interact with the model right in your browser.
Prefer local?
Moondream 3 is open and free to download—run it on your own computer.Looking to build?
It’s also available via the Fal AI platform, so you can access Moondream 3 through APIs and integrate it directly into your own tools and workflows.I’m still exploring all its quirks, but the jump forward in practical, explainable vision AI is unmistakable.
If you work at the intersection of LLMs and computer vision, check this out—AI models are finally making visual understanding something you can count on.





Couldn't agree more. This genuinely marks a significant leap. The struggle with structured image analisys has been a real bottleneck. What if Moondream 3's capacity for structured outputs could streamline large-scale ecological surveys, counting specific species or tracking changes in biomass over time? The implications are huge.