A team of Stanford students has won at TreeHacks, the largest collegiate hackathon in the United States, with an accessibility-focused AI music system that transforms everyday objects into playable instruments.
The system, called Maestro, was built by Stanford electrical engineering and computer science student Vansh Gadhia and his teammates in 36 hours. Gadhia, a RISE Global Fellow and Chegg.org Global Student Prize Top 50 finalist, shared details of the project on LinkedIn, describing Maestro as “an accessibility-first AI music system that lets you perform almost any instrument in real time using computer vision and everyday objects.”
The project highlights how multimodal AI systems are moving from research environments into real-time creative tools, with potential implications for music education, accessibility, and digital skills development.
Computer vision meets generative music
According Gadhia, who shared details of the project on LinkedIn, the team built Maestro as “an accessibility-first AI music system that lets you perform almost any instrument in real time using computer vision and everyday objects.”
By tracking hand motion and physical markers, the system maps gestures to notes and instruments instantly. Gadhia wrote: “If you have a broom, you have an instrument.”
Maestro combines MediaPipe and OpenCV for gesture tracking, WebSockets to connect phone and browser inputs, FluidSynth and SoundFonts for real-time MIDI playback, and Demucs plus Basic Pitch for transcription and stem reconstruction. The team integrated the Suno API for music generation and ran the system on NVIDIA DGX Spark hardware to reduce latency between gesture, sound generation, and AI feedback.
The platform also includes AI song generation and a coaching layer. After a session, the system analyzes posture, timing, and performance data and provides feedback through an AI tutor.
Gadhia wrote that the goal is “to open up music to more people by removing cost and access barriers, and to help others discover and share the joy of playing and creating music without being constrained by owning a traditional instrument.”
Real-time multimodal feedback
The system operates as a multimodal pipeline combining computer vision, hand tracking, audio analysis, and AI agents.
MediaPipe tracks 21 hand landmarks per frame. OpenCV detects markers on the physical object. Strumming gestures are identified through fingertip velocity, while a companion iPhone app detects chord inputs and syncs them with webcam tracking in real time.
In coaching mode, video and audio are analyzed in parallel by separate AI agents. Feedback is returned as structured guidance and delivered via on-screen text and text-to-speech.
The team reports that latency was a central challenge. Musical interaction required immediate response to feel natural, placing emphasis on system coordination across frontend, backend, and AI models.
From hackathon demo to education tool
TreeHacks, hosted annually by Stanford University, focuses on student-led solutions to real-world challenges. Maestro received second place in the music track, supported by Suno, NVIDIA, Perplexity, and Reet Chowdhary.
While developed as a hackathon project, the system points toward broader applications in education and accessibility. By reducing the cost barrier associated with musical instruments and combining performance with AI coaching, tools like Maestro could extend creative learning opportunities beyond formal music classrooms.
The team states plans to expand object recognition, introduce collaborative modes, and build adaptive practice tracking.
ETIH Innovation Awards 2026
The ETIH Innovation Awards 2026 are now open and recognize education technology organizations delivering measurable impact across K–12, higher education, and lifelong learning. The awards are open to entries from the UK, the Americas, and internationally, with submissions assessed on evidence of outcomes and real-world application.
