We going to train on your data after all • The Register

Microsoft’s GitHub next month plans to begin using customer interaction data – “specifically inputs, outputs, code snippets, and associated context” – to train its AI models.

The code locker’s revised policy applies to Copilot Free, Pro, and Pro+ customers, as of April 24. Copilot Business and Copilot Enterprise users are exempt thanks to the terms of their contracts. Students and teachers who access Copilot will also be spared.

Those affected have the option to opt out in accordance with “established industry practices” – meaning according to US norms as opposed to European norms where opt-in is commonly required. To opt out, GitHub users should visit /settings/copilot/features and disable “Allow GitHub to use my data for AI model training” under the Privacy heading.

Mario Rodriguez, GitHub’s chief product officer, would rather you didn’t.

“By participating, you’ll help our models better understand development workflows, deliver more accurate and secure code pattern suggestions, and improve their ability to help you catch potential bugs before they reach production,” he wrote in a blog post.

To excuse its covetous behavior, GitHub in its FAQs notes that Anthropic, JetBrains, and corporate parent Microsoft operate similar opt-out data use policies.

The rationale for the change, according to Rodriguez, is that interaction data makes company AI models perform better. Adding interaction data from Microsoft employees has led to meaningful improvements, he claims, such as an increased acceptance rate for AI model suggestions.

The data GitHub wants includes:

Model outputs that have been accepted or modified;
Model inputs including code snippets shown;
Code context surrounding your cursor position;
Comments and documentation you’ve written;
File names and repo structure;
Interactions with Copilot features (e.g. chats); and
Feedback (e.g. thumbs up/down ratings).

The policy shift does somewhat change the meaning of GitHub private repositories, which are notionally “only accessible to you, people you explicitly share access with, and, for organization repositories, certain organization members.” These might be more accurately described as “GitHub private^* repositories,” with the asterisk to denote the limits of GitHub’s definition of the word “private.”

As the FAQs explain: “If a Copilot user has their settings set to enable model training on their interaction data, code snippets from private repositories can be collected and used for model training while the user is actively engaged with Copilot while working in that repository.”

Recent banter in the GitHub community doesn’t include much enthusiasm for the plan. To judge by emoji votes alone, users have offered 59 thumbs-down votes and just three rocket ships, which we understand signal some measure of excitement.

But among the 39 posts commenting on the change at the time this article was filed, no one other than Martin Woodward, GitHub VP of developer relations, has really endorsed the idea.

User indignation might be somewhat mitigated if GitHub users recognized that OpenAI’s Codex – used in GitHub Copilot – is “a GPT language model fine-tuned on publicly available code from GitHub.” That verbiage shows the data-gorged AI horse is already out of the barn, so to speak.

Shutting the doors at this point won’t change the fact that the AI industry is built on data gathered without asking for a strong indicator of enthusiastic consent. ®

Source link

Leave a Reply Cancel reply