AI Loophole #1; Your GitHub README.md

Elias Griffin@lemmy.world · edit-2 5 months ago

AI Loophole #1; Your GitHub README.md

bamboo@lemm.ee · 5 months ago

Anything you put publicly on the internet in a well known format is likely to end up in a training set. It hasn’t been decided legally yet, but it’s very likely that training a model will fall under fair use. Commercial solutions go a step further and prevent exact 1:1 reproductions, which would likely settle any ambiguity. You can throw anti-AI licenses on it, but until it’s determined to be a violation of copyright, it is literally meaningless.

Also if you just hope to spam tab with any of the AI code generators and get good results, you’re not. That’s not how those work. Saying something like this just shows the world that you have no idea how to use the tool, not the quality of the tool itself. AI is a useful tool, it’s not a magic bullet.

catloaf@lemm.ee · 5 months ago

I think that training models for fair use purposes, like education, not commercialization, will also fall under fair use. But even so, it’s very difficult to prove that someone has trained their model on your data without a license, so as long as it’s available, I’m sure that it’ll be used.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

This “fair use” argument is excellent if used specifically in the context of “education, not commercialization”. Best one I’ve seen yet, actually.

The only problem is that perplexity.ai isn’t marketing itself as educational, or as a commentary on the work, or as parody. They tout themselves as a search engine. They also have paid “pro” and “enterprise” plans. Do you think they’re specifically contextualizing their training data based on which user is asking the question? I absolutely do not.

the_doktor · 5 months ago

And this is why AI needs to be banned from use. People own the things they post / place them under various licenses, and AI coming along and taking what you did is a blatant violation of copyright, ownership, trust, and is just general theft.

I am absolutely angry with the concept of AI and have campaigned against its use and written at length, many times, to every company that believes it’s allowed to scour the internet for training data for its highly flawed, often incorrect, sometimes dangerous AI garbage. To hell with that and to hell with anyone who supports AI.

bamboo@lemm.ee · 5 months ago

It hasn’t been decided in court yet, but it’s likely that AI training won’t be a considered copyright violation, especially if there is a measure in place to prevent exact 1:1 reproductions of the training material.

But even then, how is the questionable choices of some LLM trainers reason to ban all AI? There are some models that are trained exclusively on material that is explicitly licensed for this purpose. There’s nothing legally or morally dubious about training an LLM if the training material is all properly licensed, right?

Elias Griffin@lemmy.world · edit-2 5 months ago

Sounds like AI or an AI influencer post. The first paragaph is so far off-topic, might as well be talking about sailing. You completely mis-understood what I meant using TabNine. I wrote my own code and obfuscated my own code. Then tried to have AI complete another function using my code.

Nothing you said is relevant is any way, shape, or form.

[EDIT} https://www.tabnine.com/

wizardbeard@lemmy.dbzer0.com · edit-2 5 months ago

My guy, your posts are particularly hard to follow, and you are very very quick to jump to the conclusion that you’re somehow being targeted and under attack. It’s no surprise that people aren’t responding to what you think is appropriate for them to respond to.

You’ve gone out of your way to provide extra info about irrelevant details: Why does the particular flavor of git you use matter at all to this conversation beyond the fact that you self host, why does it matter that you are on github as well when we are specifically discussing things you believe were sourced from readme.mds you have self hosted?

Meanwhile you don’t give many details or explanation about the core thing you are trying to discuss, seemingly expecting people to be able to just follow your ramblings.

Edit: After having re-read your OP, it’s less messy than I initially thought, but jesus christ man you need to work on arranging your points better. It shouldn’t take reading your main post, a few of your comments, and the main post again to get your point: “AI data scrapers appear to treat readme files as public data regardless of any anti-AI precautions or licensing you’ve tried to apply, and they appear to not only grab from github bit also from self-hosted git repositories.”

Chronographs · 5 months ago

Seriously. OP might have a legitimate point but they’re making it with the energy of someone trying to convince me that vole people live in the antiposition of the time cube.

AlexanderESmith@social.alexanderesmith.com · 5 months ago

In fairness, a lot of the more exceptional engineers I’ve worked with couldn’t write their way out of a wet paper bag.

On top of that, even great technical writers are often bad at picking - or sticking with - an appropriate target audience.