Where to start? Text Extraction

Loopedcandle@lemmy.world · 11 months ago

Where to start? Text Extraction

coolkicks@lemmy.world · 11 months ago

Yeah, model training is hard. Like capital H HARD. you need a bunch of data and it needs to be high quality.

New York is the financial center of USA, so separating finance jobs from job postings written by someone using New England vernacular is a step you need to go through to make sure your data is high enough quality.

So if you are just starting, use 20 newsgroups dataset in those links, it’s pretty good data with a ton of resources written about it. It’s not fun data, but it isn’t as likely to fall victim to biases in data you aren’t expecting.