Connect with us

AI

Q&A: Tal Melenboim on AI’s missing piece: clean, licensed data

Tal Melenboim, the visionary behind VFR.ai and StyleTech.ai, champions the cause of high-quality data as the cornerstone of robust AI systems.

AI's missing piece: clean, licensed data discussion
Image: KnowTechie

When most of the AI world obsesses over bigger models and wilder benchmarks, Tal Melenboim zeroes in on the real engine: data.

As the founder of VFR.ai and chairman of StyleTech.ai, he’s spent years building the bones of AI—systems that last, not just headlines that flash and fade. “Models get all the hype, but they’re useless without the right data,” he says, cutting through the noise.

In this Q&A, Melenboim explains why the age of cheap, scraped data is ending, how synthetic content sabotages accuracy, and why companies that survive the new wave of lawsuits will be the ones obsessed with sourcing, quality, and verifiability—long before the regulators come knocking.

Who we’re talking to

Tal Melenboim

Tech investor and entrepreneur; founder of VFR.ai; chairman/founder at StyleTech.ai; CTO at Moda Match; multiple U.S. patents; winner of a Climate Solutions competition with StyleTech.ai.

Smiling person in a black shirt.

What pulled you toward data infrastructure, specifically in the AI space, rather than the flashier side of models or consumer applications?

From the beginning, I was always more interested in what makes things actually work. The infrastructure, the pipeline, the systems underneath the surface. Models get all the attention, but they’re only as good as the data they’re fed. 

I’ve seen it over and over again: great algorithms fail because the underlying data was broken, biased, or badly organized. Over time, it became clear that if you want to build reliable, scalable AI, the real work starts with getting the data right.

How has the way companies approach training data changed since you started working in this space?

A great deal has changed. Initially, the majority of teams merely utilized whatever was available. The basic premise was that more data, irrespective of its source, will improve outcomes. That perspective, however, has begun to change. The importance of high-quality data above quantity is now widely acknowledged. 

Businesses are becoming more inquisitive about the origins, representativeness, and authorization to use their data. They are constructing infrastructure to verify that their pipelines can manage more than simply scaling, audit datasets, and trace their origins. They require verification, reliability, and organization.

You’ve said before, “the real value behind AI isn’t the model, it’s the data.” For readers who are new to this space or just assume that bigger models mean better results, can you break that down?

A lot of people think that if you make the model bigger, you’ll automatically get better outcomes. But that’s not how it works. A huge model trained on poor data just learns those problems at scale. It becomes more confident in being wrong. 

On the other hand, a smaller model trained on curated, permissioned, high-integrity data can perform far better, especially in real-world settings. The model is just the engine. The data is the fuel. If the fuel is contaminated, the engine doesn’t matter.

We’re seeing high-profile lawsuits over how models were trained. Do you see that as a temporary bump in the road or something more fundamental?

Let me tell you, this isn’t just a phase. For far too long, the industry has chosen to disregard this critical structural issue. While no one was looking, it was possible to accomplish goals such as constructing opaque databases, exploiting copyrighted material, and scraping the internet without authorization. But now there’s pushback from creators, regulators, and the courts.

Organizations should reevaluate their sourcing strategies from the ground up. That includes generating training datasets that are resistant to ethical and legal criticism, paying rightful owners, and licensing content appropriately. It’s not simply about checking a box; it’s about laying a solid foundation for artificial intelligence.

You’ve also mentioned feedback loops, in which AI models learn from content provided by AI. What does this look like in practice, and why should anyone care?

It is already happening. A model generates text, which is scraped and put to a dataset before being used to train a new model. Over time, layers of synthetic information accumulate on top of one another. It’s similar to making a copy of a copy. 

You lose integrity, context, and sense of realism. That may not seem noticeable at first, but it gradually leads to poor performance and erratic conduct. People should be concerned because the more this occurs, the more isolated these systems become from the real world.

What’s the biggest misconception you see AI teams making when they build their data pipelines?

The biggest one is assuming that pipelines are just technical problems to be solved after the fact. They treat data like a commodity and focus all their time and budget on the model. But the truth is, if you don’t invest in the data pipeline early, you end up with brittle systems. 

You can’t trace your data, you can’t audit it, and you definitely can’t fix problems once the model is live. A well-built pipeline needs to include processes for tagging, cleaning, verifying, and managing data at every stage.

If someone wanted to build an AI model today without scraping the internet, where would they realistically get quality training data?

There are many ways to get there. Companies that work with AI data can sell you information under license. You can also work together with groups that have access to data that is specific to your field, like medical records, legal papers, or financial transactions. 

Crowdsourcing is another method. In this method, users share their own data after agreeing to do so. And fake data can be useful sometimes, as long as it comes from real, high-quality seed data and is checked out correctly. That is important to know and be able to show where your data comes from.

You often talk about the need for curated, permissioned, and verifiable datasets. What does that infrastructure actually look like?

It starts with ingesting data through channels where you have clear legal rights. Then you add metadata at the source, who it came from, when it was collected, what it can be used for. You need systems for annotation, review, quality scoring, and bias detection. 

You also need to be able to trace a single data point all the way through the pipeline and into a model if necessary. And if something changes, like the license expires, you should be able to pull that data out or stop using it. That level of control is what makes the difference between a risky dataset and a reliable one.

Where do you see AI data sourcing heading in the next two to three years as these legal battles play out?

I think we’re going to see a lot more structure and transparency. Data marketplaces will mature. We’ll see certifications or standards around training data, especially for sensitive industries. 

Companies will be expected to keep detailed records of where their data came from, how it was licensed, and how it was processed. There will also be a rise in cooperatives or consortia where groups of companies share access to verified data under agreed terms. The old approach of grabbing whatever’s online is going away. It has to.

What happens to smaller AI startups that can’t afford clean, licensed datasets?

It’s a real challenge, but not impossible. Startups may need to focus on narrower domains where they can generate or collect data themselves. That might mean working with open datasets, building partnerships with institutions, or focusing on user-contributed data where consent is clear.

Another option is joining data co-ops or pooled licensing arrangements where multiple startups share access. It requires more effort up front, but it’s also an opportunity to build something truly differentiated, because if your data is better, your model will be too.

Final Takeaways

In this Q&A, Tal Melenboim cuts through model hype and anchors the discussion where AI succeeds or fails: lawful, high‑integrity data.

He tracks the shift from mass scraping to rights-clear sourcing with full lineage, warns of feedback loops as models train on model output—”a copy of a copy”—and argues that teams must design pipelines with tagging, cleaning, versioning, audits, and revocation controls from the outset.

He lays out practical ways to source quality data without scraping, including licensed providers, domain partnerships in regulated fields, consent-based user contributions, and synthetic data validated against strong ground truth.

He also sets expectations for the near future, including standards, certifications, mature data marketplaces, and strict record-keeping.

For startups, the path is focus and cooperation: narrow domains, pooled licenses, and data co‑ops that turn better data into better models. The takeaway is blunt: invest in curated, permissioned, and verifiable datasets now, or ship brittle systems that won’t withstand the scrutiny of courts or customers.

What do you think—are we finally entering the era where data quality matters more than model size? How is your team tackling the challenges of data sourcing, verification, and transparency? Share your thoughts in the comments below, or join the conversation with us on Facebook or Twitter

Kevin is KnowTechie's founder and executive editor. With over 15 years of blogging experience in the tech industry, Kevin has transformed what was once a passion project into a full-blown tech news publication. Shoot him an email at kevin@knowtechie.com.

Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

More in AI