- the weekly swarm
- Posts
- Why Data Science Failed
Why Data Science Failed
The Lessons We Learned From the Collapse
There's nothing artificial about this intelligence
Meet HoneyBook—the AI-powered platform here to make every client relationship more productive and prosperous.
With HoneyBook, you can attract leads, manage clients, book meetings, sign contracts, and get paid.
Plus, HoneyBook AI tool summarizes project details, generates email drafts, takes meeting notes, predicts high-value leads, and more.

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏
This week I’m discussing the collapse of the data science role in the real-world. 👀
Not a subscriber? Join the informed. Over 200K people read my content monthly.
Thank you. 🎉
Many people reading this headline will think to themselves, what is this guy talking about, there are thousands of openings for jobs in data science. Yes there are, however; most don’t know the abysmal statistics behind the collapse of this role.
Before we go any further. Let me give you the number. What do you think is the success rate for data science projects in the real-world is? The current estimate is 4%. Yep. Only 4% of data science projects ended up with a published, usable machine learning model. Let’s be honest, even if it’s off a little, that number is shockingly bad.
Only 4% of data science projects ended up with a published, usable machine learning model.
Here’s a definition of each role in the applied space. One is more theoretical, the data scientist, one is more applied, the machine learning engineer.
Data Scientist - Heavy math, many with PhDs, mostly from academia, no real-world data skills and very little to no programming acumen. The statisticians.
Machine Learning Engineer - Very little math. No advanced degrees. Many from other data roles like BI professionals, data engineers or DBAs. The data professionals.
Gartner is a very reputable company who’s been around for a minute. In 2015 Gartner’s Nick Heudecker authored an article that raised a lot of eyebrows. The article was picked up by Forbes. Here’s an excerpt from the Forbes article.

Forbes
My theory on the failure is simple. Companies put theory heavy academics with very little real-world technical acumen in roles they weren’t qualified for. Yes, this is a gross over simplification and there are many variables to digest however; the failure rate for business intelligence projects is only 70%. 🙂
Business intelligence is essentially data mining. Models look through data for patterns. The only real difference is the final result. In machine learning, a model is created that can be used instantly to provide a prediction on new data. In data mining, the data professional needs to hand their findings over to the business. The role has always been the domain of the data professional. Honestly, the numbers here aren’t good. The number Gartner gives is around a 30% success rate. While I consider that abysmal, it’s not 4%.

It was one of the greatest titles I had ever seen. Data Scientist: The Sexiest Job of the 21st Century, pronounces the Harvard Business Review. I was working on a business intelligence contract at the time so I was interested in analytics. The BI Developer was the top predictive analytics role at the time and this new paradigm was going to quickly replace it.

Business intelligence is very similar to machine learning.
Business intelligence is more about descriptive and diagnostic analytics — understanding what happened and why. It uses historical data to provide insights via dashboards, reports, and visualizations.
Machine Learning is more about predictive and prescriptive analytics — learning patterns from data to predict what will happen or recommend what to do next. The final product from all machine learning engineering projects is a model. A model is simply code that looks for patterns in data.
I read the article in its entirety, twice. I knew it was a role I was going to focus on moving forward. However, something was a little off. Here’s how the article defined the new role.
It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data.
The title wasn’t technical and used the most often overused slang at the time, big data. In 2010, it was the in technology, like machine learning is today.
Big data means a lot of different things to different people. However, in the applied space it means unstructured data. This still made sense to me. These new data scientists had finally uncovered a way to use big data in predictive analytics. I was in awe.
I continued reading, “But thousands of data scientists are already working at both start-ups and well-established companies.” I thought this was strange. I was young and my job was all I had, so I was a voracious reader of all things data. Not only that, I was a consultant and was used to doing six month to a year stints at companies then moving to new roles and I had never heard of the data scientist.
Then I arrived at this sentence. “Data scientists realize that they face technical limitations, but they don’t allow that to bog down their search for novel solutions. As they make discoveries, they communicate what they’ve learned and suggest its implications for new business directions.” 💣 Yikes… tick… tock.
This is where it hit me. 🥊 These aren’t technical professionals with data experience, they are business professionals or academics. 💥 Data in the real-world isn’t data in academia. It’s not pre-cleansed or fabricated for toy modeling problems. It’s a horrific mess.
Data in the real-world isn’t data in academia. It’s not pre-cleansed or fabricated for toy modeling problems. It’s a horrific mess.
It’s stored at a hundred places all over sundry drives. It’s stored in relational databases, from different vendors on disparate systems. Worse still, a lot of it was stored in Excel spreadsheets or inside this horrible product from Microsoft called Access. 😃
I thought to myself, I suck at cleaning and massaging data and I’ve been doing it for a decade. How the hell is someone who’s never worked with it going to manage? I didn’t know at the time but this would be one of the core reasons for the roles collapse.

Small Budget, Big Impact: Outsmart Your Larger Competitors
Being outspent doesn't mean being outmarketed. Our latest resource showcases 15 small businesses that leveraged creativity instead of cash to achieve remarkable marketing wins against much larger competitors.
Proven techniques for standing out in crowded markets without massive budgets
Tactical approaches that turn resource constraints into competitive advantages
Real-world examples of small teams creating outsized market impact
Ready to level the playing field? Download now to discover the exact frameworks these brands used to compete and win.

What are the reasons why only 4% of machine learning models make it to production? Let’s high-level some reasons.
Misunderstanding the Problem
Most organizations don’t understand why they need Ai. They simply know they need it. This is a recipe for disaster from the very start. Here are the first two questions I ask before moving forward with any discussion about implementing machine learning in any environment.
What are you trying to predict?
Do you have the data?
All machine learning is prediction. If you aren’t trying to predict something, then you don’t need machine learning. All machine learning models are built on data. Do you have the correct data to solve the problem? What if you don’t? Before you cleanse a single attribute, you might need to create a project to collect the data you need. Altering current production environments to collect the data you need at most companies is a separate project and often a painful one.
Poor Data Quality
Machine learning models are finicky. You can’t pass them raw data and expect great results. They require highly cleansed, well-structured, consistent and well-balanced data. The old axiom here is shit in, shit out. If you don’t have the right data, you as the machine learning engineer are responsible for helping the data professional collect the right data.
Technical Acumen
This is easily the number one reason why these projects are failing. Organizations put people in roles they are unqualified for. They hire a theoretician who’s never seen a SQL statement and ask them to source data from three disparate vendor databases. How are they going to export the data once they know what data they need? How are they going to combine the data from those databases or export the data they need to a CSV file?
You might be thinking, that’s someone else’s problem. It might be at Microsoft, it’s not at any small to medium sized shop you’ll ever work at. You can work the entire end-to-end machine learning pipeline or you can’t. Most people can’t.
Complexity of Machine Learning Pipeline
There are only 4 core sections of this pipeline but each one has its own technical requirements.
You need to source the data. We’ve already discussed this issue. This part is very difficult in the real-world.
Next is data cleansing. Most people have no clue what this entails, yet it’s 80% of the job. Like anything else, it’s a learned skill. To be more accurate, sourcing and cleansing is around 80% of the role.
Next is modeling, this is the easy part because the models already exist. Just pass the clean data to the model. The problem here is that most don’t know what models to use for a given problem.
Lastly, there production and this is a big one. Deployment, monitoring, retraining… all learned skills… that most haven’t learned it.
Lack of Skilled Professionals
I am constantly amazed by those who profess machine learning skills but can’t answer the most basic questions about the role. I’m not talking about handling difficult technical questions, I’m talking about the most basic questions you can ask an applicant.
What are the two core Pandas data structures? Machine learning engineers use Pandas every single day, all day. This should be as easy as your phone number yet most don't know. Forget about questions that are a little more technical like, why is more data almost always better in machine learning? Good luck with that one. It has to do with the law of large numbers. Finding skilled machine learning engineers is next to impossible.
Regardless of what the real failure numbers are, the data science role as it’s defined today in the real-world has failed. Here are some things to think about.
The current number of successfully deployed machine learning projects in the real-world is around 4%.
The data science role has fallen off every top job board in the past few years. Most of the jobs are mislabeled. When you see data cleansing or SQL, that’s a job for a machine learning engineer, not a glorified statistician.
The majority of work in machine learning revolves around two things. The first is data sourcing and the second data cleansing. Two skills most data scientists do not have. Modeling has been democratized and it’s the easy part.
The machine learning pipeline is only 4 core items. However, they involve a disparate set of highly technical skills most data scientists do not have.
Poor data quality is the norm, not the exception. Downloading a nicely formatted dataset only happens on toy problems.
I know this angers a lot of people who have spent years in academia believing their hard work will prepare them for a job in the real-world. With a 4% success rate, you can clearly see that wasn’t the case. After billions of dollars of losses it’s time we reevaluate what went wrong and more importantly, how to fix it. 👏
Thanks for reading and have a great day.
I almost forgot. 🙂 Nah, I’m not going to leave you hanging.
Why is more data always bettering in machine learning? 👈
We have an intuition that more observations are better. This is the same intuition behind the idea that if we collect more data, our sample of data will be more representative of the problem domain.
There is a theorem in statistics and probability that supports this intuition that is a pillar of both of these fields and has important implications in applied machine learning. The name of this theorem is the law of large numbers.
The law of large numbers is a theorem from probability and statistics that suggests that the average result from repeating an experiment multiple times will better approximate the true or expected underlying result.
For example, let’s say I flip a coin 5 times and it comes up heads every time. Does that mean the chances of the coin coming up tails on the next flip is now much greater? Nope. It’s still 50/50. The more we flip that coin, the closer we will get to the real answer.

If you do make it to a career in machine learning, you will eventually get this question in a technical interview and this is how you answer it. 👏
I hope you are getting a lot of real-world insight into the top data roles via my newsletter and YouTube channel but I can help even more.
If you are going the machine learning engineer route, here’s a a free course on applied statistics. It covers 90% of all the statistics you’ll need in the real-world.
Python is the gold standard in machine learning and other than SQL, it’s the second most used language in data engineering. That means you need to know it. Here’s my second free course for you in this article; Python for machine learning. Now, it’s geared towards MLEs, however; the basics are all the same.
Another one! Yep, here’s another free course for you if you’re interested in real-world machine learning. If you follow me, you know what the top model is the real-world is for 80% of all machine learning problems. Right. Gradient boosters. This is a short course on XGBoost, the most famous gradient booster.
Thank you and stay focused. 🥳