Lessons from my 0th Year as a Data Scientist

Later this week, I’ll graduate from the Masters in Data Science Program at the University of San Francisco alongside an incredible cohort of colleagues. It’s been a doozy of a year, to say the least. Now that it’s coming to an end, I’d like to reflect on some of the things I’ve learned and share them with you, in the hope that it’ll be useful to somebody somewhere who, like myself a year ago, is just starting out on their own data journey. Here are, in no particular order, a few of the things I’ve learned over the past 12 months…

Play to Your Strengths#

Data Science is a broad field. It’s easy to be overwhelmed by the sheer number of topics that many assume you need to know in order to be a good Data Scientist — linear regression, time series, A/B testing, distributed computing, machine learning, deep learning, databases, SQL, computer science, bayesian inference, MCMC… The list goes on and on.

I don’t share this opinion. You do not need to know everything there is to know about everything, although it can sometimes feel that way. My advice is to play to your strengths. What do I mean by that? Well, take me for example. I am by all accounts terrible at statistics. All of this talk about estimating parameters and hypothesis tests just goes way over my head. Rather than beat my head against a wall trying to understand what the heck an F-test for significance is, I move on. On to the next thing, the thing that actually makes sense to me.

As a result, I end up knowing a decent amount about a small number of subjects and very little about anything else. This might seem like a bad thing, depending on your point of view. The thing I like about this sort of approach to learning is that the things you come to really grasp, the things that make sense to you, become places of refuge you can fall back to when the going gets tough. The subjects you know will give you confidence and serve as footholds from which you can grow your understanding of new subjects later on.

Be Patient#

Of course, sometimes there’s no escape. You really need to understand this one thing which just doesn’t make any sense. Be patient. Take a break and come back to it — it will still be there, I promise. Learning about sequence to sequence models for the first time? Or perhaps Attention? Confused? Trust me, so was everybody else the first time they encountered them. And the second time. And the third time. Until… finally, things clicked.

There is never a need to understand something the first time you see it. Or the second time. What’s important is that you understand things in the end. How long it takes is irrelevant. Learn at your own pace and don’t feel bad if it takes you a long time to understand something.

Struggling Is a Good Thing#

That being said, it’s important that you don’t give up. This applies to big things and little things alike. For example, last week I spent 48 hours straight trying to set up SSL encryption for a website my friends and I put together. You know what the trick was? Three clicks of a mouse — three harmless clicks which eluded me for 48 hours. The good news? Because I stuck with it, I’ll remember the solution for the rest of my life.

The moral of the story is this: struggling is a good thing. The more you struggle to understand something, the greater the chance is that you’ll remember it going forward. Not only that, but you’ll find that as you struggle to understand just this one thing, you’ll end up learning quite a bit about a bunch of other things. In the example above, I learned a bit about load balancers, ports, and how servers talk to each other behind a firewall. This doesn’t mean you should spend 48 hours banging your head against a wall to understand every little thing. The point is to not give up when you’ve decided that you want to understand this one thing.

Know Your Data#

This will sound obvious to any Data Scientist but I assure you it’s not. Knowing your data encompasses a variety of aspects, which depend on what kind of data you’re working with and what you plan to do with it.

Let’s take machine learning for example. Sites like Kaggle provide tens of thousands of datasets for machine learning, most of which are guilty of a shocking lack of documentation. Where did the data come from? When was it collected? For what purpose? Are there tasks for which it should definitely not be used? Will it ever go stale?

Questions like these, it seems, are hardly ever asked, or at least not frequently enough for my liking. While most would agree that documentation is important when it comes to building software, data seems to be regarded as self-evident. I couldn’t disagree more and would encourage you to understand how your data came into existence before you attempt to do anything with it.

Recently, it’s been proposed that every dataset have an accompanying datasheet, which answers the above questions and many more. This paper describes the concept in detail and is one I highly recommend.

Be the Algorithm#

When it comes to modeling, knowing your data is not enough. You need to understand what your model does. Often, the best way to do that is to be the algorithm. Take, for example, this Kaggle competition that asks you to decide if two questions are asking the same thing. Before writing a single line of code, I grabbed 10 random pairs of questions and classified them as similar or not using nothing but my own brain.

Why? I wanted to understand what exactly I, a human, look for when trying to judge if two questions are asking the same thing so that I could understand what my model, a computer, might struggle with or excel at. Were there dead giveaways that I could expect my model to pick up on? Or was the task too difficult to expect even decent results? If I couldn’t perform reasonably well at this task, I wasn’t going to hold my breathe that any model could.

I ended up classifying those 10 questions at 80% accuracy — just a smidgen under the 81% accuracy that my final model, an RNN, achieved. Afterwards, I repeated the same exercise of classifying questions myself twice more and got 80% each time. This told me that 80% was probably the best accuracy I could hope for given the data I had. In a way, I was the final baseline against which I could evaluate my model’s performance.

Don’t Stop at 90%#

When you’re working on a project and finally, after weeks of effort, manage to get all of your ducks in a row — data, modeling, deployment, what have you — it’s tempting to pack things up and call it a day. After all, you did it, right? You accomplished the task you set out to accomplish.

It’s completely understandable to feel this way and is the reason you’ll find that virtually every personal project on GitHub that you come across is only 80% or 90% complete at best. Don’t stop at 90%. Keep going. No one understands your project better than you and no one is more aware of its shortcomings. Take the time to really polish it off. You’ll feel a greater sense of accomplishment and will have something incredible to share with the rest of the world.

Every Line of Code Adds Technical Debt#

I’ll end with what is for me the most surprising lesson I’ve learned this past year, and one which I find myself re-discovering over and over again.

Every line of code adds technical debt.

I can’t emphasize this point enough. Every line of code, by its very existence, adds complexity to your project. Each added line introduces a potential new source of bugs and increases the burden of maintaining a project, not to mention the fact that it’s one more thing you have to think about.

So what should you do? Keep it simple and ruthlessly eliminate excess code. No matter how good you think your code is, I guarantee you it’s not perfect — there is no such thing as perfect code.

That function you wrote that’s 10 lines long? There’s probably a way to make it 5 lines long, though it might take you 3 hours to figure it out. Some might say those 3 hours would be better spent working on something new. Not me. It’s tempting to believe that getting a working solution down on paper is what matters most and that you’ll always be able to refactor later on.

Sadly, this rarely happens in practice. The pressure to move forward and add new features leaves you with little time to fix yesterday’s mess. Moreover, that mess often becomes the foundation upon which new features are built, meaning any changes require revamping multiple components, like trying to build a house on landfill that hasn’t been properly stabilized.

Final Thoughts#

Phew, that was a bit longer than I expected. Are you also a recent graduate about to begin life as a professional Data Scientist? I hope you’ll share what you learned along the way with us. Someone somewhere will find what you have to say useful, I guarantee it.