What I Wish I’d Known about Data Science

So many tools to worry about, so few you’ll actually use

‘Data science’ is a vague term, so treat it accordingly

Data science can cover virtually any quantitative work. Two data scientists at different companies, or even within the same company, could do totally different types of work. The field has gradually been fracturing into more specific job titles, such as data engineer, data analyst, machine learning engineer, and so on. This process of specialization will certainly accelerate in the future. Therefore, when you’re talking about data science or applying to jobs, try to figure out what the specific relevant definition of data science is for that situation, and make sure that it matches yours. Specifically, it’s useful to find out what the deliverables will be in a specific data science role. Will you need to write code that lives in a production system? Will you need to be creating data pipelines? Will you be producing analyses of offline data, and if so, what kind of analyses? Figuring out what deliverables you’ll be responsible for is often better than reading actual job descriptions, since job descriptions tend to get written to attract a broad range of candidates for a role rather than really detail what the job will entail.

Imposter syndrome is a normal part of the job

Every data scientist experiences imposter syndrome. I’ve found that a meaningful part of the job is navigating it. There are just always going to be things you don’t know. As mentioned above, the field is poorly defined, so there is an incredibly vast number of topics that could conceivably fall under the definition of ‘data science.’ If you read blogs or Quora, it makes it feel like you need to be world class at every skill to be a data scientist: a Stanford PhD statistician, a Google-calibur engineer, and a McKinsey-grade business expert, all wrapped in one. The reality is that nobody is perfect at everything. Even if you somehow magically were perfect at every skill, you’d only use a subset of those skills for each project, and you’d lose practice with the ones you didn’t use. All you need to do in order to be a good data scientist is to find a way to use data to be useful. There are lots of different ways to do that. It’s fine to feel imposter syndrome from time to time. Just know it’s normal, and don’t let it get you down. Instead, try to embrace situations where you have something new to learn as exciting growth opportunities, and remember to keep that feeling in mind the next time you encounter someone else who doesn’t know something you do.

You’ll never have to know all the tools

Hadoop, Spark, Yarn, Julia, Kafka, Airflow, Scalding, Redshift, Hive, TensorFlow, Kubernetes… there are a seemingly unending number of data science coding languages, frameworks, and tools. When you haven’t worked at a data science job before, it feels like you have to know all of them to be a real data scientist. Every time I heard someone mention a tool I didn’t know in conversation, I used to silently freak out internally and make a mental note to find a Coursera class on the topic I could binge, stat. Fortunately, you can safely ignore 99% of the data science tools out there. Eventually, your company will have its own set of tools. Everyone at the company will get good at using those tools, and be completely clueless about most of the others. Plus, no good company will care if you’ve used their particular set of tools before. Unless you’re going for a really specialized role, they’ll expect you can learn their stack on the job. You just need to know enough to pass an interview. Pick a small set of tools that work for you. Get comfortable with them, and don’t worry about branching out too much until you’re at a job.

However, learn your basic tools well

You don’t have to know every tool, but you should go deep on the basic tools you use daily. You’ll never regret learning the boring parts of whatever SQL dialect your company uses, like how to write an optimized query. If you use R, learn the ins and outs of ggplot2 and dplyr. If you use Python, try to really understand pandas, numpy and scipy. I pretended to know git for months, but always got myself tied up in git-knots. Finally, I broke down and read a great tutorial on the tool. Then, I felt git-invincible. If you find yourself using something regularly, take some time to simply read its manual.

You’re an expert in a domain, not just methods

Data science came about as a compromise between research science roles and business analyst roles. The former used powerful methods but only indirectly influenced business decisions while the latter directly influenced business owners but wielded limited tools to do so. Data scientists make the most impact when they combine both sides together, mixing deep domain knowledge with the right statistical and engineering tools to make better decisions or useful data products.

In my experience, most data scientists lean too far in the research scientist direction and not far enough towards the business analyst path. They love using fancy techniques, but they underinvest in learning about their domain. They go to machine learning conferences, but more rarely attend conferences on, say, marketing or risk. Many data scientists don’t even realize that they have a domain. Any team with accumulated knowledge about what works and doesn’t has domain knowledge, and you can learn about it from your business partners or by talking to similar teams at other companies. Knowing your domain is half the battle, so invest time there, just like you do for your ‘hard skills.’

The most important skill is critical thinking

A big part of any knowledge work job is determining what’s important and what’s not. You can do the perfect analysis, but if it turns out you were solving the wrong problem or your insight isn’t actionable, it won’t matter. It’s worth actively spending time thinking about the broader context of your work. What are the most important challenges on your team, and why? Is your current roadmap the best way to help your team, or should you shift your plan? The answers to these questions can change over time, so it’s important to check in regularly. I’ve seen a lot of data scientists march down a path for too long simply because of inertia.

What to do as a student to become a Data Scientist

Take relevant classes — not just technical classes

Of course, statistics and computer science classes will be helpful on the job. However, lots of classes can be helpful. Anything that gets you practice thinking critically and making written arguments, such as philosophy, history, or English, can be useful, since that’s a lot of what you do in data science. Social science subjects such as economics or quantitative psychology can be great for gaining experience making causal inferences. A class I think back to often is the persuasive speaking class I took, which I invoke regularly at my job. Take your fair share of technical classes, but learn broadly and follow your interests. My strategy was always to go with great professors over great syllabi. I’d still recommend that to any college student, data science or not.

Practice communication — written, visual, and verbal

Communication skills are wildly important and chronically undervalued in data science. Your impact can only be as good as your communication skills since you need to persuade others to make decisions or help build products based on your analyses. Thus, a lot of very technical data scientists’ careers are implicitly limited because they can’t write or speak clearly. Practice — in all three forms, written, visual and verbal — makes a real difference. Take classes with lots of writing, especially if you feel you’re a weak writer or English isn’t your first language. A lot of campuses have writing centers to help you get feedback. That’s a resource to take advantage of while you have it.

Work on real data problems

Kaggle is great for learning about modeling. However, with Kaggle, the hardest part has already been done for you: collecting, cleaning, and defining the problem to be solved with that data. The best way to prepare for a job as a data scientist is to use real data to answer real questions. The reason is simple: it’s the closest you can get to an actual job without actually having one. Find something you’re interested in and get your own data. Scraping data off the Internet is much easier than most beginners realize with packages like BeautifulSoup, Scrapy, and rvest. Wikipedia and Reddit are good targets if you need inspiration, but the best choice is something that you’re genuinely excited about exploring. Then, ask some questions that interest you and see how well you can answer them. Clean the data, make some graphs and models, and then write up your conclusions somewhere public. It’ll be slow going in the beginning, but that’s because you’re learning. If you can, try to solve actual real-world problems for people in your community, such as doing statistics work for a school sports team or doing polling analysis for the school newspaper, in order to get practice with stakeholder management as well.

Publish your work and get feedback however you can

The only way to get better at anything is to get feedback. Data work is no exception. These days, it’s so easy to post notebooks to Github or personal websites. If you write about a topic your friends are interested in, you can learn a lot from how they respond. What was compelling about your presentation? What was unclear? Were you able to persuade them of your main argument? Did they get bored reading and not make it to the end? Crucially, make your code available, and try to get code reviews from other students so you can make one another better. If you use a technique from a class you’re taking, you could even show a professor what you’ve done and get some expert feedback while showing some initiative. And, who knows, if one of your analyses goes viral on the Internet, you may even get a job out of it!

Go to events — hackathons, conferences, meetups

To the extent that your geography and budget allow it, try to interact with the outside data science world while you’re a student. Doing so will give you a better understanding of the realities of the field and give you a head start for networking. There are data science meetups and hackathons in most major cities, and in my experience, most people are very friendly to students at them. Conferences usually have dramatically discounted tickets for students. Going with friends can make for a fun field trip together, too!

Be flexible with how you enter the field

Data science is a competitive field. There are a limited number of tech companies with great data science brands, and the battle for their summer internships and entry-level roles is fierce. However, once you have even a small amount of real data science work experience, it’s much easier to get a second job in the field. Data scientists with a few years under their belts, even from little-known companies, often have little trouble getting hired at top companies. Thus, if you want to be a data scientist, and you don’t get an offer right off the bat from one of the famous companies, consider broadening your job search. There are lots of companies with interesting problems to solve.