Course
How to Overcome Challenges When Scaling Data Science Projects
We’re at the tail end of the Information Age.
But agriculture and industry didn’t disappear in the mid-1900s when the Information Age swallowed up the Industrial Age. Similarly, the Information Age will be—has already been?—absorbed into the thrilling new era we’re in now: the Experience Age.
If the Information Age was all about collecting massive amounts of data, the Experience Age is all about analyzing it and discovering what it can do for us- making it work for us.
And that’s where data scientists like you come in. More and more organizations are forming data science teams to extract insights from constant data streams. The U.S. Bureau of Labor Statistics estimates data science jobs will increase by 35% before 2033.
Compare that with the national average job growth rate of 3% across all industries.
This stat is exciting, but the high demand for data scientist jobs means there’s a huge need for data-driven insights, and every organization wants a giant piece of the pie.
The pressure can feel crushing. How can your team keep up with an organization’s demand for data-driven insights? How can you scale the number of data science projects your team can pull off without using excess resources?
It all comes down to organization. With a well-oiled data governance and sprint-planning system, you can do more than you ever dreamed.
What is Data Governance, and Why is it Important?
Data governance is the system a team uses to manage the lifecycle of the data it gathers. With an effective data governance plan, your team stays organized and follows important state, federal, and global regulations.
You might be asking yourself, “What’s the difference between data governance vs. data management?” Think of it this way:
- Data governance sets the policies and procedures that govern how you acquire, use, and secure your data.
- Data management is how you then collect, process, store, analyze, and interpret that data.
In other words, data governance sets up the framework for data management and oversees those processes. You can learn more about data governance concepts with DataCamp.
Most organizations have default governance plans on a micro level—one for a certain business tool and another for a separate function. As a data scientist, your job is to streamline data governance into one highly organized, tightly controlled machine.
Does it take a lot of work upfront? Yes. But once you’ve set it up, you’ll be ready to manage bigger datasets and take on more projects. To grasp these concepts quickly, check out DataCamp's Data Governance Fundamentals Cheat Sheet, a handy guide for referencing key concepts and best practices.
Setting Up a Data Governance Framework
The data governance process starts with building a team.
You and the other data scientists at your organization will need to work together to implement a data governance program. The exact titles and responsibilities will vary depending on your organization, but in general, your organization will need to appoint four roles:
- Data steward: Manages the governance program, ensures security, and liaises between the business and the IT team
- Data architect: Designs the system that will process and store data and helps data steward follow governance policies
- Data custodian: Moves, stores, secures, and oversees the use of the data
- Data analyst: Interprets the data and turns it into actionable insights for the business
Depending on the size of your company, you may need more than one person in each role. Some organizations will also have data administrators or councils that oversee the creation of data governance policies.
Building a comprehensive strategy is crucial, and DataCamp's module on Creating a Data Governance Strategy can provide you with a structured approach to this process.
Outlining a data governance policy
Once you’ve built your team, you can collaborate to define the data governance policies that everyone will follow.
Think about these questions:
- How will your company use and manage the data according to data governance best practices?
- Who will make decisions about the data’s use as technology rapidly evolves?
- How does the organization expect end-users to benefit from the data?
Explore the answers and use them to create your overarching data governance policy. Think of it as the umbrella that safeguards the sub-policies you build around standards, data culture, and security.
Figuring out your standards, security measures, and data culture needs
Now, you’ll need to think about data standards. Data won’t do much for you if it isn’t top-notch quality. What standards should the data meet? How will your team filter out the data that doesn’t meet them? Understanding and ensuring data quality is paramount. Dive deeper into this topic with DataCamp's Introduction to Data Quality Course.
The next item on the agenda is security. Figure out:
- How you’ll classify your data—public, private, confidential, restricted, and so forth
- Who will have access to each classification
- How you’ll encrypt the data to keep it secure from storage to transmission and back again
- An alarm system to notify your team of security violations
- A policy for how you’ll handle any violations
- A testing and auditing schedule to make sure your program runs as intended
Finally, one of the most important things your data governance team can do is keep the entire organization informed on how the data can help them. Keeping people informed helps create a culture where data is valued and cared for, like the asset it is.
So, how can you make something as seemingly dull as data appeal to your whole organization?
You make it come alive, that’s how.
Show your organization exactly how the data makes their work easier. Host quarterly presentations with charts and visuals showing how data impacts company decisions. Send out informative briefs or monthly newsletters in the same vein. Provide company-wide courses that help employees improve their data literacy.
An organization that cares about its data will help uphold the policies, procedures, and standards you establish. This standard and structure will make it easier for your data science team to take on more projects without sacrificing data quality or security.
And now you’re ready to sprint.
Sprint Planning for Data Science Teams
You’ve built the framework for data to safely and efficiently move through your organization. Now it’s time to see how sprint-like planning can work for you. Sure, sprint planning is part of the scrum project management system used in software development. But it works well for data governance and management, too.
That's because, like software development, data management involves millions of moving parts. Literally.
First, let’s talk about what a sprint is.
A sprint is a pre-defined timeframe during which your team will work on tasks to meet one key goal. Although sprints can be as long as you want, they’re usually one to four weeks long. Often, this is enough time for your data science team to complete small or medium-sized projects.
Other times, your team will need to run multiple sprints to complete one giant project. You know, the ones where you have to generate a huge dataset and take it through the entire lifecycle from collection to interpretation.
Before the sprint begins, your team will meet to map it out.
That way, when the sprint officially starts, everyone knows exactly what to do during the workday. According to the Scrum methodology, sprint planning meetings should last no more than two hours for each week of the sprint.
Let’s say your healthcare organization has handed you a smaller project. Your team needs to discover why a call-to-action (CTA) button on the company’s homepage performs poorly. The CTA button is urging patients to schedule an important cancer screening.
To solve this problem, you’ll need to:
- Analyze historical data that tells you about the CTA button's target audience
- Come up with one to two new variants of the CTA button
- A/B test the variants against each other and the original
- Collect, process, and analyze the data over a specific period
- Deliver actionable insights to help the marketing team choose the right CTA button
The marketing team would like results within three weeks.
This is a tight timeline for an A/B test, but your team is on it. You will hold a three-hour sprint planning session for your three-week sprint.
We’ll use this scenario to show you what sprint planning can look like for a data science team.
1. Identify your sprint goal and time frame
The first two questions your team should ask during sprint planning are:
- What outcome do we want this sprint to deliver?
- What is a realistic timeframe for us to achieve this outcome?
The marketing team gave you the ideal outcome, which is a CTA button that gets twice the clicks the current one does.
And you already know you have three weeks to complete the sprint.
You’re ready to move to the next step.
2. Write your user story
In software development, a user story is a description of the end product from the user's point of view. It’s a creative task written in plain, natural language. The goal is to put the development team in the end user’s shoes and correlate story points to concrete tasks within a sprint.
You can do something similar for your data science project, too.
Let’s go back to our mock CTA button project. The marketing department wants users to want to click on the CTA button for a free cancer screening. That means you need to put yourself in the shoes of someone encountering the button on the healthcare organization’s website.
Answer questions like these as you write your user story:
- Who is the person who will click the new CTA button?
- What health-related worries keep them up at night?
- Why would they benefit from a cancer screening?
- How does the button convey those benefits clearly and concisely?
- How do the text and graphics surrounding the button make them feel?
- What is it about the button color, font, and copy that compels them to click?
- Why will this person click the CTA button?
You can adapt these questions for your project. Focus on the result and work backward until you understand how it interacts with and benefits the user.
If you can match specific points in your story to tasks in your sprint, that's even better. It’ll help you with the third step of the sprint-like planning journey.
3. Assign sprint tasks to each team member
By now, you should know which tasks you need to do and why you need to do them.
It’s time to assign each member of your data science team a task to complete within the sprint timeframe. You might need to break some tasks up into smaller pieces. You’ll also need to pinpoint dependencies. For example, your data science team wouldn’t be able to test new CTA button variants until they build them.
Consider using task management software to keep track of who is responsible for each task and to ensure that dependencies are clearly identified and managed.
Sketch out an outline of how long each subtask and task should take. Assign sprint hours to each one. Check with your team to make sure the hours feel reasonable and doable.
Remember that these hours are estimates and that things can and will change. If you’re worried about the timeline, communicate that with the department that requested the project. Negotiate for a longer time estimate if needed. It’s always a good idea to pad a project with extra time. Your team can use it to evaluate the progress and adjust to any issues.
Rushing through a project can cost more time and money down the road—and that’s the very problem sprint planning is meant to help you avoid.
Data Governance and Sprint Planning: The Perfect Match
As data science becomes increasingly popular in this exciting Experience Age, your team will work on dozens of projects simultaneously.
Keep calm and sprint on.
By which we mean:
- Identify estimated due dates for each project
- Divide each project into one or multiple sprints
- Assign different sprints to different sub-teams within your department OR schedule sprints according to which projects are due first
- Hold sprint planning sessions for each sprint/project
Now you get to watch as you and your team meet goal after satisfying goal. With your data governance framework in place, managing the constant data flow will be a smooth and secure process.
And if any hiccups shake up the journey, you’ll have the policies and procedures in place to handle them without breaking a sweat.
Final Thoughts
As you embark on the journey of scaling your data science projects, remember that continuous learning and adapting are key to success. To further enhance your skills and knowledge, explore DataCamp's comprehensive courses on data governance, such as the Data Governance Concepts Course, and stay ahead in the Experience Age.
John is a digital marketing specialist for global brands like Optimist. He spends most of his time A/B testing and different strategies and, in his spare time, argues his findings with his dog. Zeus. You can follow him @J_PMarquez.
Start Your Data Science Journey Today!
Track
Associate Data Scientist
Course
Data Science for Business
blog
5 Common Data Science Challenges and Effective Solutions
DataCamp Team
8 min
blog
Why Data Upskilling is the Backbone of Digital Transformation
blog
Version Control For Data Science
podcast
Unlocking the Power of Data Science in the Cloud
podcast
Successful Frameworks for Scaling Data Maturity
podcast