Mastering Machine Learning Project Time Estimation: A Step-by-Step Gui

Software project estimation can be a tricky task due to the various uncertainties and risks involved. This is especially true for Machine Learning projects because of how naturally unpredictable ML approaches are. We’re faced with numerous questions, such as: Can we achieve a high level of accuracy in our model predictions? How much data do we need for that? Which model should we use?

Some ML engineers, being aware of all these unknowns, may try to avoid giving time estimates, but this can worsen the relationship with the stakeholders which, in turn, may negatively reflect on performance reviews. Besides, there are many situations where time estimates are indeed necessary, including some obvious examples such as these:

Your ML project is just one piece of a larger initiative that involves multiple teams working together;
Your deliverables will be used by external customers and they need to know when to expect them;
There are multiple competing projects to be launched and their priority depends on the time estimates.

This article aims to equip the reader with practical tips on how to handle time estimations given the uncertainties that often come with ML projects.

Time Estimation Framework

In a situation where time estimations are unavoidable, I suggest breaking down this process into the following steps:

Split the project into milestones;
Split the milestones into smaller tasks;
Estimate the effort and risk for each task;
Take external factors into account;
Sum it all up.

I will illustrate all these steps in detail using the following simple project as an example:

We want to tag medicine-related tweets and send them for human moderation;
We already have a system that can classify tweets under the "politics" topic;
The current system was launched 3 years ago and most of the people who worked on this project have since left the company, the only person who is somewhat familiar with the system works in a different team;
We plan to handle both ML and backend development by ourselves, with no frontend, design, or other work required;
We want to run an experiment and check if the number of user reports for medical tweets decreases.

Now let me walk you through the process.

1. Split the Project into Milestones

A milestone can be defined as a substantial task that is significantly different from the others in some way. For example, backend fixes could be seen as a separate milestone, as opposed to ML model training; or writing documentation, as opposed to either of these tasks.

As for our “medicine tagging” project, we can break it down into these milestones:

Label a dataset for ML model training;
Train a model following the same approach as used earlier (logistic regression);
Train a neural network for the same task and compare the quality with the previous model;
Make backend and infrastructure changes to deploy the model and run the experiment;
Run the experiment, analyze the results, and clean up the code after;
Document project work.

Some important remarks:

Risky exploration steps should be treated as a separate milestone, just like the Train a neural network task in the example above;
- Sometimes it's worth running a separate project or a "spike" to try out new approaches so that it doesn’t affect the main project.
People often forget about the final steps of a project like analyzing experiments, cleaning up code, and writing documentation. It can be quite frustrating to realize that we can’t start a new project because of this closure work, which was not taken into account;
Don't worry about having too many milestones. In fact, it's better to have many of them than few.

2. Split the Milestones into Smaller Tasks

Milestones provide us with a high-level vision of the work that needs to be accomplished. Now we need to break them down into manageable tasks that we’ll hand over to engineers. Each task should take no more than 2-3 days of effort, which is beneficial from both estimation and execution perspectives: it’s easier to estimate more granular tasks and it feels good when you see your progress closing tickets as an engineer.

Let's try to break down our earlier milestones:

Label a dataset for ML model training:
1. Put together a small dataset for internal labeling;
2. Label this dataset from multiple perspectives (involving engineers and PMs);
3. Reach an agreement on the labels that differ and create a comprehensive description of the labeling approach;
4. Put together another small dataset and label it in a similar way (involving engineers and PMs);
5. Compare the labels and improve the labeling approach description if needed;
6. Define the stratification approach for the dataset to label (consider language, text length, types of users, subtopics, etc.);
7. Put together a stratified dataset and analyze it;
8. Set up a project on a labeling platform, e.g. Amazon MTurk;
9. Run the project on a small dataset and analyze the results;
10. Make changes to the task description or project settings if needed;
11. Run the labeling project on the full dataset and analyze the results.
Train a model following the same approach as used earlier (logistic regression);
1. Write code for text preprocessing;
2. Write code for logistic regression training;
3. Train the logistic regression model and analyze the results;
4. Share the logistic regression model quality with the stakeholders and get their approval.
Train a neural network for the same task and compare quality:
1. Write code for neural network training;
2. Train the neural network and analyze the results, comparing them with the logistic regression model;
3. Inform stakeholders about the neural network model quality and discuss which approach to proceed with;
Make backend and infrastructure changes to deploy the model and run the experiment:
1. Deploy the ML model following the company’s best practices;
2. Implement the backend logic for the "medicine" topic;
3. Add tests;
4. Set up the experiment and test it;
5. Get a peer review for your PR (preferably from someone on the team which owns the code).
Run the experiment, analyze the results, and clean up the code after:
1. Start the experiment;
2. Analyze the experiment results, share and discuss them with the stakeholders;
3. Clean up the experiment code;
Document project work:
1. Document the data labeling approach;
2. Document the logistic regression training approach and its quality;
3. Document the neural network training approach and its quality, comparing it with logistic regression;
4. Improve documentation for model deployment, if needed;
5. Create a project presentation.

You may notice that the decomposition is quite thorough, which is exactly what we aim for in this stage of the estimation process. Even seemingly obvious tasks like "analyze a dataset," "discuss labels," "get a PR review," or "add tests" are explicitly defined because these small steps are frequently overlooked, yet collectively, they sum up to a significant effort.

3. Estimate Effort and Risk for Each Task

Now that we have a list of smaller tasks, it should be easy to estimate the time required for each one. However, a common challenge at this stage is that we tend to focus on the most positive scenario, while in reality there might be a lot of risks involved. To account for them, we’ll introduce risk multipliers — numbers greater than one which we will use to make different types of time estimations.

These estimations will fall into the following categories:

Optimistic estimation: no multipliers, this is the initial estimation;
Realistic estimation: moderate multipliers to account for the most basic risks (since most of the projects are delayed by 10% or more, the multipliers should not be less than 1.1);
Conservative estimation: aggressive multipliers which take into account some non-common difficulties (the multipliers should not be less than 1.2).

Here is what it could look like for our “medicine tagging” tasks:

Task	Optimistic estimation (days)	Realistic multipliers (explained)	Conservative multipliers (explained)
1.1. Put together a small dataset for internal labeling;	0.5	x1.1 (basic)	x1.2 (basic)
1.2. Label the small dataset from multiple perspectives (involving engineers and PMs);	1.5	x1.5 (dependency on PMs)	x3 (dependency on PMs)
1.3. Reach an agreement on the labels that differ and create a comprehensive description of the labeling approach;	1	x1.5 (dependency on PMs)	x3 (dependency on PMs)
1.4. Put together another small dataset and label it in a similar way (involving engineers and PMs);	1.5	x1.5 (dependency on PMs)	x3 (dependency on PMs)
1.5. Compare the labels and improve the labeling approach description if needed;	0.5	x1.5 (dependency on PMs)	x3 (dependency on PMs)
1.6. Define the stratification approach for the dataset to label (consider language, text length, types of users, subtopics, etc.);	1	x1.5 (discussion and some research required)	x3 (discussion and some research required)
1.7. Put together a stratified dataset and analyze it;	0.5	x1.1 (basic)	x3 (if more iterations are required)
1.8. Set up a project on a labeling platform, e.g. Amazon MTurk;	1	x1.1 (basic)	x2 (if done for the first time)
1.9. Run the project on a small dataset and analyze the results;	1	x1.1 (basic)	x2 (if extra research is required)
1.10. Make changes to the task description or project settings if needed;	0.5	x1.5 (dependency on PMs)	x3 (if significant changes are required)
1.11. Run the labeling project on the full dataset and analyze results;	1	x1.1 (basic)	x2 (if extra research is required)
2.1. Write code for text preprocessing;	1	x1.5 (ML is tricky)	x3 (if multiple iterations are required)
2.2. Write code for logistic regression training;	1	x1.5 (ML is tricky)	x3 (if multiple iterations are required)
2.3. Train the logistic regression model and analyze the results;	1	x1.5 (ML is tricky)	x3 (if multiple iterations are required)
2.4. Share the logistic regression model quality with the stakeholders and get their approval;	1	x1.5 (dependency on PMs)	x1.5 (dependency on PMs)
3.1. Write code for neural network training;	2	x1.5 (ML is tricky)	x2 (not much experience with NNs)
3.2. Train the neural network and analyze the results, comparing them with the logistic regression model;	2	x1.5 (ML is tricky)	x2 (not much experience with NNs)
3.3. Inform stakeholders about the neural network model quality and discuss which approach to proceed with;	1	x1.5 (dependency on PMs)	x1.5 (dependency on PMs)
4.1. Deploy the ML model following the company’s best practices;	2	x1.5 (not much experience)	x3 (not much experience, possible dependency on other teams)
4.2. Implement the backend logic for the “medicine” topic;	2	x1.5 (not much experience)	x3 (not much experience, possible dependency on other teams)
4.3. Add tests;	0.5	x1.1 (basic)	x1.2 (basic)
4.4. Set up the experiment and test it;	1	x1.1 (basic)	x3 (not much experience, possible dependency on other teams)
4.5. Get a peer review for your PR (preferably from someone on the team which owns the code);	0.5	x1.5 (dependency on another team)	x3 (dependency on another team)
5.1. Start the experiment;	0.5	x1.1 (basic)	x1.2 (basic)
5.2. Analyse the experiment results, share and discuss them with the stakeholders;	1	x1.5 (dependency on stakeholders)	x3 (dependency on stakeholders)
5.3. Clean up the experiment code;	1	x1.1 (basic)	x1.2 (basic)
6.1. Document the data labelling approach;	0.5	x1.1 (basic)	x1.2 (basic)
6.2. Document the logistic regression training approach and its quality;	0.5	x1.1 (basic)	x1.2 (basic)
6.3. Document the neural network training approach and its quality, comparing it with logistic regression;	0.5	x1.1 (basic)	x1.2 (basic)
6.4. Improve documentation for model deployment;	0.5	x1.1 (basic)	x1.2 (basic)
6.5. Create a project presentation.	1.5	x1.1 (basic)	x1.2 (basic)

Some important remarks here:

Use your best judgment when coming up with multipliers, ask senior developers for help if struggling;
ML is always tricky, so be generous with multipliers for ML tasks;
If certain tasks have a dependency on stakeholders, other teams, a manager who’s constantly busy, etc. factor that in when thinking about multipliers;
Engineers working with a certain system for the first time entail additional risks.

4. Take External Factors into Account

The above is just the “productive engineering time” required for the project, but we also need to consider that engineers don’t focus on project work 100% of their time every day. There are external factors that add "downtime" to the project estimation. These factors include:

Meetings, which usually take 20%+ of the working time;
Other tasks and commitments, like finishing another project, conducting interviews, etc.
PTOs and on-call duties;
Public holidays, especially long ones like Christmas.

5. Sum It All Up

Now that we have all the numbers for our project, let's calculate the total effort required.

Here's a breakdown of the "productive engineering time" estimation:

Optimistic effort: 31 days
Realistic effort: 46 days
Conservative effort: 88 days

Let's assume that engineers spend one out of every five days on meetings, which adds approximately 25% to the estimation. Additionally, during the project, each engineer will have 5 days of PTO and 5 days of on-call duties. Taking all that into account we get:

Optimistic: 31 * 1.25 + 5 + 5 = 49 days
Realistic: 46 * 1.25 + 5 + 5 = 67 days
Conservative: 88 * 1.25 + 5 + 5 = 120 days

These are the final estimates for one engineer. If we want to make estimations for multiple engineers working on the project, we can try to plan which tasks will be done in parallel. However, this is usually quite hard to predict. A simpler approach is to multiply the estimate by 0.75 for each extra engineer working on the project. For example, if three engineers are working on the project, the realistic estimate will be as follows:

👉 46 x 1.25 x 0.75 x 0.75 + 5 + 5 = 43 working days which can be easily translated into weeks/months taking the upcoming holidays into account.

Conclusion

There you have it — we’ve completed the full estimation cycle for our ML project. One thing you might have noticed is that the realistic estimation is much higher than the optimistic estimation, and the conservative one is even more so! Yet, the optimistic estimation is what most people would normally be aiming for, failing to consider all implementation details and risks. In my experience, the actual project timeline usually falls somewhere between realistic and conservative estimations.

Using this time estimation framework has helped reduce anxiety for the engineers on my team and provide more realistic expectations for the stakeholders. I encourage you to give it a try, reflect on it after each completed project, and tweak it so that it works best for your team.

How to Estimate the Time for a Machine Learning Project

Table of contents