Scale and optimise your data science investment through cloud migration

Credera Team

Scale and optimise your data science investment through cloud migration

The rise of data science and artificial intelligence (AI) has transformed the way organisations conduct business, drive innovation, and create unparalleled value. Research predicts that AI will generate $15 trillion (£11.8 trillion) in value across the global economy by 2030. This staggering potential has led organisations to invest heavily in data science, weaving it into the very fabric of their business operations.

ORGANISATIONS STRUGGLE TO REALISE VALUE

Initial investments in data science often focus on hiring top-notch data scientists who bring statistical rigour to the table. The team often sits within the business, implementing high-value predictive models that achieve significant bottom line results with their first few models.

As these teams grow, experts are increasingly bolstered by a growing crop of young data science graduates, as evidenced by the rapid expansion of university programs specialising in the field. However, these data science teams have often grown beyond even data engineering investments, building their own infrastructure to continue delivering value. This leads to the use of outdated on-premises or infrastructure as a service (IaaS) infrastructure and “wild-west” development practices that present a maintainability and scalability bottleneck. This means that, despite the influx of data science expertise, many organisations struggle to realise the scale in value that they expected.

TEAMS FAIL TO SCALE AND OPTIMISE DATA SCIENCE PROCESSES

While data scientists excel in math and statistics, few possess software and systems engineering backgrounds. This disconnect has led to challenges in scaling and optimising data science processes. Inconsistencies in coding style and strict, methodological approaches leads to inefficient development and slow production timelines. For example, migrating to an Apache Spark-based platform can significantly improve scalability, but many data science teams lack the experience or know-how to make this transition. Additionally, rigorous automated software-style unit testing, a cornerstone of modern software development, is often missing from data science workflows.

Organisations must optimise their data science investments to continue scaling

To address these challenges and continue scaling, organisations must optimise their data science investments. Through our client experience we see three elements that drive value for organisations.

1. Restructure teams

One key aspect of this optimisation involves restructuring teams. Data science teams often start within individual business units, eventually centralising the data science teams into a centre of excellence and then decentralising back into the business again during the optimisation process. Striking the right balance between centralisation and decentralisation can help organisations maintain agility while ensuring a consistent and scalable approach to data science.

2. Unify data engineering and machine learning with data science teams

Another essential component of optimisation is the unification of data engineering and machine learning development with data science teams. By fostering close collaboration between these groups, organisations can ensure that machine learning (ML) models are production-ready from the outset, reducing time to market and improving overall efficiency.

3. Migrate to a cloud-based platform

Finally, the scalability of cloud-based platforms is crucial for handling the massive data volumes needed to run transformative workloads. By migrating to a cloud-based platform, organisations can leverage the inherent scalability and flexibility of the cloud to handle increasing data volumes and workloads, driving innovation and business value. This migration is typically foundational to other organisational and process-based improvements, and this article will focus on the mindset and approach to succeed in a migration.

Migration Mindset

Migrating data science workloads to the cloud is quite distinct from application migrations. For organisations that have experience migrating data warehouse workloads, data science workloads likewise introduce additional complexity in the following four ways.

Infrastructure complexity: Data science teams typically enjoy more freedom and access to specialised hardware compared to application teams. This enables them to explore, experiment, and iterate on their models and algorithms more rapidly. Despite these differences, data science environments are generally more homogeneous than enterprise application portfolios. There are usually only a few profiles of development environments used by data science teams, making it easier to standardise and streamline the migration process.
Diverse technology stacks: Data science workloads often involve a wide range of tools and frameworks, such as Jupyter Notebooks, TensorFlow, PyTorch, and scikit-learn. The complexities of DataFrame libraries like pandas or R make code translation more complex than SQL. These diverse technology stacks increase the complexity of migration, as each component may require different configurations or adjustments to work seamlessly on the cloud platform.
Code optimisation: Data science code may not be optimised for performance or scalability due to the focus on experimentation and rapid iteration during model development. Migrating to a cloud platform may necessitate re-architecting or refactoring parts of the code to ensure efficient resource utilisation and complete use of the cloud's capabilities.
Reproducibility: Data science workloads often involve complex machine learning models and algorithms, making it crucial to maintain reproducibility during migration. This means that validation of the migration can be challenging, as even slight changes in data, code, or environment can lead to vastly different inference, making it difficult to validate the performance of migrated workloads.

Despite these differences, there are some core approaches developed for cloud application and warehouse migrations that can be adapted and applied to data science migrations for a smoother and more efficient transition. For example, teams can adapt the 6Rs migration framework to suit the specific needs of data science products. This framework provides a structured way to assess and prioritise workloads for migration, ensuring a systematic and efficient transition to the cloud.

Additionally, leveraging automated data migration and automated code transformation tools can help data science teams migrate their workloads more quickly and seamlessly. By automating the transformation of code to cloud-native frameworks like Apache Spark, teams can significantly reduce manual effort and ensure optimal performance on the cloud platform.

There are six ways to migrate a data product

The 6Rs approach, popularised by AWS, is a framework that helps organisations plan and execute migrations of applications or data products to the cloud or modernised platforms. The 6Rs stand for Rehost, Replatform, Refactor, Rebuild/Repurchase, Retire, and Retain. Here's an explanation of each approach in the context of a data science team:

1. Re-host: Also known as "lift-and-shift," this approach involves moving your ML models, notebooks, and data pipelines as-is from the current environment to the new platform. Minimal changes are made to the codebase, which might involve adjusting configurations or dependencies to ensure compatibility. This is the quickest way to migrate but may not take full advantage of the features or optimisations available in the new environment.

2. Re-platform: This is a "lift-tinker-and-shift" approach, and involves making slight modifications to the existing ML models or data pipelines to optimise them for the new platform. Examples include adjusting the code to use a different database, changing the data storage format, or modifying the model training process to leverage the new platform's capabilities. Re-platforming typically results in improved performance, scalability, and maintainability without a complete overhaul of the codebase.

3. Refactor: In this approach, the data science team re-architects or rewrites the ML models, notebooks, and data pipelines to take full advantage of the new platform's features and capabilities. This might involve adopting new programming paradigms, utilising different ML frameworks, or redesigning data pipelines to leverage cloud-native services. Refactoring requires significant effort but can result in substantial improvements in performance, maintainability, and scalability.

4. Rebuild / Repurchase: This approach involves starting over with the same goal but employing an entirely new approach. Rebuilding means leveraging the full feature set of the cloud platform to build an ML model that significantly outperforms the old model. A special case of a rebuild is a “repurchase”, which involves replacing the current ML models or data pipelines with commercially available off-the-shelf solutions on the new platform.

This approach might be suitable for a data science team if their existing models or pipelines have become too complex or difficult to maintain, or if better alternatives exist on the market. Repurchasing requires careful evaluation of the available options and potential trade-offs in functionality and customisation.

5. Retire: The retire approach involves identifying and decommissioning ML models, notebooks, or data pipelines that are no longer needed, have been replaced, or provide little value. Retiring these assets can help simplify the migration process, reduce maintenance overhead, and focus resources on higher-impact initiatives.

6. Retain: Retaining involves keeping the existing ML models, notebooks, or data pipelines in their current environment, either because they are still functional or due to other constraints such as regulatory requirements or lack of resources. In this case, the data science team may choose to revisit the migration decision later or maintain a hybrid approach, where some assets are migrated and others are retained in the current environment.

By applying the 6Rs framework to their migration strategy, data science teams can determine the most suitable approach for each ML model or data pipeline, ensuring a smooth transition to the new platform with minimal disruption.

Our recommended approach for data science migrations

We recommend a three-step roadmap for accelerating data science migration: Assess and Plan, Mobilise, and Migrate.

PHASE 1 ASSESS AND PLAN: CATALOGUE AND TRIAGE ML MODELS AND OTHER DATA PRODUCTS PIPELINES

The planning phase is crucial to the success of any data science migration. A successful migration begins by cataloging and triaging all data products pipelines, such as those dedicated to ML models, dashboards, tables, and A/B test support. Using guidelines from AWS, teams define the six Rs approaches and create a decision tree to help triage data products into a migration approach. They then develop an initial playbook for migration execution, including accelerators such as automated code translators and data migration tools, and create a regression testing and validation plan for migrations, incorporating additional automation where necessary.

PHASE 2 MOBILISE: EXECUTE PILOT MIGRATIONS AND TRAIN DATA SCIENTISTS

During the mobilisation phase, the migration team selects a sample of migrations using a framework that maximises coverage across the 6Rs migration categories, model/pipeline complexity, and business stakeholders and teams. The migration team then:

Executes pilot migrations and update the playbook details.
Engages expertise from their organisation or external partners and leverage new platform features to build reference implementations.
Implements a validation plan to ensure the quality of data and models is maintained and define the future state code repository structure.
Documents lessons learned in their migration playbook as best practices to always be followed.
Implements new features on the future state platform, such as experiment tracking, model registry, continuous integration and continuous development (CI/CD) deployment, and end-to-end data pipeline observability.

Training data scientists on the new approach is essential for successful migration. Pair them with experienced individuals and supplement with formal training on different ways of working and utilising the features of the new platform.

PHASE 3 MIGRATE: DISTRIBUTE WORK TO EXECUTE MIGRATIONS AND RUN MIGRATED MODELS IN PARALLEL

In the final migration phase, the migration team often expands, using the information learned during pilot migrations to update the playbook and six R triage. The expanded team leverages the playbook developed during the mobilise phase as a guide for best practices. By this time, the patterns should be well known, so the migration programme can distribute work across the platform, data engineering, and data science teams. Each migration should leverage frequent, small commits to break down refactoring into manageable chunks, and use the pull request review process to ensure that proper automated testing is in place to validate results.

Finally, the validation phase involves running migrated models in parallel with the original for at least two inference cycles (e.g., if the model runs monthly, run in parallel for two months) to ensure seamless transition and maintain quality.

Scaling your investment

By following this three-step roadmap, businesses can accelerate their data science migration, reducing the time and complexity involved in the process. At Credera, we’ve seen this approach facilitate major value for organisations including a major energy provider. Through a collaborative partnership, Credera successfully delivered a future state operating model that enables unified ways of working and consolidated technology platforms across a team of more than 30 data scientists, supporting over 60 data products that power the company's core processes.

We’d love to start a conversation about what is next for your organisation’s data investment. To find out more, please get in touch with a member of our team.

DOWNLOAD OUR CLOUD OPTIMISATION WHITEPAPER

Ready to achieve your vision? We're here to help.

We'd love to start a conversation. Fill out the form and we'll connect you with the right person.

Searching for a new career?

View job openings