Course Overview
Learn how to create an end-to-end, hardware-accelerated machine learning pipeline for large datasets. Throughout the development process, you’ll use diagnostic tools to identify delays and learn to mitigate common pitfalls.
Please note that once a booking has been confirmed, it is non-refundable. This means that after you have confirmed your seat for an event, it cannot be cancelled and no refund will be issued, regardless of attendance.
Prerequisites
- Basic knowledge of a standard data science workflow on tabular data. To gain an adequate understanding, we recommend this article.
- Knowledge of distributed computing using Dask. To gain an adequate understanding, we recommend the “Get Started” guide from Dask.
- Completion of the DLI’s Fundamentals of Accelerated Data Science course or an ability to manipulate data using cuDF and some experience building machine learning models using cuML.
Course Objectives
- Develop and deploy an accelerated end-to-end data processing pipeline for large datasets
- Scale data science workflows using distributed computing
- Perform DataFrame transformations that take advantage of hardware acceleration and avoid hidden slowdowns
- Enhance machine learning solutions through feature engineering and rapid experimentation
- Improve data processing pipeline performance by optimizing memory management and hardware utilization
Follow On Courses
Outline: Enhancing Data Science Outcomes With Efficient Workflow (EDSOEW)
Introduction
- Meet the instructor.
- Create an account at courses.nvidia.com/join
Advanced Extract, Transform, and Load (ETL)
- Learn to process large volumes of data efficiently for downstream analysis:
- Discuss current challenges of growing data sizes.
- Perform ETL efficiently on large datasets.
- Discuss hidden slowdowns and perform DataFrame transformations properly.
- Discuss diagnostic tools to monitor and optimize hardware utilization.
- Persist data in a way that’s conducive for downstream analytics.
Training on Multiple GPUs With PyTorch Distributed Data Parallel (DDP)
- Learn how to improve data analysis on large datasets:
- Build and compare classification models.
- Perform feature selection based on predictive power of new and existing features.
- Perform hyperparameter tuning.
- Create embeddings using deep learning and clustering on embeddings.
Deployment
- Learn how to deploy and measure the performance of an accelerated data processing pipeline:
- Deploy a data processing pipeline with Triton Inference Server.
- Discuss various tuning parameters to optimize performance.
Assessment and Q&A