MOOC: Introduction to Machine Learning in Production

Instructors:

Andrew Ng – Founder, DeepLearning. AI
Robert Crowe – Tensorflow Engineer, Google
Laurence Moroney – Leader AI Advocacy, Google

Week 1: Overview of the ML Life-cycle and Deployment

ML project life-cycle

Scoping
- Define project
  - Decide on key metrics
Data
- Define data and establish baseline
  - How much silence before/after each clip?
  - How to perform volume normalization?
- Label and organize data
  - Is the data labeled consistently?
Modeling
- Select and train model
  - ML Model built with
    - Code (algorithm/model)
    - Hyperparameters
    - Data
- Perform error analysis
  - Can help you identify what effort to collect
Deployment
- Deploy in production
- Monitor & maintain system
- Key challenges
  - Concept drift and Data drift
    - Data drift
      - Gradual change over time
      - Sudden shock
    - Concept drift: Desired mapping from x to y changes
    - Need to detect changes
  - Software engineering issues
    - Real-time or batch
    - Cloud vs Edge/Browser
    - Compute resources (CPU/GPU/memory)
    - Latency, throughput (Qery Per Second)
    - Logging
    - Security and privacy
- Deployment patterns
  - New product/capability
  - Automate/assist with manual task
  - Replace previous ML system
  - Key ideas:
    - Gradual ramp up with monitoring
    - Rollback
  - Canary deployment
    - Roll out to small fraction (say 5%) of traffic initially
    - Monitor system and ramp up traffic gradually.
  - Blue green deployment
    - Easy to rollback
- Degree of automation
  - Human only > Shadow mode > AI assistance > Partial automation > Full automation
- Monitoring
  - Monitoring dashboard
    - Software metrics
      - Server load
      - Memory
      - Cmopute
      - Latency
      - Throughput
    - Input metrics
      - Avg input length
      - Avg input volume
      - Num missing values
      - Avg image brightness
    - Output metrics
      - # times return " " (null)
      - # times user redoes search
      - # times user switch from speech to typing
  - Pipeline monitoring
    - Example: Audio –> voice activity detection –> Speech recognition –> transcript
References:
- Machine Learning in Production: Why You Should Care About Data and Concept Drift: https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb
- Monitoring Machine Learning Models in Production: https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/
- A Chat with Andrew on MLOps: From Model-centric to Data-centric AI: https://www.youtube.com/watch?v=06-AZXmwHjo

Week 2: Modeling overview

Selecting and Training a Model

Model-centric AI development
Data-centric AI development
- Focus on high quality data can lead to more efficient in performing well
Key challenges
- AI System is compose of Code (algorithm/model) and Data
- Model development is an iterative process
- Model + Hyperparamters + Data –> Training –> Error analysis –> Model + Hyperparamters + Data
  - Audit performance before move to prod
- Challenges
  - Dodging well on training set (usually measured by average training error)
  - Doing well on dev/test set
  - Doing well on business metrics/project goals
Why low average error isn’t good enough
- Performance on disproportionately important examples
  - Example: Web search
    - Users are willing to forgive for not the most accurate results for informational and transactional queries. i.e. “Apple pie recipe”, “Latest movies”, “Wireless data plan”, “Diwali festival”
    - For navigational queries (i.e. “Stanford”, “Reddit”, “youtube”), users have a very clear intent and unforgiving if the search engine doesn’t provide the exact answer.
  - For an algorithm to miss a few navigational queries, it can have significant user experience impact.
- Performance on key slices of the dataset
  - Example: ML for loan approval: Make sure not to discriminate by ethnicity, gender, location, language or other protected attributes
  - Example: Product recommendations from retailers: Be careful to treat fairly all major user, retailer, and product categories.
- Rare classes
  - Skewed data distribution
  - Accuracy in rare classes
Establish a baseline
- Example: For speech recognition, use human level performance as the comparison and target to match human level performance initially.
- Unstructured and structured data
  - Examples of unstructured data: Image, Audio, text
- Ways to establish a baseline
  - Human level performance (HLP)
  - Literature search state-of-art/open source
  - Quck-and-dirty implementation
  - Performance of older system
Tips for getting started
- Getting started on modeling
  - Literature search to see what’s possible (courses, blogs, open-source projects)
- Deployment constraints when picking a model
  - Should you take into account deployment constraints when picking a model?
  - Yes, if baseline is already established and goal is built and deploy.
  - No (or not necessarily), if purpose is to establish a baseline and determine what is possible and might be worth pursuing.
- Sanity-check for code and algorithm
  - Try to over-fit a training dataset before training on a larger one.

Error analysis and performance auditing

Example error analysis
- Speech recognition: In this example, try to identify factors that impacted the error output. Factors may include car noise, people noise, low bandwidth.
Iterative process of error analysis
- Identify tags and propose tags in an iterative manner
- What fraction of errors has that tag?
- Of all data with that tag, what fraction is misclassified?
- What fraction of all the data has that tag?
- How much room for improvement is there on data with that tagl
Prioritizing what to work on
- Decide on most important categories to work based on:
  - How much room for improvement their is.
  - How frequently that category appears.
  - How easy is to improve accuracy in that category.
  - How important is it to improve in that category.
skewed datasets
- Examples
  - Medical Diagnosis: 99% of patients don’t have a disease
  - Speech Recognition: In word detection, 98.7% of the time wake word doesn’t occur
- Combining precision and recall – F1 score
- Multi-class metrics
  - Example: cell phone QA: Classes: Scratch, Dent Pit mark, discoloration
Performance auditing
- Check for accuracy, fairness/bias, and other problems.
  - Brainstorm the way the system might go wrong.
    - Performance on subsets of data (e.g., ethnicity, gender)
    - How common are certain errors (e.g., Fp, FN)
    - Performance on rare classes
  - Establish metrics to assess performance against these issues on appropriate slices of data
  - Get business/product owner buy-in
- Example: Speech recognition
  - Brainstorm the ways the system might go wrong
    - Accuracy on different genres and ethnicities
    - Accuracy on different devices
    - Prevalence of rude mis-transcriptions
  - Establish metrics to asses performance against these issues on appropriate slices of data
    - Mean accuracy for different genders and major accents
    - Mean accuracy on different devices
    - Check for prevalence of offensive words in the output

Data iteration

Model-centric view:
- Take the data you have, and develop a model that does as well as possible on it.
- Hold the data fixed and iteratively improve the code/model.
Data-centric view
- The quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.
- Hold the code fixed and iteratively improve the data.

A useful picture of data augmentation

Improving the performance in one area may also improve the perform of the AI system in related problem space.

Data augmentation

Goal
- Create realistic examples that
  - the algorithm does poorly on, but
  - humans (or other baseline) do well on
Checklist
- Does it sound realistic?
- Is the x –> y mapping clear? (e.g. can humans recognize speech)
- Is the algorithm currently doing poorly on it?

Can adding data hurt performance?

For unstructured data problems, if:
- The model is large (low bias)
- The mapping x –> y is clear (e.g., given only the input x, humans can make accurate predictions).
- Then, adding data rarely hurts accuracy.
Photo OCR counterexample on 1 vs I. Adding more examples of I, can cause the model to scew toward I on edge case.

Adding features

Restaurant recommendation example
- Since there are fix number of people and restaurants, consider adding features.
- Possible features to add
  - Is person vegetarian (based on past orders)?
  - Does restaurant have vegetarian options (based on menu)?
Product
Collaborative filtering
- recommend based on what other similar users like.
Ccontent based filtering
- recommend restaurants based on various information about the restaurant.
- Cold start problem

Experiment tracking

What to track
- Algorithm/code version
- Dataset used
- Hyperparameters
- Results
Tracking tools
- Text files
- spreadsheet
- Experiment tracking system
Desirable features
- Information needed to replicate results
- Experiment results, ideally with summary metrics/analysis
- Resource monitoring, visualization, model analysis

From bug data to good data

Good data
- Covers important cases (good coverage of inputs x)
- Is defined consistently (definition of labels y is unambiguous)
- Has timely feedback from production data (distribution covers data drift and concept drift)
- Is sized appropriately

References

Establishing a baseline: https://blog.ml.cmu.edu/2020/08/31/3-baselines/
Error analysis: https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/responsible-machine-learning-with-error-analysis/ba-p/2141774
Experiment tracking: https://neptune.ai/blog/ml-experiment-tracking

Week 3: Data Definition and Baseline

Define Data and Establish Baseline

Data labeling consistency is important for successful training
The way data was prepared will have huge impact on the models
Example: User ID merge
- User informtion from multiple sources might be ambiguous when attempt to understand if they represent the same user.
- Following factors may introduce additional ambiguity:
  - bot/spam account
  - fraudulent transaction
  - Looking for job?
Input
- quality of the input can have big impact
- What features need to be included?
Target label
- How can we ensure labelers give consistent labels?
Major types of data problems
- Unstructured
  - people can label unstructured data
  - Small data
    - Manufacturing visual inspection from 100 training examples
    - Having clean labels are critical
  - Big data
    - Speech recognition from 50 million training examples
    - Emphasis on data process for big data
- Structured
  - harder to obtain more structured data
  - harder to get people to label structured data
  - Small data
    - Housing price prediction based on square footage, etc from 50 training examples
    - Having clean labels are critical
  - Big data
    - Online shopping recommendations for 1 million users
    - Emphasis on data process for big data
Small data and label consistency
- Problems with a large dataset but where there’s a long tail of rare events in the input will have small data challenges too.
Improving label consistency
- Ha,ve multiple labelers label same example
- When there is disagreement,have MLE, subject matter expert (SME) and/or labelers discuss definition of y to reach agreement.
- Standardize labels
Human level performance
- Ground Truth Label
- Getting ground truth label and human to agree, it’s possible to raise the combined performance
- Improving label consistency will raise HLP
- This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.

Label and Organize Data

Obtaining data
- Get into this iteration loop as quickly possible (i.e. don’t spend too much time to obtain data initially)
- Instead of asking: How long it would take to obtain m examples? Ask: How much data can we obtain in k days.
- Exception: If you have worked on the problem before and from experience yuo know you need m examples.
- Brainstorm list of data sources and evaluate how much time/cost will each require. This will help make decisions on which sources to invest time/money to acquire the data.
- Additional factors to consider: Data quality, privacy, regulatory constraints
- Labeling data
  - options: in-house vs. outsourced vs. crowd-sourced
  - Having MLE slabel data is expansive. But doing this for just a few days is usually fine.
  - Who is qualified to label?
  - Recommender systems – maybe impossible to label well by a person, maybe better to label by purchase history
  - Don’t increase data by more than 10x at a time
Data pipeline
- pre-processing scripts
  - need to be replicable from dev to prod env
- Tools that can be helpful: TensorFlow Transform, Apache Beam, Airflow
Meta-data, data provenance and lineage
- Data provenance: Where data comes from
- Lineage: sequence of steps needed to get to end of the pipeline
- Documentation can help keep track of provenance and lineage
- Tensorflow Transform can also keep track of these
- Metadata: Can be useful to troubleshoot data problems
Balanced train/dev/test split
- Small dataset: Balanced split means the percentage of positive examples are the same in train/dev/test environments.
- Large dataset: a random split will be representative.

Scoping

Questions to ask:
- What project should we work on?
- What are the metrics for success?
- What are the resources (data, time, people) needed?
Scoping process
- Brainstorm business problems (no AI problems)
  - Example: What are the top 3 things you wish were working better?
  - Identify what to achieve
- Brainstorm AI solutions
  - Identify how to achieve
- Assess the feasibility and value of potential solutions
  - Predictive examples:
    - Given past purchases, predict future purchases
    - Given weather, predict shopping mall foot traffic
  - Low predictive examples:
    - Given DNA info, predict heart disease
    - Given social media chatter, predict demand for clothing style
    - Given history of a stock’s price, predict future price of that stock

–	Unstructued	Structured
New	Can a human, given the same data, perform the task? (HLP) If so, then ML might be possible	Are predictive features available?
Existing	1. Can a human, given the same data, perform the task? (HLP) If so, then ML might be possible 2. History of the project can also help predict future progress.	New predictive features? Also look at history of the project.

Scoping process (continued)
- Determine milestones and resourcing
  - Key specifications:
    - ML metrics (accuracy, precision/recall, etc)
    - Softare metrics (latency, throughput, etc. given compute resources)
    - Business metrics (revenue, etc.)
    - Resources needed (data, personnel, help from other teams)
    - Timeline
  - If unsure, consider benchmarking to other projects, or building a POC (Proof of Concept) first.

References

Label Ambiguity: https://csgaobb.github.io/Projects/DLDL.html
Data pipelines: https://cs230.stanford.edu/blog/datapipeline/#best-practices
Data lineage: https://blog.tensorflow.org/2021/01/ml-metadata-version-control-for-ml.html
MLops: https://cloud.google.com/blog/products/ai-machine-learning/key-requirements-for-an-mlops-foundation

Dave's Blogs

bits and pieces of information from Dave's daily life

MOOC: Introduction to Machine Learning in Production

Instructors:

Week 1: Overview of the ML Life-cycle and Deployment

ML project life-cycle

Week 2: Modeling overview

Selecting and Training a Model

Error analysis and performance auditing

Data iteration

A useful picture of data augmentation

Data augmentation

Can adding data hurt performance?

Adding features

Experiment tracking

From bug data to good data

References

Week 3: Data Definition and Baseline

Define Data and Establish Baseline

Label and Organize Data

Scoping

References

Leave a comment Cancel reply

Instructors:

Week 1: Overview of the ML Life-cycle and Deployment

ML project life-cycle

Week 2: Modeling overview

Selecting and Training a Model

Error analysis and performance auditing

Data iteration

A useful picture of data augmentation

Data augmentation

Can adding data hurt performance?

Adding features

Experiment tracking

From bug data to good data

References

Week 3: Data Definition and Baseline

Define Data and Establish Baseline

Label and Organize Data

Scoping

References

Share this:

Related

Leave a comment Cancel reply