This workshop will go way beyond, “Hey, you should use LIME.” We will share concrete tips for how to navigate the need to balance interpretable and predictive modeling in the wild. Specific goals for the workshop are to:
1) Explain the difference between interpretable and predictive models, and the difference between local and global interpretability.
2) Cover a few concepts about data collection and data preparation related to interpretable vs predictive models.
3) Dive into which algorithms predict well, which interpret well, and which do a bit of both.
4) Share a decision flow chart showing how to approach a project based on the need for interpretability and/or predictive power.
5) Review Python code examples of what it means to navigate between interpretability and predictive models (yes, LIME is included in the examples).
Perhaps the most pervasive argument for model interpretability today is that model consumers and model stakeholders need to trust model recommendations and understand how to incorporate them into their decision making. Without trust and understanding, your model runs a real risk of becoming shelf-ware or being misused. Because of this reality, an understanding of the concepts discussed in this talk are absolutely vital to the success of a data scientist.
Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.
In this talk, Maryam will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. She will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous).
Maryam will discuss how we implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. Maryam will specifically focus the description of results on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.
(1) Lessons learnt from implementing different word embedding methods (from pertained to custom);
(2) How to map trends from a combination of natural language and structured data;
(3) How data science skills have varied across industries, functions and over time.
In this technical session, we will cover key advancements in the analytical process designed to improve productivity for data scientists. These advancements include:
Analytics accessible to anyone seeking insight from data;
Open source integration;
Predefined best practices and autotuning;
Interpretability of machine learning algorithms;
Ease of deployment of models in batch or in real time;
Maximizing resource efficiency with containers.
Yelp strives to connect people with great local businesses. For the past five years, machine learning has been instrumental in helping us achieve this goal, and today, it powers many of our product features. In our journey to become a machine learning company, we have learned numerous practical lessons–everything from how to phrase a business problem in the language of ML to the subtle issues that can lead our ML algorithms astray. This talk will share several of these lessons, with a focus on some tricky nuances that we learned the hard way. Topics discussed will include subtle feedback loops that can occur in model training, label leakage, and issues with data representation.
Probabilistic programming frameworks get a lot of press, but relatively little attention is paid to the indicators that a problem is a good fit for a Bayesian approach, how to pick a framework, or the right first steps to structuring a probabilistic model. This talk aims at building a structural underpinning for Bayesian modeling and lays out concrete strategies for solving real world problems.
In this mini workshop, we will walk through a framework for successfully managing data science in the enterprise that covers people, process, and technology. We will step through the key stages of the data science lifecycle, from ideation through to delivery and monitoring, discussing common pitfalls and best practices in each based on Domino’s experience working with leading data science teams. Attendees will be provided with examples of Domino’s Lifecycle Assessment and be guided through an interactive exercise to evaluate the bottlenecks in their own organizations. They will leave with a customized physical artifact that can be used to prioritize investment in hiring, process management, or technology acquisition.
Many organizations have been incorporating quantitative research or data science into key business processes for decades — well before the term “data science” was coined. But even in these organizations, there are many existing workflows and parts of the business where data science has not been previously applied. There is perceived value in adopting data science methods and outputs to drive faster innovation, greater differentiation, and smarter decisioning, but the learning curve is daunting. In this discussion, our speakers will delve into the approaches they’ve taken and lessons they’ve learned on the path to instrument legacy organizations and workflows with data science, touching on best practices around organizational structure, cultural guidance, technology choices and more.