Right Code, Right Place, Right Time

I gave a talk at Pydata DC 2018 where I tried to articulate some reasons why companies building machine learning products under-invest in engineering and architecture. I’m very interested in feedback, pointers to other resources on this topic, and a general discussion about how to make more effective ML products.

Open in Google Docs

And the video:

Python Plotting for Exploratory Data Analysis

Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).

In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.

To that end, I made pythonplot.com, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.

Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use

When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.1

Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,5081,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184471) followed by home/room/bed (12.5%, 59471), school (12.1%, 57471), in public (8.7%, 41471), the bathroom (5.7%, 27471), and at work (4.5%, 21471).


  1. I have no idea what “Infoveillance” means. [return]

Nonparametric Latent Dirichlet Allocation

Today is my last day at Qadium. Next week, I am joining the data science team at Distil Networks.

I’ve been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of Nonparametric Latent Dirichlet Allocation.

I published a collection of notes on nonparametric Bayesian methods and Latent Dirichlet Allocation at dp.tdhopper.com. I hope this will be useful to other students and researchers of these methods.

Profile in Computational Imagination

I recently had the honor of being interviewed by Michael Swenson for his interview series called “Profiles in Computational Imagination”. I talked a bit about my current work, my wandering road to data science, and my love for remote work. You can read it here.