I gave a talk at Pydata DC 2018 where I tried to articulate some reasons why companies building machine
learning products under-invest in engineering and architecture.
I’m very interested in feedback, pointers to other resources on this topic, and a general
discussion about how to make more effective ML products.
Open in Google Docs
And the video:
I gave a talk at Scipy 2018 loosely based on my Ansible tutorial. Here are my slides:
Open in Google Docs
And the video:
Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).
In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.
To that end, I made pythonplot.com, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.
I recently gave to the Duke Big Data Initiative entitled Dr. Hopper, or How I Quit My Ph.D. and Learned to Love Data Science. The talk was well received, and my slides seemed to resonate in the Twitter data science community.
I’ve started a long-form blog post with the same message, but it’s not done yet. In the mean time, I wanted to share the slides that want along with the talk.
I created a single page website to collect notes on one of my other hobbies: ultralight backpacking. In particular, notes on ultralight gear for the very tall.
When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.
Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,508⁄1,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, 184⁄471) followed by home/room/bed (12.5%, 59⁄471), school (12.1%, 57⁄471), in public (8.7%, 41⁄471), the bathroom (5.7%, 27⁄471), and at work (4.5%, 21⁄471).
Today is my last day at Qadium. Next week, I am joining the data science team at Distil Networks.
I’ve been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of Nonparametric Latent Dirichlet Allocation.
I published a collection of notes on nonparametric Bayesian methods and Latent Dirichlet Allocation at dp.tdhopper.com. I hope this will be useful to other students and researchers of these methods.
I have published some notes on the Dirichlet distribute, Dirichlet processes, Gibbs sampling for mixture models and nonparametric mixture models, and the Gibbs sampler for nonparametric Latent Dirichlet Allocation.
This is related to my work on a Python implementation of Hierarchical Dirichlet Process Latent Dirichlet Allocation.
I recently had the honor of being interviewed by Michael Swenson for his interview series called “Profiles in Computational Imagination”. I talked a bit about my current work, my wandering road to data science, and my love for remote work. You can read it here.
I gave a talk at the Research Triangle Analysts meetup about Pyspark. It wasn’t recorded, but you can see the IPython notebook I presented from.