Technical

Python Plotting for Exploratory Data Analysis

Mon Jun 26, 2017 by Tim Hopper in read, technical

Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).

In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.

To that end, I made pythonplot.com, a brief introduction to Python plotting libraries and a “rosetta stone” comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.

Build a Real Time Machine Learning System

Wed May 17, 2017 by Tim Hopper in presentation, technical, watch

I gave a talk at the Data Science Conference on on building a realtime machine learning system with Kafka, Streamparse, and Storm. You can see the video on Youtube

Understanding Probabilistic Topic Models By Simulation

Tue Oct 25, 2016 by Tim Hopper in watch, technical, presentation

I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here’s the description:

Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.

You can watch the video on Youtube:

I gave a shorter version of the talk at PyData NYC 2015.

Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use

Fri Nov 06, 2015 by Tim Hopper in read, technical

When I worked at RTI International, I worked on an exploratory analysis of Twitter discussion of electronic cigarettes. A paper on our work was just published in the Journal of Internet Medical Research: Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.¹

Marketing and use of electronic cigarettes (e-cigarettes) and other electronic nicotine delivery devices have increased exponentially in recent years fueled, in part, by marketing and word-of-mouth communications via social media platforms, such as Twitter. … We identified approximately 1.7 million tweets about e-cigarettes between 2008 and 2013, with the majority of these tweets being advertising (93.43%, 1,559,⁵⁰⁸⁄₁,669,123). Tweets about e-cigarettes increased more than tenfold between 2009 and 2010, suggesting a rapid increase in the popularity of e-cigarettes and marketing efforts. The Twitter handles tweeting most frequently about e-cigarettes were a mixture of e-cigarette brands, affiliate marketers, and resellers of e-cigarette products. Of the 471 e-cigarette tweets mentioning a specific place, most mentioned e-cigarette use in class (39.1%, ¹⁸⁴⁄₄₇₁) followed by home/room/bed (12.5%, ⁵⁹⁄₄₇₁), school (12.1%, ⁵⁷⁄₄₇₁), in public (8.7%, ⁴¹⁄₄₇₁), the bathroom (5.7%, ²⁷⁄₄₇₁), and at work (4.5%, ²¹⁄₄₇₁).

I have no idea what “Infoveillance” means. ^[return]

Nonparametric Latent Dirichlet Allocation

Fri Oct 16, 2015 by Tim Hopper in read, technical

Today is my last day at Qadium. Next week, I am joining the data science team at Distil Networks.

I’ve been privileged to work with Eric Jonas on the data microscopes project for the past 8 months. In particular, I contributed the implementation of Nonparametric Latent Dirichlet Allocation.

I published a collection of notes on nonparametric Bayesian methods and Latent Dirichlet Allocation at dp.tdhopper.com. I hope this will be useful to other students and researchers of these methods.

Notes on Dirichlet Processes

Fri Oct 16, 2015 by Tim Hopper in read, technical

I have published some notes on the Dirichlet distribute, Dirichlet processes, Gibbs sampling for mixture models and nonparametric mixture models, and the Gibbs sampler for nonparametric Latent Dirichlet Allocation.

This is related to my work on a Python implementation of Hierarchical Dirichlet Process Latent Dirichlet Allocation.

Introduction to PySpark

Sat Feb 28, 2015 by Tim Hopper in read, technical, presentation

I gave a talk at the Research Triangle Analysts meetup about Pyspark. It wasn’t recorded, but you can see the IPython notebook I presented from.

ShouldIGetAPhD.com

Mon Dec 08, 2014 by Tim Hopper in read, technical

Last year, I published nine interviews with Internet friends about how an academically-minded, 22-year old college senior should work on a Ph.D. Many people have told me the interviews have been helpful for them or that they’ve emailed them to others.

I decided to make a dedicated website to host the interviews. You can find it at shouldigetaphd.com.

I hope this continues to be a valuable resource. I’d encourage you to share this with anyone you know who is thinking through this question.

Introduction to Scikit-Learn

Mon Jan 21, 2013 by Tim Hopper in presentation, watch, technical

I gave a talk at a recent Research Triangle Analysts meetup on Scikit-learn, the excellent machine learning libary for Python. The talk wasn’t recorded, but you can see the IPython notebook that I presented from.

Pickle and Redis

Mon Oct 22, 2012 by Tim Hopper in read, presentation, technical

I gave a talk at PyCarolinas 2012 about using Pickle and Redis to persist data with Python. It wasn’t recorded, but you can see the IPython notebook I presented from.