>Solving Pricenomics Treefort BnB challenge with pandas

Pricenomics has put out a data analysis challenge as part of an effort to promote and recruit for their freelance portion of the The Priceonomics Data Studio. The challenge revolves around a fictional company called “Treefort BnB”, which has released a dataset of their treehouse listings. It seems to me like it could be pretty cool to freelance for Pricenomics as I have enjoyed reading their blog posts in the past, so I decided to solve the puzzle.

The problem and my motivation

The basic goal is to determine the median price of a booking for the the hundred most represented cities by number of listings. The technical portion of the application for the freelance gig is restricted to the specific computation mentioned above, but it seemed to me like a somewhat more extended analysis and exposition could make for a nice blog post (an opportunity not to be wasted). So in what follows I’ll show how I would solve the basis challenge using Python’s pandas library. If you haven’t seen pandas in action before this should give a good idea of how it can be useful for solving simple analysis problems. After solving the basic problem I’ll show how I would go about poking around the dataset to learn more about TreefortBnB and show off some more pandas functionality.

The basic solution

If you want to follow along using Jupyter you can find all the code here.Using IPython I downloaded the data and read it into a pandas (it is conventional to import the module as pd) dataframe with

!curl https://s3.amazonaws.com/pix-media/Data+for+TreefortBnB+Puzzle.csv > Puzzle.csv
df = pd.read_csv("./Puzzle.csv")

We can inspect the first few rows by calling head

df.head(5)

which in a Jupyter notebook will render nicely as an HTML table:

	Unique id	City	State	$ Price	# of Reviews
0	1	Portland	OR	75	5
1	2	San Diego	CA	95	3
2	3	New York	NY	149	37
3	4	Los Angeles	CA	199	45
4	5	Denver	CO	56	99

So we have columns corresponding to a unique id (it’s a good idea to check that it actually is unique!), a state, a city a price and total number of reviews for each listing. Having glimpsed the basic layout of the data, one can plan out a sequence of operations to convert the full table into desired summary. Here is one way to do this:

basicsummary = df.assign(City = df.City.str.title()) \
                 .groupby(["City","State"])["$ Price"] \
                 .describe().unstack().sort_values("count").tail(100)["50%"] \
                 .sort_values(ascending=False).to_frame() \
                 .rename(columns = {"50%":"median price ($)"})
basicsummary.head(5)

		median price ($)
City	State
Indianapolis	IN	650
Malibu	CA	304
Park City	UT	299
Truckee	NV	275
Healdsburg	CA	275

That first chunk of Python deserves a bit of explanation, which I am happy to provide.

The first line converts all the city names to title case. This is necessary because the original column contains rows with both title and lower cases (e.g both “New York” and “new york”).
The second line creates a grouping data structure based on the city and ctate and pulls out the price in each group (the other columns aren’t needed in this exercise). The grouping is based on both cities and states to avoid grouping distinct cities that happen to have the same name (e.g. both Carmel, California and Carmel, Indiana are present in this dataset).
The third line is perhaps the most obscure. In it I make use of pandas useful describe method, which returns several statistics about a column, including the total number of entries (as count) and the median (as 50%). The unstack method is used to get these statistics into columns, from which point we can pull out the hundred most numerous cities and extract the median price.
The fourth line sorts the resulting “Series” data structure by price and converts the result to a dataframe.
The last line simply renames the “50%” to “median price ($)”.

These five lines, counted as such by way of an ad-hoc continuation break up, are not necessarily the most Pythonic, but more or less reflect how I think about the steps in this exercise. I oftentimes find myself chaining method calls like this when working with pandas interactively in Jupyter or vanilla IPython, as it allows me to build up a single operation from many while getting the output by evaluating the statement in one go.

An exploratory analysis and improved solution

The above exercise focused on on median prices. This is perhaps a reasonable metric to get a sense for the typical sort of bookings one might expect to make, since it is robust to possible extreme outliers which might be present in the data. One of the nice things about using pandas interactively, however, is that we don’t have to guess, when we can see! As was already mentioned, the describe method returns a bunch of statistics which can be useful in developing a rough idea of the distribution of data in a column. We are interested in the distributions of prices in particular:

df["$ Price"].describe().to_frame() #The call to `to_frame` is a bit superfluous (I did it here to get a nice HTML table).

	$ Price
count	42802.000000
mean	187.117074
std	263.002109
min	10.000000
25%	90.000000
50%	130.000000
75%	200.000000
max	10000.000000

By comparing the ratio of the mean and the standard deviation, also known as the coefficient of variation, we get a sense that the distribution is somewhat dispersed to the right, as prices do not take on negative values. Furthermore we see what looks like an absurd maximum booking price of $10000. Perhaps this is Paris Hilton’s gold and sapphire plated Redwood penthouse? We can get a better idea of whats going on by looking at a graphical representation of the distribution:

df["$ Price"].hist(bins = 64,log=True)

histogram of prices

We find that there are indeed a few listings priced at $10000 (note that the y axis is on a logarithmic scale), and that they appear to be well separated from the rest of the data. One’s imagination can run wild about the possible sorts of things this could mean about these bookings. However, it is useful to constrain one’s imagination by taking a look at the data. In pandas is it easy to select rows by criteria in the columns:

df[df["$ Price"] > 5000]

	Unique id	City	State	$ Price	# of Reviews
2819	2820	San Francisco	CA	10000	31
5135	5136	San Francisco	CA	6000	11
8328	8329	Chicago	IL	6500	1
8638	8639	Indianapolis	IN	5500	0
8867	8868	Park City	UT	10000	1
10332	10333	Park City	UT	6500	0
13431	13432	Queens	NY	6500	0
17195	17196	Boston	MA	5225	0
20279	20280	New York	NY	5600	0
23817	23818	Park City	UT	5500	0
26232	26233	Park City	UT	10000	0
28626	28627	Miami Beach	FL	10000	0
35136	35137	Miami Beach	FL	6000	0
38261	38262	Miami Beach	FL	5750	0

By looking at additional columns of these expensive listings we get a clue as to what might be going on. Most of these listings have no reviews. While the challenge didn’t offer any specific hints as to what one was to learn by computing median prices, the spirit of it suggest that the goal was to provide insight into a “typical” Treefort BnB experience. It stands to reason that listing without reviews have possibly never been booked, so that even though they are in the data set they may not be relevant to the larger questions we are interested in. In contrast, one would imagine that listings with many reviews are at least significant to the Treefort BnB user experience, inasmuchas users have written about them. So what does any of this have to do with prices? We can get a sense for this by doing:

df.plot(x="$ Price",y ="# of Reviews",kind='scatter',
        logx=True,xlim = [9,11000],ylim = [-10,110])

price/reviews scatter plot

Notice how there is a clustering of points with zero or one hundred reviews? The points in the zero cluster extend significantly into the high price regime. It is hard to tell from the plots exactly how important these zero review points are to our overall problem, but they at least suggest we should look further into it. To start we can take at look at the same type of scatter plot, restricted to the bookings in Indianapolis, which which had the highest median prices in our initial analysis

df[df.City == "Indianapolis"].plot(x="$ Price",y ="# of Reviews",kind='scatter',
                                   logx=True,xlim = [9,11000],ylim = [-10,110])

price/reviews scatter plot

Yikes! It looks like Indianapolis has a bunch of listing with no reviews, and that they probably are a big part of why it came out on top in our naive median price analysis. We can get a quantitative view of the larger picture by separating the reviewed from the unreviewed and taking a peak at some statistics:

is_reviewed = df["# of Reviews"] > 0
df.groupby(is_reviewed)["$ Price"].describe().to_frame().unstack()

	0
	count	mean	std	min	25%	50%	75%	max
# of Reviews
False	16000	238.025500	365.393895	10	100	150	250	10000
True	26802	156.726252	168.202838	10	85	125	185	10000

We see that about 40% of the total listings are unreviewed with their prices shifted to the right (both in terms of mean and median). Furthermore there are some rather suspicious looking quantiles for these unreviewed listings ($100, $150, $250… what are the odds?!). All of this seems to add credibility to the initial idea that filtering out listings that haven’t been reviewed might give a better picture of what to expect when booking.

With this in mind I set out for a better version of the basic result above. I also thought it could be useful to group location with the same median price into the same row with a slightly different formatting, which could then be indexed by a price rank. This lead me to the following lines of code:

refinedsummary = df.get(df["# of Reviews"] > 0) \
                   .assign(Location = df.City.str.title()+df.State.apply(lambda s: ' ('+s+')')) \
                   .groupby("Location")["$ Price"].describe().unstack() \
                   .sort_values("count").tail(100) \
                   .reset_index().groupby("50%")["Location"].apply(lambda x: "%s" % ' / '.join(x)) \
                   .sort_index(ascending=False).to_frame() \
                   .reset_index().rename(columns = {"50%":"median price ($)"})
refinedsummary.set_index(refinedsummary.index+1,inplace = True)
refinedsummary.index.rename("rank",inplace = True)
refinedsummary.head(15)

	median price ($)	Location
rank
1	300.0	Carmel (CA)
2	225.0	Healdsburg (CA) / Malibu (CA)
3	200.0	Incline Village (NV) / Laguna Beach (CA) / Truckee (NV)
4	199.0	Hermosa Beach (CA)
5	184.5	Napa (CA)
6	179.0	Park City (UT)
7	177.5	Sunny Isles Beach (FL)
8	175.0	Sonoma (CA)
9	165.0	New York (NY)
10	162.0	Manhattan Beach (CA)
11	157.5	La Jolla (CA)
12	150.0	Sausalito (CA) / Beverly Hills (CA) / Newport Beach (CA) / Venice (CA) / Boston (MA) / Austin (TX) / San Francisco (CA)
13	149.0	Marina Del Rey (CA)
14	148.0	Mill Valley (CA)
15	130.0	Santa Monica (CA) / Miami Beach (FL)

With this we see that Indianapolis has been pushed out of the top ranks, which are now occupied by somewhat more sensible places for expensive treehouses (to the extent that such a thing exists).

There are of course, other aspects of the data we might want to consider. For example, where should you head if you want the widest selection of treeforts? By, now, you may have an idea of how I would go about answering this question…

nlistings = df.get(df["# of Reviews"] > 0) \
         .groupby(["City","State"]).count()["$ Price"] \
         .sort_values(ascending = False).to_frame() \
         .rename(columns = {"$ Price":"total listings"})
nlistings.head(10)

		total listings
City	State
New York	NY	5597
Brooklyn	NY	3017
San Francisco	CA	2455
Los Angeles	CA	1941
Austin	TX	1495
Chicago	IL	875
Washington	DC	836
Miami Beach	FL	790
Portland	OR	586
Seattle	WA	573

At first glance (say, for the first ten seconds) this looks reasonable, although there is a question of whether or not or exactly how Brooklyn is not contained in New York. But let’s assume that “New York” means every borough except Brooklyn. The data show that there are 3017 reviewed listings in Brooklyn. Are there over three thousand treehouses in Brooklyn? Granted, there are quite a few trees. But treehouses are sufficiently rare that you might make it into the New York Times for building one there. Still, exactly how many is not obvious but…

WARNING!!! YOU ARE ENTERING THE FERMI ESTIMATION ZONE. HOLD ON TO YOUR POWERS OF TWO AND TEN!

… to get things started lets say that there are about as many trees as there are people. So, a couple million. Now, one expects that a great deal, say 90%, of these tree are going to be found in public parks where one is likely to have a hard time constructing a treefort, not to say anything about renting it out. We still have hundreds of thousands of trees to work with though. Now imagine you are walking on a street in Brooklyn and spot a bunch of trees (if you hare having a hard time imagining use Google Street View). Most of the trees that you see are sort of lining the streets and sidewalks, which also would make for construction difficulties. Let’s be generous though and say that 10% of the remaining trees can be found in backyards and other places suitable for treehouse construction. There are ten thousand-ish trees (from tiny saplings to grand oaks) left to work with. And yet, the above table states that there several thousand listings in the same area. Are we to believe that every third tree in Brooklyn has a treehouse on it and all of them are for lease through Treefort BnB!?

Of course we don’t, because we know that the data are fake and Treefort BnB does not exist. But I wanted to end with that little Fermi estimation paragraph because it shows how data of this sort usually raises many interesting questions, including those about the accuracy of the data itself, which is something that freelancers should definitely keep in mind when working with a client’s data.