Decoding data — without using Machine Learning (…well, almost!)

Malav Shah
7 min readSep 11, 2020

‘Data is the new oil.’ You’ve probably heard this over a thousand times but it is more true now than it has ever been before. Every company and individual businesses collect all kinds of data, ranging from really basic things like ‘how many times you pick up the phone in a day?’ to things like ‘What is it that you watch on Netflix the most?’. They collect these data points to understand a plethora of things ranging from consumer liking to spending habits etc. This helps them target a particular set of customers for their products and increase their revenues (It is all about the $$$).

Many a times, these datasets look like just a bunch of information tagged together but it is important to decode interesting connections between these data points. This could particularly be helpful if you are an individual looking to start a business but do not know who your target audience exactly is or if you just want to know something that might help you make some decisions i.e. where to stay, what to buy etc. Rather than just targeting random clusters of people and then realizing that this is not my target audience and having to start again, knowing a bit about ‘Who would like what and why?’ will definitely help.

I often come across people who correlate decoding data with having Machine Learning knowledge. I get questions like — ‘ Hey, I don’t know Machine Learning so how do I understand how attribute A connects to attribute B?’ Machine Learning helps, sure but that shouldn’t be deterrent for you to start exploring data.

My idea behind this article is just that. The only thing that you need to start decoding data is to come up with some ‘interesting’ questions that you want to know the answer for. This is often the hard part but once you get through this, you get a better understanding of what and how you can use the data to answer those questions.

I am going to answer 4 basic questions that would help me understand the data better and give me insights as to what the data really has to offer. As you will see, all the questions except the last one can actually be answered without Machine Learning.

For the purpose of this article, I am going to use an Airbnb data set from August 2020 for the city of Austin. Let’s begin!

Q.1 What are the most common types of amenities offered by the Airbnb’s in Austin?

All of us always want to understand what it is that I will get in terms of amenities at a particular Airbnb home if I book it. Well, here are some insights. You will more likely than not find these 15 things at majority of the Airbnb’s in the city of Austin. Unsurprisingly, most of the Airbnb’s offer Wifi and air conditioning which is the norm in these modern times. One of the surprising things that I see in this plot is the ‘friendly workspace’ which seems to be offered by more than half the Airbnb’s in Austin (c’mon who really wants to work when they are on a break? I guess there are workaholics who don’t rest. Is that me? Maybe!). Other amenities are ones that you’d expect to see and this plot exactly confirms that.

Q.2 Is location a major factor that influences how an Airbnb is priced?

Map of Austin with prices

The yellow dots in the plot show the Airbnb’s that are priced above USD 150 whereas the purple dots show the Airbnb’s that cost less than USD 150. We can see that except a few locations, both the colors overlap each other heavily which suggests that although location is a factor, it is certainly not the most important one relating to the price of the Airbnb (Certainly surprising right? We always expect areas like downtown to only have the most expensive Airbnbs. This is exactly why data is so important in current times.) and we will be able to confirm this when we do some prediction (the Machine learning that I spoke about in the title) of the prices for the last question.

Q.3 Do people staying at superhosts’ homes actually have better experiences than those who don’t?

To answer the above question, it is important to gauge people’s sentiment toward various things. As can be seen from the image on the side, the super-hosts excel on all the things that would make the people’s experience better (acceptance rate, response rate etc.) and this is further boosted by the higher ratings that the super-hosts get from the customer after they have completed their stay. The average rating does take into account all the things ranging from cleanliness, value for money etc. Hence, for the most part, people staying at superhosts’ homes do have better experiences and thus rightly earning that super-host badge.

Q.4 Predict whether the Airbnb is expensive (> USD 150) or not and what the most important factors that affect the price of an Airbnb? (Sorry, one ML question is a must! :P)

This is a question which requires machine learning so that prices of Airbnb with particular attributes can be predicted and hence we train a classification model (basically things where we need to know if it it ‘this or that, here or there’ etc. In this particular case, we want to know whether the Airbnb will be expensive or cheap. Backtracking just a bit — you might ask ‘what is a model?’. It is basically something that learns from data itself and tries to identify patterns and relationships between the different data points.

Before showing you the next image, one needs to know two basic things — train data and test data. Train data is basically the data that the model has seen i.e. the data using which the model learns to predict unseen data while test data is the data which is used to test how the model actually performs on unseen data (which is the case when a ML model is deployed in real-life scenarios)

Okay, that’s a lot of weird numbers. What does it mean? Ignore the numbers except the two accuracy values at the top for the scope of this Medium article. The first one basically means that the model is accurately able to predict 90.4% of the times whether the Airbnb will be expensive or not on the data that it has already seen while it is able to predict 87% of the times on data that it has not seen which is encouraging as this means that the model is able to generalize and predict well even on unseen data, which is the ultimate goal. There a variety of other factors like the ROC curve etc. used to judge how a model performs but all those technical details are for another in depth Medium article about how to compare different models.

For the part 2 of the question,

We get this using parameters of the model that was just trained. The plot basically says that things such as number of bedrooms, minimum nights a person can stay as well as how accommodating the host is in terms of various schedule and other changes that a customer might need is are few of the most important factors that contribute to the price. As we saw in the last question, although location is important, it is certainly not among the top 5 factors. Other things that are unsurprisingly important include things like the type of property, number of reviews, number of amenities, availability etc.

Conclusion:

These were some of the questions that you could answer for a given data set and get more insights and try and make sense of it. The point of this article also was to show that not every insight requires the use of machine learning (every analysis except Q4 is not based on Machine Learning but is analyzed using basic Python libraries). Hence, not knowing machine learning should not be deterrent for your pursuit of trying to decode the data and get valuable insights from it.

Hope you enjoyed decoding some data with me!

--

--

Malav Shah

Curious about anything and everything, all day long!