Introduction to Kaggle And Your First ML Competition
In this blog post, I would like to introduce Kaggle to you, and walk you through a basic Kaggle competition. If you are a beginner in the machine learning space, and are trying to validate your skills and trying to ramp up new ones, Google’s Kaggle is a great environment to do that. Kaggle is a platform where you can enter machine learning competitions of varying degrees of complexity, try to solve real world problems and when you become good at it, even win cash and other prizes. And it is also an excellent way to build a name for yourself and grow your portfolio. ( And the bragging rights that come with winning on Kaggle)
The notebook used in Kaggle can be accessed via this link.
Most people including me have started on Kaggle with their famous “Titanic Survival Prediction” competition. In this competition, We are provided with a data set of passenger info onboard the Titanic ship that day. In addition to information like age, sex, social status, profession etc., we are also given whether a passenger has survived or not. This will be the training data set. Using this, we need to train a machine learning model, that will then predict survival of passengers on a separate test data set. Obviously, our aim would be to get a high a percentage of accurate predictions. This competition was designed to get us an overview of the Kaggle platform and doesn’t include any prizes. But the learning experience is well worth the effort.
ok, with that out of the way, let’s get started.
First of all, remember that this problem is not an easy one to solve. In the sense that there isn’t a lot of intelligence in the data itself that can be learned by a machine. Let me explain.
We know that women and children were the first to be let on lifeboats. So, there is a good chance of survival if the passenger was a woman, or a child, or both. But, does it mean all women and children survived, no!! Similarly some upper class men managed to bully their way into those life rafts so, being a man on that ship doesn’t necessarily confirm they would have been dead. So even for a human, looking at a passenger’s info, it’s not easy to predict their survival, and hence it’s just as hard for a machine to do it. The reason I mention this is to set your expectations that, it’s highly unlikely to get prediction percentages higher than 90% on this data.
This is a simple case of binary classification. The output class is either a 0 or a 1, 0 for Not survived and 1 for survived.
Log into Kaggle, join the competition and get started.
These steps are fairly self-explanatory. You can login to Kaggle using a variety of login options, and once logged in you should be able to enroll yourself in the Titanic competition. Once enrolled, open the “Code” environment by clicking on “edit”.
Let’s take a pause here. Before we do anything further, I will describe the broad, high level steps in this exercise. The very first one would be to “import” the various data manipulation libraries and the pre-built classification models that you may want to use. There is a good number of prebuilt binary classifiers to choose from ( Thank you Math and Comp Sci PhDs) so we will import a few of them and try them all out to see which one works best for us. We will also use some cross-validation models used for Hyper parameters ( I’ll explain this later)
Once you have done that, the next step is to actually download and load data. This is a fairly simple process. You can do it using the data load libraries provided.
The next and in my opinion the most critical step is data manipulation and feature engineering. The passenger data provided is messy, incomplete and sometimes may be redundant for our training. We need to process this data to fill those gaps, drop unnecessary columns and combine related columns to remove redundancies. The quality of this data will have a direct impact on the quality of your predictions, so we need to spend considerable effort getting this right.
The last step is to do the actual training. Here we use one of the classifiers that we have imported earlier. We also use cross-validation to get the right hyper-parameters. I said I will explain the idea of Hyper-parameters earlier, so let me do it.
Each machine learning model that’s available to us has a number of input parameters. These input parameters need to be carefully tuned for the model to work best for our problem. Unfortunately, there is no easy way to do it, and also there isn’t exactly a logical way to do it. Besides, there can be a number of combinations of parameter values and its not feasible to hand-code and test all of them. This is where “Hyper parameter tuning” comes in handy. Using a “Cross-Validation” model we can try different combinations of these parameters and gives us the combination with the best results. We can then use these parameters to train our model, and also to test it on the test data set.
Once the model is trained, we run the test data set through it and submit results. We do not have the survival data for the test data set so we will only see how well the model does by submitting the results.
There is no distinctive advantage between classifiers so let’s try a few of them to see different hyper-parameters. We can also try some ensemble models, but given there isn’t real intelligence in the data set there is not a whole lot to be gained by using them. However, it is still good to learn how to use ensemble classifiers. But for the purpose of this video, we are NOT going into ensemble classifiers.
There you go, we have successfully enrolled in a Kaggle competition, trained a model and submitted our results. Hope you all learned something from it.
Thank you for your time.