Kaggle is a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analysis problems. The experience you get on Kaggle is invaluable in preparing you to understand what goes into finding feasible solutions for big data. Kaggle enables data scientists and other developers to engage in running machine learning contests, write and share code, and to host datasets. Kaggle makes the environment competitive by awarding prizes and rankings for winners and participants. It especially a great place for beginners who are just trying to break into the data science field.
This platform is trusted by some of the largest data science companies of the world such as Walmart, Facebook and Winton Capital. On Kaggle, data scientists get exposure and a chance to work on problems faced by big companies in real-time.
To make it more convenient for hosts, Kaggle offers an additional consulting service that can help prepare data and describe the problem in the best possible format. Hosts have the sole ownership and royalty-free license to use the winning entry any way they want with all intellectual property. Once you launch the competition, do participate in the forums regularly.
Kaggle gives beginners a chance to take part in solving real world problems. It takes time and consistent practice to master challenging machine learning problems such as image recognition, forecasting, NLP (natural language processing) and sentiment analysis.
This platform uses different scoring mechanism to rank submissions from different contributors. Consistent practice is the only way to improve and enhance your data science skills. When a project is ongoing there are discussion boards and the chosen winner is also interviewed; this gives every contributor a sneak peek into the thought process of knowledgeable and experienced competitors.
Step by step action plan will help you know how to navigate this platform. Choose a programming language, data exploration, start with easier dataset, start with learning competitions, and focus on learning not prize money.
Choose a single programming language and stick with it. Two of the most popular programming languages on the Kaggle data community are R and Python. R is the right choice for data analysis, Python is suitable when you are dealing with statistics code or data integrated with web apps.
Seaborn library is a popular and highly recommended for data exploration for Python users.
Practice on an easier dataset is recommended because it helps you understand the lay of the land and also helps you familiarize with machine learning libraries.
Start with the “getting started” category of the competitions. Kaggle competitions fall into many categories: Featured competitions are posted by the Governments organizations and companies and offer monetary prizes. Research competitions usually offer a small or no price but are valuable for your resume and career progress. Recruitment competitions are hosted by companies that are looking to hire brilliant data scientists. And the getting started competitions for beginners to provide numerous guiding tutorials and simpler datasets. A research project is a great choice for a long-term project where you can exercise your data science skills and stimulate your creativity.
Tips to have fun; the sad fact is that a large percentage of Kaggle contributors may never win any competition. The kernel is essentially a short script which shares a solution, explores a concept and even showcases a great technique. Forums are a valuable place to ask as many questions you like. As a beginner you should take a couple of competitions alone to get the basics right. Close to half of the contributors on Kaggle work individually.
Competitions typically last between two to six months, and contributors are allowed to upload five entries per day (as an individual or team). Take an active role in the forums, and read the scripts as this is a good opportunity to learn how other competitors construct features and interpreting data. Also, do read blog posts detailing previous competitions. There is a formal blog by Kaggle called No Free Hunch.
The solutions should be new and unique and are not available anywhere else.
On Kaggle performance is relative; performance is compared with others meaning what you come up with will be compared against every other participant and team in the competition. Take your time to thoroughly understand a domain before you even start analyzing the data.
The progression system on Kaggle is specifically designed to cater for different levels of expertise. There are three main categories – Discussions, Kernels and Competitions with their own rules of progression and rewards. In each category, there are five performance levels with the lowest performance tiers being Novice and then Contributor. Points are temporary as they ‘decay’ over time. The formula for point decay is e-t/500.
In Kaggle the kernel is an indispensable tool, foundation and core of your work as it contains the code required for analysis. On Thursday of every week, the Kaggle team comes together to select the best kernel using datasets available on the platform for the previous fourteen days. There are several kernels are available to play with, and you can apply various models to find the solution and improve the performance.
You do not necessarily have to participate by creating your Kernel. You can also participate by being an active spectator. Keep up to date by checking out the latest kernels then comment and UpVote the ones you like. Kaggle has learning resources to help beginners understand what is involved in real-life Kaggle competitions.
Common Kaggle tutorials are Titanic, Digit Recogniser, Bag of Words meet Bag of Popcorn, De-noising Dirty Documents, San Francisco Crime Classification, Taxi Trajectory Prediction, Facebook Recruiting.
To succeed on Kaggle, start by reading the competition guidelines thoroughly. The second and very crucial step is to understand the performance measures. Step three is to understand the data in detail. Step four is to know your objective. Step five is to setup your own validation environment. Step six is to read the forums. Step seven is to research exhaustively. Step eight to stay with basics and apply it rigorously. Step nine is to ensemble models. Step ten is the commitment to work on a single or selected few projects. Step eleven is to pick the right approach.
Most novices on Kaggle tend to worry excessively about which language to use (R or Python). It is wise to do manual tuning or main parameters when experimenting with methods. Experienced Kagglers admit that one of the winning habits is to do the manual tuning.
In Kaggle competition titled Don’t Get Kicked hosted by a chain of dealers known as Carvana. The participants were required to predict the cars that would go up for sale in a second hand (pre-owned) auction and the ones that will not be sold.
The diabetic retinopathy detection competition hosted by the California health care foundation is where the participants were asked to take clear images of the eye and diagnose which images indicated the presence of diabetic retinopathy.
Zillion Pillows (Zillow) is the largest digital inventory and estimation of American homes in the world. With 73 million unique visitors per month, 20 TBs of data and 1.2 million statistical and machine learning models that runs every right to predict the next Zestimates, it is undoubtly the best machine learning case study for real estate. Zillow launched its Zillow Price Competition on Kaggle.
Follow the book author for more information at kaggle.com/zusmani
Automated page speed optimizations for fast site performance