# Using Data Mining Tools & Concepts to Beat the Spread

Essay by   •  April 12, 2016  •  Research Paper  •  2,794 Words (12 Pages)  •  1,336 Views

## Essay Preview: Using Data Mining Tools & Concepts to Beat the Spread

Report this essay
Page 1 of 12

Betting on College Football
Using Data Mining Tools & Concepts to Beat the Spread

Abstract. Spread betting is the name of a popular betting method in many college & professional sports. The goal is to produce a market with nearly half of all betters on both sides of the “line”, which defines the scoring threshold determining whether or not a particular better wins or loses a particular bet on a given game. The goal of this project is to apply data mining techniques to historical game statistics for college football teams to determine which side of the line to bet for a given game.

Keywords: Data Mining, College Football, Spread Betting, Sports Betting, J48

1   Introduction

Sports betting is a very popular gambling method. In most sporting events, the predicted odds of various opponents winning are not equal. Typically, in a sporting event involving exactly two opponents, the team with the lesser odds of winning is labeled the underdog, while the opponent with the better odds of winning is labeled the favorite. To account for this difference in odds in the world of sports betting, typically a handicap, known as the point spread, is applied in an attempt to equalize the odds of winning a bet independent of which team the bet is placed on. The term “beating the spread” is typically used to reference the attempt to win spread bets at a rate consistently returns a profit. Many individuals have attempted to create simulators and models in an effort to accurately predict the outcome of a spread; few have succeeded in doing so.

Few papers have been published that discuss the application of data mining to sports betting; those that do exist typically discuss the ability to predict the winner of an event but do not specifically consider spread betting. The goal of this project is to investigate the potential usefulness of data mining processes and tools in predicting the outcome of a spread on a future game.

The initial data set will consist of general game information and detailed offensive, defensive and special teams statistics for each team in every game during the 2011 football season. The data set will also contain player injury reports and college football poll rankings for weeks prior to each game.

This project presented itself with a number of obstacles. The biggest challenge, perhaps, was that a substantial portion of the data, including rankings, player injury reports and spread details, could not be found in a readily available in a compact, downloadable format. Manual data collection would not have been feasible due to time constraints; furthermore, manual collection is prone to user error. As such, a significant amount of time was spent developing scripts to programmatically collect the missing data from various websites.

Once collected, another challenge presented itself in that the game statistics were paired with the game in which they occurred, as this is how the data is made available on the world wide web. Since the goal of the classifier in this project is to predict the outcome of a future game, same-game statistics could not be used to train the decision tree. Substantial data pre-processing needed to be implemented to restructure the data so that each game was instead paired only with historical game results and thus no longer associated with same-game statistics. Last, because of the very well-known difficulty in beating the spread, a large number of attributes needed to be collected to achieve higher classification accuracy.

The general approach taken for this project was as follows: the first step involved collecting the necessary data, some of which required lengthy scripts to obtain the data from well-structured web sites via a method known as web-scraping. The data then needed to be stored in various tables as part of a relational database (MySQL). Preprocessing would be performed to aggregate and restructure the data into the appropriate format. Initial classification results were taken to use as a baseline, and in doing so, several types of classification methods were utilized. Last, experiments will be performed to analyze the impact of using player injury reports as part of data cleaning, and also assess the impact of adding college football poll data to the data set.

1. Problem Description

The goal of this paper is to employ traditional data mining techniques to create a decision tree that can predict the outcome of a given spread with accuracy greater than 52.48%, which is the percentage of bets that need to be won in order to be profitable, assuming a 10% fee is charged when losing a bet. Since the data set needed is not readily available in a database-friendly format, a large portion of the time spent on this project will be used to first collect the data from various sports websites, and second to verify the accuracy of the data. The remaining scope of this project will involve applying data cleaning processes to create aggregate features, using WEKA to create a classification tree predicting the outcome of a game’s point spread, and third to analyze injury reports to see if they can be of any assistance in improving the accuracy of the classification tree.

1. Data Preparation

3.1    Data Details

Since much of the data was not available in a readily- downloadable format, the data needed for this project had to be collected from a variety of sources. To qualify as a potential source, a web page needed to have the html arranged in such a fashion that the data could be programmatically extracted from the HTML in a process known as web-scraping. To assist with this process, a PHP package called “Simple HTML DOM” was utilized. A MySQL Database with the following tables was created to store the data of interest. Once the database had been created, the PHP scripts were modified to programmatically insert the extracted data into the various MySQL tables. Note that most of the raw data was nominal, with a few binary attributes as well (i.e. Overtime, win/loss)

The raw data collected from the various web sites can be categorized as follows:

1. General Game Information (Table name: game_info)

Data Sources:

1. http://www.covers.com (for Game Score & Spread Information)
2. http://www.cfbstats.com (for everything else)

Description: This table contained general information for each game during the 2011 season

including when the game occurred, who played, and what the outcome was,

Contents Summary:

• Date of Game
• Home/Away Team
• Game Score
• Table Size: 16 columns  / 800 rows (games)

1. Detailed Game Statistics  (Table name: team_game_stats)

Data Sources:

1. http://www.cfbstats.com

Description: This table contained detailed statistics for each team in each game during the 2011 season. These statistics were taken for both offense and defense. Additional statistics were also recorded for special teams.

...

...