Statistic

Essay by Easoncyc • March 9, 2019 • Exam • 1,452 Words (6 Pages) • 855 Views

Essay Preview: Statistic

prev next

Page 1 of 6

ETC3250 Report

Group member: Youcheng Cao, Jingying Wang, Yi Yu

Introduction:

Regards to the data set we have been given, it has been divided into two categories: training set and test set. There is also a list of the variables, which includes the age, job, marital, etc. The purpose of the project is to build a model to predict the probability that a client will subscribe to a bank term deposit on the basis of these predictors. The main process we are going through includes accessing data, dealing with data and fitting them into different models, comparing models and choosing the best one.

Methodology:

Firstly, we analyze data by visualising them. The graphs below show that almost all of the variables have obvious relationship with the probability that a client will subscribe to a bank term deposit (namely, ), except education. For the relationship between education and , we can understand from the first graph below.

By looking at the percentage bar chart related to and . We can find that in year 0,4,6,9, the variable does not have obvious impact on the value of . Thus, we combine the data of in year 0,4,6,9 together and regard them as .

[pic 1][pic 2][pic 3][pic 4][pic 5]

Secondly, we need to deal with the data and select those data which is really useful for the model. Here we state several main steps to do so.

We need to find out the correlations between variables. This is a very important step because we cannot reach the right conclusion for statistical analysis if we failed to so. In order to solve this problem, we draw a graph to show the correlations between the variables. We can see from the graph that the variable and the variable has a strong correlation, which means that the change of would cause the change of to a large extent and this indicate that the two variables bring same information for the model, thus we decide to remove one of the highly correlated variables.

[pic 6]

(2) Selecting data by using Principal component analysis and k means

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA provides us a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. Thus, it would be easier for us to select outliers. Besides, we also use k means to further select data which is useful for us. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. In order to choose the most appropriate value of k, we combine with the Davies- Bouldin index (DB) to find out the most ideal value of k. The Davies–Bouldin index (DBI)is a metric for evaluating clustering algorithms. Consequently, we found that the optimum value of k is 8. By analysing the graph below, we found that those purple dots on the left side of the graph should be outliers. Because they distribute away from the PCA line. Then, we need to delete those outliers.

[pic 7]

(3) Deleting the data with abnormal attributes.

For the variable marital, we can found some value of the data is “unknown”. This indicated there are some missing values for those variables. There are various ways to deal with the missing value. We can choose either filling the missing value manually or deleting data. Regarding our data set, we have a huge data set, thus, it may cause significant deviation if we choose to fill the missing using mean value or other special value. Besides, we do not have too much missing value in our data set. Therefore, we can just delete the data which contains the value “unknown”.

(4) Discretization

We visualise all the data by drawing graphs so that we can find the relationship between variables and factor, and the correlations between variables. For instance, we found that we can regard year 0,4,6,9 in as basic, because we found that they do not have obvious impact on .

Convert categorical variables to numerical variables so that it could be easier for us to fit the data in our model.

When we trying to build model using tree method, we do not need to use dummy variable for every qualitative variable as we did for Logistic model. Instead, we transfer qualitative variables into quantitative variables in the data matrix.

...

...

Download as: txt (8.5 Kb) pdf (575.8 Kb) docx (706.4 Kb)
Continue for 5 more pages »

Read Full Essay Save

Only available on AllBestEssays.com

Similar Essays

Diagnostic and Statistical Manual of Mental Disorders-II

In 1980, the American Psychology Association placed transsexuality in the Diagnostic and Statistical Manual of Mental Disorders-II, labeling any individual who considers themselves a transgender

1,873 Words  |  8 Pages

Statistics for Decision Making in Economics

Statistics for Decision Making in Economics Introduction The decision making process that virtually all of our federal agencies use in the determination of our economic

2,299 Words  |  10 Pages

Research, Statistics, and Psychology Paper

Research, Statistics, and Psychology Paper Defining statistics is a scientific method for psychology is through a form of critical thinking, which is the foundation through

633 Words  |  3 Pages

Case Analysis - How Is Statistics Applied in Petron

HOW IS STATISTICS APPLIED IN PETRON Aracelli Acosta, having an overall position especially in Admin & Management, works on 24 years in Petron Gasoline Station.

458 Words  |  2 Pages

Lies, More Lies and official Statistics - Why Can't We Really Trust the official Crime Statistics?

Lies, more lies and official statistics. Why can't we really trust the official crime statistics? Every year crime statistics are recorded for England and Wales

2,011 Words  |  9 Pages

Statistics Memo

Taking a random sample from the population and calculating the mean from the data can pose some issues with the underlying population mean. Confidence intervals

334 Words  |  2 Pages

More Outdoor Activities and Less Video Games - Statistics for Childhood Obesity

The statistics for childhood obesity is shocking. Worldwide, twenty-two-million children under the age of five are estimated to be overweight (Reese, 2010). The United States

1,467 Words  |  6 Pages

Business Statistics

Introduction To complete the statistical analysis and draw conclusions of a hospital stay and the associated charges for a normal child delivery, both Descriptive and

4,020 Words  |  17 Pages

Understanding the Effect of "random Repeats": Statistical Review of Repeat Victimization Studies

UNDERSTANDING THE EFFECT OF "RANDOM REPEATS": STATISTICAL REVIEW OF REPEAT VICTIMIZATION STUDIES Abstract The notion of repeat victimization has been a growing interest of crime

6,334 Words  |  26 Pages

Similar Topics

Statistics

Research Statistics and Psychology Paper

High Quality Free Essays

Join 394,000+ other members

Get Better Grades

Sign up