How To Code Binary And Multi-Class Classification, Explore Data Set With Boxplots, Clustering With K-means, And Implement Predictive Modeling For A Cloth E-Retailer.
Understanding Mean, Median, and Mode in Describing Data
1) Data can be described, among other things, in terms of central tendency and spread.
- a) Name at least two common measures used for capturing central tendency.
Mean
Median
Mode
- b) Give a one sentence definition of at least two of the measures named in a).
Mean. It can be simply defined as the average of numbers. Arithmetically we achieve mean by adding up all the values in a data set and dividing the sum by the number of values making the data set.
Median. Given a data set the, data is arranged in ascending order, the middle number separating the lower half from the higher half in a data set is called the median. It is simply the middle value.
Mode. Given a data set Mode is the number that is repeated most.
- c) Name at least two common measures used for capturing spread.
d) A boxplot can be used to visualize the distribution of an attribute. Explain how to interpret a boxplot.
A boxplot enables scholars to examine the distributional relationship of data set, further it enables us to study the level of the scores.
In the first step, scores are organized. Secondly the sorted data is grouped into four equal quarters (25% of scores in each subgroup). The four subdivision of the data are referred to as quartiles scores. The quartile groups are labeled 1 to 4 starting at the lowest.
If a box plot is comparatively short –This means the dataset been analyzed has a great agreement and is concentrated towards the mean (Thearling, 2017). In case of students exam performance, the students seems to have scored grades within the same range.
If box plot is comparatively tall – This suggests that the data set under analysis has a great variation. i.e in case of students exam performance, some may have scored high grades while others have scored low grades meaning the separation between the two is very wide.
If box plot lower or higher than the other – This means that there exist a variation between the data set. For instance, the box plot for women may be lower or higher than the men in an election analysis this could mean more men participated in the elections than women and the vice versa applies for the higher ended box plot.
When the box plot are unequal in size – Means that similar views are represented in the wider scale and more variable opinions are held in other parts of the scale which may be narrower (Shmueli, Bruce, Yahav, Patel, & Lichtendahl, 2017). Further using whiskers to interpret a box plot, lower longer whisker means that the students average performance is concentrated towards the lower quartile and the vice versa applies for the longer upper whisker.
Interpreting Boxplots for Data Distribution Visualization
Consider the following training data whose goal is to determine whether a car is manual or automatic.
Input |
Hidden |
output |
|
size |
1.0 |
0.469 |
0.6 |
model |
2.0 |
0.523 |
|
engine |
3.0 |
0.572 |
|
0.617 |
In this example manual is encoded as 1 while automatic is encoded by 0
The output is 0.6 and because this value is closer to 1 the neural network predicts the car as automatic.
Input |
Hidden |
Output |
|
size |
1.0 |
0.469 |
0.4 0.51 |
model |
2.0 |
0.523 |
|
engine |
3.0 |
0.572 |
|
0.617 |
In this example the manual will be encoded by (1, 0) and automatic (0, 1)
In the output, the larger of the two node values is in the second position and map to (0,1) hence the neural network predicts the car is automatic.
b) If we have a categorical but ordered input attribute, let’s say with the possible values {Low, Medium, High}, how would you code that? Why is this a good coding for that attribute?
I would encode low as 3 medium as 2 and high as 1
This would be appropriate coding because the attributes are ordered in terms of hierarchy with high indicating better.
A one-dimensional dataset with ten instances is given below:
{1, 1, 2, 3, 5, 8, 13, 21, 33, 54}
Assume that you have to explore a large dataset of high dimensionality and that you know nothing about the distribution of the data. Describe a method for finding the number of clusters in the dataset using k-means, and furthermore, explain how k-means can be applied to find the dimensions in which the clusters separate (i.e. how can you eliminate dimensions that don’t provide any useful information for clustering the dataset, using k-means).
Gap statistic method will be used to determine the number of clusters in the dataset. The method will use output of k algorithims to compare change of dispersion in the cluster under null reference data distribution. According to Thearling, (2017), this method gives more precise results than the other methods. The formula is given below.
- Cluster the observed data, varying the number of clusters from , and compute the corresponding .
- Generate reference data sets and cluster each of them with varying number of clusters . Compute the estimated gap statistic .
- With , compute the standard deviation and define .
- Choose the number of clusters as the smallest such that .
1) You work at an e-retailer selling primarily clothes. Now you would like to use data mining, more specifically predictive modeling, to select which customers to target with a promotion for a new line of luxury dresses.
- a) Describe how you would model this task, specifically the kind of data set that you would use, i.e., input variables (attributes) and the output variable (target).
I would break down the input data variables into three major division:
(1) All data related to new line of luxury dresses in terms of prizes quality, sizes, designs and colors.
(2) All Variables related to the different media houses that my company will for the planned advertising and promotion activity for the luxury dress.
(3) All Variables related to the budgetary allocation for the advertising and promotion activities with respect both external advertising cost and internal advertising investment.
K-Means for Finding Clusters in Dataset and How to Eliminate Dimensions that Don’t Provide Useful Information for Clustering
I would use the following Output variables for my data allocation.
- The observation of changes in sales volume of the new luxury dress per given period of time.
- The changes in revenue income resulting from the promotion activity.
- The volume of advertisement made to different media houses during the campaign period.
4, b) Give the main advantage (one per technique) of using the following two modeling techniques for the task presented above: i) random forest and ii) decision trees. Are these properties contradictory, i.e., must we choose one of them, or can we (at least to some degree) have both?
Random forest main advantage is its ability to limit overfitting without increasing errors due to bias and variance (Shmueli, et al, 2017). Using random forest will allow usage of many random features of the data used. This will allow usage of many decision trees instead of just one which is more effective.
The main advantage of using a decision tree would be they are easy and fast to interpret hence making the visualization process quicker and easier (Roiger, 2017).
Both decision trees and random forest can be used together to some degree depending on the data. This is because a random forest is a combination of decision trees. When a decision tree gets very deep overfitting can occur recurring random forest to eliminate the shortcoming.
- A) How many instances belong to class3? (1E)
- b) How many instances have been predicted as class4? (1E)
3265+2=3267
- c) What is the precision for class6? (1E)
precision = true positive
true positive + false positive
6/6+7
6/13
Ans= 0.462
- d) What is the recall for class6? (1E)
recall=tp/tp+fn
6/6+48+8
=6/62
Ans=0.0968
This model provides lower accuracy than the previous model. However it maybe preferred in case there is large class imbalance. This is because it will have better predictive power.
It is representing appropriate or Generalization of data’. In machine refers to how accurately the concepts fed into a machine through machine learning model is generally applicable to other given concepts listed in the model when the machine was learning. A well generalized model should accurate represent the raw set data and should be flexible to accommodate new data efficiently (KS, & Kamath, 2017).
It represent data underfitting: According to Lu, Setiono, & Liu (2017), this is a model that cannot appropriately represent the modeling (training) data and cannot appropriately apply the concept of generalization when it comes to new additional data. For us to solve the problem of underfitting we should fit the target variable as the nth degree polynomial resulting to achievement of general Polynomial. As we increase the polynomial degree the training error will tend to decrease. Further, the cross validation error will also decrease as we increase the polynomial degree, forming a convex curve which is more accurate than the latter underfitted one.
It represent Overfitting: It occurs when a model over represents the training data to a point that the modelling adversely influences generalization when new data is added to the model (Ashraf, Ahmad, & Ashraf, 2018). To minimize the limitations of optimal flexibility of data should be increased in the model. Using any of following approaches.
- a) Lasso regularization – we add the ‘P’ term (things used to reduce data overfitting) to our existing model.
- b) Ridege regularization – ‘P’ our regularization data is added to the existing cost function to reduce the effect of overfitting and allow generalization of new data.
- c) Elastic net – we combine the above two methods (a & b) to the existing model to reduce the effect of overfitting and allow generalization of new data
References
Ashraf, N., Ahmad, W., & Ashraf, R. (2018). A Comparative Study of Data Mining Algorithms
for High Detection Rate in Intrusion Detection System.
KS, D., & Kamath, A. (2017). Survey on Techniques of Data Mining and its Applications.
Lu, H., Setiono, R., & Liu, H. (2017). Neurorule: A connectionist approach to data mining. arXiv preprint arXiv:1701.01358.
Roiger, R. J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.
Thearling, K. (2017). An introduction to data mining.