Exploratory Data Analysis And Linear Regression Analysis

Exploratory Data Analysis and Linear Regression Analysis

The aim of this study was to demonstrate applied knowledge of people, markets, finances, technology and management in a global context of business intelligence practice. In specific, the concept behind this assignment was to showcase the understanding of data mining process and data visualisation though specified data mining as well as data visualisation tools. The first part of this study covered the data mining aspect using RapidMiner tool and the second or third part explored the data visualisation aspect using Tableau software.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

In this part of this report, the analyst has utilized the house-price data set and performed a number of studies to understand the major factors that influence the price of a house. In specific the purpose was to understand and identify 5 – 6 variables out of these listed ones that primarily define the price of any house in the selected location. As step of this analysis, the analyst has performed exploratory data analysis, correlation analysis and chi square analysis one by one (Roiger, 2017). On the other hand, once the key variables were identified, the analyst performed linear regression analysis to comprehend which variable is most significant one to predict the house price.

As mentioned above, the first step was to perform an exploratory data analysis, which helped the analyst to get a basic understanding of all the variables mentioned in the given house price dataset. As a part of this analysis, RapidMiner was used. In Rapidminer, the analyst first introduced the data set under design tab and connected the data set with process. Once the data is processed, below table was found which summarized the descriptive statistics of each of the variable included in this data set (Jovic, Brkic & Bogunovic, 2014). 

The above mentioned table has shown the basic statistics including minimum value, maximum value as well as the mean score, however this was not sufficient to comprehend what are the variables that have significant contribution while considering the house price. Therefore, the analyst performed further study, namely correlation analysis to find out the association of each variables with house price (Naik & Samant, 2016). Considering the correlation value, the analyst tried to find out top 5 – 6 variables that largely associated with house price.  

The first step of performing correlation analysis was to exclude those attributes that have no requirement for this analysis. In order to do so, the analyst used “select attribute” operator which helped the analyst to exclude variables like ID, Date, etc. After excluding those variables, the analyst found 16 variables and further included the correlation matrix operator to find the correlation values (Gentile et al. 2016). The below figure has shown the correlation process performed in RapidMiner. 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Exploratory Data Analysis (EDA) using Rapidminer

The below mentioned matrix has shown the correlation value in ascending order. According to the correlation value, any value close to 1 and more than 0.70 means the variable is largely associated with the dependent variable. In this case the below table has shown that “sqft_living” was one such variable which has more than 70% association with house price. In other words, it can be said that this is the most significant variable. In similar fashion, from this table, it can be concluded that “grade”, “sqft_above”, “sqft_living15” and “bathrooms” were other four variables together made top 5 variables that define house price.  

The anslyst has performed chi square test also to ensure the above five variables by providing second opinion. The chi-square statistic is a nonparametric statistical technique used to determine if a distribution of observed frequencies differs from the theoretical expected frequencies. Chi-square statistics use nominal data, consequently in its place of using means and variances, this test uses frequencies (Gentile et al. 2016). The value of the chi-square statistic is given by

X2 = Sigma [(O-E)2/ E ]

Where X2 is the chi-square statistic, O is the observed frequency and E is the expected frequency.   

The Weight by Chi Squared Statistic operator calculates the weight of attributes with respect to the class attribute by using the chi-squared statistic. The higher the weight of an attribute, the more relevant it is considered. From the below mentioned chi square table, thus it was found that “sqft_living”, “sqft_above”, “bathrooms”, “sqft_basement”, and “grade”, “sqft_living15” are the top most variables. 

While the above mentioned analysis has helped to identify the top 5 variables that contributed the most to predict house price, linear regression analysis helped to find out the specific contribution of each of these variables.  Below is the linear regression process performed via RapidMiner. According to this process, it has seen that the analyst utilised select attributes operator to specify those five variables identified using exploratory data analysis. Once, these variables were chosen, the analyst has used set role attribute to define the dependent variable. Finally, the linear regression operator has been used to find out the results. 

The below table has shown the linear regression results. P value of each of these variables were less than 0.05, which indicated all variables are significant one while deciding the house price. In specific, the linear regression equation can be defined as mentioned below:

Correlation Analysis

House price = -646863.75 + 22.82*sqft_living15 – 80.485 * sqft_above + 111024.92 * grade + 245.42 * sqft_living – 35464.023 * bathrooms

The above mentioned equation will help to understand the price of house. In specific, if the coefficient value of sqft_living15, grade, and sqft_living is taken into account, then it can be seen that these three will increase the house price for any positive changes in these variables. On the other hand, sqft_above and bathrooms will decrease the house price for any positive changes in these variables. 

This particular table has been drawn with the help of tableau desktop software. This software is basically used to visualise data. In tableau, the very basic step of visualising any information with the help of table or graph primarily requires data inclusion. Once data has been introduced, the software itself categorised variables as measures or dimensions. As mentioned in the earlier section of this study, there were several variables along with house price in the given house price dataset (Federer, & Joubert, 2018). In this section, the aim of this visualisation was to display house prices and other relevant data over time (years) using Tableau Desktop. In order to do so, the analyst has used the chosen 5 variables along with price data. In this, the analyst has defined the monthly timeframe to visualise all selected variables (Carrell, et al. 2017). In case of bathrooms, it was shown as count per room and for rest variables median value has been taken into account. From this visualisation it can be concluded that each of the quarter over the time frame 2014 to 2015, has started from a high point and then reduced to a specific point. For example, from grade variable, it can be said that there were 50% houses that have been graded more than 7 and 50% houses that have been graded less than 7 over this time frame (Fezarudin, Tan & Saeed, 2017). Basically, the median values for each of these variables were indicating a point where 50% houses were above this level and 50% below this level.

In this section, unlike the above one, the analyst has used geo map to represent house price and other information. In order to prepare this geo map, the analyst has utilized both latitude and longitude vales as geo dimension, zip code as normal dimension and prices as measures over the time frame 2014 to 2015. The zip code has been used to segregate the location with different color (Stirrup & Ramos, 2017). From the visualisation, it can be said that all these data were taken from a specific location of the United States of America. The prices varies depend on the factors like “sqft_living”, “sqft_above”, “bathrooms”, “grade”, and “sqft_living15”. However, as far as location is concerned, the visualisation did not show any specific trend.

Reference List

Roiger, R.J., 2017. Data mining: a tutorial-based primer. CRC Press.

Kotu, V. & Deshpande, B., 2014. Predictive analytics and data mining: concepts and practice with rapidminer. Morgan Kaufmann.

Naik, A. & Samant, L., 2016. Correlation review of classification algorithm using data mining tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia Computer Science, 85, pp.662-668.

Federer, L.M. & Joubert, D.J., 2018. Providing Library Support for Interactive Scientific and Biomedical Visualizations with Tableau. Journal of eScience Librarianship, 7(1), p.2.

Stirrup, J. & Ramos, R.O., 2017. Advanced Analytics with R and Tableau: Advanced analytics using data classification, unsupervised learning and data visualization.

Carrell, D., Albertson-Junkans, L., Ramaprasan, A., Baer, A. & Cronkite, D., 2017. Interactive Visualization of a Patient’s Electronic Health Information to Assist Manual Chart Review Using Tableau® Software. Journal of Patient-Centered Research and Reviews, 4(3), p.173.

Jovic, A., Brkic, K. & Bogunovic, N., 2014, May. An overview of free software tools for general data mining. In Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on (pp. 1112-1117). IEEE.

Gentile, A.L., Kirstein, S., Paulheim, H. & Bizer, C., 2016, May. Extending RapidMiner with data search and integration capabilities. In International Semantic Web Conference (pp. 167-171). Springer, Cham.

Fezarudin, F.Z., Tan, M.I.I. & Saeed, F.A.Q., 2017, August. Data Visualization for Human Capital and Halal Training in Halal Industry Using Tableau Desktop. In Asian Simulation Conference (pp. 593-604). Springer, Singapore.

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.