Programming/Kdb/Labs/Exploratory data analysis
From Thalesians Wiki
In this lab we'll make sense of the following data set from the UCI Machine Learning Repository:
- Name: Real estate valuation data set
- Data Set Characteristics: Multivariate
- Attribute Characteristics: Integer, Real
- Associated Tasks: Regression
- Number of Instances: 414
- Number of Attributes: 7
- Missing Values? N/A
- Area: Business
- Date Donated: 2018.08.18
- Number of Web Hits: 111,613
- Original Owner and Donor: Prof. I-Cheng Yeh, Department of Civil Engineering, Tamkang University, Taiwan
- Relevant papers:
- Yeh, I.C., and Hsu, T.K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
There are many data sets on UCI that are worth exploring. We picked this one because it is relatively straightforward and clean.
Let's read the data set information:
The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan. The real estate valuation is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).
This paragraph describes how the original researchers split up the data set. We will split it up differently: fifty-fifty.
Let's read on:
The inputs are as follows:
- X1 = the transaction date (for example, 2013.25=2013 March, 2013.500=2013 June, etc.)
- X2 = the house age (unit: year)
- X3 = the distance to the nearest MRT station (unit: metre)
- X4 = the number of convenience stores in the living circle on foot (integer)
- X5 = the geographic coordinate, latitude (unit: degree)
- X6 = the geographic coordinate, longitude (unit: degree)
The output is as follows:
- Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres)