Difference between revisions of "Programming/Kdb/Labs/Exploratory data analysis"

From Thalesians Wiki
< Programming‎ | Kdb‎ | Labs
Line 32: Line 32:
The inputs are as follows:
The inputs are as follows:


* X1 = the transaction date
* X1 = the transaction date (for example, 2013.25=2013 March, 2013.500=2013 June, etc.)
</blockquote>
* X2 = the house age (unit: year)
* X3 = the distance to the nearest MRT station (unit: metre)
* X4 = the number of convenience stores in the living circle on foot (integer)
* X5 = the geographic coordinate, latitude (unit: degree)
* X6 = the geographic coordinate, longitude (unit: degree)
 
The output is as follows:
 
* Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres)
</blockquote>
</blockquote>

Revision as of 14:13, 18 June 2021

In this lab we'll make sense of the following data set from the UCI Machine Learning Repository:

  • Name: Real estate valuation data set
  • Data Set Characteristics: Multivariate
  • Attribute Characteristics: Integer, Real
  • Associated Tasks: Regression
  • Number of Instances: 414
  • Number of Attributes: 7
  • Missing Values? N/A
  • Area: Business
  • Date Donated: 2018.08.18
  • Number of Web Hits: 111,613
  • Original Owner and Donor: Prof. I-Cheng Yeh, Department of Civil Engineering, Tamkang University, Taiwan
  • Relevant papers:
    • Yeh, I.C., and Hsu, T.K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.

There are many data sets on UCI that are worth exploring. We picked this one because it is relatively straightforward and clean.

Let's read the data set information:

The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan. The real estate valuation is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).

This paragraph describes how the original researchers split up the data set. We will split it up differently: fifty-fifty.

Let's read on:

The inputs are as follows:

  • X1 = the transaction date (for example, 2013.25=2013 March, 2013.500=2013 June, etc.)
  • X2 = the house age (unit: year)
  • X3 = the distance to the nearest MRT station (unit: metre)
  • X4 = the number of convenience stores in the living circle on foot (integer)
  • X5 = the geographic coordinate, latitude (unit: degree)
  • X6 = the geographic coordinate, longitude (unit: degree)

The output is as follows:

  • Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres)