Difference between revisions of "Programming/Kdb/Labs/Exploratory data analysis"
Line 1: | Line 1: | ||
=Getting hold of data= | |||
In this lab we'll make sense of the following data set from the <span class="plainlinks">[https://archive.ics.uci.edu/ml/index.php UCI Machine Learning Repository]</span>: | In this lab we'll make sense of the following data set from the <span class="plainlinks">[https://archive.ics.uci.edu/ml/index.php UCI Machine Learning Repository]</span>: | ||
Line 43: | Line 45: | ||
* Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres) | * Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres) | ||
</blockquote> | </blockquote> | ||
=Downloading the data set and converting it to CSV= | |||
The data set can be downloaded from the data folder <span class="plainlinks">https://archive.ics.uci.edu/ml/machine-learning-databases/00477/</span>. The data is supplied in the form of an excel file, <tt>Real estate valuation data set.xlsx</tt>. In order to export this data to kdb+/q, we convert it to the '''comma-separated values (CSV)''' format: | |||
* start Excel; | |||
* File > Open the file <tt>Real estate valuation data set.xlsx</tt>; | |||
* File > Save As, set "Save as type" to "CSV (Comma delimited)", click "Save". |
Revision as of 08:32, 19 June 2021
Getting hold of data
In this lab we'll make sense of the following data set from the UCI Machine Learning Repository:
- Name: Real estate valuation data set
- Data Set Characteristics: Multivariate
- Attribute Characteristics: Integer, Real
- Associated Tasks: Regression
- Number of Instances: 414
- Number of Attributes: 7
- Missing Values? N/A
- Area: Business
- Date Donated: 2018.08.18
- Number of Web Hits: 111,613
- Original Owner and Donor: Prof. I-Cheng Yeh, Department of Civil Engineering, Tamkang University, Taiwan
- Relevant papers:
- Yeh, I.C., and Hsu, T.K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
There are many data sets on UCI that are worth exploring. We picked this one because it is relatively straightforward and clean.
Let's read the data set information:
The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan. The real estate valuation is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).
This paragraph describes how the original researchers split up the data set. We will split it up differently: fifty-fifty.
Let's read on:
The inputs are as follows:
- X1 = the transaction date (for example, 2013.25=2013 March, 2013.500=2013 June, etc.)
- X2 = the house age (unit: year)
- X3 = the distance to the nearest MRT station (unit: metre)
- X4 = the number of convenience stores in the living circle on foot (integer)
- X5 = the geographic coordinate, latitude (unit: degree)
- X6 = the geographic coordinate, longitude (unit: degree)
The output is as follows:
- Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres)
Downloading the data set and converting it to CSV
The data set can be downloaded from the data folder https://archive.ics.uci.edu/ml/machine-learning-databases/00477/. The data is supplied in the form of an excel file, Real estate valuation data set.xlsx. In order to export this data to kdb+/q, we convert it to the comma-separated values (CSV) format:
- start Excel;
- File > Open the file Real estate valuation data set.xlsx;
- File > Save As, set "Save as type" to "CSV (Comma delimited)", click "Save".