Difference between revisions of "Programming/Kdb/Labs/Exploratory data analysis"
Line 151: | Line 151: | ||
The resulting plot is confusing because the <tt>transaction_date</tt> is bucketed into (floating point) months: | The resulting plot is confusing because the <tt>transaction_date</tt> is bucketed into (floating point) months: | ||
[[File:time_series_plot_of_house_price_per_unit_area.png]] |
Revision as of 08:56, 19 June 2021
Getting hold of data
In this lab we'll make sense of the following data set from the UCI Machine Learning Repository:
- Name: Real estate valuation data set
- Data Set Characteristics: Multivariate
- Attribute Characteristics: Integer, Real
- Associated Tasks: Regression
- Number of Instances: 414
- Number of Attributes: 7
- Missing Values? N/A
- Area: Business
- Date Donated: 2018.08.18
- Number of Web Hits: 111,613
- Original Owner and Donor: Prof. I-Cheng Yeh, Department of Civil Engineering, Tamkang University, Taiwan
- Relevant papers:
- Yeh, I.C., and Hsu, T.K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
There are many data sets on UCI that are worth exploring. We picked this one because it is relatively straightforward and clean.
Let's read the data set information:
The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan. The real estate valuation is a regression problem. The data set was randomly split into the training data set (2/3 samples) and the testing data set (1/3 samples).
This paragraph describes how the original researchers split up the data set. We will split it up differently: fifty-fifty.
Let's read on:
The inputs are as follows:
- X1 = the transaction date (for example, 2013.25=2013 March, 2013.500=2013 June, etc.)
- X2 = the house age (unit: year)
- X3 = the distance to the nearest MRT station (unit: metre)
- X4 = the number of convenience stores in the living circle on foot (integer)
- X5 = the geographic coordinate, latitude (unit: degree)
- X6 = the geographic coordinate, longitude (unit: degree)
The output is as follows:
- Y = house price per unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 square metres)
Downloading the data set and converting it to CSV
The data set can be downloaded from the data folder https://archive.ics.uci.edu/ml/machine-learning-databases/00477/. The data is supplied in the form of an excel file, Real estate valuation data set.xlsx. In order to export this data to kdb+/q, we convert it to the comma-separated values (CSV) format:
- start Excel;
- File > Open the file Real estate valuation data set.xlsx;
- File > Save As, set "Save as type" to "CSV (Comma delimited)", click "Save".
Opening the data set in kdb+/q
We read in the resulting CSV file as a kdb+/q table:
t:("HFFFHFFF";enlist",")0:`$":S:/dev/bodleian/teaching/kdb-q/Real estate valuation data set.csv"
(Here you need to replace our path with the corresponding path on your machine.)
We have specified the type for each column as "HFFFHFFF". Here "H" stands for short and "F" stands for float.
We can examine the meta data for the resulting table with
meta t
The result:
c | t f a --------------------------------------| ----- No | h X1 transaction date | f X2 house age | f X3 distance to the nearest MRT station| f X4 number of convenience stores | h X5 latitude | f X6 longitude | f Y house price of unit area | f
It's somewhat inconvenient to work in q with table column names containing spaces, so we rename the columns as follows:
t:`no`transaction_date`house_age`distance_to_nearest_mrt`number_of_convenience_stores`latitude`longitude`house_price_per_unit_area xcol t
We check the meta data again:
meta t
The result:
c | t f a ----------------------------| ----- no | h transaction_date | f house_age | f distance_to_nearest_mrt | f number_of_convenience_stores| h latitude | f longitude | f house_price_per_unit_area | f
Producing the time series plot of house_price_per_unit_area
We look at the first 10 rows
select[10] from t
no transaction_date house_age distance_to_nearest_mrt number_of_convenience_s.. -----------------------------------------------------------------------------.. 1 2012.917 32 84.87882 10 .. 2 2012.917 19.5 306.5947 9 .. 3 2013.583 13.3 561.9845 5 .. 4 2013.5 13.3 561.9845 5 .. 5 2012.833 5 390.5684 5 .. 6 2012.667 7.1 2175.03 3 .. 7 2012.667 34.5 623.4731 7 .. 8 2013.417 20.3 287.6025 6 .. 9 2013.5 31.7 5512.038 1 .. 10 2013.417 17.9 1783.18 3 ..
and observe that the data is not sorted by transaction_date. We therefore sort it by transaction_date (in-place, hence `t and not t in the following command):
`transaction_date xasc `t
meta t reveals that the data is now sorted by transaction_date:
meta t
c | t f a ----------------------------| ----- no | h transaction_date | f s house_age | f distance_to_nearest_mrt | f number_of_convenience_stores| h latitude | f longitude | f house_price_per_unit_area | f
We see that the transaction_date now has the "sorted" (s) attribute (a).
Here is our first attempt to produce a time series plot of house_price_per_unit_area:
select transaction_date, house_price_per_unit_area from t
The resulting plot is confusing because the transaction_date is bucketed into (floating point) months: