Difference between revisions of "Programming/Kdb/Labs/Big data in kdb"

From Thalesians Wiki
< Programming‎ | Kdb‎ | Labs
 
(23 intermediate revisions by the same user not shown)
Line 3: Line 3:
Let us create and populate the <tt>trades</tt> table:
Let us create and populate the <tt>trades</tt> table:
<pre>
<pre>
trades : ([] code:`symbol$(); date:`date$(); time:`time$(); price:`float$(); volume:`short$(); mktflag:`symbol$(); comments:());
trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades; (); ,; (
.[`trades;();,;(
    (`CLM16; 2016.04.07; 10:20:00.329; 38.3  ; 4h; `E; "NYMEX/CME Group trade");
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
    (`GCM16; 2016.04.07; 12:00:00.055; 1239.7; 6h; `E; "");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
    (`GCM16; 2016.04.07; 12:37:02.106; 1240.5; 1h; `E; "");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
    (`CLM16; 2016.04.07; 13:00:00.128; 38.04 ; 3h; `E; "NYMEX/CME Group trade");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
    (`VXK16; 2016.04.07; 13:22:05.617; 18.85 ; 5h; `E; "");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
    (`GCM16; 2016.04.07; 14:35:01.241; 1241.2; 1h; `E; "");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
    (`GCM16; 2016.04.08; 10:13:01.048; 1240f ; 3h; `E; "");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
    (`VXK16; 2016.04.08; 11:34:53.417; 18.53 ; 1h; `E; "Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
    (`CLM16; 2016.04.08; 12:00:00.227; 40.61 ; 3h; `E; "NYMEX/CME Group trade");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
    (`VXK16; 2016.04.08; 12:44:00.684; 18.44 ; 2h; `E; "Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
    (`VXK16; 2016.04.08; 12:45:33.130; 18.49 ; 1h; `E; "Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
    (`CLM16; 2016.04.08; 15:20:02.000; 40.78 ; 3h; `E; "NYMEX/CME Group trade");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
    (`CLM16; 2016.04.11; 11:00:00.105; 41.43 ; 2h; `E; "NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
    (`VXK16; 2016.04.11; 14:00:00.829; 18.35 ; 1h; `E; "");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
    (`VXK16; 2016.04.11; 15:14:58.775; 19.05 ; 2h; `E; "");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
    (`GCM16; 2016.04.11; 16:00:00.044; 1257.9; 1h; `E; "");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
    (`GCM16; 2016.04.11; 16:28:34.311; 1258.7; 1h; `E; ""))]
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]
</pre>
</pre>


Line 64: Line 64:
or using <tt>load</tt>, in which ase the loaded table will be assigned to a new variable, that variable's name matching the name of the file; in this particular case, <tt>tradesfile</tt>:
or using <tt>load</tt>, in which ase the loaded table will be assigned to a new variable, that variable's name matching the name of the file; in this particular case, <tt>tradesfile</tt>:
<pre>
<pre>
load `$":C:/Users/Paul Bilokon/Documents/tradesfile"
load`$":C:/Users/Paul Bilokon/Documents/tradesfile"
</pre>
</pre>


=Splayed tables=
=Splayed tables=
==Introduction==


Instead of saving the table as a single binary file, we could save it '''splayed''', i.e. so that each column is saved in a separate binary file.
Instead of saving the table as a single binary file, we could save it '''splayed''', i.e. so that each column is saved in a separate binary file.
Line 108: Line 110:
We couldn't save the original <tt>trades</tt> table because the columns of type symbol, namely <tt>code</tt> and <tt>mktflag</tt>, weren't enumerated. We shall now show how to enumerate them. Before proceeding, delete the directory <tt>tradesdir</tt> with all its contents, in case you created that directory by following the steps outlined above, so that we start with a clean slate. To enumerate the symbol columns in <tt>trades</tt>, we use <tt>.Q.en</tt>:
We couldn't save the original <tt>trades</tt> table because the columns of type symbol, namely <tt>code</tt> and <tt>mktflag</tt>, weren't enumerated. We shall now show how to enumerate them. Before proceeding, delete the directory <tt>tradesdir</tt> with all its contents, in case you created that directory by following the steps outlined above, so that we start with a clean slate. To enumerate the symbol columns in <tt>trades</tt>, we use <tt>.Q.en</tt>:
<pre>
<pre>
tradesenum:.Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";trades]
tradesenum:.Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";trades]
</pre>
This will create the directory <tt>data</tt> with a single binary file in it, named <tt>sym</tt>. That file enumerates the values found in the columns of type symbol, namely <tt>`CLM16</tt>, <tt>`GCM16</tt>, <tt>`VXK16</tt>, and <tt>`E</tt>.
 
Once this is done, we run
<pre>
(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set tradesenum
</pre>
The directory now contains the files with names
* <tt>.d</tt>,
* <font color="red"><tt>code</tt></font>,
* <tt>comments</tt>,
* <tt>comments#</tt>,
* <tt>date</tt>,
* <font color="red"><tt>mktflag</tt></font>,
* <tt>price</tt>,
* <tt>time</tt>, and
* <tt>volume</tt>.
Notice that this time, with the symbol columns enumerated, we were able to save the columns of type <tt>symbol</tt> successfully.
 
Note that, unlike the above command,
<pre>
(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set trades
</pre>
will fail with <tt>'type</tt>, as before.
 
==Splayed tables on disc==
 
Let us now take a look at the individual files in the directory <tt>tradesdir</tt>. The files <tt>date</tt>, <tt>price</tt>, <tt>time</tt>, and <tt>volume</tt> contain the serializations of the columns with the corresponding names.
 
Notice that you can serialize the columns individually. The command
<pre>
(`$":C:/Users/Paul Bilokon/Documents/myprice")set trades[`price]
</pre>
will create a file <tt>myprice</tt>, whose contents are identical to those of <tt>tradesdir/price</tt>.
 
Recall that <tt>trades[`price]</tt> is a list of type float:
<pre>
q)trades[`price]
38.3 1239.7 1240.5 38.04 18.85 1241.2 1240 18.53 40.61 18.44 18.49 40.78 41.43 18.35 19.05 1257.9 1258.7
q)type trades[`price]
9h
</pre>
 
When we run
<pre>
(`$":C:/Users/Paul Bilokon/Documents/mycols")set cols trades
</pre>
</pre>
This will create the directory <tt>tradesdir</tt> with a single binary file in it, named <tt>sym</tt>. That file enumerates the values found in the columns of type symbol, namely <tt>`CLM16</tt>, <tt>`GCM16</tt>, <tt>`VXK16</tt>, and <tt>`E</tt>.
we obtain a file, <tt>mycols</tt>, whose contents match those of <tt>tradesdir/.d</tt>.
This file specifies the ''order'' of the columns in the splayed table. (How else would we know it?)
 
Thus <tt>.d</tt> is the serialization of the list of column names (as symbols) in the table.
 
However, when we run
<pre>
(`$":C:/Users/Paul Bilokon/Documents/mycode")set trades[`code]
</pre>
we obtain the file <tt>mycode</tt>, whose contents differ from those of <tt>tradesdir/code</tt>. This is because all the <tt>symbol</tt>s in the enumerated table are stored in <tt>tradesdir/sym</tt>, whereas <tt>tradesdir/code</tt> contains indices of the <tt>symbol</tt>s within <tt>tradesdir/sym</tt>.
 
We will reproduce <tt>tradesdir/code</tt> by running
<pre>
(`$":C:/Users/Paul Bilokon/Documents/mycode")set tradesenum[`code]
</pre>
 
==Enumerated versus unenumerated columns==
 
This underlines the difference between <tt>trades</tt> and <tt>tradesenum</tt>: <tt>.Q.en</tt> returns a copy of the table with the symbol columns enumerated.
 
We can distinguish between the columns that have been enumerated and those that haven't as follows:
<pre>
q)type trades[`code]
11h
q)type tradesenum[`code]
21h
</pre>
although the results of <tt>meta trades</tt> and <tt>meta tradesenum</tt> look identical.
 
Similarly to <tt>tradesdir/code</tt>, the file <tt>tradesdir/mktflag</tt> contains serialized enumerated strings containing indices of the actual strings within <tt>tradesdir/sym</tt>.
 
Each distinct text string, corresponding to each distinct <tt>symbol</tt> value, is stored in <tt>tradesdir/sym</tt> once. Using the terminology from Java, symbols are '''interned'''.
 
==Columns of strings==
 
The column <tt>comments</tt>, whose <tt>type</tt> is <tt>string</tt> (rather than <tt>symbol</tt>) was serialized by <tt>set</tt> to two files:
* <tt>tradesdir/comments</tt> and
* <tt>tradesdir/comments#</tt>.
 
The latter, <tt>comments#</tt>, contains the actual text found in the strings.
 
Unlike the serializations of the enumerated <tt>symbol</tt> columns, where distinct values are stored only once, <tt>comments#</tt> contains repeats.
 
If you open this file in a binary file viewer, you will see that "NYMEX/CME Group trade", for example, occurs multiple times.
 
Using the terminology from Java, serialized <tt>string</tt>s are not interned, whereas serialized <tt>symbol</tt>s are. The file <tt>tradesdir/comments</tt> indexes into
<pre>
tradesdir/comments#
</pre>
 
==Appending to a splayed table==
 
Now suppose that we have
<pre>
moretrades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`moretrades;();,;(
  (`GCM16;2016.04.11;16:29:05.890;1257.8;2h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`CLM16;2016.04.11;16:29:45.105;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.11;16:29:50.898;1257.8;1h;`E;"");
  (`GCM16;2016.04.12;10:01:05.890;1257.8;2h;`E;"");
  (`GCM16;2016.04.12;10:02:05.287;1257.8;2h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`CLM16;2016.04.12;10:18:45.230;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.12;10:18:50.382;1257.8;1h;`E;""))];
</pre>
 
How can we append them to our splayed table?
 
If we were to do
<pre>
(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set .Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";moretrades]
</pre>
the table would be overwritten. However, we don't want to overwrite it; we want to append to it.
 
What we want is
<pre>
upsert[`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/"].Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";moretrades]
</pre>
 
==Sorting a splayed table==
 
Yes, <tt>upsert</tt> works with tables on disc, not just in memory. <tt>upsert</tt> is not alone in this. For example, we can sort the splayed table on disc:
<pre>
`code xasc`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
</pre>
We confirm that the sorted attribute is now set on the column <tt>code</tt>:
<pre>
q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c      | t f a
--------| -----
code    | s  s
date    | d
time    | t
price  | f
volume  | h
mktflag | s
comments| C
</pre>
 
==Applying attributes to a splayed table==
 
<tt>xasc</tt> is not alone in being able to work with tables on disc. Here is how we can set the grouped (<tt>`g#</tt>) attribute on our table:
<pre>
q)@[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";`mktflag;`g#]
`:tradesdir/
q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c      | t f a
--------| -----
code    | s  s
date    | d
time    | t
price  | f
volume  | h
mktflag | s  g
comments| C
</pre>
 
==Compression==
 
Let us recreate the on-disc version of the table
<pre>
trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]
</pre>
but this time use compression:
<pre>
(`$":C:/Users/Paul Bilokon/Documents/tradesdir/";17;2;9)set .Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";trades]
</pre>
 
Without compression the table occupied 17,595 bytes; with compression &mdash; only 1,182 bytes.
 
At first the three arguments <tt>17 2 9</tt> look cryptic. They actually represent
* the '''logical block size''': a power of 2 between 12 and 20;
* the '''algorithm''':
** 0 &mdash; none,
** 1 &mdash; q IPC,
** 2 &mdash; gzip,
** 3 &mdash; snappy (since V3.4),
** 4 &mdash; lz4hc (since V3.6);
* the '''compression level''': 0-9 for gzip, 1-12 for lz4hc.
 
To read up on file compression in kdb+/q, see https://code.kx.com/q/kb/file-compression/.
 
==Querying splayed tables on disc==
 
Some (but not all) q-sql queries work with tables on disc, for example:
<pre>
q)select from`$":C:/Users/Paul Bilokon/Documents/tradesdir/"where price>100f
code  date      time        price  volume mktflag comments
------------------------------------------------------------
GCM16 2016.04.07 12:00:00.055 1239.7 6      E      ""
GCM16 2016.04.07 12:37:02.106 1240.5 1      E      ""
GCM16 2016.04.07 14:35:01.241 1241.2 1      E      ""
GCM16 2016.04.08 10:13:01.048 1240  3      E      ""
GCM16 2016.04.11 16:00:00.044 1257.9 1      E      ""
GCM16 2016.04.11 16:28:34.311 1258.7 1      E      ""
</pre>
 
However, something like
<pre>
delete price from`$":tradesdir/"
</pre>
wouldn't work; we would need to manipulate the table in-memory and then save down the modified table to disc, overwriting the original.
 
=Partitioned tables=
 
==Introduction==
 
In addition to splaying a table, we may partition it: for each distinct value <tt>v</tt> of a column that defines the partition, a distinct directory will be created, where the binary files with serialized columns will be stored, but including only those rows with value <tt>v</tt> in the partition column.
 
The standard <tt>.Q.dpft</tt> function performs the partitioning.
 
The name of that function stands for its four parameters:
* directory;
* partition;
* sort field;
* table name.
 
To store the <tt>trades</tt> table, which is, as before,
<pre>
trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]
</pre>
we can execute the following code:
<pre>
{
  `trd set select from trades where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`trd];
  delete trd from`.;}each distinct trades[`date]
</pre>
 
We will thus call <tt>.Q.dpft</tt> once for each of the distinct values in the <tt>date</tt> column, resulting in the following calls:
<pre>
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.07; `code; `trd1];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.08; `code; `trd2];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.11; `code; `trd3];
</pre>
 
Here <tt>`trd1`trd2`trd3</tt> are, respectively, the names of the three tables obtained by selecting from <tt>trades</tt> the rows where <tt>date</tt> has one of the values, respectively, 2016.04.07, 2016.04.08, 2016.04.11:
<pre>
select from trades where date=2016.04.07;
select from trades where date=2016.04.08;
select from trades where date=2016.04.11;
</pre>
 
We could not implement our calls as follows
<pre>
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.07; `code; select from trades where date=2016.04.07];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.08; `code; select from trades where date=2016.04.08];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.11; `code; select from trades where date=2016.04.11];
</pre>
because of this limitation of <tt>.Q.dpft</tt>: it expects, as its fourth argument, the table ''name'', not the actual table.
 
The table is always saved to disc in a directory with that name. This is why we created the temporary table <tt>trd</tt>, with the date we will write to a partition in the global namespace (<tt>`.</tt>) using <tt>`trd set...</tt> rather than, for example, <tt>trd:</tt>. The latter will create the variable <tt>trd</tt> in the local scope of our anonymous function defined in the <tt>{...}</tt> block. We wouldn't be able to reference that variable as <tt>`trd</tt>.
 
At the end of our local function we delete the temporary table from the global namespace:
<pre>
delete trd from `.;
</pre>
 
The following directory structure will be obtained by evaluating our code. Under <tt>C:/Users/Paul Bilokon/Documents/mydb</tt>:
* <tt>2016.04.07</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** <tt>code</tt>
*** <tt>comments</tt>
*** <tt>comments#</tt>
*** <tt>date</tt>
*** <tt>mktflag</tt>
*** <tt>price</tt>
*** <tt>time</tt>
*** <tt>volume</tt>
* <tt>2016.04.08</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** ...
*** <tt>volume</tt>
* <tt>2016.04.11</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** ...
*** <tt>volume</tt>
* <tt>sym</tt>
 
In the directory that we have specified to <tt>.Q.dpft</tt>, we have one subdirectory per each partition (<tt>2016.04.07</tt>, <tt>2016.04.08</tt>, and <tt>2016.04.11</tt>). Inside those directories we have a subdirectory whose name is that of the table that we provided to <tt>.Q.dpft</tt> as an argument: <tt>trd</tt>. Inside <tt>trd</tt>, we have the same files as we encountered when splaying the table, with the exception of <tt>sym</tt>, which now resides two levels above. As before, the file <tt>sym</tt> enumerates the columns of type <tt>symbol</tt>
 
Notice that we didn't have to call <tt>.Q.en</tt> explicitly this time: <tt>.Q.dpft</tt> did this for us.
 
==Sorting and the <tt>`p#</tt> attribute==
 
The table that we have obtained on disc is
<pre>
date      code  time        price  volume mktflag comments
-----------------------------------------------------------------------------------------------------------------------------
2016.04.07 CLM16 10:20:00.329 38.3  4      E      "NYMEX/CME Group trade"
2016.04.07 CLM16 13:00:00.128 38.04  3      E      "NYMEX/CME Group trade"
2016.04.07 GCM16 12:00:00.055 1239.7 6      E      ""
2016.04.07 GCM16 12:37:02.106 1240.5 1      E      ""
2016.04.07 GCM16 14:35:01.241 1241.2 1      E      ""
2016.04.07 VXK16 13:22:05.617 18.85  5      E      ""
2016.04.08 CLM16 12:00:00.227 40.61  3      E      "NYMEX/CME Group trade"
2016.04.08 CLM16 15:20:02.000 40.78  3      E      "NYMEX/CME Group trade"
2016.04.08 GCM16 10:13:01.048 1240  3      E      ""
2016.04.08 VXK16 11:34:53.417 18.53  1      E      "Transaction represents a trade in two contract months in the same class"
2016.04.08 VXK16 12:44:00.684 18.44  2      E      "Transaction represents a trade in two contract months in the same class"
2016.04.08 VXK16 12:45:33.130 18.49  1      E      "Transaction represents a trade in two contract months in the same class"
2016.04.11 CLM16 11:00:00.105 41.43  2      E      "NYMEX/CME Group trade"
2016.04.11 GCM16 16:00:00.044 1257.9 1      E      ""
2016.04.11 GCM16 16:28:34.311 1258.7 1      E      ""
2016.04.11 VXK16 14:00:00.829 18.35  1      E      ""
2016.04.11 VXK16 15:14:58.775 19.05  2      E      ""
</pre>
 
Can you notice any differences? Notice that, within each partition, the table has been sorted by <tt>code</tt> (which we passed as a parameter to <tt>.Q.dpft</tt>).
 
The attribute that was set on that column, however, is not <tt>`s#</tt> (sorted) but <tt>`p#</tt> (parted):
<pre>
q)meta trd
c      | t f a
--------| -----
date    | d
code    | s  p
time    | t
price  | f
volume  | h
mktflag | s
comments| C
</pre>
 
The '''parted''' attribute asserts that all the equal values in the list are adjacent. This speeds up the querying by <tt>code</tt>.
 
==Adding another table==
 
We can add another table to the <tt>mydb</tt> directory.
 
Suppose that we have the following (again, highly contrived) table of quotes:
<pre>
quotes:([]code:`symbol$();date:`date$();time:`time$();bidprice:`float$();bidsize:`short$();askprice:`float$();asksize:`short$();mktflag:`symbol$());
.[`quotes;();,;(
  (`CLM16;2016.04.07;10:15:00.010;38.34 ;86h;38.35 ;3h ;`E);
  (`GCM16;2016.04.07;11:02:16.663;1241.4;22h;1241.5;1h ;`E);
  (`CLM16;2016.04.07;12:05:00.303;38.12 ;7h ;38.13 ;13h;`E);
  (`CLM16;2016.04.07;12:22:00.486;38.11 ;16h;38.12 ;8h ;`E);
  (`GCM16;2016.04.07;13:00:00.205;1238.6;8h ;1238.7;7h ;`E);
  (`CLM16;2016.04.07;15:00:00.051;38.52 ;9h ;38.53 ;18h;`E);
  (`GCM16;2016.04.07;15:20:02.224;1240.9;6h ;1241f ;1h ;`E);
  (`CLM16;2016.04.08;10:53:00.002;40.83 ;6h ;40.84 ;66h;`E);
  (`CLM16;2016.04.08;13:56:30.070;40.54 ;38h;40.56 ;58h;`E);
  (`CLM16;2016.04.08;15:20:02.000;40.77 ;26h;40.79 ;44h;`E);
  (`CLM16;2016.04.08;15:21:43.786;40.76 ;3h ;40.77 ;28h;`E))]
</pre>
 
We run the following code...
<pre>
{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from `.;} each distinct quotes[`date]
</pre>
...and obtain the following directory structure under <tt>C:/Users/Paul Bilokon/Documents/mydb</tt>:
* <tt>2016.04.07</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** <tt>code</tt>
*** <tt>comments</tt>
*** <tt>comments#</tt>
*** <tt>date</tt>
*** <tt>mktflag</tt>
*** <tt>price</tt>
*** <tt>time</tt>
*** <tt>volume</tt>
** <tt>qt</tt>
*** <tt>.d</tt>
*** <tt>askprice</tt>
*** <tt>asksize</tt>
*** <tt>bidprice</tt>
*** <tt>bidsize</tt>
*** <tt>code</tt>
*** <tt>date</tt>
*** <tt>mktflag</tt>
*** <tt>time</tt>
* <tt>2016.04.08</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** ...
*** <tt>volume</tt>
** <tt>qt</tt>
*** <tt>.d</tt>
*** ...
*** <tt>time</tt>
* <tt>2016.04.11</tt>
** <tt>trd</tt>
*** <tt>.d</tt>
*** ...
*** <tt>volume</tt>
** <tt>qt</tt>
*** <tt>.d</tt>
*** ...
*** <tt>time</tt>
* <tt>sym</tt>
 
==Loading partitioned tables==
 
To load partitioned tables into kdb+, you can pass the name of the directory that contains them as a command line parameter to the q executable (we killed the already opened q instance first before restarting it in this manner):
<pre>
C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"
</pre>
 
==Caveat==
 
However, we are in for a surprise:
<pre>
q)\v
`date`sym`trd
</pre>
 
Where's <tt>qt</tt>? What's the matter?
 
It turns out that, while <tt>trd</tt> has the partitions <tt>2016.04.07</tt>, <tt>2016.04.08</tt>, and <tt>2016.04.11</tt>, <tt>qt</tt> had only <tt>2016.04.07</tt> and <tt>2016.04.08</tt>.
 
Each table in <tt>mydb</tt> must have exactly the same partitions.
 
We remedy things by recreating <tt>quotes</tt> and then storing <tt>quotes</tt> as <tt>qt</tt> using
<pre>
{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from`.;} each 2016.04.07 2016.04.08 2016.04.11
</pre>
 
On quitting kdb+ and then restarting it with
<pre>
C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"
</pre>
we see that now both <tt>trd</tt> and <tt>qt</tt> are present:
<pre>
q)\v
`date`qt`sym`trd
</pre>
 
We can examine the tables by running
<pre>
select from trd
</pre>
and
<pre>
select from qt
</pre>
 
==The importance of current directory==
 
Check (by running <tt>\cd</tt>) that the current directory is
<pre>
"C:\\Users\\Paul Bilokon\\Documents\\mydb"
</pre>
Notice that if we change it, say, with
<pre>
\cd C:/q/w64
</pre>
our partitioned tables stop "working":
<pre>
q)select from trd
'./2016.04.07/trd/code: The system cannot find the path specified.
</pre>
 
Changing the current directory in the presence of partitioned tables wasn't a good move. We fix things by changing it back:
<pre>
\cd C:/Users/Paul Bilokon/Documents/mydb
</pre>
Notice that, two other variables are present in the restarted kdb+ instance: <tt>date</tt> containing the list of the partitions:
<pre>
2016.04.07 2016.04.08 2016.04.11
</pre>
and <tt>sym</tt>, containing the list of enumerated symbols:
<pre>
`CLM16`GCM16`VXK16`E
</pre>
 
==Appending rows to partitioned tables==
 
Now let us suppose we want to append the following rows:
<pre>
morequotes:([]code:`symbol$();date:`date$();time:`time$();bidprice:`float$();bidsize:`short$();askprice:`float$();asksize:`short$();mktflag:`symbol$());
.[`morequotes;();,;(
  (`GCM16;2016.04.07;15:23:21.147;1241.2;4h ;1241.3;1h;`E);
  (`GCM16;2016.04.07;15:33:04.535;1241.7;17h;1241.6;2h;`E))]
</pre>
Na&iuml;vely, we attempt to run
<pre>
qt:morequotes;
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";2016.04.07;`code;`qt];
delete qt from`.;
</pre>
 
On restarting the kdb+ instance (we have to do this because we replaced <tt>qt</tt> with our temporary table; we wrote to <tt>qt</tt> image on disc but the in-memory representation is not updated) with
<pre>
C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"
</pre>
we find that we did not append to the <tt>2016.04.07</tt> partition of <tt>qt</tt> but replaced it:
<pre>
select from qt
</pre>
<pre>
date      code  time        bidprice bidsize askprice asksize mktflag
-----------------------------------------------------------------------
2016.04.07 GCM16 15:23:21.147 1241.2  4      1241.3  1      E
2016.04.07 GCM16 15:33:04.535 1241.7  17      1241.6  2      E
2016.04.08 CLM16 10:53:00.002 40.83    6      40.84    66      E
2016.04.08 CLM16 13:56:30.070 40.54    38      40.56    58      E
2016.04.08 CLM16 15:20:02.000 40.77    26      40.79    44      E
2016.04.08 CLM16 15:21:43.786 40.76    3      40.77    28      E
</pre>
 
This exposes another limitation of <tt>.Q.dpft</tt>: since it does not allow us to append to a partition but overwrites it, it requires the entire partition to be in memory for it to be written to a partitioned table. For some very large partitions this is infeasible: they simply won't fit in memory.
 
To ''append'' to a partition on disc, we need a modified version of <tt>.Q.dpft</tt>.
 
Let us examine the code of <tt>.Q.dpft</tt>. We see that it is written in k:
<pre>
q).Q.dpft
k){[d;p;f;t;s]i:<t f;r:+enxs[$;d;;s]`. . `\:t;{[d;t;i;u;x]@[d;x;:;u t[x]i]}[d:par[d;p;t];r;i;]'[(::;`p#)f=!r;!r];@[d;`.d;:;f,r@&~f=r:!r];t}[;;;;`sym]
</pre>
The part of the code that we are interested in is
<pre>
@[d;x;:;u t[x]i]
</pre>
here we replace <tt>:</tt> with <tt>,</tt> and obtain
<pre>
k)dpftappend:{[d;p;f;t;s]i:<t f;r:+.Q.enxs[$;d;;s]`. . `\:t;{[d;t;i;u;x]@[d;x;,;u t[x]i]}[d:.Q.par[d;p;t];r;i;]'[(::;`p#)f=!r;!r];@[d;`.d;:;f,r@&~f=r:!r];t}[;;;;`sym]
</pre>
We can test it by recreating the quotes table, writing it to disc with
<pre>
{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from`.;} each 2016.04.07 2016.04.08 2016.04.11
</pre>
again, then appending <tt>morequotes</tt> with
<pre>
qt:morequotes;
dpftappend[`$":C:/Users/Paul Bilokon/Documents/mydb";2016.04.07;`code;`qt];
delete qt from`.;
</pre>
 
(We remark that this modification of <tt>.Q.dpft</tt> was suggested by Pawan Singh &mdash; see https://groups.google.com/forum/\#!topic/personal-kdbplus/51rxAntpcOc.)
 
This time we have achieved our goal of appending to a partition in <tt>qt</tt>: after restarting again with
<pre>
C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"
</pre>
we check that we have indeed appended to a partition on disc:
<pre>
select from qt
</pre>
<pre>
date      code  time        bidprice bidsize askprice asksize mktflag
-----------------------------------------------------------------------
2016.04.07 CLM16 10:15:00.010 38.34    86      38.35    3      E
2016.04.07 CLM16 12:05:00.303 38.12    7      38.13    13      E
2016.04.07 CLM16 12:22:00.486 38.11    16      38.12    8      E
2016.04.07 CLM16 15:00:00.051 38.52    9      38.53    18      E
2016.04.07 GCM16 11:02:16.663 1241.4  22      1241.5  1      E
2016.04.07 GCM16 13:00:00.205 1238.6  8      1238.7  7      E
2016.04.07 GCM16 15:20:02.224 1240.9  6      1241    1      E
2016.04.07 GCM16 15:23:21.147 1241.2  4      1241.3  1      E
2016.04.07 GCM16 15:33:04.535 1241.7  17      1241.6  2      E
2016.04.08 CLM16 10:53:00.002 40.83    6      40.84    66      E
2016.04.08 CLM16 13:56:30.070 40.54    38      40.56    58      E
2016.04.08 CLM16 15:20:02.000 40.77    26      40.79    44      E
2016.04.08 CLM16 15:21:43.786 40.76    3      40.77    28      E
</pre>
 
==Compression==
 
While it is not possible to pass in compression parameters to <tt>.Q.dpft</tt>, kdb+ can write compressed files by default.
 
This is governed by the zip defaults <tt>.z.zd</tt>. Set this as an integer vector, e.g.
<pre>
.z.zd:17 2 9
</pre>
and all files will be compressed in this way (unless given different parameters).
 
To disable compression by default, set
<pre>
.z.zd:3#0
</pre>
 
==Maintenance==
 
For maintenance of partitioned tables, be sure to examine <tt>dbmaint.q</tt>: https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md:
<blockquote>
The script <tt>dbmaint.q</tt> contains utility functions for maintenance of partitioned database tables in kdb+. It should not be used for splayed tables which are not partitioned, tables stored as single files or in memory tables.
</blockquote>
 
In particular, this script defines the functions <tt>addcol</tt>, <tt>castcol</tt>, <tt>clearattrcol</tt>, <tt>copycol</tt>, <tt>deletecol</tt>, <tt>findcol</tt>, <tt>fixtable</tt>, <tt>fncol</tt>, <tt>listcols</tt>, <tt>renamecol</tt>, <tt>reordercols</tt>, <tt>setattrcol</tt>, <tt>addtable</tt>, <tt>rentable</tt>. See the documentation for details.
 
Another useful function for working with partitioned tables is <tt>.Q.chk</tt> (fill historical database, hdb). It is run as follows:
<pre>
.Q.chk x
</pre>
where <tt>x</tt> is a path to the historical database. <tt>.Q.chk</tt> fills tables missing from partitions using the most recent partition as a template, and reports which partitions (but not which tables) it is fixing:
<pre>
q).Q.chk[`:hdb]
()
()
,`:/db/2009.01.04
,`:/db/2009.01.03
</pre>
 
Q must have write permission for the hdb area to create the missing tables.

Latest revision as of 22:14, 21 December 2021

Example table

Let us create and populate the trades table:

trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]

Each row represents a trade in a futures contract. The value in the code column identifies that contract. The date and time columns give the dates and times of the trades as reported by the exchange. The price column contains the trade prices per contract. The volume column contains the number of contracts traded. The name of the mktflag column stands for "market flag", which distinguishes the pit trades (in which case the value in the column is `P) from the electronic trades (in which case the value is `E). In our table, which represents a hypothetical subsample from some large data set, all the trades are electronic trades. The comments column contains arbitrary comments on the data rows as strings — not as symbols. In practice, the specific comments in our example ("NYMEX/CME Group trade", "Transaction represents a trade in two contract months in the same class") would be implemented as symbol or numeric flags rather than as strings. We chose to use the string data type to demonstrate its use in a table.

Saving a table as a single binary file

We can use the following command to save this table as a binary file:

`:tradesfile set trades

The resulting file will have 965 bytes and will reside in the current directory, which you can check with

\cd

By default this will be the directory containing the q.exe executable. To save it in a different directory, we could use something like this:

(`$":C:/Users/Paul Bilokon/Documents/tradesfile") set trades

The brackets are needed as

`$":C:/Users/Paul Bilokon/Documents/tradesfile" set trades

is interpreted by q, from right to left, as

`$(":C:/Users/Paul Bilokon/Documents/tradesfile" set trades)

which isn't what we want.

In the case of success the return is the file path as a symbol:

q)(`$":C:/Users/Paul Bilokon/Documents/tradesfile") set trades
`$":C:/Users/Paul Bilokon/Documents/tradesfile"

Loading a table from a single binary file

We can then load the table from a file either using get, in which case we can specify the new table name (by assigning to a variable with that name the result of get):

trades1:get`$":C:/Users/Paul Bilokon/Documents/tradesfile"

or using load, in which ase the loaded table will be assigned to a new variable, that variable's name matching the name of the file; in this particular case, tradesfile:

load`$":C:/Users/Paul Bilokon/Documents/tradesfile"

Splayed tables

Introduction

Instead of saving the table as a single binary file, we could save it splayed, i.e. so that each column is saved in a separate binary file.

Queries often refer to a small subset of such a table's columns. Storing the columns in separate files allows kdb+ to save time by loading only some of them.

To achieve this, we would normally call set with a directory path, rather than a file path, as its first argument.

However, the following will fail with `type:

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set trades

Unenumerated columns

set will disallow us to splay the table trades since it contains unenumerated columns of type symbol. (The same error would be thrown if the table were keyed; we would need to remove the key with

()xkey trades

before saving the table.)

We can remove those columns and save the table in one line with

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set `code`mktflag _ trades

Notice the traling / in the directory path.

As a result of running this, the directory tradesdir will be created under C:/Users/Paul Bilokon/Documents. It will contain the files with names

  • .d,
  • comments,
  • comments#,
  • date,
  • price,
  • time, and
  • volume.

Enumerating the columns

We couldn't save the original trades table because the columns of type symbol, namely code and mktflag, weren't enumerated. We shall now show how to enumerate them. Before proceeding, delete the directory tradesdir with all its contents, in case you created that directory by following the steps outlined above, so that we start with a clean slate. To enumerate the symbol columns in trades, we use .Q.en:

tradesenum:.Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";trades]

This will create the directory data with a single binary file in it, named sym. That file enumerates the values found in the columns of type symbol, namely `CLM16, `GCM16, `VXK16, and `E.

Once this is done, we run

(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set tradesenum

The directory now contains the files with names

  • .d,
  • code,
  • comments,
  • comments#,
  • date,
  • mktflag,
  • price,
  • time, and
  • volume.

Notice that this time, with the symbol columns enumerated, we were able to save the columns of type symbol successfully.

Note that, unlike the above command,

(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set trades

will fail with 'type, as before.

Splayed tables on disc

Let us now take a look at the individual files in the directory tradesdir. The files date, price, time, and volume contain the serializations of the columns with the corresponding names.

Notice that you can serialize the columns individually. The command

(`$":C:/Users/Paul Bilokon/Documents/myprice")set trades[`price]

will create a file myprice, whose contents are identical to those of tradesdir/price.

Recall that trades[`price] is a list of type float:

q)trades[`price]
38.3 1239.7 1240.5 38.04 18.85 1241.2 1240 18.53 40.61 18.44 18.49 40.78 41.43 18.35 19.05 1257.9 1258.7
q)type trades[`price]
9h

When we run

(`$":C:/Users/Paul Bilokon/Documents/mycols")set cols trades

we obtain a file, mycols, whose contents match those of tradesdir/.d. This file specifies the order of the columns in the splayed table. (How else would we know it?)

Thus .d is the serialization of the list of column names (as symbols) in the table.

However, when we run

(`$":C:/Users/Paul Bilokon/Documents/mycode")set trades[`code]

we obtain the file mycode, whose contents differ from those of tradesdir/code. This is because all the symbols in the enumerated table are stored in tradesdir/sym, whereas tradesdir/code contains indices of the symbols within tradesdir/sym.

We will reproduce tradesdir/code by running

(`$":C:/Users/Paul Bilokon/Documents/mycode")set tradesenum[`code]

Enumerated versus unenumerated columns

This underlines the difference between trades and tradesenum: .Q.en returns a copy of the table with the symbol columns enumerated.

We can distinguish between the columns that have been enumerated and those that haven't as follows:

q)type trades[`code]
11h
q)type tradesenum[`code]
21h

although the results of meta trades and meta tradesenum look identical.

Similarly to tradesdir/code, the file tradesdir/mktflag contains serialized enumerated strings containing indices of the actual strings within tradesdir/sym.

Each distinct text string, corresponding to each distinct symbol value, is stored in tradesdir/sym once. Using the terminology from Java, symbols are interned.

Columns of strings

The column comments, whose type is string (rather than symbol) was serialized by set to two files:

  • tradesdir/comments and
  • tradesdir/comments#.

The latter, comments#, contains the actual text found in the strings.

Unlike the serializations of the enumerated symbol columns, where distinct values are stored only once, comments# contains repeats.

If you open this file in a binary file viewer, you will see that "NYMEX/CME Group trade", for example, occurs multiple times.

Using the terminology from Java, serialized strings are not interned, whereas serialized symbols are. The file tradesdir/comments indexes into

tradesdir/comments#

Appending to a splayed table

Now suppose that we have

moretrades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`moretrades;();,;(
  (`GCM16;2016.04.11;16:29:05.890;1257.8;2h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`CLM16;2016.04.11;16:29:45.105;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.11;16:29:50.898;1257.8;1h;`E;"");
  (`GCM16;2016.04.12;10:01:05.890;1257.8;2h;`E;"");
  (`GCM16;2016.04.12;10:02:05.287;1257.8;2h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`CLM16;2016.04.12;10:18:45.230;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.12;10:18:50.382;1257.8;1h;`E;""))];

How can we append them to our splayed table?

If we were to do

(`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/")set .Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";moretrades]

the table would be overwritten. However, we don't want to overwrite it; we want to append to it.

What we want is

upsert[`$":C:/Users/Paul Bilokon/Documents/data/tradesdir/"].Q.en[`$":C:/Users/Paul Bilokon/Documents/data/";moretrades]

Sorting a splayed table

Yes, upsert works with tables on disc, not just in memory. upsert is not alone in this. For example, we can sort the splayed table on disc:

`code xasc`$":C:/Users/Paul Bilokon/Documents/tradesdir/"

We confirm that the sorted attribute is now set on the column code:

q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c       | t f a
--------| -----
code    | s   s
date    | d
time    | t
price   | f
volume  | h
mktflag | s
comments| C

Applying attributes to a splayed table

xasc is not alone in being able to work with tables on disc. Here is how we can set the grouped (`g#) attribute on our table:

q)@[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";`mktflag;`g#]
`:tradesdir/
q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c       | t f a
--------| -----
code    | s   s
date    | d
time    | t
price   | f
volume  | h
mktflag | s   g
comments| C

Compression

Let us recreate the on-disc version of the table

trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]

but this time use compression:

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/";17;2;9)set .Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";trades]

Without compression the table occupied 17,595 bytes; with compression — only 1,182 bytes.

At first the three arguments 17 2 9 look cryptic. They actually represent

  • the logical block size: a power of 2 between 12 and 20;
  • the algorithm:
    • 0 — none,
    • 1 — q IPC,
    • 2 — gzip,
    • 3 — snappy (since V3.4),
    • 4 — lz4hc (since V3.6);
  • the compression level: 0-9 for gzip, 1-12 for lz4hc.

To read up on file compression in kdb+/q, see https://code.kx.com/q/kb/file-compression/.

Querying splayed tables on disc

Some (but not all) q-sql queries work with tables on disc, for example:

q)select from`$":C:/Users/Paul Bilokon/Documents/tradesdir/"where price>100f
code  date       time         price  volume mktflag comments
------------------------------------------------------------
GCM16 2016.04.07 12:00:00.055 1239.7 6      E       ""
GCM16 2016.04.07 12:37:02.106 1240.5 1      E       ""
GCM16 2016.04.07 14:35:01.241 1241.2 1      E       ""
GCM16 2016.04.08 10:13:01.048 1240   3      E       ""
GCM16 2016.04.11 16:00:00.044 1257.9 1      E       ""
GCM16 2016.04.11 16:28:34.311 1258.7 1      E       ""

However, something like

delete price from`$":tradesdir/"

wouldn't work; we would need to manipulate the table in-memory and then save down the modified table to disc, overwriting the original.

Partitioned tables

Introduction

In addition to splaying a table, we may partition it: for each distinct value v of a column that defines the partition, a distinct directory will be created, where the binary files with serialized columns will be stored, but including only those rows with value v in the partition column.

The standard .Q.dpft function performs the partitioning.

The name of that function stands for its four parameters:

  • directory;
  • partition;
  • sort field;
  • table name.

To store the trades table, which is, as before,

trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]

we can execute the following code:

{
  `trd set select from trades where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`trd];
  delete trd from`.;}each distinct trades[`date]

We will thus call .Q.dpft once for each of the distinct values in the date column, resulting in the following calls:

.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.07; `code; `trd1];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.08; `code; `trd2];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.11; `code; `trd3];

Here `trd1`trd2`trd3 are, respectively, the names of the three tables obtained by selecting from trades the rows where date has one of the values, respectively, 2016.04.07, 2016.04.08, 2016.04.11:

select from trades where date=2016.04.07;
select from trades where date=2016.04.08;
select from trades where date=2016.04.11;

We could not implement our calls as follows

.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.07; `code; select from trades where date=2016.04.07];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.08; `code; select from trades where date=2016.04.08];
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb"; 2016.04.11; `code; select from trades where date=2016.04.11];

because of this limitation of .Q.dpft: it expects, as its fourth argument, the table name, not the actual table.

The table is always saved to disc in a directory with that name. This is why we created the temporary table trd, with the date we will write to a partition in the global namespace (`.) using `trd set... rather than, for example, trd:. The latter will create the variable trd in the local scope of our anonymous function defined in the {...} block. We wouldn't be able to reference that variable as `trd.

At the end of our local function we delete the temporary table from the global namespace:

delete trd from `.;

The following directory structure will be obtained by evaluating our code. Under C:/Users/Paul Bilokon/Documents/mydb:

  • 2016.04.07
    • trd
      • .d
      • code
      • comments
      • comments#
      • date
      • mktflag
      • price
      • time
      • volume
  • 2016.04.08
    • trd
      • .d
      • ...
      • volume
  • 2016.04.11
    • trd
      • .d
      • ...
      • volume
  • sym

In the directory that we have specified to .Q.dpft, we have one subdirectory per each partition (2016.04.07, 2016.04.08, and 2016.04.11). Inside those directories we have a subdirectory whose name is that of the table that we provided to .Q.dpft as an argument: trd. Inside trd, we have the same files as we encountered when splaying the table, with the exception of sym, which now resides two levels above. As before, the file sym enumerates the columns of type symbol

Notice that we didn't have to call .Q.en explicitly this time: .Q.dpft did this for us.

Sorting and the `p# attribute

The table that we have obtained on disc is

date       code  time         price  volume mktflag comments
-----------------------------------------------------------------------------------------------------------------------------
2016.04.07 CLM16 10:20:00.329 38.3   4      E       "NYMEX/CME Group trade"
2016.04.07 CLM16 13:00:00.128 38.04  3      E       "NYMEX/CME Group trade"
2016.04.07 GCM16 12:00:00.055 1239.7 6      E       ""
2016.04.07 GCM16 12:37:02.106 1240.5 1      E       ""
2016.04.07 GCM16 14:35:01.241 1241.2 1      E       ""
2016.04.07 VXK16 13:22:05.617 18.85  5      E       ""
2016.04.08 CLM16 12:00:00.227 40.61  3      E       "NYMEX/CME Group trade"
2016.04.08 CLM16 15:20:02.000 40.78  3      E       "NYMEX/CME Group trade"
2016.04.08 GCM16 10:13:01.048 1240   3      E       ""
2016.04.08 VXK16 11:34:53.417 18.53  1      E       "Transaction represents a trade in two contract months in the same class"
2016.04.08 VXK16 12:44:00.684 18.44  2      E       "Transaction represents a trade in two contract months in the same class"
2016.04.08 VXK16 12:45:33.130 18.49  1      E       "Transaction represents a trade in two contract months in the same class"
2016.04.11 CLM16 11:00:00.105 41.43  2      E       "NYMEX/CME Group trade"
2016.04.11 GCM16 16:00:00.044 1257.9 1      E       ""
2016.04.11 GCM16 16:28:34.311 1258.7 1      E       ""
2016.04.11 VXK16 14:00:00.829 18.35  1      E       ""
2016.04.11 VXK16 15:14:58.775 19.05  2      E       ""

Can you notice any differences? Notice that, within each partition, the table has been sorted by code (which we passed as a parameter to .Q.dpft).

The attribute that was set on that column, however, is not `s# (sorted) but `p# (parted):

q)meta trd
c       | t f a
--------| -----
date    | d
code    | s   p
time    | t
price   | f
volume  | h
mktflag | s
comments| C

The parted attribute asserts that all the equal values in the list are adjacent. This speeds up the querying by code.

Adding another table

We can add another table to the mydb directory.

Suppose that we have the following (again, highly contrived) table of quotes:

quotes:([]code:`symbol$();date:`date$();time:`time$();bidprice:`float$();bidsize:`short$();askprice:`float$();asksize:`short$();mktflag:`symbol$());
.[`quotes;();,;(
  (`CLM16;2016.04.07;10:15:00.010;38.34 ;86h;38.35 ;3h ;`E);
  (`GCM16;2016.04.07;11:02:16.663;1241.4;22h;1241.5;1h ;`E);
  (`CLM16;2016.04.07;12:05:00.303;38.12 ;7h ;38.13 ;13h;`E);
  (`CLM16;2016.04.07;12:22:00.486;38.11 ;16h;38.12 ;8h ;`E);
  (`GCM16;2016.04.07;13:00:00.205;1238.6;8h ;1238.7;7h ;`E);
  (`CLM16;2016.04.07;15:00:00.051;38.52 ;9h ;38.53 ;18h;`E);
  (`GCM16;2016.04.07;15:20:02.224;1240.9;6h ;1241f ;1h ;`E);
  (`CLM16;2016.04.08;10:53:00.002;40.83 ;6h ;40.84 ;66h;`E);
  (`CLM16;2016.04.08;13:56:30.070;40.54 ;38h;40.56 ;58h;`E);
  (`CLM16;2016.04.08;15:20:02.000;40.77 ;26h;40.79 ;44h;`E);
  (`CLM16;2016.04.08;15:21:43.786;40.76 ;3h ;40.77 ;28h;`E))]

We run the following code...

{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from `.;} each distinct quotes[`date]

...and obtain the following directory structure under C:/Users/Paul Bilokon/Documents/mydb:

  • 2016.04.07
    • trd
      • .d
      • code
      • comments
      • comments#
      • date
      • mktflag
      • price
      • time
      • volume
    • qt
      • .d
      • askprice
      • asksize
      • bidprice
      • bidsize
      • code
      • date
      • mktflag
      • time
  • 2016.04.08
    • trd
      • .d
      • ...
      • volume
    • qt
      • .d
      • ...
      • time
  • 2016.04.11
    • trd
      • .d
      • ...
      • volume
    • qt
      • .d
      • ...
      • time
  • sym

Loading partitioned tables

To load partitioned tables into kdb+, you can pass the name of the directory that contains them as a command line parameter to the q executable (we killed the already opened q instance first before restarting it in this manner):

C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"

Caveat

However, we are in for a surprise:

q)\v
`date`sym`trd

Where's qt? What's the matter?

It turns out that, while trd has the partitions 2016.04.07, 2016.04.08, and 2016.04.11, qt had only 2016.04.07 and 2016.04.08.

Each table in mydb must have exactly the same partitions.

We remedy things by recreating quotes and then storing quotes as qt using

{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from`.;} each 2016.04.07 2016.04.08 2016.04.11

On quitting kdb+ and then restarting it with

C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"

we see that now both trd and qt are present:

q)\v
`date`qt`sym`trd

We can examine the tables by running

select from trd

and

select from qt

The importance of current directory

Check (by running \cd) that the current directory is

"C:\\Users\\Paul Bilokon\\Documents\\mydb"

Notice that if we change it, say, with

\cd C:/q/w64

our partitioned tables stop "working":

q)select from trd
'./2016.04.07/trd/code: The system cannot find the path specified.

Changing the current directory in the presence of partitioned tables wasn't a good move. We fix things by changing it back:

\cd C:/Users/Paul Bilokon/Documents/mydb

Notice that, two other variables are present in the restarted kdb+ instance: date containing the list of the partitions:

2016.04.07 2016.04.08 2016.04.11

and sym, containing the list of enumerated symbols:

`CLM16`GCM16`VXK16`E

Appending rows to partitioned tables

Now let us suppose we want to append the following rows:

morequotes:([]code:`symbol$();date:`date$();time:`time$();bidprice:`float$();bidsize:`short$();askprice:`float$();asksize:`short$();mktflag:`symbol$());
.[`morequotes;();,;(
  (`GCM16;2016.04.07;15:23:21.147;1241.2;4h ;1241.3;1h;`E);
  (`GCM16;2016.04.07;15:33:04.535;1241.7;17h;1241.6;2h;`E))]

Naïvely, we attempt to run

qt:morequotes;
.Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";2016.04.07;`code;`qt];
delete qt from`.;

On restarting the kdb+ instance (we have to do this because we replaced qt with our temporary table; we wrote to qt image on disc but the in-memory representation is not updated) with

C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"

we find that we did not append to the 2016.04.07 partition of qt but replaced it:

select from qt
date       code  time         bidprice bidsize askprice asksize mktflag
-----------------------------------------------------------------------
2016.04.07 GCM16 15:23:21.147 1241.2   4       1241.3   1       E
2016.04.07 GCM16 15:33:04.535 1241.7   17      1241.6   2       E
2016.04.08 CLM16 10:53:00.002 40.83    6       40.84    66      E
2016.04.08 CLM16 13:56:30.070 40.54    38      40.56    58      E
2016.04.08 CLM16 15:20:02.000 40.77    26      40.79    44      E
2016.04.08 CLM16 15:21:43.786 40.76    3       40.77    28      E

This exposes another limitation of .Q.dpft: since it does not allow us to append to a partition but overwrites it, it requires the entire partition to be in memory for it to be written to a partitioned table. For some very large partitions this is infeasible: they simply won't fit in memory.

To append to a partition on disc, we need a modified version of .Q.dpft.

Let us examine the code of .Q.dpft. We see that it is written in k:

q).Q.dpft
k){[d;p;f;t;s]i:<t f;r:+enxs[$;d;;s]`. . `\:t;{[d;t;i;u;x]@[d;x;:;u t[x]i]}[d:par[d;p;t];r;i;]'[(::;`p#)f=!r;!r];@[d;`.d;:;f,r@&~f=r:!r];t}[;;;;`sym]

The part of the code that we are interested in is

@[d;x;:;u t[x]i]

here we replace : with , and obtain

k)dpftappend:{[d;p;f;t;s]i:<t f;r:+.Q.enxs[$;d;;s]`. . `\:t;{[d;t;i;u;x]@[d;x;,;u t[x]i]}[d:.Q.par[d;p;t];r;i;]'[(::;`p#)f=!r;!r];@[d;`.d;:;f,r@&~f=r:!r];t}[;;;;`sym]

We can test it by recreating the quotes table, writing it to disc with

{
  `qt set select from quotes where date=x;
  .Q.dpft[`$":C:/Users/Paul Bilokon/Documents/mydb";x;`code;`qt];
  delete qt from`.;} each 2016.04.07 2016.04.08 2016.04.11

again, then appending morequotes with

qt:morequotes;
dpftappend[`$":C:/Users/Paul Bilokon/Documents/mydb";2016.04.07;`code;`qt];
delete qt from`.;

(We remark that this modification of .Q.dpft was suggested by Pawan Singh — see https://groups.google.com/forum/\#!topic/personal-kdbplus/51rxAntpcOc.)

This time we have achieved our goal of appending to a partition in qt: after restarting again with

C:\q\w64>q "C:\Users\Paul Bilokon\Documents\mydb"

we check that we have indeed appended to a partition on disc:

select from qt
date       code  time         bidprice bidsize askprice asksize mktflag
-----------------------------------------------------------------------
2016.04.07 CLM16 10:15:00.010 38.34    86      38.35    3       E
2016.04.07 CLM16 12:05:00.303 38.12    7       38.13    13      E
2016.04.07 CLM16 12:22:00.486 38.11    16      38.12    8       E
2016.04.07 CLM16 15:00:00.051 38.52    9       38.53    18      E
2016.04.07 GCM16 11:02:16.663 1241.4   22      1241.5   1       E
2016.04.07 GCM16 13:00:00.205 1238.6   8       1238.7   7       E
2016.04.07 GCM16 15:20:02.224 1240.9   6       1241     1       E
2016.04.07 GCM16 15:23:21.147 1241.2   4       1241.3   1       E
2016.04.07 GCM16 15:33:04.535 1241.7   17      1241.6   2       E
2016.04.08 CLM16 10:53:00.002 40.83    6       40.84    66      E
2016.04.08 CLM16 13:56:30.070 40.54    38      40.56    58      E
2016.04.08 CLM16 15:20:02.000 40.77    26      40.79    44      E
2016.04.08 CLM16 15:21:43.786 40.76    3       40.77    28      E

Compression

While it is not possible to pass in compression parameters to .Q.dpft, kdb+ can write compressed files by default.

This is governed by the zip defaults .z.zd. Set this as an integer vector, e.g.

.z.zd:17 2 9

and all files will be compressed in this way (unless given different parameters).

To disable compression by default, set

.z.zd:3#0

Maintenance

For maintenance of partitioned tables, be sure to examine dbmaint.q: https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md:

The script dbmaint.q contains utility functions for maintenance of partitioned database tables in kdb+. It should not be used for splayed tables which are not partitioned, tables stored as single files or in memory tables.

In particular, this script defines the functions addcol, castcol, clearattrcol, copycol, deletecol, findcol, fixtable, fncol, listcols, renamecol, reordercols, setattrcol, addtable, rentable. See the documentation for details.

Another useful function for working with partitioned tables is .Q.chk (fill historical database, hdb). It is run as follows:

.Q.chk x

where x is a path to the historical database. .Q.chk fills tables missing from partitions using the most recent partition as a template, and reports which partitions (but not which tables) it is fixing:

q).Q.chk[`:hdb]
()
()
,`:/db/2009.01.04
,`:/db/2009.01.03

Q must have write permission for the hdb area to create the missing tables.