Difference between revisions of "Programming/Kdb/Labs/Big data in kdb"

From Thalesians Wiki
< Programming‎ | Kdb‎ | Labs
Line 307: Line 307:


Without compression the table occupied 17,595 bytes; with compression &mdash; only 1,182 bytes.
Without compression the table occupied 17,595 bytes; with compression &mdash; only 1,182 bytes.
To read up on file compression in kdb+/q, see <span class="plainlinks">https://code.kx.com/q/kb/file-compression/</span>.

Revision as of 17:42, 20 June 2021

Example table

Let us create and populate the trades table:

trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]

Each row represents a trade in a futures contract. The value in the code column identifies that contract. The date and time columns give the dates and times of the trades as reported by the exchange. The price column contains the trade prices per contract. The volume column contains the number of contracts traded. The name of the mktflag column stands for "market flag", which distinguishes the pit trades (in which case the value in the column is `P) from the electronic trades (in which case the value is `E). In our table, which represents a hypothetical subsample from some large data set, all the trades are electronic trades. The comments column contains arbitrary comments on the data rows as strings — not as symbols. In practice, the specific comments in our example ("NYMEX/CME Group trade", "Transaction represents a trade in two contract months in the same class") would be implemented as symbol or numeric flags rather than as strings. We chose to use the string data type to demonstrate its use in a table.

Saving a table as a single binary file

We can use the following command to save this table as a binary file:

`:tradesfile set trades

The resulting file will have 965 bytes and will reside in the current directory, which you can check with

\cd

By default this will be the directory containing the q.exe executable. To save it in a different directory, we could use something like this:

(`$":C:/Users/Paul Bilokon/Documents/tradesfile") set trades

The brackets are needed as

`$":C:/Users/Paul Bilokon/Documents/tradesfile" set trades

is interpreted by q, from right to left, as

`$(":C:/Users/Paul Bilokon/Documents/tradesfile" set trades)

which isn't what we want.

In the case of success the return is the file path as a symbol:

q)(`$":C:/Users/Paul Bilokon/Documents/tradesfile") set trades
`$":C:/Users/Paul Bilokon/Documents/tradesfile"

Loading a table from a single binary file

We can then load the table from a file either using get, in which case we can specify the new table name (by assigning to a variable with that name the result of get):

trades1:get`$":C:/Users/Paul Bilokon/Documents/tradesfile"

or using load, in which ase the loaded table will be assigned to a new variable, that variable's name matching the name of the file; in this particular case, tradesfile:

load`$":C:/Users/Paul Bilokon/Documents/tradesfile"

Splayed tables

Instead of saving the table as a single binary file, we could save it splayed, i.e. so that each column is saved in a separate binary file.

Queries often refer to a small subset of such a table's columns. Storing the columns in separate files allows kdb+ to save time by loading only some of them.

To achieve this, we would normally call set with a directory path, rather than a file path, as its first argument.

However, the following will fail with `type:

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set trades

Unenumerated columns

set will disallow us to splay the table trades since it contains unenumerated columns of type symbol. (The same error would be thrown if the table were keyed; we would need to remove the key with

()xkey trades

before saving the table.)

We can remove those columns and save the table in one line with

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set `code`mktflag _ trades

Notice the traling / in the directory path.

As a result of running this, the directory tradesdir will be created under C:/Users/Paul Bilokon/Documents. It will contain the files with names

  • .d,
  • comments,
  • comments#,
  • date,
  • price,
  • time, and
  • volume.

Enumerating the columns

We couldn't save the original trades table because the columns of type symbol, namely code and mktflag, weren't enumerated. We shall now show how to enumerate them. Before proceeding, delete the directory tradesdir with all its contents, in case you created that directory by following the steps outlined above, so that we start with a clean slate. To enumerate the symbol columns in trades, we use .Q.en:

tradesenum:.Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";trades]

This will create the directory tradesdir with a single binary file in it, named sym. That file enumerates the values found in the columns of type symbol, namely `CLM16, `GCM16, `VXK16, and `E.

Once this is done, we run

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set tradesenum

The directory now contains the files with names

  • .d,
  • code,
  • comments,
  • comments#,
  • date,
  • mktflag,
  • price,
  • sym,
  • time, and
  • volume.

Notice that this time, with the symbol columns enumerated, we were able to save the columns of type symbol successfully.

Note that, unlike the above command,

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set trades

will fail with 'type, as before.

Splayed tables on disc

Let us now take a look at the individual files in the directory tradesdir. The files date, price, time, and volume contain the serializations of the columns with the corresponding names.

Notice that you can serialize the columns individually. The command

(`$":C:/Users/Paul Bilokon/Documents/myprice")set trades[`price]

will create a file myprice, whose contents are identical to those of tradesdir/price.

Recall that trades[`price] is a list of type float:

q)trades[`price]
38.3 1239.7 1240.5 38.04 18.85 1241.2 1240 18.53 40.61 18.44 18.49 40.78 41.43 18.35 19.05 1257.9 1258.7
q)type trades[`price]
9h

When we run

(`$":C:/Users/Paul Bilokon/Documents/mycols")set cols trades

we obtain a file, mycols, whose contents match those of tradesdir/.d. This file specifies the order of the columns in the splayed table. (How else would we know it?)

Thus .d is the serialization of the list of column names (as symbols) in the table.

However, when we run

(`$":C:/Users/Paul Bilokon/Documents/mycode")set trades[`code]

we obtain the file mycode, whose contents differ from those of tradesdir/code. This is because all the symbols in the enumerated table are stored in tradesdir/sym, whereas tradesdir/code contains indices of the symbols within tradesdir/sym.

We will reproduce tradesdir/code by running

(`$":C:/Users/Paul Bilokon/Documents/mycode")set tradesenum[`code]

Enumerated versus unenumerated columns

This underlines the difference between trades and tradesenum: .Q.en returns a copy of the table with the symbol columns enumerated.

We can distinguish between the columns that have been enumerated and those that haven't as follows:

q)type trades[`code]
11h
q)type tradesenum[`code]
21h

although the results of meta trades and meta tradesenum look identical.

Similarly to tradesdir/code, the file tradesdir/mktflag contains serialized enumerated strings containing indices of the actual strings within tradesdir/sym.

Each distinct text string, corresponding to each distinct symbol value, is stored in tradesdir/sym once. Using the terminology from Java, symbols are interned.

Columns of strings

The column comments, whose type is string (rather than symbol) was serialized by set to two files:

  • tradesdir/comments and
  • tradesdir/comments#.

The latter, comments#, contains the actual text found in the strings.

Unlike the serializations of the enumerated symbol columns, where distinct values are stored only once, comments# contains repeats.

If you open this file in a binary file viewer, you will see that "NYMEX/CME Group trade", for example, occurs multiple times.

Using the terminology from Java, serialized strings are not interned, whereas serialized symbols are. The file tradesdir/comments indexes into

tradesdir/comments#

Appending to a splayed table

Now suppose that we have

moretrades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`moretrades;();,;(
  (`GCM16;2016.04.11;16:29:05.890;1257.8;2h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`VXK16;2016.04.11;16:29:38.575;19.49 ;1h;`E;"");
  (`CLM16;2016.04.11;16:29:45.105;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.11;16:29:50.898;1257.8;1h;`E;"");
  (`GCM16;2016.04.12;10:01:05.890;1257.8;2h;`E;"");
  (`GCM16;2016.04.12;10:02:05.287;1257.8;2h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`VXK16;2016.04.12;10:03:38.123;19.49 ;1h;`E;"");
  (`CLM16;2016.04.12;10:18:45.230;41.45 ;2h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.12;10:18:50.382;1257.8;1h;`E;""))];

How can we append them to our splayed table?

If we were to do

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/")set .Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";moretrades]

the table would be overwritten. However, we don't want to overwrite it; we want to append to it.

What we want is

upsert[`$":C:/Users/Paul Bilokon/Documents/tradesdir/"].Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";moretrades]

Sorting a splayed table

Yes, upsert works with tables on disc, not just in memory. upsert is not alone in this. For example, we can sort the splayed table on disc:

`code xasc`$":C:/Users/Paul Bilokon/Documents/tradesdir/"

We confirm that the sorted attribute is now set on the column code:

q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c       | t f a
--------| -----
code    | s   s
date    | d
time    | t
price   | f
volume  | h
mktflag | s
comments| C

Applying attributes to a splayed table

xasc is not alone in being able to work with tables on disc. Here is how we can set the grouped (`g#) attribute on our table:

q)@[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";`mktflag;`g#]
`:tradesdir/
q)meta get`$":C:/Users/Paul Bilokon/Documents/tradesdir/"
c       | t f a
--------| -----
code    | s   s
date    | d
time    | t
price   | f
volume  | h
mktflag | s   g
comments| C

Compression

Let us recreate the on-disc version of the table

trades:([]code:`symbol$();date:`date$();time:`time$();price:`float$();volume:`short$();mktflag:`symbol$();comments:());
.[`trades;();,;(
  (`CLM16;2016.04.07;10:20:00.329;38.3  ;4h;`E;"NYMEX/CME Group trade");
  (`GCM16;2016.04.07;12:00:00.055;1239.7;6h;`E;"");
  (`GCM16;2016.04.07;12:37:02.106;1240.5;1h;`E;"");
  (`CLM16;2016.04.07;13:00:00.128;38.04 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.07;13:22:05.617;18.85 ;5h;`E;"");
  (`GCM16;2016.04.07;14:35:01.241;1241.2;1h;`E;"");
  (`GCM16;2016.04.08;10:13:01.048;1240f ;3h;`E;"");
  (`VXK16;2016.04.08;11:34:53.417;18.53 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;12:00:00.227;40.61 ;3h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.08;12:44:00.684;18.44 ;2h;`E;"Transaction represents a trade in two contract months in the same class");
  (`VXK16;2016.04.08;12:45:33.130;18.49 ;1h;`E;"Transaction represents a trade in two contract months in the same class");
  (`CLM16;2016.04.08;15:20:02.000;40.78 ;3h;`E;"NYMEX/CME Group trade");
  (`CLM16;2016.04.11;11:00:00.105;41.43 ;2h;`E;"NYMEX/CME Group trade");
  (`VXK16;2016.04.11;14:00:00.829;18.35 ;1h;`E;"");
  (`VXK16;2016.04.11;15:14:58.775;19.05 ;2h;`E;"");
  (`GCM16;2016.04.11;16:00:00.044;1257.9;1h;`E;"");
  (`GCM16;2016.04.11;16:28:34.311;1258.7;1h;`E;""))]

but this time use compression:

(`$":C:/Users/Paul Bilokon/Documents/tradesdir/";17;2;9)set .Q.en[`$":C:/Users/Paul Bilokon/Documents/tradesdir/";trades]

Without compression the table occupied 17,595 bytes; with compression — only 1,182 bytes.

To read up on file compression in kdb+/q, see https://code.kx.com/q/kb/file-compression/.