Homemade Time Series With OCaml
December 8, 2019
Sometimes I just build things out of curiosity so that I know how they will work instead of using something pre-made. So in the spirit of GROKing something via building it, I recently created my own time series module, and thought I would share it with others. I hope you enjoy my thoughs and creation of a simple time series module using OCaml.
I'm testing out two different methods of building a time series module. First, I am building a time series module centered around a record type that holds the time series data along with meta-data. Next, I'm going to do the same thing with modules and functors. Having both versions will let me make a comparison on the pros and cons of each implementation.
What I want in a Time Series module
1 - Simplicity (in the implemenation and interface)
I want a very simple interface that anybody can undersand with only a passing glance (no documentation or instruction manual needed). All behavior must be completely obvious and as anticipated. The functionality should be distilled to only what is necessary, but easy to extend. I think an implementation using record types will probably be the simplest because there will less moving parts involved.
2 - Speedy (Enough)
This thing cannot take forever for obvious reasons, but I'm not too concerned with speed. I don't use any seriously large data sets, and if something really fast is needed in the future, I can make it as an extra add on. What I'm most concerned about with speed is being able to quickly tinker with the module in the top level when I'm experimenting with data. So speed really counts when measuring how quickly other people can learn to use the module, and how quickly the module can be used.
3 - Lightweight
I don't want 100 requisite libraries for my time series library to compile. I think this would be annoying and rude to other developers who may use it in the future (including the future me). I really just want this to feel like a plug and play module, where you can load it in the toplevel and it just works right out of the gate with all code being self-contained.
4 - Flexible
I want to be able to track any kind of data. A few that spring to mind are binary, categorical, ordinal, integers, and floats, but this needs to work for everything right off the bat. Although modules in OCaml are generally more flexible, I don't think their flexibility and extendability will provide much value here.
Other Thoughts before building
I want to be able to track my data over a fixed set of arbitrary time periods. I'll keep it simple and assume that I will only be using time intervals of days, hours, minutes, or seconds. I also don't want to force the use of any granular timestamp data when it is not needed. Ex. My time series should not have to care about the time of day a data point was observed if my series is a daily time series.
With the pre-requisites in mind I'm organizing the code into four files:
Utils.ml - contains the basic building blocks of my module including all common types and functions
TimeSeriesWithRecords.ml - the implementation of a time series module using a record type to store all information
TimeSeriesWithModules.ml - the implementation of a time series module using functors to generate modules that store all information
Test.ml - a quick test of both methods with a benchmark comparison
To define the core components, I really only care about capturing a value and a time as a data point. Then storing a sequence of these data points.
First I want to define a clear way to track dates - clearer than what is provided by the Unix module. I though about counting months from 1 to 12, instead of using the default Unix method of counting from 0 to 11, but I think this can get confusing (what if I forget or dont notice). To make things easier, I think it is better to use a type for month that cannot be accidentally misinterpreted, so I will add the following month type, and create a date type that cannot be misintrepreted. I'm also setting up an interval type to note the smallest unit of intended measure in the series. This can be useful if I want a daily series, but I don't always have observations on consecutive days.
type month = | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec type interval = Seconds of int Minutes of int Hours of int Days of int type date = { year : int; month : month; day : int; } type time = { hour : int; min : int; sec : int; }
I'm also setting up a data_point type that will wrap all information together that is relevant to a single observation.
type 'a data_point = { date : date; time : time option; value : 'a option }
These are all stored in the Utils.ml file along with some other functions that are useful, but probably boring to read so I'll leave them at the bottom of the post for anybody interested.
The Record Implementation
Now it is very straightforward to create an ordered collecton of data points. The type 'a time_series not only contains the data, but useful metadata as well.
type 'a time_series = { data : 'a data_point array; sorted : bool; missing_points : bool; interval : interval }
Now data can actually put into a series!
There is still a huge problem, which is that I will be dangerously building the time series myself. I need a helper function to do this that will not make mistakes. But, before I can build a function to create a time series safely, I will need some helper functions to do a few important things:
(1) Check if the series is sorted
I'm going to cheat some here and just assume that if I am going to build a new series, then I am always going to sort the data and set sorted to true. Maybe this means that the sorted record field is unncessary? I don't think so, it clearly tells me that the data is sorted (as long as the safe time series builder functions are used)!
let sort_series data = Array.fast_sort (fun x y -> date_compare x y) data
(2) Check if the series is missing any data points between the earliest and latest data points
let mkdate (d : date) = { Unix.tm_sec = 0; tm_min = 0; tm_hour = 0; tm_mday = d.day; tm_mon = int_of_month d.month; tm_year = d.year; tm_wday = 0; tm_yday = 0; tm_isdst = true } |> Unix.mktime let make_next_date interval current_date = let open Unix in let next_dt = localtime ( (fst @@ mkdate current_date) +. float_of_int (seconds_of_interval interval) ) in { year = next_dt.tm_year; month = month_of_int next_dt.tm_mon; day = next_dt.tm_mday } (* NOTE: This function assumes that data is always sorted *) let is_missing_points time_series = (* TODO: Raise an exception here if the data is not already sorted! *) let rec run prior_date remaining_points = match Array.length remaining_points with | 0 -> false | _ -> let next_date = (Array.get remaining_points 0).date in if next_date <> make_next_date time_series.interval prior_date then true else run next_date (Array.sub remaining_points 1 (Array.length remaining_points - 1)) in run (Array.get time_series.data 0).date (Array.sub time_series.data 1 (Array.length time_series.data - 1))
Finally, all of the necessary functions exist to build a time series safely.
let build_time_series data interval = let data_copy = Array.copy data in sort_series data_copy; let missing_points = is_missing_points { data = data; sorted = true; missing_points = true; interval = interval } in { data = data_copy; sorted = true; missing_points = missing_points; interval = interval }
The Functor / Module Implementation
I want to do the same thing, but with a focus on modules. To get started I can define a module version of the record type type above that will be used to store the time series information.
module type TimeSeries = sig type t val data : t data_point array val sorted : bool val missing_points : bool val interval : interval end
I can also setup a functor to build a TimeSeries module. The functor only needs knowledge of the data and the interval for all other calculations. This can be translated into an input module type and corresponding functor to build a TimeSeries module.
module type TsData = sig type t val data : t data_point array val interval : interval end module Builder (D : TsData) : TimeSeries = struct type t = D.t let data = let data_copy = Array.copy D.data in sort_series data_copy; data_copy let sorted = true let missing_points = is_missing_points { data = data; sorted = true; missing_points = true; interval = D.interval } let interval = D.interval end
Which is better
I'm just doing the minimal testing necessary to see which one I like better. Below there is some common data to test both implementations. I have left some missing values in test_data so that the tests can make sure that both implementations return the series with missing_data = true.
#use "Utils.ml";; #use "TimeSeriesWithRecords.ml";; #use "TimeSeriesWithModules.ml";; let test_data = [| {date = {year = 2019; month = Jan; day = 1}; time = None; value = Some 1.0}; {date = {year = 2019; month = Jan; day = 2}; time = None; value = Some 2.0}; {date = {year = 2019; month = Jan; day = 4}; time = None; value = None}; {date = {year = 2019; month = Jan; day = 6}; time = None; value = Some 3.0}; {date = {year = 2019; month = Jan; day = 8}; time = None; value = Some 3.0} |]
The record test is very simple, only a single line. Looks like everything is working.
# let ts = build_time_series test_data (Days 1);; val ts : float time_series = {data = [|{date = {year = 2019; month = Jan; day = 1}; time = None; value = Some 1.}; {date = {year = 2019; month = Jan; day = 2}; time = None; value = Some 2.}; {date = {year = 2019; month = Jan; day = 4}; time = None; value = None}; {date = {year = 2019; month = Jan; day = 6}; time = None; value = Some 3.}; {date = {year = 2019; month = Jan; day = 8}; time = None; value = Some 3.}|]; sorted = true; missing_points = true; interval = Days 1}
The module implementation requires more effort than the record implementation. The data input module needs to be setup, then passed into a functor to build a TimeSeries module with data.
# module TestTsData = struct type t = float let data = test_data let interval = Days 1 end ;; module TestTsData : sig type t = float val data : t data_point array val interval : interval end # module Test = Builder (TestTsData);; module Test : sig type t = Builder(TestTsData).t val data : t data_point array val sorted : bool val missing_points : bool val interval : interval end # Test.data;; - : Test.t data_point array = [|{date = {year = 2019; month = Jan; day = 1}; time = None; value = Some <abstr>}; {date = {year = 2019; month = Jan; day = 2}; time = None; value = Some <abstr>}; {date = {year = 2019; month = Jan; day = 4}; time = None; value = None}; {date = {year = 2019; month = Jan; day = 6}; time = None; value = Some <abstr>}; {date = {year = 2019; month = Jan; day = 8}; time = None; value = Some <abstr>}|]
This looks great, but the type of value in each data point is abstract. So the setup has to go one step further to make sure that the functor provides a module with a concrete value type. Note that this has to be done separately for each differnt type that is used, which can be non-significant overhead compared to the record implementation.
So it is necessary to setup up a functor for building a float time series, that does the exact same thing as above, but exposes the type of value.
# module FloatBuilder (D : TsData with type t = float) : (TimeSeries with type t = float) = struct type t = D.t let data = let data_copy = Array.copy D.data in sort_series data_copy; data_copy let sorted = true let missing_points = is_missing_points { data = data; sorted = true; missing_points = true; interval = D.interval } let interval = D.interval end ;; module FloatBuilder : functor (D : sig type t = float val data : t data_point array val interval : interval end) -> sig type t = float val data : t data_point array val sorted : bool val missing_points : bool val interval : interval end
Now the output module has concrete data that can be viewed and used!
# module FloatTest = FloatBuilder (TestTsData);; module FloatTest : sig type t = float val data : t data_point array val sorted : bool val missing_points : bool val interval : interval end # FloatTest.data;; - : float data_point array = [|{date = {year = 2019; month = Jan; day = 1}; time = None; value = Some 1.}; {date = {year = 2019; month = Jan; day = 2}; time = None; value = Some 2.}; {date = {year = 2019; month = Jan; day = 4}; time = None; value = None}; {date = {year = 2019; month = Jan; day = 6}; time = None; value = Some 3.}; {date = {year = 2019; month = Jan; day = 8}; time = None; value = Some 3.}|]
A quick benchmark also shows that with my small test series (of whopping size n=5), there is not a significant difference between the runtime of the two implementations. This (pitiful) test does not feel real at all, so this is something that I'll need to re-visit in the future.
let rec call_n_times ?(count = 0) n f = if count < n then (f (); call_n_times ~count:(count + 1) n f) else () let timer n f = let start_time = Unix.gettimeofday () in call_n_times n f; let end_time = Unix.gettimeofday () in end_time -. start_time let build_ts_with_record () = let ts = build_time_series test_data (Days 1) in if ts.sorted then () else () let build_ts_with_module () = let ts = (module FloatBuilder (TestTsData) : TimeSeries) in let module TS = (val ts : TimeSeries) in if TS.sorted then () else () let benchmark n = timer n build_ts_with_record /. timer n build_ts_with_module
# benchmark 10_000;; - : float = 0.92154114381409
Concluding Thoughts
Althought the functor / module implementation is cool, it feels too complicated for the feature set, but this may change in the future as the module grows. The record implementation seems much more preferable as it is more simple in its implementation and especially in its usage. There does not seem to be any difference in effectiveness between the two implementations, so I have to declare the record implementation as the winner.
I still have a few TODOs left to complete in the module. These seem important, but I don't currently need them and they are probably not interesting to read about.
At the top of the TODO list:
(1) Build an actual test with a larger and more complicated data set
(2) Add functionality to fill missing data, using interval to measure where data is missing
(3) Add functionality to drop values from the time series (ex. where value = None)
All files are available here:
Homemade-Time-Series-With-OCaml
Grok-Correlation
Efficient-Functional-Programming
Hello-World