Pentaho Enterprise and Community Editions offer the Scheduling tool to enable jobs to be set to run at certain times on a repeatable basis. At the moment, while the schedule is viewable, there is no built-in method for extracting the schedules for reporting or documentation purposes. A couple of years back I figured out the methods to achieve this and I’d like to share the ETL with the wider Pentaho community.
This ETL uses the Pentaho REST API to get the XML containing the scheduler information, including the emails, the output types, the messages and which reports/jobs/transformations are being used.
In section 1, we set the URL, password and HTTP Accept type header (application/xml) and return the xml in a single string variable.
In section 2, the result XML is sent to XML readers which will split it to gain different contexts of the data
a. ‘Get job header’ fetches the timing information
b. ‘get DayOfWeek Recurrences’ picks out the scheduling recurrences for days of the week (e.g. when set to run Mon, Wed, Fri)
c. ‘job data detail’ extracts the information stored in ‘key/value’ pairs
Section 3 is where ‘key/value’ pairs which are in an indefinite number of rows, are ‘pivoted’ so that the key is now the field name and the ‘value’ is the…value.
After those sections, the rows from each line are joined using the jobname, UNIX timestamps get converted to database timestamps and the data is prepared before finally being inserted to the table pentaho_schedule
Here’s a zip of the whole ktr. It has explanatory notes and the (postgres) DDL for creating the pentaho_schedule table. Converting that to another database should be easy.
Don’t forget to change the database connection to suit your setup.
Andrew Cave
Andrew Cave is a senior data engineer with BizCubed. He has worked in network data, billing, telco credit and debt after a career in the welfare sector. He loves databases. Follow him on LinkedIn