Monday 14 December 2015

Python for the DBA (2) - Data Unloading

I've mentioned before that I like the Python programming language. Not only is it a "clean" language for writing programs in with a rich set of data types and structures for manipulating data, it also has a standard API for database access making it "database neutral". This makes it relatively easy to get data that is in a database into and out of your Python code where you can use the power of Python for whatever kind of data manipulation you need.

To show how easy it is to get data into a Python program I'll show a little utility to unload data from one table in a database to a flat, text file as one record per line with each data field separated by special delimiters.

Assuming the following:
  • Your Python program already has a database connection
    • This is relatively easy, but does vary by specific database driver being used
  • And it has opened the output file to unload to i.e. an open file handle and not the file name
  • And it has the name of the table, the field delimiter characters and the record terminator characters in local variables
Then you can write a simple function to call that will unload the data from that table to the specified file:
def unload_table_to_file (db_conn, table_name, out_data_file, field_separator, record_terminator) :
    # Preparation steps before doing the real work
    # Replace literals of special characters in the separator and terminator
    record_terminator = record_terminator.replace ('\\n', '\n')
    field_separator = field_separator.replace ('\\t', '\t')
    
    # Simplest SQL to get the data, assume column ordering is correct
    select_SQL = "select * from " + table_name
    
    # Now do the work - read in all records and output them
    print ("=== Unloading . . .")
    select_cursor = db_conn.cursor ()
    select_cursor.execute (select_SQL)
    for data_row in select_cursor.fetchall () :
        # Convert Python native data types to simple string representations
        output_fields = []
        for data_field in data_row :
            output_fields.append (str(data_field))
        out_data_file.write (field_separator.join (output_fields) + record_terminator)
    print ("=== Finished")
The main lines are:
  • First make sure that if any "escaped characters" are in the delimiters then they are the real escape characters
    • "\t" means the TAB character, and "\n" means the NEWLINE character
  • We just use a "select * from table" to get all the data into the Python code
    • The String concatenate operator is just the "+" operator i.e. "+" is overloaded depending on data type
  • And we execute this "select" via a Cursor created from the database connection passed in
  • The real work happens in 2 nested loops - for each data row, for each data field in each data row
    • The fields from one record are put into a list (essentially equivalent to an array)
    • This is so we can use a single action at the end to concatenate them all together with each data field value separated by the "field separator"
    • Then we "write" out this concatenated list of fields to the output file
  • The "str" function will return the String representation of a data value e.g. convert an integer to its String representation
    • This assumes that there exists a data type specific "str" method for each possible data type
  • The "join" method of a String variable iterates over a list of data, and returns one String of the list contents concatenated together with the value of the String variable between each one.
I think this shows some of the power and elegance of Python that reading all the data from a table can be done with so few lines of code, and that the code itself is so readable. And this code is flexible enough to deal with different values for the field separator characters and the record terminator characters. You could just do a classic comma separated list (',' as the field separator) or something more complicated to avoid issues where a comma may appear in the data itself e.g. "|__|" as the field separator (vertical bar, underscore, underscore, vertical bar). And this flexibility is all handled by this Python code.

Caveats

This is probably the most simple Python code to achieve the goal of reading all the data in a table and writing out to a flat, text file. There are several things it does not do at all, and other things that could be enhanced with it.
  • It is assumed that all data types being used can be cleanly converted to valid String representations using "str()"
    • This may not be true for some data types, or you may wish to handle some data types differently e.g. Dates
  • Dates should probably be handled differently, to ensure the full date and time values are output
    • You could execute an "ALTER SESSION" SQL statement to set the value of NLS_DATE_FORMAT to something suitable
    • Python has a native Date / Time data type, "datetime", which Oracle DATE columns are mapped to, and you could test for this
  • You could add extra functionality such as debugging / tracing messages (use the "logging" module) or other progress messages
    • You may wish to add extra counters to count the number of records and fields per record

No comments: