Tuesday, 24 November 2015

Use Bind Variables in Application SQL for Oracle

If you don't already know, then you should almost always be using bind variables in all SQL statements used in any applications you write that run against Oracle. Bind variables are place holders within your SQL statements that get replaced by real data values at run time when the SQL statement is executed i.e. at run time the real data value is "bound" to the corresponding place holder variable in the SQL statement. So rather than using unique SQL statements with explicit data values embedded in them:
select col1, col2, col3 from mytable where col4 = 'A Value' ;
select col1, col2, col3 from mytable where col4 = 'Another Value' ;
select col1, col2, col3 from mytable where col4 = 'Yet Another Value' ;
use one common SQL statement for all executions instead, binding a different value for each execution:
select col1, col2, col3 from mytable where col4 = :b1 ;
Note that there are a couple of circumstances to be aware of where it might make sense to not use bind variables e.g. data warehouses running ad-hoc queries with histograms present on many skewed data columns. But for the vast majority of circumstances on all OLTP type Oracle databases the applications should be using bind variables.

The most commonly mentioned benefit of using bind variables is quicker SQL statement execution because an existing execution plan in memory is simply reused. There is also an associated reduction in CPU load by avoiding the need to fully parse and validate every SQL statement when executed. However, there are many other benefits that come from using bind variables, which makes the case for using them even stronger. Even if your system does not have high CPU utilisation and performance is adequate today, it is still worth exploring switching to use bind variables for the other benefits you gain.

Here are some of the other benefits, in no particular order.

Library Cache

The memory needed for your library cache where SQL statements and their execution plans are stored in the SGA is significantly reduced. Reusing an existing SQL execution plan for another execution of the same SQL statement but with different run time data values for the bind variables means that fewer SQL statements and execution plans in total need to be stored in the library cache. The main benefit is from the storage for the SQL statement itself which is now stored once only in the library cache and reused, rather than once per execution with different embedded run time data values.

Latch Contention

Latch contention internal to Oracle is reduced. Parsing a SQL statement involves actions such as allocating storage slots in the library cache for the SQL statement and the execution plan, and reading parts of the data dictionary cache while checking table and column names, etc. All of this involves latches - internal locks on Oracle data structures to stop them being used or changed by another session. If you don't need to parse and validate a SQL statement from scratch then the number of latches that get locked during the execution of each SQL statement is significantly reduced.

Better SQL Statement Execution Statistics

When a SQL statement is executed Oracle measures and maintains a number of statistics on the execution of that SQL statement, such as total number of executions, CPU time used, disk reads, etc.

When not using bind variables each SQL statement is unique in the library cache. Eventually when another unique SQL statement is executed it will force one of the other SQL statements to be aged out of the library cache to release its slot. When this happens all of the execution statistics for that SQL statement get destroyed and lost too.

When using bind variables then any existing entry for that SQL statement in the library cache is reused, and the execution statistics for this particular execution are added on to those already accumulated so far. Given that such shared SQL execution plans will stay in the library cache longer, then the execution statistics that Oracle keeps will be accumulated with each execution. And in turn this means that these execution statistics better reflect the true execution profile of all SQL statements being executed against that database.

More Useful AWR Reports

In turn these SQL statement execution statistics are captured and stored in the Automatic Workload Repository (AWR) when the regular snapshots are taken (every hour by default). When using bind variables you get more complete and accurate statistics on total SQL statement execution captured for the reasons given previously. When not using bind variables the churn of SQL statements in the library cache mean that very few are still present in the library cache when the regular AWR snapshot is taken, and their execution statistics are missing from the AWR.

In the SQL Statistics section of an AWR report the top SQL statements are displayed by their cumulative resource usage such as number of executions, CPU usage, disk reads, etc. The idea is that it is trying to help you identify the SQL statements that are having the most impact on the performance of your system. This means that when not using bind variables you are probably missing most of the SQL statement executions from this part of the AWR report because individually their resource usage is very low.

The AWR report helps you spot such an anomaly by telling you the total SQL execution count covered by the statistics being reported at the start of each table of resource usage by SQL statement:
Captured SQL account for ##.#% of Total [Statistic Name]
If the percentage is too low then it implies that there are a lot of SQL statements being executed that are not being captured and reported on by AWR. Even if it is over 50% it still means that up to 50% of the SQL workload on your system is not being captured and reported on by AWR. And you don't know whether you can ignore this or not because you don't know how many SQL statements are involved, how often they are executed, or how much resource they use while executing.

Not using bind variables is an obvious candidate for causing this "missing" set of SQL statements, but there are other possibilities too. So investigate further either way if this percentage for captured SQL is not over 80%. What is causing so much executed SQL to not be captured by AWR?

What about third party applications?

What if you don't have control over the application, and have no ability to change any of the SQL in it? Well you can still benefit from the use of bind variables by forcing Oracle to rewrite all SQL statements dynamically to use bind variables instead. This feature was introduced in Oracle 10g precisely to help with this kind of problem.

There is an initialization parameter "CURSOR_SHARING" which defaults to a value of "EXACT" i.e. an existing cursor in the library cache is only shared and reused if the text of the new SQL statement exactly matches that of the SQL statement that created that cursor originally. If bind variables are not being used, then this means that each SQL statement with an embedded data value will be different and will have a new cursor allocated for it in the library cache.

You can set "CURSOR_SHARING" to "FORCE" which causes Oracle to dynamically rewrite each SQL statement being executed and replace embedded data values by bind variables i.e. Oracle ends up executing the equivalent SQL statement with bind variables. This results in greater reuse of execution plans within the library cache and all of the benefits of using bind variables outlined so far.

Note that there was also a value of "SIMILAR" you could set it to but this was deprecated in, so you shouldn't use this setting any more.

Monday, 9 November 2015

Data Driven or Application Driven Development?

I'm the kind of person who believes that Data is really important and needs to be treated and dealt with properly. Of course we also need Applications that use that Data - one without the other is no use at all. But a battle I often have with clients now is that Data needs to be modelled and designed at the same stage as the Application is. Ignore the data model and database design and you will likely end up with a mess that will only come back to haunt you later on. (More on this in other blog posts to come).

I've just read a reasonably interesting blog post from Pythian on being Data Driven where this distinction is mentioned explicitly. At one point one of the panel mentions how some people continue to ignore the importance of Data and its design (my emphasis within this quote):
I recently I talked with someone who felt extremely behind the curve and they only recently did things like looking at data design as integral to the design phase of their system development. They had always, of course, reviewed design of the applications before implementing them, but data was always an afterthought. Applications create data, so what?

Now that they’re trying to get better at data, they had to actually explicitly make sure you got a data person in there designing the data, looking at the impact of what this application’s doing with the landscape or the information architecture they have and reviewing that before going off and actually writing a code.

That was new for them.
This is the kind of thing I have been on about for years - the Data is as important as the Application and needs to be designed properly up front. In fact, you can argue that the Data is more important in the long run. Why? Because Applications come and go but Data lives forever. Think about it - you often replace one Application suite with another, but you will almost always migrate over your existing Data i.e. you might throw away the Application software but you keep the Data itself.

And yet I continue to see software development organisations adopt Object Oriented and Agile Development methodologies wholesale and treat the Data as an "afterthought". Why, when the Data is at least as important as the Application?

Monday, 26 October 2015

JSON, Python & Trello

JSON is a simple data storage format intended to allow easy data interchange between programs and systems. Data is stored in a text format as objects of "name : value" pairs. And that's about it, other than objects can be nested (a value could be another whole object) and multiple objects can occur one after another in the data set (a list or array of such objects). This makes it slightly database like because you have records (objects) made up of named fields holding data values, with nested records inside other records for any kind of relationship or hierarchy.

The "name : value" storage structure is often termed a Dictionary where a given name is associated with its value - other terms such as Hash Map are used in other programming languages. This makes a JSON data set an almost perfect match for processing in Python which has almost directly corresponding data types of dict for the "name : value" pairs and list for multiple objects of the same type. And Python offers a standard "json" module (library) for importing and exporting such JSON data sets into and out of Python objects (the corresponding methods are called "load" and "dump" for consistency with the naming in the existing "pickle" module). Generally speaking when you load in a JSON data set it is converted to a corresponding Python dict object, that may in turn contain other embedded list and dict objects according to the structure of the JSON data being loaded in.


A real example of this in use is with Trello, where you can export your Do Lists to a local file as a JSON data set. You could then open this file and read in the JSON data set in Python, and process it one way or another. Note that you can also directly access the live data in a Trello List using a suitable URL - Google the Trello API for more on this.

Trello structures its data as follows in a hierarchy where each thing has a name and an internal, normally hidden identifier:
  • A Board is the top level structure - each Board is separate from each other
  • A Board contains multiple Lists
  • A List contains Cards - normally these are the "actions" you want to do
  • A Card can contain a few subsidiary other data items, such as a Checklist, or Labels (categories)
With the data from one Trello Board now read into a Python dict object it becomes very easy to navigate through this and pull out individual members of a list for further processing. An obvious example is to do a simple reformat of the lists for printing purposes - Trello only lets you print the whole board or an individual card, and a board can run to many, many pages when you have lots of lists and cards (items) in it given the default spacing between items.

Trello does not actually store the data in a strict hierarchy, but instead more in a relational format, where child items will contain the identifier ("id") of the parent item they are part of. Thus the top level JSON data for a Trello Board contains data of:
  • name - name of the board
  • id - identifier of this board, a long hexadecimal number
  • desc - description of the board
  • lists - all the Lists in this Board
  • cards - all the individual Cards in all the Lists
  • checklists - all the Checklists used in different Cards
i.e. most things are stored directly linked off a Board, and not stored nested within each other. This is clearly to allow easy moving of Cards between Lists - only the parent identifier within the Card needs to be updated, not the Lists involved or the Board itself.

The main data sets of lists, cards and checklists come into Python as a list (array) of multiple members, each member being a dict. Each data record for these child data sets contains an "id" field for the parent i.e. "idBoard" in a List, and "idList" in a Card (Checklists are slightly different as the Card does contain a list of the Checklist identifiers).

Example 1 - Printing the names of the lists in a board

We just need to iterate over the "lists" data set of the Board, printing out the "name" of each List. We can also check if each list is open i.e. not closed. Note that I am using Python 3, hence "print" is a true function now.
import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])
print ("Lists:")
# trello_board [LISTS] is a Python list, each member being a Python dict for each List
# Loop through all the Lists i.e. one List at a time
for trello_list in trello_board [LISTS] :
    if (not trello_list [CLOSED]) :
        print (trello_list [NAME])
This assumes that the JSON file name is given as the first argument on the command line to the Python program.

Example 2 - Printing out the cards in one list

Assuming that the second command line argument is the first part of the name of a List in the Board, then we can use this to find that List and then print out just its Cards.
import json
import sys

NAME = "name"
ID = "id"
DESC = "desc"
CLOSED = "closed"
LISTS = "lists"
CARDS = "cards"
CHECKLISTS = "checklists"
IDBOARD = "idBoard"
IDLIST = "idList"
IDCHECKLISTS = "idChecklists"

json_file = open (sys.argv [0])
trello_board = json.load (json_file)
print ("Board: " + trello_board [NAME])

# Loop through all Lists in the Board, comparing its name against the input name
for trello_list in trello_board [LISTS] :
    # Only do a partial match on name on leading part of name
    if (trello_list [NAME] [:len(sys.argv [1])] == sys.argv [1] and not trello_list [CLOSED]) :
        print ("List: " + trello_list [NAME])
        # Loop through all Cards in all Lists, checking the parent ID of the Card
        for trello_card in trello_board [CARDS] :
            if (trello_card [IDLIST] == trello_list [ID] and not trello_card [CLOSED]) :
                print (trello_card [NAME])
                if (trello_card [DESC]) :
                    print (trello_card [DESC])

Note that in reality you would have extra code to check that the command line arguments were present, and better error handling for things like the file not existing.

Also I actually used "textwrap" in Python to word wrap long lines properly in the output, and indent wrapped lines for better readability - I've just used "print" directly in these examples to keep it simple.


That's it for JSON and Python using Trello as an example. You can load in a JSON data set from a file into a corresponding Python data structure using just one method call (json.method), and it is very easy to traverse that data structure finding and processing the data elements you want to using the "field name" as the index into the dictionary of "name : value" pairs in the Python data structure.

Friday, 16 October 2015

Getting My Passion Back

I've been quiet for a long while on this blog because I had been working for a company that just did not get databases at all, and it has been a real energy sapper dealing with the consequences of all of the wrong decisions they kept making.  I'll spare you the gory details, but the biggest problems of many were the wholesale adoption of Agile Development (Scrum) with no training, support or planned cutover coupled with the complete abandonment of any documentation, and the decision by the Technical Architect that databases were trivial and unimportant now so proper, up front database design was not needed any more. 

As you can imagine the consequences of such changes were almost catastrophic and very repetitive:-
  • No one knew whether any newly developed piece of software was correct because there was no documentation at all on what it was supposed to be doing in the first place or how it should have been doing it
  • Doing Agile Development with no clear end goal meant that every new Sprint was always refactoring the previous Sprint's work to add in the new but unplanned additional functionality so real productivity kept decreasing while test failure rates increased
  • The resultant database was a complete mess with data all over the various tables, undocumented, with duplicated data fields within the same table, and no indexes whatsoever other than on the primary key fields
  • They used Hibernate to "generate" the SQL queries, but no one knew how to make it generate "good" SQL - one search query joined the main data table to itself 5 times in the same query, and no one in the development team could explain why. 
That's progress for you!!

I've been out of there for some time now, but it left behind a distinctly bad taste about "modern software development" and databases, and put me off doing database work for some time.  Luckily though I've slowly come back round to realising that databases are the right place to store important data, and they should be designed properly and not just thrown together.  And there are still challenges to be met with good database design, and new technologies to learn and use.

Which means both that I should be starting to blog more frequently in the future, and that I'm now looking again for database related work (Oracle based, around Woking in the UK, if anyone needs a good Oracle person).

Wednesday, 26 March 2014

Red Gate Source Control for Oracle

The background to this - I was "given" a free license to Red Gate Source Control for Oracle some months ago as part of some marketing and awareness activity Red Gate were doing. I've been busy with other things, so I've only now got around to trying to understand what the product does and see if it can be of any benefit to me. Hence this review.

Before I start my review I want to make two things clear. The first thing is about how you treat Databases in a software development world. Databases are fundamentally different to Application Software. Whereas Application Software is normally rebuilt each time by compiling it from its source code, Databases are never rebuilt - they are always modified in place. A customer's data cannot be thrown away and replaced by a new empty database. Changes to the database must happen in place, in an incremental way. To put it another way, Application Software evolves between each release, with each release being a next generation that replaces the one before it. Databases though metamorphose between releases, with changes being applied directly to the database so that it changes in place and existing data is retained.

This means that a Database must be maintained through a series of "change scripts" that make incremental changes to the database, and these change scripts must be run in a controlled sequence when upgrading the database from one version of its associated Application Software to another version. I would expect any "Database Source Code Control" tool to help in the production of these "database change scripts", and to help manage their check in to the source code tree.

The second point is that I am only interested in source code control tools that truly track "changes" to source code files. I want to use something that knows what changed in a file, and can use this to do things like merges between different branches in a sensible way. I don't want something that just tracks the final contents of each file, and can reverse engineer what changed between two versions by working out the differences after the event. I've used Mercurial and I know that it works this way. And I'm also aware that Git works in a similar way by tracking changes to files. These are the main tools that truly track changes (there may be a few other ones as well), but it means that most other source code control tools do not track changes in the same way. Tracking changes to files is a very powerful way of doing source code control, and enables a whole bunch of advanced features around propagating changes between versions, and merging changes together.

Red Gate Source Control for Oracle

Before I tried to use the product I thought I would read up on it to understand exactly what it thought it was capable of, and how I would need to configure it to get out of it what I wanted. The first disappointment was the slightly weak product information on their web site, which was of the generic "we can do it" type without being specific about what "it" actually was that it did. But then I guess that is marketing for you.

What I did get from the product web page was that it only works with the source code control tools Subversion (SVN) or Team Foundation Server (TFS) from Microsoft. This is the first major shortcoming for this product. I'm not going to stop using Mercurial, which I find to be really good at what it does.

After that I did manage to find documentation for the product online, via the Support link on the Red Gate web site, which was good enough at describing how to use it. It seems that it will generate a set of SQL script files containing the DDL SQL statements to create the database you are using i.e. it reverse engineers from your current database the SQL that would be needed to recreate it. And having generated these SQL files it will then check them into your source code tree for you. When you change the structure of your database in any way, you can make it do a refresh and it will re-generate these SQL files, and re-upload them to your source code control tool. This means that it will update those SQL files that have changed in some way since the last source code update.

This was pretty much confirmed in a review by Lewis Cunningham, which described using Red Gate Source Control for Oracle and the series of SQL files it generated.

Which brings me to the second major shortcoming of the product - as Lewis notes "It generated create scripts rather than alters". Remember the point I made at the very start about deployed production databases must metamorphose in place, keeping the existing data inside them. This means that databases must be maintained through a series of "change scripts" that alter existing database structures to update them. We do not want just another set of "create" scripts that will create a new empty database. This is confirmed in a comment in Lewis's post by a Red Gate employee:-
I am one of the developers ... You were wondering ... how you might get the alter scripts. ... You can checkout a version of the database from SVN and Schema Compare can compare it to a live database and generate the alters you need to deploy the new version.
He is saying that Red Gate Source Code Control for Oracle will not do the "alters" for you, but Red Gate have another product called Schema Compare that will produce the "alters" for you.

Which leads on to the third shortcoming with this tool. How did my local test database get "changed" in the first place, so that Red Gate Source Control for Oracle could detect this change? If the Red Gate tool did not produce the SQL that I ran to change my local database, then it must have come from somewhere else. Most likely it came from another tool, or from SQL statements I wrote myself manually and ran against the database.

Now, if I've already got the SQL statements that "alter" the database structure, and I am sensible enough to put these into a file before running them against my local test database, then I already have the necessary "change script" that updates my database in place. And I could just check in this "change script" directly to my source code tree, and not bother with Red Gate Source Control for Oracle at all.


To summarise what Red Gate Source Control for Oracle does:
  • Red Gate Source Control for Oracle only works with SVN and TFS
  • It will not produce "alter" SQL scripts to run against a database - only "create" SQL scripts
  • You must manually or otherwise create these "alter" SQL scripts yourself to change your local database
  • It only detects changes to a database after you have already run your own "alter" SQL to change it
As a result of its limitations I didn't get around to using it because it would not work with my preferred source code control tool (Mercurial), and it would not deliver what I really needed anyway (database "change scripts").

The "holy grail" of database development tools for me would be one that helped produce the "alter" SQL statements in the first place, as a series of "change scripts". I could then test these "change scripts" locally in my test database, and if happy with the results check them into the source code tree. Unfortunately this tool from Red Gate does not do this at all, and does something completely different - only generating a complete set of "create" SQL scripts all the time, and only after you have already changed the structure of your local test database somehow.

Furthermore, I'm not sure what this tool really delivers beyond running "DBMS_METADATA.GET_DDL" (or the equivalent of it) in a loop to get the SQL DDL statements for each object in the database. I've already got my own set of scripts that do just this. The only thing Red Gate seems to add is a nice GUI in front of it all, and some integration to two source code control tools.

Tuesday, 11 March 2014

Speeding up Imports

There are a number of techniques you can use to speed up an Oracle Import, some of which I'll describe here. This is not any attempt at a comprehensive list, just some of the main techniques I have used that can really speed up some parts of an import. I've seen a standard import come down from 2.5 hours to about 0.75 hours using these techniques.

The first thing to realise is that an exported dump file is a "logical" dump of the database - it contains the SQL statements necessary to recreate all the database structures, and the data records to be loaded into each table. The import works by executing these SQL statements against the target database, thus recreating the contents of the original database that was exported. We can leverage this to our advantage in various ways.

The objects and their corresponding SQL are in a certain order in the exported dump file, and they are imported in a strict sequence by the "impdp" utility. This order is roughly the following for the main objects (it is not a complete or absolute list, as I have simplified some of the details a bit):
  • Sequences
  • Tables - Creation, but no data
  • Table Data
  • Indexes
  • Constraints - Check & Referential
  • Packages, Functions, Procedures
  • Views
  • Triggers
  • Functional and Bitmap Indexes
  • Materialized Views
When you run a full import, each of these sets of objects is done in turn, one after the other. This means that the import is not really parallelised. There is a PARALLEL option, but this only really affects the "Table Data" section, and allows multiple slave readers of the table data to run INSERT statements i.e. different tables can be loaded at the same time, but that is all. The PARALLEL option does not affect other options, such as index builds, which is a shame.


Its all about Disk Bandwidth

Before I get into specific techniques, I want to point out that an Oracle Import is fundamentally a disk I/O bound task. The techniques I'll be describing will be using parallelism of various forms to get more work done at the same time on the system, to reduce the total elapsed time. And this in turn will produce more disk I/O operations running at the same time. So you will ultimately be limited by the system's ability to get data off and on to the disks, rather than being limited by the amount of CPU or Memory it has in it.

Consider how each of the object types in turn is created within the database during the import:
  • The initial table data load is mainly disk I/O - read all the data records from the dump file and INSERT
  • Index creation is mainly disk I/O - read all the data in the table, sort it for the index, write the index to disk
  • Constraints are mainly disk I/O - read all the data in one table and verify value or existence in another table
  • Materialized Views involve executing the SELECT to populate the MV on disk
Yes, some CPU will be used (certainly for data load and index creation), but the CPU's must each be fed with data from disk to keep them busy. So CPU utilisation can only go up by increasing the disk bandwidth used.

In my experience it is the index creation and the referential integrity constraints that take the longest elapsed time during imports, due to their need to always read all the data records in a table.


Divide And Conquer

The main technique I use to speed up an import is to import each object type separately via its own "impdp" command, using the "INCLUDE" option to only do those type of objects. While this does not directly speed up the import itself, it sets the foundation for being able to do so. Now we have a series of individual import commands to import the whole database, we can then sub-divide these up within themselves.

So instead of loading data into all the tables in one command, which does them sequentially one after the other, we can run separate import commands at the same time for each of the largest tables, and run another import command for all the other tables (using the EXCLUDE option to omit the largest tables). Now we have a high level of parallelism within the table data load, which we can fine tune based on the relative sizes of our tables and the disk I/O bandwidth available on the system.


Parallelizing Indexes

We could use a similar technique on indexes of doing different tables at the same time, but this can have other issues to deal with e.g. an increase in the amount of temporary tablespace storage used. Also you will probably be limited by the time taken for all of the indexes on the largest table in the database, which would be built serially one after the other.

Instead what I do is use the "SQLFILE" option of impdp with the "INCLUDE=INDEX" option to get the "CREATE INDEX" statements into a single SQL file. Then I edit this file and add a "PARALLEL" clause to the end of each "CREATE INDEX". In fact, import actually puts a "PARALLEL 1" on the end of each "CREATE INDEX" already (I'm using, so I only need to do a global replace of this by the degree of parallelism I want. You probably want something like the same as the number of disks you have, but you can experiment to get the best value.

Once edited I can then create the indexes by running this SQL script from SQL*Plus, after all the tables have had their data loaded in. As each index is being created using parallelism, each will be maximising the resources on the system, and we can just let it build one index at a time. There may not be any benefit from trying to split the indexes into two separate SQL files and running them both at the same time.

Note that you can create and edit this index SQL script as soon as you have the dump file you want to import. You can do this while the table data is being loaded, so that the index SQL script is ready before the tables have finished loading. This way there is no need for a gap between the end of the table data being loaded and the index creation starting.


Skipping Constraints

As mentioned, constraints involve reading all of the data in a table to validate that it is correct against the specified constraint. If we are importing a complete database from another system, and we know for certain that all of the data satisfies all of the constraints in that database, then we can basically skip validating the constraints completely. The speed up from this is quite significant, as we completely avoid enormous amounts of disk I/O for every table in the database, but can still end up with all the constraints in place.

As for indexes I use the SQLFILE option to produce a file containing the SQL for the constraints. I then edit this file and add "NOVALIDATE" to the end of each constraint statement. This means that Oracle will create the constraint, which is what we want, but it will not validate the data i.e. it will assume that the data is valid and not explicitly check it, which is also what we want. In fact each constraint SQL statement ends in the word "ENABLE", so we can do a global replace of this by "ENABLE NOVALIDATE", which is easy.

Again, as for indexes, we can produce and edit this SQL script as soon as we have the dump file, and then run it at the appropriate point in the steps being used to import the database.



I've discussed the import side of things so far, as that is where you often want the speed up. But you can apply similar principles to the export of the database, using divide and conquer to run multiple exports at the same time. Again this will increase the disk I/O operations occurring at the same time, but will reduce the total elapsed time of the export. Some of the techniques include the following:
  • Do an export of just the metadata i.e. just the SQL to create the objects, using "CONTENT=METADATA_ONLY"
  • Export the data from the biggest tables individually to their own dump files, using the "TABLES" option to list each table
  • Export the data from all the other tables to a dump file, using the "EXCLUDE=TABLE:" option
    • Beware that the "EXCLUDE" option has a slightly weird syntax where you need to embed quotes around the list of table names


Full Database Migration

If you are using export and import to migrate a production database from one system to another, then you can combine all of these techniques together to minimise the total elapsed time before the second database is up and available.
  • Once the metadata export has finished, transfer it to the target system over the network
    • Then create the empty tables, and produce and edit the SQL scripts for indexes and constraints
  • As each table data export finishes, start its transfer over the network to the target system
    • And as each data dump file arrives on the target system start its load
  • Once data load has completed, start creating the indexes, followed by the other steps
This overlaps execution of the export, transfer of the dump files over the network, and import on all of the systems at the same time.

Monday, 17 February 2014

Packaging Python With My Application

To try and keep a longer story as short as possible I needed to package up the Python run time environment to ship along with a Python based application I had written. And in this case the target platform was Windows, though the solution will also work for Linux or any other platform (however most Linux distributions will already have Python on them). I needed to ship Python itself with my application to guarantee that the application would be able to run (had everything it needed), and to avoid complications of requiring the customer to download and install Python themselves (potential issues over version compatibility).

Through a number of blog posts by other people about different "packaging" techniques (see References later) I came up with the following solution that works. This is not the only method of packaging a Python application, and indeed it is quite surprising how many different techniques there are. But this worked for me, and was what I wanted i.e. including Python itself with my Python based application. One of the neat benefits of this for me is that the whole Python run time I need is only 7.5 MB in size, and the main ZIP file of the run time environment is only 2.5 MB in size, which shows how compressable it all is.

Packaging Python with my application

First I create a directory (folder for Windows people) for my application, and put all my application's Python files in there.

Then I create a sub-directory in this to put Python itself into e.g. Python33_Win.

Into this I put the following files:
Note that "cx_Oracle.pyd" is needed because my application makes a connection to an Oracle database to do its work. Also "msvcr100.dll" is technically a Microsoft DLL that is needed by programs written in C, which the Python interpreter is. Microsoft allows this DLL to be copied for the purpose of running such C based programs.

The "python33.zip" is something I created, and into which I put the rest of the necessary Python run time files. There are quite a lot of these, all taken from the directory where you installed Python on your own system:
  • All the ".py" files in the top level Python folder
  • The following folders including any sub-folders:-
    • collections
    • concurrent
    • ctypes
    • curses
    • dbm
    • distutils
    • email
    • encodings
    • html
    • http
    • importlib
    • logging
    • pydoc_data
    • site-packages
    • unittest
    • urllib
    • venv
    • wsgiref
    • xml
Then I wrote a wrapper script to run my application via the Python run time included. In this case it is a Windows batch script, and it exists in the folder above my application source code. My application needs two command line arguments provided e.g. user name and password.
MyAppName\Python33_Win\python MyAppName\myappname.py %1 %2
That's it, and it works.

How does it work.

Built into the Python interpreter i.e. into "python.exe", is clearly the functionality to dynamically load into itself various libraries it needs at run time. An example of these are the "*.pyd" files explicitly included in the Python directory. However, it also has the functionality to open up a ZIP file and look inside that for the libraries it needs. Thus we can take most of the Python run time environment files and put them into a PYTHON33.ZIP file, and Python will look in here to find the files it needs. Which makes packaging up Python pretty simple.

The exceptions to this are the "python33.dll" and Microsoft C DLL, and a few PYD files (which are Python Dynamic libraries, a bit like Windows DLL's). These will not be found in the ZIP file, as they seem to be needed before Python gets around to opening such a ZIP file.

Further Notes

  • These files and directories are what I needed to make my application work. The particular set needed can be different for each Python based application, depending on what modules you import into your program at run time.
  • I tried using the "modulefinder" module to report on which files were actually being referenced at run time when my application was being run. This helped reduce down the total number of files I needed to include in the ZIP file.
  • The ZIP file is named "python33.zip" because I am using Python version 3.3, and its DLL is named "python33.dll" i.e. it looks for a ZIP file with a matching name of its version.


The main articles I found that helped me and pointed out this way of packaging up Python were:
  • Distributing a Python Embedding Program which states that everything can be put into a ZIP file, except for some special files, and gives an example where those files are listed.
  • How to Distribute Commercial Python Applications which describes the different packaging options available for Python based applications, and concludes that "For complex scenarios, my recommendation is to go with bundling CPython yourself and steer clear of the freezers".
    • It does not however tell you how to "bundle CPython yourself" - it just states that it is possible to do it.