Global Spin: Database-Backed Web Sites

Database-Backed Web Sites

Click here to order from a bookstore near you.

The utility of the Web took a giant leap forward when databases started being used as backend repositories for Web applications. Databases provide a flexible, dynamic way to store information for use on the Web, and existing databases provide treasure troves of data to be presented and extended by Web applications.

Databases, however, are used differently than are Web applications. A database user is usually accessing it through a continuous programmatic interface that makes one SQL query at a time with progress indicators and partial results available to mask the time taken by large queries. Fortunately, the utility of databases in Web application design has led to a number of solutions–both within Perl and within the database systems themselves–that enable databases to be used in a Web-centric fashion. They are used without sacrificing the data integrity, availability, and performance that are their greatest strengths.

Examples of this chapter's principles are offered for MySQL and Oracle, but the same ideas apply to any relational database management system (RDBMS), including commercial systems such as SQL Server, Informix, or DB2, as well as open-source systems such as PostgreSQL.

Accelerating DBI for the Web

When a Web site is using persistent Perl connections, as described in Chapter 10, "Tools for Perl Persistence," improving database access time isn't difficult. The persistent environment provides a new layer of continuity between Web server requests, and this layer can be used to maintain database connections and other cached information about the database.

Performance blocks can be removed from Web applications at many levels:

The database connection architecture can be made more efficient using Apache::DBI.
Database query preparing and execution can be streamlined using placeholders and cached statements.
Perl's internal representation of the resultant data can be reduced and accelerated by using references.

Using Apache::DBI

The Apache::DBI module has few features, but the most important is the simple redefinition of DBI's connect method, which can improve Web application performance immensely. Normally, a database connection is opened by DBI each time the connect method is called, and the connection is closed when the disconnect method is called, usually at the end of a program or module. In Web Common Gateway Interface (CGI) context, this means that database connections are opened and closed every time a CGI program is run, which could be hundreds of times per second. Because a database connection takes up to ten seconds to be established, this means that the database connection time is a thousand times too slow to be acceptable.

In a persistent environment, the same effect occurs. Even though the Web application persists across many requests, connect and disconnect still are being called during each request, and a new database connection is being made and broken each time. Again, this increases the application runtime to an unacceptable level.

Apache::DBI works by caching open database connections based on the database accessed and the database user specified when connecting. The connect method within DBI defers to the same method within Apache::DBI if the environment is persistent, which enables the Apache::DBI connect method to keep track of open connections and to return them to the caller program as needed. The connect method also checks connections before returning them to ensure that a particular connection has not been closed by the database or some external process.

Apache::DBI also overloads the disconnect method to keep open database connections from being closed inadvertently by programs that are written without Apache::DBI in mind. Because of this, no programs need be modified and no tables need be restructured to use Apache::DBI. In fact, it's better to invoke Apache::DBI outside the body of your program code to ensure that it is used by all Web applications across the board. This can be done by including a directive in the httpd.conf configuration for Apache or the startup scripts for mod_perl, VelociGen, and other persistent Perl environments, as shown in Listing 14.1.

Listing 16.1 Load Apache::DBI for all Applications

01 # add one to httpd.conf file for 
02 # Apache-based servers with mod_perl
03 PerlModule Apache::DBI
04 PerlRequire Apache/DBI.pm
05
06 # add one to startup.pl, vep_startup.pl
07 # or similar Perl engine startup scripts
08 use Apache::DBI;
09 require Apache::DBI;
10
11 # add a connect string to startup.pl, etc.
12 # to open a connection when the server starts
13 Apache::DBI->connect_on_init($datasource, $user, $pass);

Because of the unique relationship between DBI and Apache::DBI, the DBI module checks when it is loaded to see if Apache::DBI is already in use. Thus, if Apache::DBI is loaded before all occurrences of DBI in a Web application, database connections are cached regardless of whether an individual script uses Apache::DBI.

Lines 03 and 04 of Listing 14.1 illustrate two possible ways to invoke Apache::DBI at server startup when using mod_perl. The PerlModule or PerlRequire directive in httpd.conf corresponds to a use or require directive in a Perl program, respectively. Alternately, line 08 or 09 could be used in the startup script executed when the Perl engine starts, as long as Apache::DBI is called before DBI in that script. Either method ensures that any connection made by DBI in persistent programs is cached by Apache::DBI.

Additional performance gains can be achieved by creating a database connection when the server starts each Perl engine. This is done by calling the connect_on_init method in the startup script with the same parameters that are used in the Web application, as in line 13 of Listing 14.1. When doing this, make sure that the arguments passed to connect_on_init are exactly the same as those used in the Web application, including preference arguments such as AutoCommit or RaiseError. Otherwise, Apache::DBI assumes that the connections are different, and the initial connection are cached but never used.

Note that Apache::DBI does not require the Apache server to be used. Any Web server that supports a persistent Perl environment (see Chapter 10) can use Apache::DBI by placing the use statement in the Perl engine's startup script.

There is one case in which Apache::DBI is not advised. When site users are accessing the database with individual database user names and passwords, it is better to forego Apache::DBI because the module would cache each user's connection separately. Additionally, Apache::DBI can't share database connections across Perl engines. As a result, each user's connection would be held open and cached across multiple Perl interpreters, and the number of open connections would soon grow to a point at which the database would refuse connections or otherwise behave erratically.

An additional case to watch for is when Apache, mod_perl, and Apache::DBI are used together in a high-transaction environment. Because Apache::DBI can't share database connections across Perl engines, at least one connection is opened for each engine, which translates to one open connection per httpd process started by Apache. On a high-transaction server, this could mean that hundreds of database connections are open at all times. Although this might not cause problems for most database systems, small sites might find that the database begins to behave erratically when Web applications are scaled to higher volumes of traffic.

Preparing Statements with Placeholders

When using DBI for database access, a query generally is passed to the database in two steps with a separate process used to return the result. The query is first sent to the database with a prepare method, which instructs the database to make an execution plan for the query and which returns a statement handle representing the prepared plan. Then, an execute method is called on the statement handle, which instructs the database to execute the plan and return the results to DBI. After DBI has the results, a variety of methods can be called on the statement handle to extract the data for use by the program. This process might seem cumbersome to database developers who are accustomed to instant results returned from a single query entered at the command line or through a graphic interface, but DBI is designed this way to match the steps performed internally by the database when a query is executed. This enables each step of the process to be optimized by the program based on the type and frequency of queries that are expected and the process by which the result will be used.

One way to optimize database-to-Web performance is by using placeholders to stand in for data in an SQL query. When a query is prepared by DBI, the database creates a plan of the best way to access the data needed to fulfill the query. This is another example of the one-user, continuous-access database model that assumes that the user is performing a unique query and waiting for the results. The plan is simply thrown away after it is executed, leaving the database to create another plan from scratch. Unfortunately, a Web application is likely to make a similar query hundreds of times per second, so the database spends much of its time generating the same plan repeatedly.

Placeholders help this situation by giving the database a uniform template of the type of query that will be used in a number of executed statements. The placeholders themselves–demarcated by a question mark–can indicate any value that would ordinarily be assigned to or compared against a data field. For instance, in this SQL query:

Listing 16.

SELECT *
FROM foof
WHERE foo = 'bar'

the SQL subclause WHERE foo = 'bar' could be rewritten as WHERE foo = ?, with the value 'bar' being represented by a placeholder. It could not be written as WHERE ? = 'bar' because foo is a field name, not a value. The database accepts the placeholder in a template and makes a plan around it–thereby creating a more generic plan that then can be reused.

After a template has been sent through the DBI prepare method and an execution plan has been generated by the database, the DBI execute method can be called against the prepared plan that's stored in the DBI statement handle $sth (see Listing 14.2). This can be done as many times as needed without generating a new plan. The values specified by placeholders in the prepared template are passed as arguments to the execute method each time it is called. These values are analyzed by DBI to determine their data type if it hasn't previously been specified.

Listings 14.2 and 14.3 provide a comparison between a standard way of invoking DBI for a stand-alone (or CGI) Perl application and an accelerated example that assumes Apache::DBI is being used in a persistent environment. The listing is part of the forum application in Listing 14.7, which is found in the "MySQL Example" section later this chapter.

Listing 16.2 Standard DBI Example

01 # show all the responses to this message
02 my ($childid, $title, $username, $created);
03 $sth = $dbh->prepare(q{SELECT m.msgid, m.title, u.username,
04                DATE_FORMAT(m.created, '%d %M %Y') as created 
05                FROM messages m, users u
06                WHERE m.authorid = u.userid 
07                AND m.parentid = $msgid
08                ORDER BY msgid DESC});
09 $sth->execute;
10 
11 while (($childid, $title, $username, $created) = $sth->fetchrow_array)
12 {
13   print qq{<li><a href="tree.psp?msgid=$childid">$title</a>};
14   print qq{ by $username on $created</li>\n};
15 }
16 $sth->finish;

Listing 16.3 Accelerated DBI Example

01 # show all the responses to this message
02 my ($childid, $title, $username, $created);
03 $sth = $dbh->prepare_cached(q{SELECT m.msgid, m.title, u.username,
04                DATE_FORMAT(m.created, '%d %M %Y') as created 
05                FROM messages m, users u
06                WHERE m.authorid = u.userid 
07                AND m.parentid = ?
08                ORDER BY msgid DESC});
09 $sth->execute($msgid);
10 $sth->bind_columns(\($childid, $title, $username, $created));
11 
12 while ($sth->fetch)
13 {
14   print qq{<li><a href="tree.psp?msgid=$childid">$title</a>};
15   print qq{ by $username on $created</li>\n};
16 }
17 $sth->finish;

In terms of placeholders, the differences between Listing 14.2 and Listing 14.3 are minor compared to the performance improvements achieved. The statement being prepared is almost identical with only one value in line 07 ($msgid) replaced by a placeholder. The rest of the statement is the same, no matter which arguments are passed to the program, so it can be left unchanged.

The only other placeholder-related change between the two listings is in the execute statement at line 09. The value that was originally used within the statement at line 07 now is being passed as an argument to the execute method, which fills the placeholder in the prepared execution plan with the value provided. DBI also determines the data type of the value provided by checking if the value is a number. Because the value is being passed through a method call instead of within an SQL query string, the value does not have to be quoted, even if it is a string. In this case, $msgid always is a number, and DBI has no difficulty determining its data type and passing the value.

Placeholders can be made more convenient by binding variables to the input placeholders using the bind_param method. The example could be changed as shown in Listing 14.4 by adding a bind_param statement at line 07, which would enable the execute method in line 08 to be called without arguments, as it was originally. The placeholder is specified by number in the first argument, and then by the value to be used. The data type of the value can optionally be provided in case DBI would have a difficult time determining the type automatically. Additional values could be bound by adding a similar statement for each placeholder.

Listing 16.4 DBI Example with bind_param

01 # show all the responses to this message
02 my ($childid, $title, $username, $created);
03 $sth = $dbh->prepare_cached(q{SELECT m.msgid, m.title, u.username,
04                DATE_FORMAT(m.created, '%d %M %Y') as created 
05                FROM messages m, users u
06                WHERE m.authorid = u.userid 
07                AND m.parentid = ?
08                ORDER BY msgid DESC});
09 $sth->bind_param(1, $msgid, SQL_INTEGER);
10 $sth->execute;
11 $sth->bind_columns(\($childid, $title, $username, $created));
12 
13 while ($sth->fetch)
14 {
15   print qq{<li><a href="tree.psp?msgid=$childid">$title</a>};
16   print qq{ by $username on $created</li>\n};
17 }
18 $sth->finish;

It is important to note that not all databases and DBI drivers support placeholders, and some drivers support placeholders differently than others. The examples in this chapter should work with most databases, but check the relevant driver documentation for specific examples of how to use placeholders with a particular database.

Preparing Cached Statements

Cached statements can improve performance for queries that are likely to be used frequently across multiple instances of the same program. Like statements with placeholders, cached statements relieve the database from having to generate execution plans for similar queries.

DBI caches statements only when explicitly instructed to do so by using the prepare_cached method, as seen on line 03 of Listing 14.4. This keeps DBI from caching all statements, regardless of whether they would be reused, by forcing the application developer to specify which statements are likely to be reused.

The example in Listing 14.4 uses prepare_cached to indicate that the execution plan generated for the statement being prepared is applicable to all instances of this statement. This is only true if the statement as prepared is identical each time; if any aspect of the statement changes from one instance of the program to the next, the statement is not reused after it is cached. For example, the statement in Listing 14.2 should not be cached because the value of $msgid changes from one instance of the application to the next as different messages are requested. The statement in Listing 14.4, although it returns the same result, can be cached because it uses a placeholder to represent the value that changes. Because the statement as prepared does not change, the plan created for that statement can be cached and reused.

Thought should be given to instances in which cached statements would be useful. In most cases, a statement should be prepared using prepare_cached only if it can be represented as a single-quoted string, that is, only if no values are interpolated by Perl at runtime. Conversely, a statement should not be cached if it is being composed at runtime. This would occur when a search statement adds a line to the WHERE clause for each search parameter because the potential exists for having too many statements cached overall. However, some interpolated values can be cached if the number of possibilities is severely limited; for example, a query that is chosen from two or three possibilities through an if statement also could be cached because the total number of statements cached overall is small and fixed.

Retrieving Data Sets as References

Although savings in connection and query time improve the efficiency of all Web application database queries, some queries benefit additionally from increased efficiency in retrieving the result data. Large result sets with many rows can be returned more quickly by retrieving the values through references.

A common way to retrieve the results of a database query is by calling the fetchrow_array method on the statement handle, as illustrated in line 11 of Listing 14.2. This method returns one row's worth of data as a list of values that can be assigned to an array or individual variables, as in this example. Because DBI already stores the returned values internally, this requires those values to be copied to the new variables created for each row returned. For small result sets with few rows, the time required to copy the result values is minimal, so this method is still acceptable for queries that return only one row of data. For larger sets of query results, however, the overhead of creating variables and reassigning values can be expensive. In the case of Listing 14.2, the overhead of copying $title, $username, and $created would become noticeable with result sets as small as a few dozen rows. The effect would become more obvious if the much larger body field were also returned each time.

DBI provides methods for retrieving results as references to the existing database driver (DBD) data structures, which removes the extra overhead caused by copying data to the application's own variables. The references are standard Perl structures that can be de-referenced within the application to use the associated values. Take care not to assign the de-referenced values to variables within the while block, however, because it invalidate the savings by performing the copy operation anyway.

Line 13 in Listing 14.4 uses the fetch synonym for the fetchrow_arrayref method, which is the fastest way to retrieve data from a statement handle. (Incidentally, fetchrow_hashref is provided for similar situations, but the performance improvements are not as great due to the overhead of creating the hash to reference.) Although Listing 14.4 doesn't assign the returned references to any values, it does use them inherently as described in the next section, "Binding Result Columns." Because fetch incurs no overhead in terms of returning values or assigning variables, it serves perfectly as a row increment method for the while loop in this case.

Binding Result Columns

For convenience, result values can be bound to variables in a manner similar to binding placeholder values. Binding result columns also can bring a slight performance improvement by removing the overhead of repeated variable creation. DBI provides the bind_columns method, which enables predefined variables to be associated through reference to values returned from a statement handle. By making the connection between returned values and variables only once for a given statement, bind_columns gives the convenience of using custom variables with the performance of using references.

Binding columns is best for data sets that return many rows at a time, although it gives no performance loss in any case. The bind_columns method is best used in combination with returned references (using fetch or fetchrow_arrayref as mentioned in the previous section, "Retrieving Data Sets as References") to provide the best performance for large data sets. A statement that returns just one row would probably not benefit from using bind_columns, however; the additional statements required to file-scope variables, bind them as references, and return them through fetch would give little performance benefit over a simple assignment through a fetchrow_array statement.

In Listing 14.4, line 11 binds the result columns to the $childid, $title, $username, and $created variables. The variables are file-scoped through the my keyword in line 02, and then passed to bind_columns as references in the order in which their values are returned from the query prepared in line 03. With each call to $sth->fetch in line 13, the variables passed in line 11 are assigned the values of the next result row.

Bound results can be useful for substituting data for variables placed in template text, as mentioned in Chapter 13, "Using Templates with Perl Applications." If a program such as Listing 14.4 were used to fill values in a provided template with Perl-style variables embedded, the text could be evaluated by using an eval statement within the while loop, as follows:

Listing 16.

my $template = q{<li><a href="tree.psp?msgid=$childid">$title</a>};
while ($sth->fetch)
{
  eval "print $template";
}

This type of automated variable substitution provides improved performance in a situation where it's most needed. Templates are usually used in situations in which processing them shouldn't incur overhead over a custom-coded application. Therefore, everything that can be done to limit processing time should be done. Luckily, the abstraction afforded by a template provides the opportunity to tune statements such as these without affecting the way the application behaves.

Web-Modeled SQL Statements

Architectural efficiency improvements can go a long way in reducing the time and CPU load a Web application requires, but even the best Web application architecture still grinds to a halt if the SQL queries aren't optimized for Web performance. A poorly written Perl program can make even persistent environments slow by causing each request to reopen files from disk or start programs each time. In addition, a poorly written SQL statement can make a persistent connection to the database slow by causing each query to do a full table scan or transfer much more data than is necessary. Fortunately, there are some simple principles to use when checking SQL statements for Web appropriateness, so it isn't very difficult to improve the performance of a bad SQL statement. Even more fortunate would be the presence of a database programmer or database administrator to write the SQL statement for us.

As mentioned before, the structure of database systems assumes that a single user waits the few seconds (or minutes) required to process a query and return results. This is a reasonable assumption when the user base is a fixed number of known users because a database user usually understands that there is a database involved and knows what to expect from previous queries. Database front-ends also have a number of ways to let the user know that data is on its way; the hourglass spins, partial data is displayed, and database-specific messages are sometimes made available to gauge the remaining query time. Therefore, a performance loss of a dozen seconds per query is no cause for concern; the users wait until prompted to do otherwise.

On the Web, however, those assumptions are all false. The user is accessing a page, not processing a database query, and the same icon (that spinning globe, glowing N, or throbbing dinosaur) is used to indicate page response delays due to any reason. The user might cause a dozen database queries to execute on each page visited without ever knowing that a database is involved at all. No special database messages are displayed, and usually, no partial data is made available for users while other data is delivered. Confronted by a page that just doesn't respond after eight seconds, the user assumes that the page isn't going to respond at all. He or she stops the page from loading and goes elsewhere.

Queries can be improved significantly without modifying the underlying database structures by modifying the queries to suit a Web model more closely. A Web application presents a lot of data in a single page, but it also provides the capability to spread data out through a series of hyperlinked views. Web applications also might need to present data that is timely, but often times, "timely" just as easily can mean twenty minutes as twenty seconds. A Web query also can be improved by restricting the ways that data is accessed. A little flexibility is lost, but users might not realize it is missing if the interface compensates for it and masks the edges.

Inner and Outer Joins

Two of the most common query types–inner and outer joins–are some of the most expensive queries in terms of the time required to plan and process them. Joins are a necessary part of relational databases–the "relational" part, to be precise–but by their very nature, they require two distinct sets of data to be matched in some specified way to return results that are common between them. If done correctly, it can be a powerful way to make connections in data that otherwise would have to be handled programmatically. If done poorly, though, a join or two can slow the response time of a Web database to a crawl.

The basic idea behind a join is that two tables have at least one field in common, so records can be matched up in terms of the correlation of the fields. In the case of the tables in Figure 14.1, each table has a field or set of fields that can be matched up to other fields to provide combined records with information from each table.

***INSERT Figure 14.114hpp01LA

Figure 14.1

Relational Table Schema from VeloMeter.com.

The tables used for the VeloMeter forum contain fields that are used to identify relationships between records. The msgid field in the messages table, for instance, is related to the same field in the text_search table. When a query is prepared against this schema, related fields from both messages and text_search can be included in the same result. In this case, it enables the body of a message from messages to be returned by searching for a keyword in text_search.

Joins are such a common occurrence in relational database design that they're generally taken for granted. It's considered good schema design to separate out data that would otherwise be repeated from record to record. If the name stored in the realname field of the users table in Figure 14.1 were repeated for every associated message in the messages table, for instance, it would waste space by repeating a long string more times than necessary. It also would make it necessary to change every record in the messages table whenever the user made changes or corrections to his or her name. This won't happen often in the case of a name, but for more transient data–for instance, phone numbers or email addresses–it becomes more of an issue.

A join can be expensive because the tables being joined are sometimes searched in their entirety for records that match before the requested records are returned. This process is called a full table scan, and it makes a real dent in performance whenever it has to be performed.

Joins can be improved greatly by improving the design and implementation of the table schema, but some performance gains can be seen by changing the design of a joined query to take advantage of the Web application paradigm. For instance, instead of joining the users and messages tables on the userid field and searching through the username field for a listing of messages created by the user currently logged in, it might be possible to pass the userid field and related information along through a session manager and select from only the messages table based on the parentid field and avoid the join entirely. It's also possible to rely on the hyperlinked nature of a Web application to provide information without selecting it all at once. In a listing of messages with a particular keyword, for instance, it would be possible to join only the text_search and messages tables to provide the titles and creation dates of the returned results with the knowledge that user information from the users table would be provided for each message through a details page.

Queries with joins also can be made faster by making sure that the fields being joined are indexed. A database administrator knows which fields in a table are indexed and which are not, so when given a choice, it's better to join indexed fields. A more detailed discussion of indexed fields is offered in the "Index Joined Fields" section later in this chapter.

Other Expensive Queries

Other SQL keywords–such as UNION and INTERSECT–can turn an otherwise efficient query into a very expensive one. They usually occur in queries that are both complex and that stretch the boundaries of what is possible within the framework of a relational database. One example would be performing aggregations on data from groups of tables that contain overlapping data, which might or might not correspond. For instance, it might be desirable to get a count of the unique email addresses contained in both a table of survey respondents and a table of registered users. The normal approach would be to count the number of unique email addresses in one table, and then add that number to the count from the other table. Unfortunately, email addresses that are present in both tables would be counted twice. One way to get a count of unique addresses could be the query in Listing 14.5.

Listing 16.5 Expensive UNION Query

01 SELECT COUNT(DISTINCT email)
02 FROM
03 (
04   SELECT DISTINCT email
05   FROM users
06   WHERE email IS NOT NULL
07   UNION
08   SELECT DISTINCT emailaddr as email
09   FROM survey
10   WHERE emailaddr IS NOT NULL
11 )

The performance of the query in Listing 14.5 is hampered solely by the UNION keyword. Other components of the query are perfectly fast on their own. The SELECT DISTINCT statements are both as fast as they can be, and the COUNT(DISTINCT email) statement would be very fast if performed on each table individually. When the UNION keyword is used in the subquery, however, it requires the database to compile the entire contents of both queries and then scan the combined contents of those queries before sending them to the outer query. This outer query in turn has to scan the entire result for duplicates before returning the count. These full result scans are the least optimized way to return data and can cause an order of magnitude increase in the time needed to perform the query.

In fact, most queries that require the full result set to be returned cause performance to degrade with results getting worse as the data set increases in size. These queries can be deceptively fast when running them against small data sets, though, so simulating much larger data sets is good practice when testing application performance. Application performance testing is covered in detail in Chapter 15, "Testing Site Performance."

The query in Listing 14.5 could be improved by rewriting it by joining the users and survey tables or by using a subquery. This won't remove the need for a full-table scan in all cases, but the database's execution optimizer is more likely to figure out an optimized plan with either of those options. Higher performance can be gained by additionally indexing the email and emailaddr fields if possible, as discussed in the "Index Joined Fields" section later in this chapter.

On the other hand, aggregate queries processed within the database–using keywords such as MAX, SUM, and GROUP BY–can be much faster overall than retrieving the raw data and aggregating it within Perl. For instance, selecting the average age of registered site users as listed in an age field would take much less time if selected using the AVG keyword within the database than if computed within Perl by selecting all the ages, adding them, and then dividing. This is usually due to the overhead resulting from retrieving large data sets and then processing them a record at a time within Perl. The database server has much greater optimization for arithmetic functions such as these, and sometimes greater processing power to boot. In addition, queries executed within the database can be cached using snapshots and materialized views, as discussed in the "Views Versus Snapshots" section later in this chapter.

Text Searches

The text search is another common query type used in Web applications, with good reason. Web users prefer to have simple interfaces to complete search facilities, and backend search techniques have to be very flexible as a result. Web application users are familiar with the simple interface offered by Web search engines such as Google or AltaVista, so it's necessary to offer them an interface that behaves as expected when given a wide range of possible search terms. It also helps to automate the process of full-text searching because it's not always possible to hand-code keywords for every text item on a Web site, especially when the data being searched is dynamically generated from a database.

Text searching of this type is usually implemented in a database with the LIKE keyword, which enables search fields to be matched against partial text strings. LIKE uses the percent symbol to indicate that arbitrary text could match in its place, as Listing 14.6 illustrates.

Listing 16.6 Full-Text Search Through an Expensive LIKE Query

01 <sql name="search" dbtype="mysql" db="test" action="query">
02 <output>
03 SELECT DISTINCT m.msgid, m.title, 
04   DATE_FORMAT(m.created, '%d %M %Y') as created
05 FROM messages m
06 WHERE m.body LIKE '%$keyword%'
07 </output>
08 </sql>

The query in Listing 14.6, which is a modified part of the VeloMeter.com forum application described in the "MySQL Example" section later in this chapter, matches a provided keyword stored in the $keyword variable against the body of messages stored in the messages table. When the keyword provided is 'bob', for example, line 06 evaluates to WHERE m.body LIKE '%bob%', which scans the entire table for a body field that contains the substring 'bob'. This is a fairly common use of LIKE, which approximates the behavior of a Web search engine pretty closely.

Unfortunately, the need for a full table scan each time the query is run would make the query in Listing 14.6 prohibitive in a Web context with even a few hundred rows in the table. The database can't optimize a LIKE query such as this one the way it would a simple match because every possible substring of each body field has to be checked, and indexing the fields by substring would be prohibitive. Fortunately, the efficiency of queries using LIKE can be improved in a few ways:

The search can be narrowed to field values that start with the search keyword using WHERE m.body LIKE '$keyword%'. This removes the need for a full-table scan, but it also reduces the utility of a full-text search in this context. This kind of query would be better suited to matching names in an address book, where 'John', 'john', and 'John Q.' would all be matched by LIKE 'John%'.
The search can be narrowed further to field values that match the length, if not the case, of the keyword using WHERE m.body LIKE '$keyword'. This improves performance greatly by enabling field indexes to be used more frequently, but completely invalidates the full-text search above. This kind of query would be best for matching values, such as user names, Canadian postal codes, or MS-DOS file names, that have indeterminate case.
The search can be completely replaced with a query that doesn't use LIKE, by modifying the database to include a hash table or other convenience data structure. This method keeps all the utility of the application with very few changes to the query itself, but requires a modification to the structure of the database. These modifications are discussed in greater detail in the "Full-Text Searching" section of this chapter.

Although modifications to SQL queries in Web applications might require more effort than the architectural changes mentioned in the previous section, the work done to optimize the most expensive ten percent of a Web application's queries might free up ninety percent of the database server's resources for additional use. Because database servers are the most difficult Web application resources to scale, the extra work can be worthwhile.

Managing Databases For Web Performance

Although many improvements to database performance can be accomplished within the scope of Web applications, the performance of SQL queries also can be improved by changing the structure of the database involved.

Many databases are designed using optimizations for individual users, with little regard for the special needs a Web application has. Web applications are likely to make frequent, simple requests for small data sets with many similar requests varying only in the specific data requested. The data structures used in a Web application aren't likely to change dramatically, and when they do, it's usually due to a complete application redesign. As a result, flexibility in data storage is not as important as simplicity and performance.

By modifying a complex or slow database schema and adding convenience structures for Web use, it's possible to wring more performance out of a database. It's best if a database administrator with Web experience does the optimization, but any Web application developer with sufficient database knowledge can make basic changes that have a dramatic effect on performance.

Denormalize Data Tables

Normalization is one approach to the design of relational databases that is usually taken to ensure flexibility in the structure of the database that the database server itself does not provide. For instance, the inability of the Oracle database server to drop fields from a table after it contains data is a common reason to implement the schema of an Oracle database in a normalized fashion.

Normalized data structures rely on the relational nature of database systems to build virtual table structures from groups of tables that contain both information and related metadata about the type of information stored. For instance, the tables shown in Figure 14.2 and 14.3 store similar information, but the table in Figure 14.3 has been normalized to provide a more flexible structure for storing data.

***Include Figure 14.214HPP02LA

Figure 14.2

Standard user table structure.

***Include Figure 14.314HPP03LA

Figure 14.3

Normalized user table structure.

The table in Figure 14.2 contains the kind of data that is reasonable within any Web application. The username, email, and realname fields are likely to be encountered over and over when dealing with people in a multiuser system. Each field corresponds to the data stored within it, and the queries used to extract data from this table would be pretty straightforward. The table in Figure 14.3 contains the same data, but that data has been normalized into a structure that remains the same no matter which fields are associated with each record. A record in the virtual table implied by the data now spans multiple real records, and the names of the fields are now stored the same way the data is stored. Note that this enables something that a standard table otherwise would not support. A single "record" in the normalized table can store an arbitrary number of values for each field, as opposed to a single value per field in the standard approach.

Normalized data is very useful for recording arbitrary data structures that would otherwise be encumbered by the spreadsheet-like nature of database tables. If records are likely to contain an arbitrary number of fields with arbitrary types that might change depending on the record itself, normalized tables might be the solution in use.

However, normalized tables are not always desired when providing a database for a Web-based application. Web applications are much more sensitive to performance degradation. Because normalized table structures implement much of the structure of a table independently of the underlying table structure itself, a database server is less likely to be able to optimize a query against normalized tables. In addition, normalized tables are much more likely to require the expensive queries mentioned in the previous section, "Web-Modeled SQL Statements," because the structure of a normalized table is less conducive to SQL queries than the traditional row-and-column style (also called denormalized tables).

Fortunately, Web applications can overcome the same limitations in database design that normalized tables overcome by using program logic to mask some of the irregularities found in a standard table. Benefits of the normalized table in Figure 14.3, for example, could be replicated by modifying the table in Figure 14.2 to include multiple fields for each type of information. The email field could be replaced by enumerated fields such as email1, email2, and email3, each of which is considered an equivalent field by the Web application. The application would then use a WHERE clause such as the following to search for an individual email:

Listing 16.

WHERE email1 = '$email'
OR email2 = '$email'
OR email3 = '$email'

Another way to replicate this feature is by parsing the field before processing INSERT and UPDATE statements and after processing SELECT statements within a Web application. The realname field in Figure 14.2, for instance, could be filled with a comma-delimited string to represent multiple equivalent values. This method enables an arbitrary number of values to be assigned to a single field as long as the total length of the values doesn't exceed the field's maximum. This method should not be used for fields that are searched, however, because of the performance hit that substring queries incur. See the "Text Searches" section earlier in this chapter for more information.

Although restructuring the schema can bring significant performance benefits over normalized data, it is sometimes not possible to represent the source data in a flat structure. This can happen when the database is mainly used by another application, with the Web application used only to summarize some data from the application's data structures. In that case, it can be more helpful to add additional database structures to repurpose the existing data for Web use. See the following section for information about creating snapshots for quick access to data from normalized structures.

Index Joined Fields

When inner and outer joins are necessary, their performance can be helped significantly by choosing an efficient set of fields to join, and then indexing them. These might be data fields that have some meaning within the context of the record, or generated fields that are used specifically because they can be indexed easily. Either way, it's important to make sure that the fields used to join tables are indexed whenever possible. This gives the database as much information as it can get to provide suitable optimizations for a query when an execution plan is being generated.

Without an index on joined fields, the database usually has to perform at least one sort on the table, which requires a full table scan. Although this varies from database to database, an equivalent process usually occurs in any database when joined fields are not indexed. This occurs if the field from either table being joined is not indexed; the greatest performance loss occurs if neither field is indexed, but joining an indexed field to an unindexed one still incurs a significant performance loss. When joining a pair of indexed fields, however, the database can usually optimize the query to scan only those records that are applicable to the query, leaving the bulk of records untouched and saving a great deal of time.

Some fields are not good choices for an index–and therefore a join–because they contain data that is too costly to index on a regular basis. For instance, a 4000-character VARCHAR field that stores message titles would be a bad choice for an index because the size of the index necessary to index such a field would be very large. Additionally, each insert into such a table would require the message title to be added to both the field itself and the index of the field, potentially doubling the overhead of an INSERT or UPDATE query.

Good indexed fields are usually numeric fields with automatically generated values or small text or numeric data fields that are unlikely to change over the life of the record. A phone number field would be a bad choice for a join, for instance, because the number is likely to change and corresponding fields in other tables would have to be located and updated along with it. It's usually best in a Web setting to use automatically generated numbers for record identifiers (or primary keys), and then carry the key values over to related records in other tables as foreign keys. The reason for this is simple; because Web users are likely to have frequent interaction with the database without understanding its structure or having direct access to it, it's always possible that a Web user will want to change any value that is apparent–such as a username or message title–without regard for other fields in other tables that reference that value. By providing a layer of abstraction to the Web user where the database joins arbitrary-numbered fields that are never seen in the Web interface, the application is never limited in terms of the type of data it enables the user to change at any time. Figure 14.4 provides an example of this principle.

***INSERT FIGURE 14.414HPP04LA

Figure 14.4

VeloMeter forum tables revisited.

Figure 14.4 shows the VeloMeter.com forum table example again, this time with indexed fields highlighted and the relationships between joined fields specified explicitly. Note that each joined field has "id" at the end, and the keyword field (which is indexed but not used in a join) does not. This is a style used to help differentiate between primary and foreign keys when mixed among data fields in a query. Each ID field is an autonumbered integer, chosen as such because of the quick indexing and small storage size allowed by integers.

Create Snapshots

If a query is still slow and expensive after optimization, it sometimes can help to create snapshots of the data on a regular basis. Snapshots are copies of the state of a query at a given moment that can be updated automatically by the database server on a scheduled basis. Snapshots provide faster access to data in some circumstances by precomputing the results of a query and storing them in a separate table that can be read from more quickly.

Snapshots are created by providing a SELECT statement to use when creating or updating the snapshot, as well as a set of database directives describing update frequency, storage limits, and other snapshot accounting parameters. This query is saved by the database for use in updating the snapshot at regular intervals. The query can be changed if needed, but otherwise, it's not necessary to reexecute the query whenever a snapshot update is necessary.

Not all databases support snapshots–Oracle, Microsoft SQL Server and Informix do, but MySQL and PostgreSQL don't–but it is possible to create data snapshots using a stand-alone Perl application run by a scheduler. The details of how to create a table using the results of a query vary widely from database to database, but the general steps involved are similar:

Start with the query that is being optimized by the snapshot. If at all possible, write a single query that creates a table based on the output of this query. SQL syntax such as CREATE TABLE foo AS ... enables this, but database implementations vary widely. Other possibilities are insert statements with subqueries to select data or stored procedures that select and insert data programmatically.
If a single query is not possible, write a set of queries that perform the same actions: create a table or clear out the values from an old table, select the data as needed, and insert a result row into the database with the selected values.
Write a simple Perl application that processes the snapshot creation query or queries. If it is possible to create the snapshot in a single query, the program logic should be very simple and program execution time should be only as long as the original SELECT query takes. If a set of queries is necessary, create the table first, and then select the data and insert it a row at a time using a foreach loop and the DBI prepare_cached and bind_columns methods with placeholders. This method takes longer than a single query, but the DBI efficiency methods help keep execution time down.

After the Perl application is written, schedule it using cron or a similar system scheduler so that it runs as often as the data warrants. Be sure not to run the application so often that it is unable to complete the snapshot before the next scheduled snapshot; for instance, if the application takes six minutes to complete a snapshot, scheduling it every five minutes results in unusual behavior. Similarly, make sure the application continues to run in a reasonable time as the data set it's selecting gets larger. An application that takes seconds to run on a small data set might take much longer to run once more data is made available to it.

Views Versus Snapshots

It is important to note the difference between the Oracle view table and a snapshot. The two constructs sometimes can be used interchangeably with standard database applications, but there can be a marked difference in performance between an Oracle view and an analogous snapshot.

A view is a convenience table that links to the underlying data through a SELECT statement. The view is created to provide a layer of abstraction to the underlying tables involved, usually to make subsequent queries easier to write by database users. However, a view is not designed to improve the performance of the query it represents. A query against the view is translated into the equivalent query against the original data tables, which provides a continuously up-to-date result without the need for regular external updates. However, in most cases, a plan has to be created both for the view and the query it represents, which causes extra overhead and reduces performance, even when compared to the original query.

On the other hand, a snapshot actually creates a temporary table and copies the results of the specified query into it. As described in the last section, a snapshot provides a simplified set of fields the way a view does, but it is designed also to improve the performance of SELECT queries run against the snapshot. When accessed, a query against the snapshot only has to have a simple plan created, and no interaction has to take place with the tables that supplied data for the snapshot. The snapshot requires regular updates to maintain a reasonably recent copy of the data, but query performance is improved considerably versus the original query or a corresponding view.

In version 8i and later, Oracle provides "materialized" views, which create their own snapshot-like tables to provide a cached version of the view on a scheduled basis. These views are more like snapshots than the original views, so they are useful in providing the same query optimization that a snapshot would provide. One notable aspect of materialized views is the capability to use them implicitly in queries that would otherwise use the original tables. By specifying ENABLE QUERY REWRITE when the materialized view is created, it is allowed to circumvent the normal plan created by a similar query and provide precomputed values from the materialized view instead. Therefore, a new query doesn't have to be created when a materialized view is used to optimize table performance the way it would if a snapshot were being used. This can come in handy when optimizing performance after a query is already in wide use throughout a Web application.

MySQL Example

Database systems can vary significantly in terms of their support of ANSI SQL and proprietary additions to it.

The examples in this chapter are written for a common database, MySQL. MySQL is likely to be used in simple Web applications because of its ubiquity in low-cost efficient operating systems such as Linux and FreeBSD. A MySQL installation is easy to set up, has mild system requirements, and provides incredible performance for most Web applications. Oracle, another database system found in Web installments, is likely to be found in large-scale information systems that require performance on high-end hardware with complete data reliability and security. Oracle has stricter hardware requirements and a steeper learning curve, but it provides a rock-solid base for mission-critical Web application databases.

MySQL provides impressive speed for Web applications because of its light-weight implementation and its concentration on simple data storage and retrieval. MySQL is open-source software, which has made it more widely available than most other database systems. (As of early 2001, PostgreSQL is starting to look like an open-source challenger to MySQL's speed and ease of use while offering full transaction support, but most installations are still likely to find MySQL first.) MySQL also provides a simplified security model, which enables operating system security to be used in place of a dedicated password scheme and remote connections to the database to take place without a specialized network interface.

However, MySQL gets incredible performance at the expense of important database features, such as transaction processing and rollback. MySQL does not have the capability to gracefully recover from problems that occur when the state of the database is undetermined, for instance, in the middle of a query that adds 100 to the salary field for every record in an employee table. In the case of an error in the midst of such a query, databases that support transactions would simply roll back to the state the database was in before the query started. MySQL, on the other hand, would leave the database partially modified by the query.

Transactions also are important because of the possibility that two queries will act on the same data at the same time, causing one or the other to use incorrect values when inserting into the table. MySQL provides the capability to lock access to a table, but this requires special programming techniques and still doesn't provide the full robustness of a transaction model.

For many Web applications, though, the speed provided by MySQL is more a concern than transaction support. For the example in this section, integrity of the data being stored is not nearly as important as the speed with which it is accessed.

Database-Backed Web Forum

One compelling use for a database-driven Web application is the forum, a place where site viewers can become contributors by adding their own voice to the site's content. Forums are especially useful on sites with changing content that need to stay abreast of recent developments. VeloMeter, an open-source load testing tool, needed a forum for its site to enable users to answer each others questions about installation, new versions, and tips on using VeloMeter effectively.

The forum's setup is simple, which enables flexibility in programming the application as well as ease-of-use for site visitors posting to or reading from the forum. The forum is based around simple text messages that can be searched, viewed, or replied to with a few administrative pages that enable users to identify themselves to the system. The messages are stored in a table called messages, as are users and associated information, as described in Figures 14.1 and 14.4 earlier in this chapter. Any message then can be displayed in the format shown in Figure 14.5. (The program used to display this page is described in greater detail in the next section of this chapter.)

***Insert figure 14.514hpp05.tiffSC***crop

Figure 14.5

The VeloMeter forum displays a message with parents and replies.

The forum enables any number of messages to be posted with effectively infinite levels of replies and threads possible. This design is desirable because it's impossible to tell how many levels of replies a given message will have. So, any arbitrary limit placed on the number of levels for the ease of database design would likely restrict the possible uses of the forum. It also would limit programming possibilities down the road.

Arbitrary levels of replies are made possible by a self-join, which connects rows in the messages table with other rows in the same table in a parent-child relationship. To do this, the table includes a parentid field that references the msgid field of another row. This one added field enables a message to be related (in one direction only) to any number of messages designated as replies to it by joining those two fields. With the right query, a full hierarchy of parents and children could be produced, although that kind of query is generally too expensive for a simple forum application. The self-join also enables a list of parents to be compiled with little effort, which helps users to navigate through all the possible levels of replies.

Cached Display Query with Placeholders

The display query page in Listing 14.7 is one of many Perl Server Pages (PSP) that make up the entire VeloMeter forum application. (PSP pages are described in greater detail with complete implementation instructions in Chapter 12, "Environments For Reducing Development Time.") It's the most commonly accessed page because it displays any message with all its parents and replies. Most links from this page refer back to it, with only the message to display changing from request to request.

Listing 16.7 Forum Message Display (tree.psp)

01 <perl>
02 # determine which message to focus on
03 my $msgid = $QUERY{msgid} || 1;
04 
05 # build a hierarchy for this message's parents
06 my $parents = '';
07 my $dbh = DBI->connect('dbi:mysql:test','','',{RaiseError => 1});
08 my $sth = $dbh->prepare('SELECT title, parentid FROM messages where msgid = ?');
09 $sth->execute($msgid);
10 my ($p_title, $parentid) = $sth->fetchrow_array;
11 $sth->finish;
12 while ($parentid)
13 {
14   $sth->execute($parentid);
15   my $msgid = $parentid;
16   ($p_title, $parentid) = $sth->fetchrow_array;
17   $parents = qq{<a href="tree.psp?msgid=$msgid">$p_title</a> : $parents};
18   $sth->finish;
19 }
20 </perl>
21 
22 <sql name="msg_info" dbtype="mysql" db="test" action="query">
23 <output>
24 SELECT m.title, m.body, 
25   DATE_FORMAT(m.created, '%d %M %Y At %H:%i') as created, 
26   u.username 
27 FROM messages m, users u
28 WHERE m.authorid = u.userid 
29 AND m.msgid = $msgid 
30 </output>
31 </sql>
32
33 <include file="$ENV{DOCUMENT_ROOT}/page.psp" />
34 <template title="Load Testing Forum" section="Forum">
35 
36 <fetch query="msg_info" fetch="title, body, created, username" type="sql">
37 <output>
38 <p>$parents</p>
39 <p><b>$title</b><br />
40 <i>by $username on $created</i></p>
41 <blockquote>$body</blockquote>
42 <p><a href="add_msg.psp?parentid=$msgid">Reply to this message</a> | <a href="add_msg.psp?parentid=1">Post a new message</a> | 
43 <a href="search.psp">Search messages</a> | <a href="register.psp">Register as a user</a></p>
44 </output>
45 </fetch>
46 
47 <h4>Replies:</h4>
48 <ul>
49 <perl>
50 # show all the responses to this message
51 my ($childid, $title, $username, $created);
52 $sth = $dbh->prepare_cached(q{SELECT m.msgid, m.title, u.username,
53                DATE_FORMAT(m.created, '%d %M %Y') as created 
54                FROM messages m, users u
55                WHERE m.authorid = u.userid 
56                AND m.parentid = ?
57                ORDER BY msgid DESC});
58 $sth->execute($msgid);
59 $sth->bind_columns(\($childid, $title, $username, $created));
60 
61 while ($sth->fetch)
62 {
63   print qq{<li><a href="tree.psp?msgid=$childid">$title</a>};
64   print qq{ by $username on $created</li>\n};
65 }
66 $sth->finish;
67 </perl>
68 </ul>
69 
70 </template>

As complex as the page might look at first glance, its structure is simple. The page is broken into three logical sections, each of which is filled from an appropriate set of database information. The first section, queried by lines 05 to 19 of Listing 14.7 and displayed in line 36, consists of a list of links to the current message's parent messages, traced back to the original message used to start the forum. The second section of the page, displayed by lines 22 through 46 of the listing, contains the title and body of the current message with additional information about its creation time and author. The third section of the page, displayed by lines 48 to 66 of the listing, is a list of the replies to date that this message has received, with general information about the date and author of each message. Each section uses a different method to retrieve its contents from the database, and each has been optimized to take advantage of the persistent environment and Web application design.

The first section starts out by connecting to the database in line 07–using Apache::DBI, which has been loaded elsewhere–and preparing a statement in line 08 that retrieves the title and parentid fields from the record corresponding to a given message. The message in question is replaced by a placeholder because this query is going to be used repeatedly in this section. This happens first in line 09 when the query is executed for the current message; the data returned is assigned to the variables $p_title and $parent_id in line 10 using the standard fetchrow_array method provided by DBI. Line 11 makes sure the query is explicitly finished to avoid warning messages or unusual behavior.

Line 12 starts a while loop that is used to build the list of parents starting from the immediate parent of the current message. In each iteration, line 14 executes the query for the parent of the previous message, and then line 16 assigns the new $p_title and $parentid values over the old. The title of the parent message and a link to that message are then added to the beginning of the list (stored in $parents) in line 17, and the process starts over again. The loop continues adding parent messages to the list until the parentid field returns a false value, in this case, zero. At that point, all parents in the chain (back to the original message) are part of the list. Recursive queries like this almost always benefit from using the same prepared query repeatedly; recursive queries tend to have few changes from iteration to iteration, and the changing elements can be represented easily by placeholders. In this case, no matter how many levels of parents are found, the query has to be prepared only once.

The second section of the page in Listing 14.7 is much simpler and requires little optimization beyond the use of Apache::DBI. The result fields returned by the query are more likely to undergo more cosmetic changes because they are the focus of the page. Because of these considerations, the ease of representing the query in tag format outweighed the need to eke out every last drop of performance. The query won't be slow by any means; the joined fields specified in line 28 are both indexed, as is the $msgid field in line 29. The fetch tag used to display the result in lines 36 through 45 is optimized behind the scenes to use bound columns and the fetchrow_arrayref method, so data retrieval won't lack for performance either. The majority of the displayed text in lines 38 to 43 is HTML, which makes the tag-based functions more suited to this part of the page.

The third section of the page is optimized in a few ways, both within the page and across all possible pages. In this case, it's useful to optimize the query as much as possible because it stands to be executed dozens of times per page request with little variation in the structure of the query or the values returned. Line 52 prepares the query using prepare_cached and a placeholder to save the cost of preparing an execution plan for this potentially expensive query even once per page. Line 59 binds the return results to a set of variables to improve readability later on and save the overhead of variable creation and data copying. Lines 61 through 65 then print a list item and link for each message returned by the query, and line 66 finishes off the query after completion.

Individually, the optimizations made to Listing 14.7 might seem insignificant, but these changes can make the difference between overloading a database server (or not) when added across hundreds of requests per second.

Full-Text Searching

A full-text search can be the most valuable part of a forum application because it enables readers to find messages of interest to them among all the messages available in the forum. A full-text search is even more useful in this case than it would be for an entire Web site–a site is likely to have navigational aids based on content grouping, but it would be impractical to offer the same kinds of grouping in a user-contributed forum. Queries to do direct text searching of a database can be prohibitive, however. So, special means are sometimes necessary to provide the ability to search through a large amount of text quickly.

The example in Listing 14.8 implements the simplest of search aids, a hash table containing keywords and the messages in which they can be found. The table itself is very simple, as shown in Figure 14.4 earlier in this chapter. The simplicity enables the hash table to do its job while being optimized as much as possible. This avoids performance penalties that would potentially invalidate the main reason for using a search aid in the first place. The program is designed to be run as a stand-alone application from the command line or a scheduler, but it just as easily could be changed to work within a persistent environment, if needed. In this way, it could be triggered by a system event, such as the addition of a new message.

Listing 16.8 Build a Hash Table (build_hash.pl)

01 #-----------------------------------------
02 # 
03 # build_hash.pperl - search optimization
04 #
05 #-----------------------------------------
06 
07 use 5.6.0;
08 use strict;
09 use warnings;
10 use DBI;
11 
12 # connect to the database
13 my $dbh = DBI->connect('dbi:mysql:test','','',{RaiseError => 1});
14 
15 # clear out the old hash table values
16 my $sth = $dbh->prepare('delete from text_search');
17 $sth->execute;
18 
19 # get all the messages
20 $sth = $dbh->prepare('SELECT msgid, title, body
21                       FROM messages');
22 $sth->execute;
23 my ($msgid, $title, $body);
24 $sth->bind_columns(\($msgid, $title, $body));
25 
26 # pre-cache the insert statement for later
27 my $sti = $dbh->prepare(qq{INSERT INTO text_search
28                            (keyword, msgid, x_count)
29                            VALUES (?,?,?)});
30 
31 # for each message,
32 while ($sth->fetch)
33 {
34   # create a key count hash
35   my %key_count;
36 
37   # break the message into keywords
38   my $keyword;
39   foreach $keyword (split(/[^A-Za-z-']/,$title." ".$body))
40   {
41     # increment the hash entry for that keyword
42     $key_count{lc($keyword)}++ if (length($keyword) > 2);
43   }
44 
45   # insert a row for each key_counthash entry
46   my $q_keyword;
47   foreach $keyword (keys %key_count)
48   {
49     $sti->execute($keyword, $msgid, $key_count{$keyword});
50   }
51 }
52 $sth->finish;
53 
54 $dbh->disconnect;

Lines 13 through 17 of the example connect to the database and remove any previous hash table entries. This is a simplistic way of removing the previous data from such a large table, but it works for the purposes of a small forum like this one. If this were a banking application or a high-traffic news site, the table would probably need to be cleared one group at a time with each group being cleared and reinserted in a single transaction to reduce the possibility of incorrect results being returned by a query due to incomplete hash data.

Lines 20 through 29 prepare both a SELECT query and an INSERT query, passing the resultant plan to $sth and $sti, respectively. The $sti handle is used repeatedly as the program loops through data returned by the $sth handle, so the associated query is prepared only once to reduce the load it places on the database. This becomes increasingly important as the number of messages being indexed increases because the insert query likely is to be executed dozens of times for each message and making a new execution plan for each query could quickly take the lion's share of the program's allotted processing time.

The rest of the program also is simple in design. Line 39 breaks each message into a set of keywords, and then loops through them to produce a count of the number of instances of each keyword in the message. This provides a basic way to weight the relevance of a message to any given keyword, as will be shown in the search display page. Line 42 makes sure that keywords are greater than two characters long to avoid a glut of tiny words; this process could be made more intelligent by specifically weeding out a list of common words such as "if," "the," and "or." Line 49 then inserts a database entry for each keyword in the message, along with the message id and the incidence count generated earlier. This one block is likely to be executed more frequently than any other part of the program, so it is kept as minimal as possible.

The hash table can be regenerated on a schedule by using a system scheduler, such as cron, to invoke the table generation program at regular intervals. It's important to check how long the program takes to finish on a regular basis to avoid the chaos caused when a new instance of the program starts to generate a new hash table before the previous program finishes. In the case of the VeloMeter forum, the hash table is regenerated once an hour. This provides a reasonably timely search because the forum isn't very busy–few new messages are posted over the course of an hour because most visitors are content to read the postings of others.

Note that many database systems come with internal utilities for creating full-text hash tables on a scheduled basis that might be far superior to a cron-scheduled Perl program. See the documentation for your RDBMS to test the possibilities.

After a hash table is available, keyword-based search queries can be used, as in Listing 14.9. This search is as simple as the rest of the forum, but the possibilities for keyword searches are endless. Searches can use various methods to emphasize the potential relevance of a match or the keywords found within the match, and additional processing power can be devoted to finding related keywords as well as those specified, including synonyms and misspelled words. The basis of this simple search, however, is a simple text match between the keywords given and those listed in the database.

Listing 16.9 Search Query Display (search.psp)

01 <include file="$ENV{DOCUMENT_ROOT}/page.psp" />
02 <template title="Search - Load Testing Forum" section="Forum">
03 
04 <form action="search.psp" method="get">
05 <output>
06 <p>Search:&nbsp;
07 <input type="text" name="search" size="30" value="$QUERY{search}" />
08 &nbsp;
09 <input type="submit" value="Search" /></p>
10 </output>
11 </form>
12 
13 <if cond="$QUERY{search}">
14 <perl>
15 # break the search up into keywords
16 my $search_where;
17 foreach $keyword (split(/[^A-Za-z-']/o,$QUERY{search}))
18 {
19   # add an OR for each keyword
20   next unless (length($keyword) > 2);
21   $search_where .= "OR t.keyword = '$keyword'\n";
22 }
23 
24 # replace the first OR with an AND
25 if ($search_where)
26 {
27   substr($search_where,0,2) = 'AND (';
28   $search_where .= ")";
29 }
30 else
31 {
32   $search_where = "AND t.keyword = ''";
33 }
34 </perl>
35 
36 <sql name="search" dbtype="mysql" db="test" action="query">
37 <output>
38 SELECT DISTINCT m.msgid, m.title, 
39   DATE_FORMAT(m.created, '%d %M %Y') as created
40 FROM messages m, text_search t
41 WHERE m.msgid = t.msgid
42 $search_where
43 </output>
44 </sql>
45 <h4>Results:</h4>
46 <ul>
47 <fetch query="search" fetch="msgid, title, created" type="sql">
48 <output>
49 <li><a href="tree.psp?msgid=$msgid">$title</a> from $created</li>
50 </output>
51 </fetch>
52 </ul>
53 </if>
54 </template>

Lines 04 through 11 of the page display a simple search entry box, as is found on most other search engines. The same page is used to display both the search box and the results page to preserve the search keywords as they are entered. This makes correcting typos and adding keywords much easier for the user–without adding too much code.

Line 13 checks whether keywords have been submitted to the form. It then displays the rest of the page only if the form variable search contains any values. If keywords have been submitted, line 17 splits them into a list of individual keywords and composes the WHERE clause of a SELECT query with all keywords searched individually. This particular implementation finds any message with any of the keywords specified, but more advanced queries could be created that give weight to messages that contain more than one of the keywords. Lines 25 through 29 change the query to separate the keyword clause from other clauses, and line 32 covers the case in which no valid keywords are submitted.

The <sql> tag in line 36 prepares the query, including the prepared WHERE clause in line 42, and then the <fetch> tag in line 47 displays a list item for each search result. The results displayed are minimal, to enable quick scanning of all possible results for a descriptive title that might provide more information. This listing also leaves out the author's name to avoid a query that joins three tables, but a quick click to any message listed gives that information and more. The complete list looks something like the result in Figure 14.6, which provides surprisingly useful results considering the sparse simplicity of the search being performed.

***Insert figure 14.614hpp06.tiffSC***crop

Figure 14.6

A full-text search with results.

When a Database is Too Much

A database provides a flexible way to store Web-created and Web-accessible data, but it isn't always necessary to use one for all data presented by a Web site. Databases incur overhead simply by being designed to handle random access to large quantities of codified data. Databases could potentially store any kind of data. Thus, the facilities present in all cases have to be capable of handling queries of arbitrary complexity against any table. This overhead can sometimes be circumvented by using a simpler hierarchical storage method, such as an internal data structure or a file system, which gains speed by eschewing the flexibility of random access.

A perfect example of this principle is the Web site itself. If a database were the perfect repository for text data, the files that make up a Web site would be stored in the database and all Web server access would go directly to it. Instead, however, pages are usually stored as files and managed by the Web server because the Web server is capable of providing text files more quickly and reliably when accessed in Web-fashion. This is because a Web access is usually very specific, referencing a single file in the hierarchy by use of the URL and complete path. It also helps that this is the only way to access a file; there are no facilities for accessing a file based on its last modified date, size, or contents. These restrictions also enable the Web server to cache Web pages without having to replicate the entire structure of the file system; if the Web server needed to cache records in a database, it would be necessary to cache all the relationships between records as well as the indexes used to provide faster access.

When a database causes too much overhead, other resources are available to store the data used by a Web application. User preferences can be stored outside a database, as can document-based data that could be accessed through the file system. Session-based data, which is used to maintain state, also is a good candidate for storage outside the database–at least on a temporary basis.

User Preferences

Although user preferences could be used to customize an experience for every visitor, sites seldom change their format to suit a specific user. The reason is simple–there are too many user interface possibilities to test, and it is more efficient to concentrate on improving the usability of one user interface that customizing the interface for all possible tastes. Also, Web site visitors are likely to see a site only once and might never have the chance to express preferences, which invalidates all the work involved in customizing.

This reasoning shouldn't be ignored when providing some customization for users who are likely to visit a site repeatedly. The temptation might be to provide a whole host of user-defined attributes, such as colors, fonts, and navigation placement, but the truth is that most users don't care about this fine a grain of customization, and those who do won't necessarily make the best choices anyway. This level of customization would almost certainly require a database for implementation, and the overhead of accessing a database–usually a large set of joined tables due to the nature and number of options–for each Web request is extreme.

Customization can be simulated by providing a few versions of the site for different user tastes. This avoids both the problem of overhead and the problem of inexperienced users who don't understand how to customize the environment. By offering the user some choice, a site can accommodate the most likely variations requested by users while limiting those variations to ones that are both compatible with the site's design and aesthetically pleasing.

Templates and multihomed documents can be used to automate the process of providing different user interfaces on the same site. Each variation can be implemented as a template page–with one chosen as the default–and the user then can select which template he or she prefers, perhaps by trying each on a sample page (or the preferences selector itself.) After it is selected, the only user preference that has to be remembered is the name of the template page, which is an easy task for a session manager. Multihomed documents could enable the same to hold true without requiring any user preferences to be saved at all. Different themes could be specified by accessing different directories, which enables the user to pick a theme by bookmarking the resultant URL. (Multihomed documents and their uses are described in more detail in Chapter 17, "Publishing XML for the Future.")

Document-Based Data

Although a database can be a good repository for some document-based data, as mentioned earlier, a file system might be much more efficient when delivering the data through a Web application. This holds true for more than just Web pages; other text data, such as XML files or Word documents, also are good candidates for file-system storage. Because these documents can easily be organized within a hierarchical system–for instance, by year, month, and day for news stories or press releases–storing them through the file system enables faster access as well as easier browsing. With XML files, it's also possible to post-process the files as they are being served. This is covered in Chapter 16, "XML and Content Management," and Chapter 17.

Databases can be good for text searching, though, so it might be worthwhile to store a version of the document-based data in the database for this purpose. It's also possible to store just the URL or another file-based link to the document in the database with any meta-information. This makes retrieval of the document a two-step process when accessing it through the database, but it saves a lot of effort when accessing the file directly. This also gets around the problem of storing binary documents–such as PNG graphics files or MP3 music files–in a database, which can involve a less-than-intuitive process as well.

User State Management

Maintaining user state while an application is being accessed is a major concern when designing programs for the Web. The HTTP access model is stateless by design; it was understood that a document should be served without regard to previous documents accessed or further document access. Unfortunately, this means that Web applications can't rely on the programming environment to provide a complete picture of the user's state. State information usually includes important data such as the last function executed or the current items of a shopping cart, which usually can't be expressed in terms of simple requests and responses from a Web browser.

User state can be recorded in a database, but state information also can be stored in other forms. The most widely known is the cookie, a persistent value that is set by the Web server but stored on the client for a specified period of time. Cookies are good for storing small amounts of data that are not crucial to an application's performance, especially when the data needs to persist from session to session no matter how much time passes between visits. Cookies can be problematic when storing information that needs to be reliably accessible, so it's not advised to use cookies alone to track application state. Also, many browsers are configured to reject cookies entirely, so it's usually necessary to provide alternate state management facilities in case cookies aren't available.

Document-based state management is useful when the application is operating on a text file, which can be stored temporarily while the user acts on it. This is best suited to Web applications with a fixed number of users, such as an intranet application or fee-based service. Document-based state also is suited to applications that are document-centric with a single document being acted on by a number of processes that happen sequentially or recursively. The basic idea is to write applications that derive their entire state by reading the relevant document with all actions operating on the same document. An example is generating an XML purchase order by filling in sections of it through Web forms. Any existing values can be used to prefill form fields, so a user could return to any form in the sequence and change information as necessary. After it is completed, the purchase order document can be reviewed and finalized, which might involve storing a copy in the database for indexing and meta-information.

Session managers also can be used to track user state and provide simple information about the user's activities. A session manager stores an arbitrary set of data associated with a single value–the session ID. Session ID values then can be passed through client-side cookies, as part of the page URL, or by sending hidden form variables with standard queries. When a new page is requested, the session ID is read by the application and all associated state information is recovered quickly from the session management daemon.

Apache::Session

The name Apache::Session is probably a misnomer because the module doesn't require Apache and doesn't provide a complete session manager, either client or server. Rather, Apache::Session is a persistence framework that provides a generic way to store session-type data structures, which enables more specific session managers to be built on top of the Apache::Session architecture.

Apache::Session provides data persistence by connecting to databases (through DBI), flat files (similar to the document-based model described in the "User State Management" section earlier in this chapter), or shared memory. Shared memory provides the best performance in most applications, but it is the most volatile storage method of the three. The module also provides lock management for all the connection types, using the built-in locking facilities of the data source if available or implementing it internally, if necessary. The module also handles ID generation, which otherwise would be a difficult aspect of session management to implement due to the need for values that are both secure and unique.

If the existing session managers do not suit the needs of a specific project, it's possible to subclass Apache::Session directly to implement a custom session management system. Similarly, existing implementations, such as the well-documented MySQL subclass, can be modified to suit a custom configuration, which enables only the additional features to be implemented in custom code. This is done without the need to reinvent useful parts of the existing implementation.

Sidebar: What a Good Database Administrator (DBA) is Worth

As a Webmaster and a Web application programmer, I've generally had to wear as many hats as there are jobs to do. Graphic design, programming, page layout, systems integration, and project management all have to be covered because I'm a one-man team. There's no one else to turn to for assistance, so I've had to learn new technologies as they were incorporated into each application. In time, I started to seek out new technologies and techniques just to see what was possible. It's part of what drives me to learn.

There's one aspect of Web application design that I've never relished learning, though: database design and implementation. Databases are big ugly beasts, and as soon as I think I've learned something fundamental about their personalities, they change into completely different beings (new and improved!), and I have to start from scratch again. It gets worse when I try to keep track of more than one member of the bestiary–Oracle, Informix, and MySQL, for example. At that point, all the quirks of the different systems start to run together. So, it's hard to remember which optimization goes with which version of which RDBMS.

Fortunately, I've found one technology and one person who make the job a lot easier. I've talked about the technology already; DBI makes my job much easier by hiding as many of the implementation differences from me as possible while exploiting them behind the scenes. I fell in love with DBI as soon as it hit release 0.93, and it's been invisibly stabilizing and optimizing my code ever since. Tim Bunce, DBI's lead developer, is to thank for much of the amazing work done on this software, but he's not the person I'm referring to.

The person that makes my job a lot easier is more of an abstraction instantiated by people I've worked with: the database administrator (DBA). I've worked with quite a few, and they're generally the type of people who can look at a nightmarishly long query with self-joins, subqueries, and every expensive construct mentioned in this chapter and visualize not only the execution plan, but also the optimizations necessary to remove 90 percent of the performance-killing inefficiencies. This also is the person who can pick out the two or three changes to a database schema that make subsequent queries ten times faster, and he or she is the person who keeps track of the infinite configuration minutiae that need to be tweaked to keep the beast purring. This isn't the person who fiddled with MySQL the other day and got a sample database working; a good DBA has had years to collect the mountains of implementation trivia that make up the real working knowledge of an RDBMS.

The benefit of a good DBA is immeasurable, but the monetary benefits of a well-tuned database–the queries and schema as well as the system itself–can be calculated as easily as any performance improvements. The "Evaluating the Cost of Down Time" section in Chapter 1 gives a good indication of the money that can be saved by improving performance even a little; an efficient database design can earn the salary of a good DBA ten times over. The DBAs I've known have earned their keep and then some, not just by improving the performance of Web applications, but also by improving the productivity of Web programmers by freeing them from the burden of optimizing both their applications and the underlying data structures they're accessing. In short, love your DBA and your applications will run faster. Believe me–it's worth any expense.

Summary

There are many areas where a database can be optimized for Web use, and the overall utility of using databases as storage areas for dynamic data makes performance a pressing concern for most Web applications. Fortunately, database interactions can be optimized at the application level by using Apache::DBI persistence and DBI's optimization facilities. This can be done at the query level by avoiding expensive query constructs and using the strengths of the Web to simplify queries, or it can be done at the database architecture level with a little care in schema construction. Sometimes, though, the best way to improve the performance of a Web application is by sidestepping the database entirely with session managers, cumulative documents, or cookies.