Global Spin: Problems with Persistence

Problems with Persistence

Click here to order from a bookstore near you.

The performance benefits of persistence are worth nearly any cost. Persistence tends to provide exponential improvements in performance with additional benefits when Web applications see more traffic and need to become more robust.

When implementing a persistent solution, it's necessary to watch for pitfalls, however. Specifically, some of the programming techniques commonly found in Perl Common Gateway Interface (CGI) programs can lead to unusual behavior or unanticipated performance loss when transplanted into a persistent environment. Some of these problems can be overcome automatically by choosing the right environment, but stylistic solutions to persistence problems can provide further benefits with little effort and few drawbacks.

The most common pitfall when writing Perl for the Web comes from global variables. The behavior of these variables can be very different between a CGI environment and a persistent environment. Luckily, there is an easy solution to global variable problems built directly into Perl. The my keyword provides a way to limit the scope of variables, and the strict and warnings pragmas provide runtime checks to ensure that it is being used correctly.

More difficult than variable issues are subtle problems that cause performance degradation even when the application operates correctly. The use of scoped variables is good Perl programming practice in any situation, but speed killers are specific to persistent Web applications and might not be obvious to programmers coming from a CGI background. The solutions to these performance issues are varied, but it's usually possible to find a solution based on the function being performed.

Caching, which is the backbone of persistent performance benefits, also can have its down side. It's possible to cache too much information; quite simply, the size of any cache is limited to the size of available memory. As a result, Web applications have to choose what information to cache and what information not to cache, and the choice is not always an obvious one. In some cases, the choice isn't even a conscious one–some data structures create unwanted copies of themselves that hang around until they are specifically disposed of. These structures are rare, thankfully, and keeping track of them simply requires a little diligence.

Nested Subroutines and Scoped Variables

The structure of a persistent Perl program is slightly different from a stand-alone program, whether it is created automatically by a system such as Apache::Registry or written specifically for an environment such as FastCGI. One major difference between persistent and single-use environments is the treatment of global variables. Fortunately, the my keyword provides a way to alter the scope of these variables to make them behave consistently in any environment.

Of course, this use of lexical scoping has its own caveats. Because persistence is usually achieved by turning Perl programs into subroutines of other Perl programs, the structure of the compiled program might be noticeably different from that which was originally written. Fortunately, even these errors can be caught by using the strict and warnings pragma modules to check program code for odd behavior at runtime. Because these modules are compiled in with the final version of the program, they are likely to catch errors that would slip by a manual inspection of the source code.

The `my` Keyword

Global variables are likely to behave strangely in a persistent context. The scope of most variables defaults to the entire program environment. Thus, many existing Perl CGI or command-line applications contain many global variables. Listing 11.1 is an example of these in a simple program.

Listing 16.1 Global Variable Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 
05 $word = $ARGV[0];
06 count_and_reverse();
07 
08 
09 sub count_and_reverse
10 {
11   print "\nNow evaluating '$word'.\n";
12 
13   foreach $letter (split '', $word)
14   {
15     $count++;
16     $reverse = $letter . $reverse;
17   }
18 
19   print "The word '$word' contains $count letters.\n";
20   print "The reverse of '$word' is $reverse.\n\n";
21 }

Listing 11.1 is a Perl command-line program that takes a single word as an argument, counts the letters in the word, and reverses the word. It's written in a style that separates the program's basic logic from functional blocks, which is common for Perl CGI programs. Line 05 takes the first argument from the command line ($ARGV[0]) and assigns it to the scalar variable $word. Line 06 then calls the count_and_reverse subroutine defined in the remainder of the program, starting with line 09. This subroutine is called in a void context with no arguments. Thus, its only access to program input is through the global $word variable. Line 11 prints this variable in context, and the word is split into component letters in line 13 and used as the basis for a foreach loop. Inside the loop, line 15 increments a counter variable ($count) and adds the current letter to the beginning of $reverse. The subroutine ends by printing the final letter count and the reverse of the word, as stored in $reverse.

The output of Listing 11.1 should look similar to the following–assuming that a word like "foof" is given as input:

Listing 16.

Now evaluating 'foof'.
The word 'foof' contains 4 letters.
The reverse of 'foof' is foof.

This result isn't particularly exciting from a Perl point of view. The important thing to notice about this program is that it operates consistently from one instance to the next and that it provides a correct count and reversal of the input argument. This is analogous to the execution pattern seen with CGI programs, which proceed only once through the main functional block.

In contrast, Listing 11.2 is an analog of how the program might look in a persistent context. For simplicity, the persistent event handler is replaced here with a simple loop operating over the array of provided arguments. Otherwise, the program has the same structure. Lines that have changed are emphasized in bold text.

Listing 16.2 Global Variables in Persistent Context

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 
05 foreach (@ARGV)
06 {
07   $word = $_;
08   count_and_reverse();
09 }
10 
11 sub count_and_reverse
12 {
13   print "\nNow evaluating '$word'.\n";
14 
15   foreach $letter (split '', $word)
16   {
17     $count++;
18     $reverse = $letter . $reverse;
19   }
20 
21   print "The word '$word' contains $count letters.\n";
22   print "The reverse of '$word' is $reverse.\n\n";
23 }

Lines 05[nd]09 of Listing 11.2 set up a foreach loop that loops through the command-line arguments provided by @ARGV. Line 07 updates the global variable $word with each argument, and line 08 calls the count_and_reverse subroutine. Because $word is being updated before each call to the subroutine, the expected behavior of the program would be identical to that of Listing 11.2. However, providing the program with a list of arguments such as "foof hershey kahlua" would produce a result similar to the following:

Listing 16.

Now evaluating 'foof'.
The word 'foof' contains 4 letters.
The reverse of 'foof' is foof.

 
Now evaluating 'hershey'.
The word 'hershey' contains 11 letters.
The reverse of 'hershey' is yehsrehfoof.

 
Now evaluating 'kahlua'.
The word 'kahlua' contains 17 letters.
The reverse of 'kahlua' is aulhakyehsrehfoof.

Whoops! The behavior of this program is different from what you would expect. It evaluates the first argument correctly, but subsequent iterations of the count_and_reverse subroutine produce a count larger than expected and a $reverse value that includes previous values. This happens despite the fact that the value of $word is correct for each iteration of the subroutine.

If the functional part of this program were operating in a true persistent Perl environment rather than this simulation, the results would be the same. Each subsequent call to the program would include leftover values from previous iterations, and the result of the program would be inconsistent. In practice, the inconsistencies are harder to spot than they are in this simple program, and the resulting confusion can be much more damaging to a Web application's effectiveness. In these cases, using the strict and warnings pragma modules (see Listing 11.3 for a usage example) points out the culprits:

Listing 16.

Global symbol "$word" requires explicit package name at line 9.
Global symbol "$word" requires explicit package name at line 15.
Global symbol "$letter" requires explicit package name at line 17.
...
Global symbol "$reverse" requires explicit package name at line 24.
Execution of ./listing11-03a.pl aborted due to compilation errors.

For this program, the problem lies in the global variables used inside the count_and_reverse subroutine, as the error messages from the strict pragma point out. These variables are created inherently the first time the subroutine is called, but no indication is given by the programmer that they should be limited in scope. As a result, they are defined as global in scope (the default), and their values are kept from one iteration of the subroutine to the next. Because no code is used to explicitly initialize the variables before they are used, new values are simply added to old ones for each iteration.

The solution is simple in this case: use the my keyword to confine the scope of the variable to the subroutine. Complete documentation on the use of the my keyword (and its newer sibling, our, which is used to explicitly create global variables) is available in the Perl documentation. For our purposes, it suffices to say that my declares the scope of a variable to be limited to the enclosing block. This enables Perl to remove the variable after the program leaves its defined scope and reinitialize it when it appears again. Listing 11.3 uses the my keyword to fix the persistence problems in 11.2. Once again, differences are listed in bold text.

Listing 16.3 File-Scoped Variables

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 my $word;
08 
09 foreach (@ARGV)
10 {
11   $word = $_;
12   count_and_reverse();
13 }
14 
15 sub count_and_reverse
16 {
17   print "\nNow evaluating '$word'.\n";
18 
19   my $count = 0;
20   my $reverse = '';
21 
22   foreach my $letter (split '', $word)
23   {
24     $count++;
25     $reverse = $letter . $reverse;
26   }
27 
28   print "The word '$word' contains $count letters.\n";
29   print "The reverse of '$word' is $reverse.\n\n";
30 }

Lines 04 and 05 of Listing 11.3 add the strict and warnings pragmas to require variable scoping and raise warnings at potential trouble spots, respectively. Line 07 has been added to declare the scope of $word to be the outermost block of the program–in this case, the entire file. This won't affect the way the program behaves in this instance, but it becomes important if the program becomes incorporated into yet another program. ("There's always a bigger fish.")

Lines 19 and 20 have a more immediate effect. Line 19 declares the scope of $count and gives it an initial value of zero. It could be argued that either of these would take care of the unusual behavior seen when Listing 11.2 is run, but both are necessary to make sure that the variable has both a known initial value and a defined scope within the program. The former is helpful to Perl when determining how to handle the variable in later interactions, and the latter is necessary to let Perl know when it can free the memory allocated to the variable. If the my keyword wasn't used to confine the variable's scope to the subroutine block, the memory used by $count would not be freed after the subroutine ended. In a Web context where the subroutine (or persistent program) might not receive a request again for hours, this creates unnecessary inefficiency that's easily avoidable.

With these three my declarations, all global variables in the program are accounted for and their scope is limited to the smallest area possible. As a result, the output of this program should be identical to that of the original in Listing 11.1. This principle can be extended to any program developed for use in a persistent environment: Use the my declaration as each variable is initially used to avoid unusual behavior caused by the persistence of the program.

"Variable Will Not Stay Shared" Errors

Global variables aren't always easy to spot, however. In fact, some variables are global only in a persistent context–even when they would not be in a CGI context. This usually is caused by the automatic modifications made to individual CGI-style programs as they are incorporated into a persistent environment. Because each program is redefined as a subroutine, Perl might treat parts of the program differently.

For instance, Listing 11.4 is an example of a CGI-style program that would not seem to have any global variables. In fact, it's a minor modification of the program in Listing 11.3 that was corrected specifically to get rid of global variables. Only the basic structure of a CGI program has been added to account for Web form variable input and simple HTML output. As a result, Listing 11.4 can be used in a CGI environment or a persistent environment.

Listing 16.4 CGI-Style Program with File-Scoped Variables

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 
09 my $word;
10 my $q = CGI->new();
11 
12 print $q->header;
13 print "<pre>\n";
14 
15 foreach ($q->param('word'))
16 {
17   $word = $_;
18   count_and_reverse();
19 }
20 
21 print "</pre>\n";
22 
23 sub count_and_reverse
24 {
25   print "\nNow evaluating '$word'.\n";
26 
27   my $count = 0;
28   my $reverse = '';
29 
30   foreach my $letter (split '', $word)
31   {
32     $count++;
33     $reverse = $letter . $reverse;
34   }
35 
36   print "The word '$word' contains $count letters.\n";
37   print "The reverse of '$word' is $reverse.\n\n";
38 }

In a CGI environment, the program in Listing 11.4 would produce output similar to that of Listing 11.3. Line 07 includes the CGI module to provide form variable processing and simplified HTML output. Line 10 creates a new CGI object $q and limits its scope to this file. Lines 12 and 13 output the basic structure of an HTML response, using a <pre> tag to display the rest of the output as preformatted text. Line 15 has been modified to get a list of input words from the word form variable as processed by the CGI module, but the rest of the foreach loop stays the same, as does the count_and_reverse subroutine. After each form variable is processed, line 21 prints a closing </pre> tag to end the HTML output. Because the core of this program hasn't changed significantly, the result of a CGI request to the program would be consistent with that which is found at the command line.

However, in a persistent context, this program produces a warning, and if the program is called again, it produces inconsistent results. (It would produce inconsistent results without notice if warnings were not enabled, of course.) The warning, which might appear in a number of ways, depending on the environment, should appear essentially like the following:

Listing 16.

Variable "$word" will not stay shared at (eval 83) line 25.
Subroutine count_and_reverse redefined at (eval 83) line 24.

In this case, the warning is not as helpful as the messages from the strict pragma in Listing 11.2, and it provides only vague clues as to the problem. Although stated cryptically, the effect is listed in the first warning, and the cause is listed in the second warning. The cause is explored in the next section of this chapter, "Subroutines and Trickery Revisited," but the problem can be addressed solely in terms of effect. In this case, line 25 is the first instance of $word in the subroutine. Thus, it would appear that something unusual is happening to it after the subroutine is called. The warning "will not stay shared" is Perl's way of indicating that the my declaration outside the subroutine is causing the variable to fall out of scope inside the subroutine, which is a situation brought on by persistence trickery that would normally not occur. As a result, the first instance of $word inside the subroutine causes a new global variable to be declared that does not share the value of the intended $word variable. In short, Perl understands that we mean the same variable. Thus, it's warning us that it won't be the same.

To overcome this error with a work-alike that keeps the my declaration outside the subroutine, the variable can be explicitly shared inside the subroutine by providing the value we want as an argument and creating a new variable to store it. Listing 11.5 illustrates the solution.

Listing 16.5 Persistent-Style Program with File-Scoped Variables

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 
09 my $word;
10 my $q = CGI->new();
11 
12 print $q->header;
13 print "<pre>\n";
14 
15 foreach ($q->param('word'))
16 {
17   $word = $_;
18   count_and_reverse($word);
19 }
20 
21 print "</pre>\n";
22 
23 sub count_and_reverse
24 {
25   my $word = shift;
26   print "\nNow evaluating '$word'.\n";
27 
28   my $count = 0;
29   my $reverse = '';
30 
31   foreach my $letter (split '', $word)
32   {
33     $count++;
34     $reverse = $letter . $reverse;
35   }
36 
37   print "The word '$word' contains $count letters.\n";
38   print "The reverse of '$word' is $reverse.\n\n";
39 }

Lines 18 and 25 of listing 11.5 add the structure necessary to explicitly pass the value of $word to the subroutine each time it is called. Line 18 calls count_and_reverse with the file-scoped $word as the argument, and line 25 assigns it to a new variable (also called $word), which is limited in scope to the subroutine itself. This variable could be called something other than $word, but when troubleshooting existing CGI programs, it's usually easier to keep the same variable names if there's little chance of confusion. This avoids potential mismatches.

Again, the general idea behind this specific solution is to explicitly declare the relationship between the variable in the subroutine and the variable in the main program. In this case, the relationship was a simple copy, but other relationships are possible as well. If $word needed to be modified by the subroutine for use with the main program, it might have been better to pass a reference to the variable rather than a copy. In addition, if a variable is likely to be common to many subroutines (such as the CGI object $q in Listing 11.5), it is sometimes less cumbersome to explicitly declare the variable as global using the our keyword instead. Take care with our, though, because it creates a global variable that is shared not only with subsequent instances of this persistent program, but also with all other persistent Perl programs running within the same application engine.

Subroutines and Trickery Revisited

To understand the mechanism behind the "variable will not stay shared" error, it's important to know how Perl handles nested subroutines. Perl performs its own compile-time trickery by modifying the program to extract subroutines from other subroutines and by promoting them to the top level of their namespace. This is done to provide a simpler list of subroutines with which the Perl compiler has to deal. As a result, the general form of the program in Perl is changed from a nested subroutine like this:

Listing 16.

namespace
  sub foo {
    sub bar {
      sub baz {}
    }
  }
  sub bah {}

to a more manageable list of subroutines like this:

Listing 16.

namespace
  sub foo {}
  sub bar {}
  sub baz {}
  sub bah {}

Normally this works without a problem because the subroutines are called when global variables are set correctly or when they are passed arguments that can be assigned to variables with local scope. Even in the case in which variables are scoped to the namespace or file by using my, the subroutines inherit that scope because they are a level below the namespace. Thus, the example in Listing 11.4 works fine in a CGI context because the subroutine is run from the main namespace, and $word is scoped to that namespace.

However, in the case in which variables are scoped to one subroutine and run within another, the scope is not inherited. For instance, a variable defined in subroutine foo using my would not be in scope within subroutine bar because bar is extracted from foo at compile time. This makes perfect sense in reverse: a variable scoped within bar should fall out of scope within foo. Because both subroutines are effectively at the same level after the compile-time switch, variables scoped to either are off limits to the other.

As an example, Listings 11.6 and 11.7 are equivalent programs in Perl. Listing 11.6 embeds the program from Listing 11.5 in a subroutine and calls it repeatedly from within the surrounding program. Listing 11.7 shows equivalent code after the inherent translation at compile time.

Listing 16.6 Nested Subroutine Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 handler('foof','hershey','kahlua');
08 handler('merlin','callie','elfie','xena');
09 
10 sub handler
11 {
12   my $word;
13 
14   foreach (@_)
15   {
16     $word = $_;
17     count_and_reverse($word);
18   }
19 
20   sub count_and_reverse
21   {
22     my $word = shift;
23     print "\nNow evaluating '$word'.\n";
24 
25     my $count = 0;
26     my $reverse = '';
27   
28     foreach my $letter (split '', $word)
29     {
30       $count++;
31       $reverse = $letter . $reverse;
32     }
33   
34     print "The word '$word' contains $count letters.\n";
35     print "The reverse of '$word' is $reverse.\n\n";
36   }
37 }

Listing 16.7 Equivalent Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 handler('foof','hershey','kahlua');
08 handler('merlin','callie','elfie','xena');
09 
10 sub handler
11 {
12   my $word;
13 
14   foreach (@_)
15   {
16     $word = $_;
17     count_and_reverse($word);
18   }
19 }
20 
21 sub count_and_reverse
22 {
23   my $word = shift;
24   print "\nNow evaluating '$word'.\n";
25 
26   my $count = 0;
27   my $reverse = '';
28 
29   foreach my $letter (split '', $word)
30   {
31     $count++;
32     $reverse = $letter . $reverse;
33   }
34   
35   print "The word '$word' contains $count letters.\n";
36   print "The reverse of '$word' is $reverse.\n\n";
37 }

Although the cause of the "variable will not stay shared" warning isn't obvious in Listing 11.6, it should be clearer when seen in the equivalent Listing 11.7. Line 12 of Listing 11.7 limits the scope of $word to the handler subroutine. Thus, the variable would be out of scope when used in the count_and_reverse subroutine. Only by scoping each variable separately and passing its values explicitly can the relationship be defined.

The hint leading to this discovery was that the error message for Listing 11.4 in a persistent environment was "subroutine redefined." As noted before, each program segment of a Web application–which would be a separate program in a CGI context–is incorporated into the larger whole by creating a subroutine with its code. Because a program segment might itself contain subroutines, these subroutines are dealt with in a manner that is similar to the nested subroutine in Listing 11.6. Because of this chain of events, global variables were created where none were anticipated.

Use Warnings to Catch Unusual Behavior

Again, the value of using the strict and warnings pragma modules to catch potential pitfalls can't be overstated. In these cases and many others, odd behavior can be caught–or at least hinted to–by errors and warnings generated by these two modules. As shown in Listing 11.4, sometimes the cause of unusual behavior would be next to impossible to track down without the hints given by the warnings pragma module.

It's also important to keep warnings in use even after the Web application is put into production. It would seem to be better for performance to remove the pragmas when the code is no longer likely to change, but the performance benefits are slight and the danger of unchecked changes to the code is too great in a Web development environment. Simple changes to the environment, such as shifting from CGI to a persistent context, can cause minor changes in the way code is handled by Perl. It's better to find out that problems exist by seeing warnings in development rather than by seeing peculiar behavior from a live site.

Fork, Backticks, and Speed Killers

Global variables and scoping aren't the only problems encountered when translating CGI-style programs for persistent use. Although variable issues can cause unusual behavior, another class of issues affects performance directly. The problem arises when the application engine process needs to fork, or start another process to perform work. The forked subprocess is usually transitory. Thus, it is started and stopped within the span of one request from the Web server. This counteracts the performance benefits of keeping a persistent application engine in the first place.

Fortunately, there are usually ways to get around forked processes within a Web application. In Perl, it's almost always possible to modify the application to use a native Perl function or module rather than an external process. Even when the external process can't be replicated by native Perl code, a persistent interface to it probably can be created without the need to fork additional processes. Sometimes, the necessary changes are difficult to spot within legacy code because the fork isn't explicit, but most cases of forking code fall into a few categories that can be found and replaced with native code fairly easily.

Backticks and Forked Processes

For Web application programmers, forking is most likely to be used inherently when accessing system utilities or other programs external to the Perl application. The usual method for accessing these programs is to enclose a command in backticks and assign the output to a variable. Listing 11.8 is an example of a simple program that uses the UNIX system utility ls to list files in a directory.

Listing 16.8 Backticks Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 
09 my $q = CGI->new;
10 print $q->start_html;
11 
12 my $listing = `ls /tmp/perlfiles/`;
13 
14 foreach my $file (split /\n/, $listing)
15 {
16   next unless $file =~ /.pl$/;
17   chomp($file);
18   print $q->p($file);
19 }
20 
21 print $q->end_html;

Listing 11.8 creates a simple HTML page with a listing of the Perl programs in the /tmp/perlfiles/ directory. Lines 07[nd]10 set up the CGI module and start the HTML page. The backticks in Line 12 run the ls command in the default shell and return the output, which is assigned to the $listing variable. Line 14 splits the result into individual filenames and loops over the list. Line 16 excludes any filename that doesn't end in .pl, and lines 17 and 18 clean up the filename and print it as an HTML paragraph. At the end, line 21 caps off the HTML page.

Behind the scenes, most of the overhead incurred by this program is caused by line 12. Specifically, a command-line shell is opened by using backticks to process a command. Because this shell is an external program, it has to be started and restarted with each request, regardless of whether the program is running in a persistent context. As a result, one line of code slows the entire program to CGI speeds. If the command invoked within the backticks was yet another program as opposed to a shell command, the additional overhead of starting that program would slow the Web application down further.

The best way to avoid this kind of speed killer is to use native Perl functions to perform the same task as the system utility. This provides two benefits:

The performance of the Web applications is no longer hampered by the external program.
The Web application becomes more portable because it relies less on a specific environment.

Perl provides built-in functions that give equivalent functionality to most system utilities, and more esoteric programs usually have specific Perl interfaces defined to address this problem. Listing 11.9 is an example of how Listing 11.8 could be implemented using only native Perl.

Listing 16.9 Native Perl Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 
09 my $q = CGI->new;
10 print $q->start_html;
11 
12 opendir DIR, "/tmp/perlfiles/";
13 
14 while (my $file = readdir DIR)
15 {
16   next unless $file =~ /.pl/;
17   chomp($file);
18   print $q->p($file);
19 }
20 
21 closedir DIR;
22
23 print $q->end_html;

The main difference between Listings 11.8 and 11.9 is the function used to open a directory listing. Line 12 of Listing 11.9 uses the opendir function provided by Perl to open a directory in a fashion similar to the open function. Line 14 loops over individual filenames in a slightly different fashion, but the rest of the loop operates identically to Listing 11.8 and the results are the same. Line 21 is added to close the directory handle DIR that was created in line 12. These simple changes relieve the need to start an external process and open up the application to the full performance benefits of persistence.

Processes That Fork Unintentionally

Not all external processes are started by an explicit fork keyword or a command enclosed in backticks. Many common Web applications fork processes unintentionally by calling command-line programs indirectly. This can occur in two ways: a file handle might be opened to a forked process by specifying a pipe symbol in the open function, or a module might be used that performs its own fork. The latter case can be avoided only by knowing which modules are all-native (including those that use XS) and which modules use external processes. The former case, however, can be fixed by replacing pipes to external programs with Perl-native functions or modules.

Listing 11.10 is an example that uses one of the most common speed killers in a persistent Web application. Many programs send email based on Web application events, such as a new user sign-up or a successful e-commerce transaction. Because many of these programs initially are written for CGI or based on CGI applications, this problem often is inherited without realizing it is a problem.

Listing 16.10 Common CGI E-Mailer Using Sendmail

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 
09 my $q = CGI->new;
10 print $q->start_html;
11 
12 my $to = $q->param('to');
13 my $from = $q->param('from');
14 my $subject = $q->param('subject');
15 my $body = $q->param('body');
16 
17 open MAIL, "| /usr/sbin/sendmail -t";
18 
19 print MAIL "From: $from\n";
20 print MAIL "To: $to\n";
21 print MAIL "Subject: $subject\n\n";
22 
23 print MAIL "$body\n";
24 
25 close MAIL;
26 
27 print $q->p("From: $from");
28 print $q->p("To: $to");
29 print $q->p("Subject: $subject");
30 
31 print $q->p("$body");
32 
33
34 print $q->end_html;

The problem with Listing 11.10 lies in line 17. Even though no backticks or fork functions are used, the pipe symbol in the open argument calls sendmail as a command line program and pipes the MAIL file handle to it. The rest of the program is perfectly ordinary. Lines 09[nd]15 set up the CGI environment and retrieve the form variables to, from, subject, and body, which are used to determine the contents of the mail message. Lines 19[nd]23 print the contents to the Sendmail program, and lines 27[nd]31 print the same contents to the HTML output. Line 25 closes the sendmail pipe opened in line 17.

Of course, opening a pipe to sendmail involves more than just a pipe. A shell has to be started to process the command line from line 17, and that shell starts the sendmail process to set up the pipe. Again, these two events occur for each request, and the resulting overhead slows the Web application down to CGI speeds. The solution is to avoid starting a shell or an external program in the first place. Luckily this is possible by sending mail through the Net::SMTP module, which opens a socket to the mail server without the need for any new processes. Listing 11.11 incorporates the change.

Listing 16.11 Native Perl E-Mailer Using Net::SMTP

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 use Net::SMTP;
09 
10 my $q = CGI->new;
11 print $q->start_html;
12 
13 my $to = $q->param('to');
14 my $from = $q->param('from');
15 my $subject = $q->param('subject');
16 my $body = $q->param('body');
17 
18 my $smtp = Net::SMTP->new('localhost');
19   
20 $smtp->mail($from);
21 $smtp->to($to);
22 
23 $smtp->data();
24 $smtp->datasend("To: $to\n");
25 $smtp->datasend("Subject: $subject\n\n");
26 $smtp->datasend("$body\n");
27 $smtp->dataend();
28   
29 $smtp->quit;
30 
31 print $q->p("From: $from");
32 print $q->p("To: $to");
33 print $q->p("Subject: $subject");
34 
35 print $q->p("$body");
36 
37 
38 print $q->end_html;

In this case, the code difference is a little more significant due to the differences between the piped file handle and the Net::SMTP module functions. Line 08 includes the Net::SMTP module itself, and line 18 opens a connection to the SMTP mail server on the local machine. Lines 20[nd]27 pass the data to the server using functions provided by the $smtp connection object, and line 29 finally closes the connection to the server. Because the entire transaction is carried out using native Perl networking libraries and modules, no additional overhead is incurred. In addition, using a network connection might provide opportunities for further performance enhancements because the connection can be opened once and kept persistent depending on the Web application environment.

Discovering Speed Killers

As mentioned, it's difficult to find the true nature of a performance bottleneck–especially when the cause of the problem is architectural rather than programmatic. Luckily, most speed killers in a Perl Web application environment can be found in the few classes mentioned in this chapter. By following the style guidelines outlined, it's possible to avoid the majority of Web application speed killers.

However, the possibility still exists that a bottleneck is present that doesn't fall into one of these categories. As time goes on, that possibility becomes a certainty as new technologies are incorporated into the Web application framework. As a result, it's necessary to develop a more general way to pinpoint the source of a performance bottleneck–if not necessarily its nature. More information on evaluating the performance of a site and its parts can be found in Chapter 15, "Testing Site Performance."

Caching

Although some persistence problems are noticeable immediately, others can be caused by cumulative effects that won't be noticed until the Web application has been operating persistently for a considerable amount of time. These kinds of problems can be the most difficult to detect because the only obvious signs will be slow performance degradation, with a possibility of server crashes or unusual behavior after a prolonged slowdown. In some cases, though, the problems seem to fix themselves on a periodic basis because their causes are linked to the life cycles of the Web application engines themselves.

Caching provides most of the performance improvements seen by persistent Web applications because caching eliminates the overhead due to loading files off disk and processing them for use in the application. As shown in previous chapters, this holds true for compiling applications, processing simple files, or loading complex files into data structures. As a result, the usual response to any file, connection, or data structure in a persistent environment is to keep it around as long as possible in case some other part of the application might be able to benefit from its presence.

Caching comes at a price, however. The price usually isn't high, but it can be prohibitive if caching is used indiscriminately. Each cached variable, file, or data structure takes up an additional amount of memory, and over time, these cached structures can overwhelm the rest of an application engine and force it either to restart or to slow down the entire Web server machine. Luckily, most Web application environments rely on Perl to make the decision whether to keep a cached structure in memory, so many data structures are cached only if they are specifically marked as important within the application. Unfortunately, some complex data structures can escape Perl's scrutiny and stay in memory indefinitely, regardless of whether they can be used.

The main cause of unwanted caching is Perl's openness when it comes to the assignment of references. Perl cleans up memory based on the current activity of a data structure, but it is altogether too easy to convince Perl that a data structure is still active, even though its only references come from itself. Worse yet, it's sometimes necessary to set up these circular references to model a data structure properly. The solution is to manually walk through the data structure and remove circular references, which luckily is handled by some modules themselves and offered as an object method.

Circular References

In Perl, the main reason that unanticipated caching occurs is because Perl frees memory taken up by variables when the variables are no longer being used. Perl determines this by keeping tabs on the reference count of a variable, the number of currently active variables, and the program sections that refer to it. Perl knows to keep a variable in memory as long as the reference count for the variable is not zero, and it can safely remove the variable from memory when the reference count reaches zero because there is no longer any way for the program to access the variable.

The problem with reference counts is that Perl enables references to be assigned programmatically instead of handling them all inherently. In fact, most of Perl's object-oriented programming structure is based on the assignment of variable references. This kind of caching can be considered a memory leak in this context because the amount of memory used by the program increases over time as each request adds more unused objects to memory.

Most Perl modules do not create circular references, though, and many that do clean up the circular references. Thus, circular references are not a reason to avoid object-oriented modules as a matter of course. However, modules that create complex hierarchical data structures can be subject to intractable circular references. This is more likely to occur when the module enables individual parts of the hierarchies to be manipulated by offering subclassed objects as containers for other objects.

Example: XML::DOM Objects

One module that creates intractable circular references as a matter of course is the XML::DOM module. XML::DOM provides a legion of subclass objects that can be linked in a containment hierarchy to provide flexible access to any part of an XML document's structure. The XML::DOM::Element object, for instance, represents a tagged element in the document's hierarchy. This object can contain references to other Element objects, as well as references to the XML::DOM::Document object that the Element is part of and to the Element that is its parent. (More details on the XML::DOM module can be found in Chapter 16, "XML and Content Management.")

These connections between objects can easily become too complex for Perl to sort out automatically, which makes it difficult for Perl to determine when the object has gone out of scope. As a result, methods need to be invoked to dispose of objects manually, or the objects will stay in memory long after the request that created them has been fulfilled and closed.

Listing 11.12 provides an example of XML::DOM code that could produce unwanted cached objects if the objects were not disposed of manually. It processes the XML file in Listing 11.13 and displays a simple list of nodes and their values.

Listing 16.12 DOM Object Processor Example

01 #!/usr/bin/perl
02 
03 require 5.6.0;
04 use strict;
05 use warnings;
06 
07 use CGI;
08 use XML::DOM;
09 
10 my $q = CGI->new;
11 print $q->start_html;
12 
13 my $parser = XML::DOM::Parser->new;
14 
15 my $doc = $parser->parsefile('/tmp/listing11-13.xml');
16 
17 my $root = $doc->getDocumentElement;
18 
19 foreach my $node ($root->getChildNodes)
20 {
21   next unless $node->getNodeType == 1; 
22 
23   my $tag_name = $node->getTagName;
24   my $value = $node->getFirstChild->getData;
25 
26   print $q->p("\u$value is a $tag_name.");
27 }
28 
29 print $q->end_html;
30 
31 $doc->dispose;

Listing 16.13 Sample XML File

01 <test>
02 <food>tofu</food>
03 <food>ham</food>
04 <food>stuff</food>
05 </test>

For our purposes, the two most important lines in Listing 11.12 are line 15 and line 31, which create and dispose of the XML::DOM document, respectively. In addition, line 13 creates an XML::DOM::Parser object to parse the XML file into a hierarchy of XML::DOM objects. Lines 17[nd]27 use various object methods from the XML::DOM objects to retrieve and manipulate the data contained in the XML document. These lines show some of the connections made between the objects and their hierarchical relationship. The creation of the main document object in line 15 is the cause of all these objects and connections, however. If not disposed, each request would generate a new document object from the parsed XML file, which stayed in memory despite its last real reference ($doc) going out of scope at the end of the program. Over time, the memory used by these objects would bloat the size of the application engine to fill available memory, slowing the Web server to a crawl and eventually crashing the machine. By calling the dispose method in line 31, the program lets XML::DOM know that it should look through its own set of objects and remove any circular references to ease Perl's cleanup job.

Even though this technique relies on removing a data structure from memory at the end of every request, adding a more useful layer of controlled caching still is possible. The key is to make a master copy of the cached object when it is first processed and then provide copies of the object to be used by the actual program. Because there's always one untouched copy of the object in a separate cache, the programmer doesn't have to worry about modifications to a cached object being carried on to other requests. In addition, the master copy can be stored along with a file date so that the master can be updated if the file changes on disk. This kind of deliberate caching isn't likely to incur nearly the memory penalty of uncontrolled caching, and the resulting object copies behave consistently no matter how often they are used in any circumstance.

Summary

Adapting Perl Web applications for a persistent environment can invoke some daunting issues, but persistence problems fall into a few well-defined classes. In addition, the problems usually can be overcome with a combination of good Perl programming style and reliance on the built-in tools Perl provides for checking code for problems at runtime. Common persistence problems include variable scoping issues, which usually can be dealt with by using the my keyword correctly and watching for warnings. Performance problems can occur when external processes cause additional overhead, but these situations can be avoided by using native Perl modules and functions for system interaction. Unwanted caching due to the persistent environment can present its own problems, but the solutions usually can be found within the same modules causing the trouble. One thing is true no matter what the problem or solution: the benefits of persistence are worth the effort.

Problems with Persistence

Nested Subroutines and Scoped Variables

The my Keyword

"Variable Will Not Stay Shared" Errors

Subroutines and Trickery Revisited

Use Warnings to Catch Unusual Behavior

Fork, Backticks, and Speed Killers

Backticks and Forked Processes

Processes That Fork Unintentionally

Discovering Speed Killers

Caching

Circular References

Example: XML::DOM Objects

Summary

The `my` Keyword