Saturday, March 21, 2009

Full-Text Search using Oracle Text

This is in continuation with my previous post on Full Text Search in which I discussed about MySQL’s built-in Full Text Search engine and external Open Source Full Text Search engines as options for performing integrating full-text search features in java applications. This time I want to share information on building Full-Text Search Applications with Oracle Text.

Oracle Text:
Oracle Text is a powerful search technology built into all Oracle Database editions, including the free Express Edition (XE). The development APIs provided by Oracle Text allow software developers to easily implement full-featured content search applications.

Oracle Text is suitable for a wide variety of search-related use cases and storage structures. Application areas for Text include e-business, document and records management as well as issue tracking just to name a few. Retrievable text can reside in a structured form inside the database or in unstructured form either in a local file system or on the Web.

Oracle Text can be used to search structured and unstructured documents complementing the SQL wildcard matching. Oracle Text provides a complete SQL-based search API that consists of custom query operators, DDL syntax extensions, a set of PL/SQL procedures and database views. Text API gives the application developer full control over indexing, queries, security, presentation, and software configuration that is sometimes required. Oracle Text is also programming-language agnostic and works equally well for PHP as well as Java applications.

Setting Up Oracle Text:

Oracle Text is installed with an Oracle Database XE installation by default. With other database editions, you need to install the Oracle Text feature yourself. Once the feature is present, you only need to create a normal database user and grant the CTXAPP role to the user. This will allow the user to execute certain index management procedures:

Indexing Process and Searching

Oracle Text indexes retrievable data items before users are able to find content with search. This is a common approach used to ensure adequate search performance. The Oracle Text indexing process is modeled after a pipeline, where data items retrieved from a data store pass through a series of transformations before their keywords are added to the index. The indexing process is split into multiple phases, where each phase is handled by a separate entity and configurable by the application developer.

Oracle Text has different index types that are suitable for different purposes. For full-text search with large documents, the CONTEXT index is the appropriate index type. The indexing process includes the following phases:

  1. Data Retrieval: Data is simply fetched from a data store, for example, a Web page, database large object, or local file system, and passed as a stream of data to the next phase.
  2. Filtering: The filters are responsible for converting data in different file formats to plain text. The other components in the indexing pipeline only process plain text data and don't know about file formats such as Microsoft Word or Excel.
  3. Sectioning: The sectioner adds metadata about the structure of the original data item.
  4. Lexing: A stream of characters is split into words based on the language of the item.
  5. Indexing: In this final phase, the keywords are added to the actual index.

Once the index has been built, an application can use plain SQL queries to execute a search entered by an end user.

Searching

The CONTAINS operator is used for searching CONTEXT indexes.

Index Maintenance

Because base table data is replicated by the index, the data needs to be periodically synchronized to the index. Index maintenance procedures can be found in the CTX_DDL PL/SQL package.

Summary

Oracle Text allows users to create full-text index on a single column / multiple columns in a single table as well across multiple tables in a database. Details on creating the index, searching and index maintence is discussed comprehensively in the OTN Developer article on full text indexing.

References:

1. OTN Developer article on full text indexing

2. OTN Discussion Forum - Topic on multi-table indexing

3. Thread on full-text indexing

Sunday, March 08, 2009

Java frameworks for automating unit testing

What Is Unit Testing?

In computer programming, unit testing is a software design and development method where the programmer gains confidence that individual units of source code are fit for use. A unit test exercises a "unit" of production code in isolation from the full system and checks that the results are as expected. A unit is the smallest testable part of an application. The size of the unit may vary between a class and a package. In the case of Java, a unit usually refers to a single class.

The goal of unit testing is to isolate each part of the program and show that the individual parts are correct. A unit test provides a strict, written contract that the piece of code must satisfy. When unit testing is implemented the right way it helps the programmers to become more productive, while at the same time increasing the quality of the developed code. It's important to realize that unit testing should be part of the development process, and that code must be designed so it can be tested. Actually the trend today is to write the unit test code before the code to be tested, to put focus on the interface and behavior of your Java classes

Test Driven Development

The importance of unit testing had increased with the intervention of extreme programming, a lightweight software methodology proposed to reduce the cost of software development. In this approach, a developer should write unit tests even before writing the actual code.

Whether you will follow a test-driven development approach or not, it's your responsibility, as a developer, to write unit tests to check your own code. Nowadays, unit testing has become an integral part of any professional Java developer toolkit.

In Java, you will need to test every Java class you implement. The question is: Which should I write first, the code or the test? The answer is: You could start with implementing your class and getting it compiled before writing the necessary unit tests. However, a better approach, named Test-Code-Simplify cycle, suggests following these steps:
  • Write a single test.
  • Compile it. It shouldn't compile because you haven't written the implementation code it calls.
  • Implement just enough code to get the test to compile.
  • Run the test and see it fail.
  • Implement just enough code to get the test to pass.
  • Run the test and see it pass.
  • Refactor for clarity.
  • Repeat.
By following this approach, unit testing will not only help you to write robust code and decrease your bugs but it also will assist you to improve your design, and reduce the cost and fear of changing working code. Moreover, it will make your development faster and will act as a concise executable documentation of your code.

Benefits of Unit testing

1. Unit tests helps to find problems early in the development cycle.
2. Unit testing helps to simplify integration by testing the parts of a program first and then testing the sum of its parts, thus integration testing becomes much easier.
3. Facilitates change by allowing the programmer to refactor code at a later date, and make sure the module still works correctly.
4. Unit testing provides a sort of living documentation of the system. Developers looking to learn what functionality is provided by a unit and how to use it can look at the unit tests to gain a basic understanding of the unit API.
5. When software is developed using a test-driven approach, the Unit-Test may take the place of formal design. Each unit test can be seen as a design element specifying classes, methods, and observable behavior.

Automating unit tests

Unit testing is commonly automated, but may still be performed manually. A manual approach to unit testing may employ a step-by-step instructional document. Nevertheless, the objective in unit testing is to isolate a unit and validate its correctness. Automation is efficient for achieving this.

Using an automation framework, the developer codes criteria into the test to verify the correctness of the unit. During execution of the test cases, the framework logs those that fail any criterion. Many frameworks will also automatically flag and report in a summary these failed test cases. Depending upon the severity of a failure, the framework may halt subsequent testing.

As a consequence, unit testing is traditionally a motivator for programmers to create decoupled and cohesive code bodies. This practice promotes healthy habits in software development. Design patterns, unit testing, and refactoring often work together so that the most ideal solution may emerge.

Under the automated approach, to fully realize the effect of isolation, the unit or code body subjected to the unit test is executed within a framework outside of its natural environment, that is, outside of the product or calling context for which it was originally created. Testing in an isolated manner has the benefit of revealing unnecessary dependencies between the code being tested and other units or data spaces in the product. These dependencies can then be eliminated.

Benefits of an Automated Unit Test Suite

First, unit tests find problems early in the development cycle. An automated unit test suite finds problems effectively as early as possible, long before the software reaches a customer, and even before it reaches the QA team. Most of the problems in new code are already uncovered before the developer checks the code into source control.

Second, an automated unit test suite watches over your code in two dimensions: time and space. It watches over your code in the time dimension because once you’ve written a unit test, it guarantees that the code you wrote works now and in the future. It watches over your code in space dimension because unit tests written for other features guarantee that your new code did not break them; likewise it guarantees that code written for other features does not adversely affect the code you wrote for this feature.

Third, developers will be less afraid to change existing code. Over time, software systems become more and more change resistant because developers are reluctant to change old code. This is natural because when changing old code, there is always the risk of breaking it or some other part of the system through a side-effect. The only way to keep adding new features to software and retain the internal quality and clean design over time is by way of refactoring. Without occasional refactoring, the code grows more and more internally tangled until every class knows of every other class. Refactoring and cleaning up existing code is a really scary thing - unless you have automated unit tests.

Fourth, the development process becomes more flexible. Sometimes it may be necessary to fix a problem and to deploy the fix quickly. Despite best efforts, a bug may slip in and an important feature may stop working. The customers cannot purchase products, the users cannot work and your boss is breathing over your shoulder asking you to fix the problem immediately. Releasing quick fixes makes us feel uneasy because we are not certain what side-effects the changes might have. Running the unit tests with the fixes applied saves the day as they should reveal undesirable side-effects. Publishing hotfixes is something we hope we never have to do, and a unit test suite should already decrease the need for such things anyway. But if you ever have to publish a hotfix, a unit test suite improves your chances of doing so without introducing new problems.

Fifth, having a unit test suite improves your project’s truck factor. Truck factor is the minimum number of developers that if hit by a truck the project would grind to a halt. A comprehensive unit test suite improves truck factor because it makes it easier for a developer to take over a piece of code she is not intimately familiar with. A developer can start working on code written by others because the unit tests will guide the developer by pointing it out if she makes an error. Losing a key developer for any reason just before a release is less of a catastrophe if you have the safety net of unit tests.

Sixth, an automated unit test suite reduces the need for manual testing. Some manual testing will always be needed because humans excel at discovering bugs that involve complex data and workflow processes. Writing a unit test for the most complex cases might be so prohibitively time consuming that it is not cost effective any more. The QA team can concentrate discovering the hard-to-find bugs while the unit tests do most of the mundane testing.

The net effect of the benefits listed above is that software development will become more predictable and repeatable – in a word, a bit more like a real engineering discipline. Once the coding is done, the build process builds and tests the software much like physical products are built on an assembly line. This removes much of the ad-hoc nature in software development which is the underlying reason for many of the problems that plague software projects.

Unit testing frameworks

Unit testing frameworks, which help simplify the process of unit testing, have been developed for a wide variety of languages. It is generally possible to perform unit testing without the support of specific framework by writing client code that exercises the units under test and uses assertion, exception, or early exit mechanisms to signal failure. This approach is valuable in that there is a non-negligible barrier to the adoption of unit testing. However, it is also limited in that many advanced features of a proper framework are missing or must be hand-coded.

Some of the popular Open Source Testing Frameworks available for Java are listed below:

JUnit
JUnit is a regression testing framework written by Erich Gamma and Kent Beck. It is used by the developer who implements unit tests in Java.

JUnit-addons
JUnit-addons is a collection of helper classes for JUnit.

JUnitDoclet
JUnitDoclet lowers the step toward JUnit. It generates skeletons of TestCases based on your application source code. And it supports you to reorganize tests during refactoring.

JUnitEE
JUnitEE is a simple extension to JUnit which allows standard test cases to be run from within a J2EE application server. It is composed primarily of a servlet which outputs the test results as html.

JUnitPerf
JUnitPerf is a collection of JUnit test decorators used to measure the performance and scalability of functionality contained within existing JUnit tests.

DbUnit
DbUnit is a JUnit extension (also usable with Ant) targeted for database-driven projects that, among other things, puts your database into a known state between test runs. This is an excellent way to avoid the myriad of problems that can occur when one test case corrupts the database and causes subsequent tests to fail or exacerbate the damage.

EasyMock
EasyMock provides Mock Objects for interfaces in JUnit tests by generating them on the fly using Java's proxy mechanism. Due to EasyMock's unique style of recording expectations, most refactorings will not affect the Mock Objects. So EasyMock is a perfect fit for Test-Driven Development.

MockObjects
Mock Objects is a test-first development process for building object-oriented software and a generic unit testing framework that supports that process. Our first implementation is in Java, largely because that's what we've been working in, but also because it has a stable set of APIs that are suitable for writing Mock Objects.

Mockrunner
Mockrunner is a lightweight framework for unit testing applications in the J2EE environment.

Unitils
Unitils is an open source library aimed at making unit testing easy and maintainable. Unitils builds further on existing libraries like DBUnit and EasyMock and integrates with JUnit and TestNG. Unitils provides general asserion utilities, support for database testing, support for testing with mock objects and offers integration with Spring and Hibernate.

TestNG
TestNG is a testing framework inspired from JUnit and NUnit but introducing some new functionalities that make it more powerful and easier to use.

XMLUnit
XMLUnit for Java provides two JUnit extension classes, XMLAssert and XMLTestCase, that allow assertions to be made about:

  • The differences between two pieces of XML
  • The outcome of transforming a piece of XML using XSLT
  • The evaluation of an XPath expression on a piece of XML
  • The validity of a piece of XML
  • Individual nodes in a piece of XML that are exposed by DOM Traversal XMLUnit for Java can also treat HTML content (even badly-formed HTML) as valid XML to allow these assertions to be made about the content of web pages too.

XHTMLUnit
XHTMLUnit builds on JUnit and XMLUnit in order to provide testing and validation of generated XHTML code.

Cactus
Cactus is a simple test framework for unit testing server-side java code (Servlets, EJBs, Tag Libs, Filters, ...). The intent of Cactus is to lower the cost of writing tests for server-side code. It uses JUnit and extends it.

HttpUnit
HttpUnit is a framework based on JUnit, which allows the implementation of automated test scripts for Web applications. It is best suited for the implementation of automated functional tests, or acceptance tests.

Following are some of the Open Source Tools that supports unit testing :

Cobertura
Cobertura is a free Java tool that calculates the percentage of code accessed by tests. It can be used to identify which parts of your Java program are lacking test coverage. It is based on jcoverage.

Emma
Open-source toolkit for measuring and reporting Java code coverage

Findbugs
Findbugs is a static analysis tool to find bugs in Java programs.

Limitations of unit testing

Testing cannot be expected to catch every error in the program - it is impossible to evaluate all execution paths for all but the most trivial programs. The same is true for unit testing. Additionally, by definition unit testing only tests the functionality of the units themselves. Therefore it will not catch integration errors, or broader system level errors (such as functions performed across multiple units, or non-functional test areas such as performance). Unit testing is more effective if it is used in conjunction with other software testing activities. Like all forms of software testing, unit tests can only show the presence of errors; it cannot show the absence of errors.

Software testing is a combinatorial problem. For example, every boolean decision statement requires at least two tests: one with an outcome of "true" and one with an outcome of "false". As a result, for every line of code written, programmers often need 3 to 5 lines of test code.

To obtain the intended benefits from unit testing, a rigorous sense of discipline is needed throughout the software development process. It is essential to keep careful records not only of the tests that have been performed, but also of all changes that have been made to the source code of this or any other unit in the software. Use of a version control system is essential. If a later version of the unit fails a particular test that it had previously passed, the version-control software can provide a list of the source code changes (if any) that have been applied to the unit since that time.

It is also essential to implement a sustainable process for ensuring that test case failures are reviewed daily and addressed immediately. If such a process is not implemented and ingrained into the team's workflow, the application will evolve out of sync with the unit test suite - increasing false positives and reducing the effectiveness of the test suite.

Summary

The recipe for better software with less people is simple: unit test early, unit test often, and refactor when needed. A comprehensive unit test suite that runs together with the daily build is the heart beat of a software project. It gives a sense of progress and stability. Email notifications of successful builds with unit tests can boost project morale. An email notification of a unit test failure, on the other hand, tells all project developers where they should focus their attention to in order to get the problem fixed.

Despite all the benefits that unit tests bring, some amount of manual ad-hoc testing is still needed. The developer needs to run the application and use her best judgment to see if the code really does what it is supposed to. A dedicated QA team is also needed in bigger projects. But with a carefully written unit test suite, the software is self-testing and the need for manual testing and separate QA personnel will be reduced. The benefits of automatic unit tests do outweigh the extra time and effort in writing and maintaining the tests.

Sources:

  • http://java-source.net/open-source/testing-tools
  • http://en.wikipedia.org/wiki/Unit_testing
  • http://www.codeproject.com/KB/architecture/onunittesting.aspx

Thursday, November 20, 2008

Flex Profiling

I am back with my boring technical posts again! This time I have tried to collect and organize some helpful material for anyone looking into Flex Profiling, because the noise coming out from Google searches usually ends up with information of no worth.

Profiling Flash Applications with Flex Builder 3

The Flex Profiler is a new addition to Flex Builder 3 and is a powerful tool that enables you to watch an application as it allocates and clears memory and objects. It connects to your application with a local socket connection.

As the Profiler runs, it takes a snapshot of data every few milliseconds and records the state of the Flash Player at that snapshot, a process referred to as sampling. By parsing the data from sampling, the Profiler can show every operation in your application. The Profiler records the execution time of those operations, as well as the total memory usage of objects in the Flash Player at the time of the snapshot.

Following links provides a step-by-step information on how to start profile applications using Flex Builder 3.
  • http://x-geom.net/blog/?p=48
  • http://blogs.adobe.com/aharui/profiler/ProfilerScenarios.swf
  • http://www.insideria.com/2008/06/profiling-flex-applications-sa.html
  • http://labs.adobe.com/wiki/index.php/Flex_3:Feature_Introductions:_Performance_and_Memory_Profiling
  • http://livedocs.adobe.com/flex/3/html/help.html?content=profiler_4.html
Flex Builder 3 allows to profile both Memory and Performance of Flex applications. Following information helps one to quick start on both.

1. Flex Memory Profiling

Memory profiling involves examining the memory used—as well as the memory currently in use—by objects in your application. Those objects could be simple classes, such as Strings, or complex visual objects, such as DataGrids. Using memory profiling, you can determine whether an appropriate number of objects exist and whether those objects are using an appropriate amount of memory.

Understanding Flex Memory Management and VM Garbage Collection

Flash Player Memory Allocation

Flash Player is responsible for providing memory for your Flex application at runtime. When you execute a line of code that creates a new instance of the DataGrid class, Flash Player provides a piece of memory for that instance to occupy. Flash Player in turn needs to ask your computer’s operating system for memory to use for this purpose.

The process of asking the operating system for memory is slow, so Flash Player asks for much larger blocks than it needs, and keeps the extra available for the next time the application requests more space. Additionally, Flash Player watches for memory that’s no longer in use, so that it can be reused before asking the operating system for more.

Flash Player Garbage Collection

Garbage collection is a process that reclaims memory no longer in use, so that it can be reused by the application—or, in some cases, given back to the operating system. Garbage collection happens automatically at allocation, which can be confusing to new developers. This means that garbage collection doesn’t occur when memory is no longer in use, but rather when the application asks for more memory. At that point, the process responsible for garbage collection, called the garbage collector, attempts to reclaim available memory for reallocation.

The garbage collector follows a two-part procedure to determine which portions of memory are no longer in use:

1. Reference counting
2. Mark and sweep

Following are some links which will be useful for anyone looking into Flex GC and Flash Player memory design.

1. PPT from Adobe on Memory Management and GC
http://blogs.adobe.com/aharui/GarbageCollection/GCAtomic.ppt

2. Resource Management
http://www.adobe.com/devnet/flashplayer/articles/resource_management.html

3. Article on GC from Adbove Devnet
http://www.adobe.com/devnet/flashplayer/articles/garbage_collection.html

4. Sample Flex app with a straightforward memory leak
http://blogs.adobe.com/aharui/GarbageCollection/MemoryLeakTest.zip

General observations on Flex GC
  • GC is invoked during memory allocation and not asynchronously as a background thread in Java JVM
  • IE minimize/maximize/restore operations seems to fire the GC, releasing the memory
Flex memory tuning tips

1. Memory Leaks caused by Event Listeners

Always remove unused event listeners. Each time an event listener is added to an object, it increases the object's reference count. So the reference remains, until the event listener is removed. If for some reason, you cannot remove the event listener use the useWeakReference parameter in the addEventListener. This does not increase the reference count.

Problem:

When you call addEventListener() on the TextInput instance, it responds by adding a reference to the object (the one that contains the handleTextChanged method) to a list of objects that need to be notified when this event occurs. When it's time to broadcast the change event, the TextInput instance loops through this list and notifies each object that registered as a listener. In terms of garbage collection, this means that, in certain circumstances, if an object is listening for events it may never be available for garbage collection.

The following example shows a simple case:

var textInput:TextInput = new TextInput();
textInput.addEventListener('change', handleTextChanged);

Solution:

When adding an event listener to a broadcaster, the developer can specify that the event listener should use weak references. This is accomplished by specifying extra parameters for the addEventListener() method:

var textInput:TextInput = new TextInput();
textInput.addEventListener('change', handleTextChanged, false, 0, true);

2. Using Item Renderer

Use item renderers judiciously. An item renderer derived from a Container class comes with lot of unnecessary overhead. Instead use a simpler class, may be an Actionscript class derived from a UIComponent. This would reduce a lot of overhead.

3. Using Images

Use images that are smaller in size and when they are large in number prefer not to embed them in the application. As far as the image formats are concerned, PNG images are much faster than other image types.

Use BitmapData as much as possible. Use dispose() method of BitmapData to free memory that is used to store the BitmapData object.

4. Bindings

Use Binding only when necessary. Data binding expressions usually take up memory. Prefer assignments to Binding whenever possible.

5. Variables

Accessing local variables is much faster. If you have variables that need to be accessed more use local variables as they are stored on the stack and accessing them is much faster.

You could help the garbage collector by assigning unused variables to null.

6. Instance Creation

For components use deferred instantiation. This would immensly reduce the startup time. Be wary of creationPolicy="all". Try to avoid removeChild() / addChild() when it would work just as well to reuse an object or just toggle the visible property.

7. Containers

Minimize the use of containers. Try not to nest HBoxes within VBoxes and so on. Nested containers make up to huge overheads.

8. Types Conversions

Use types for the variables. Avoid implicit type conversions and when unsure of the type use the "as" operator.

9. Repeaters

Repeaters have a property called recycleChildren.Set it to true. When set to true, the repeater reuses the children it already created instead of creating new ones.

10. Dictionary

Use weak references in the Dictionary object.

11. Modules

Don't unnecessarily load/unload modules. If you need to unload a module, make sure to remove all references pointing to it. In particular if you have an event listener from within the module to something outside the module, that can prevent the module's memory from being reclaimed.

12. NativeWindow

After making a NativeWindow you must call close() before it can be GC'ed. But you must remove references before you can call close(). Also when you open a FileStream object in asynchronous mode, pending event listeners can prevent the FileStream object from being GC'ed.

Sources:
  • http://www.peachpit.com/articles/article.aspx?p=1182473
  • http://www.adobe.com/devnet/flashplayer/articles/resource_management.html
2. Flex Performance Profiling

Performance profiling is used to find aspects of the flex application that are unresponsive, or where performance can be improved. When profiling for performance, generally one should be looking for methods that are executed very frequently, or methods that take a long time whenever they’re executed. The combination of those two factors usually provides a good indication of where your time should be spent in optimizing or potentially refactoring.

Using the Flex Builder 3 Profiler, one can identify the slowest portions of the application and optimize it. Flex Builder profiler allows one to take performance snapshots to record how long was spent in each function. This is useful to identify the areas of code that might benefit from optimization. While profiler is running everything is much slower. Often Mouse or similar Events will seem to take lots of time, you can ignore these. Investigate your own functions and see if they have been called too often or they take too long.

Flex Performance Tuning Tips

Flex Speed Tips: Matt chotin has shared a some great tips to improve Flex performance. Following are a few of them:

* If you have a type in AS3, which you are not sure of always use the As operator to cast the type before you use it. This avoids VM errors with try/catch, which slow the execution and is ten times slower than the As operator.

* Array is access is slow if the array is sparse. It may be faster to put nulls in empty values as this speeds things up. Array misses are very slow, up to 20 times slower than finding a valid entry.

* Avoid implicit type conversion. In the player it will convert integers to numbers and back when asked to add integers. You might as well use numbers for everything and convert back to integer at the end.

* Local variable access is faster, so assign variables to local if they are accessed a lot. They will be stored on the stack and access is much quicker.

* Data Binding expressions take up memory and can slow down the application startup. It may be more efficient to do an assignment in code rather than using binding.

* Find a slow computer and run your application. If it runs OK ship it! Other wise you can use flash.utils.getTimer():int to get a time value in miliseconds before and after some process to time it.

More information on tuning ActionScript can be found in Matt chotin's article
  • http://www.adobe.com/devnet/flex/articles/as3_tuning.html

Sources:
  • http://www.peachpit.com/articles/article.aspx?p=1182473&seqNum=3
  • http://www.adobe.com/devnet/flex/articles/as3_tuning.html

Thursday, November 02, 2006

My experiences with apache commons.net library

This time I want to share my experiences (both good and bad) with this wonderful java network library "Apache Commons Net". It all started with a requirement in my java project for programmatically FTP MySQL backup files to a remote system and which need be done at regular intervals. In my quest to find a suitable open source Java API to use to perform FTP operations, I came across the Jakarta Commons Net Project. This project implements the client side java classes of many basic Internet protocols like FTP, Telnet, SMTP, POP3, and NNTP.

It was easy to get started with the library since the api docs for the main class FTPClient included a couple of examples; the article posted at informit was also very useful. I started writing some working (not so!!!) code using the FTP client class. Till this point, it was all well and good and I was very happy with my progress. But I didn't know that learning to use a network library is not as easy as it seems.

The problem with the using a network library is that one need to have a good knowledge of the protocol to use before starting off and using the classes. The main reason I’m writing this article is because I ran into a few issues during my attempts to create a working code and I don't want my readers to go through all those not so comfortable experiences again. Here is a sample code I used to perform an FTP upload:

FTPClient ftp = new FTPClient();

try

{

ftp.connect("domain/IP address");

ftp.login("username", "password" );

ftp.changeWorkingDirectory( "/root" );

ftp.setFileType(ftp.BINARY_FILE_TYPE);

File file = new File("something.exe");

FileInputStream in = new FileInputStream(file);

ftp.storeFile("something.exe",in);

in.close();

ftp.logout();

ftp.disconnect();

}

catch (Exception ex)

{

ex.printStackTrace();

}

First problem I encountered during my attempts was that after the ftp copy operation, there was some size difference in the coped files with the original files. Or in other words, the copied files had become corrupt after the ftp operation. The problem was with file type setting for FTP transfer. The file type setting by default is ASCII and when I tried to transfer NON-ASCII (binary) files with this setting it became corrupt. After going through a lot of mailing lists and api docs, finally I was able to solve the problem by setting the file type setting to Binary using the setFileType() method of FTPClient class with argument BINARY_FILE_TYPE (other file types include ASCII_FILE_TYPE, EBCDIC_FILE_TYPE, IMAGE_FILE_TYPE, LOCAL_FILE_TYPE).

Next problem I faced was due to my ignorance regarding the Active and Passive connection modes of FTP Clients. Here the problem occurring was very unique. The storeFile() method in FTPClient class worked fine when tried to upload files to the FTP server from my local machine but didn’t worked once I deployed them to the production server running on Linux. The problem is that when I tried to send a file using the storeFile() method from the Linux machine, the application simply hangs and doesn’t return control to the application. The file copy process also seemed to stop after just creating a 0-byte sized file on the server. After spending a lot of time trying to frame reasons for this abnormal behavior, (which I failed miserably) I tried searching for the problem in the internet. And thanks to Aaron Johnson’s article on active and passive connections of FTP, finally I was able to find a solution to the problem.

In short, active mode FTP means that your client connects to the FTP server on the command port and then the FTP attempts to make a connection to your client for the data port, which isn’t going to work in a completely locked down environment. Changing to passive mode means that the client connects to the server for both command and data port transmission.

The FTPClient class, as a subclass of the SocketClient, contains a method isConnected(), which, when invoked, returned true. So I could login, change the working directory and disconnect with no errors thrown, but as soon as storeFile() method is called, the request to create file is transferred to the server and a 0-byte sized file is created on the server, but when the FTP attempts to make a connection back to the client system (which is firewall protected) for the data port, it never succeeds in getting connected. Hence the process loops and the control never returns back to the application. The solution was to change the connection mode of FTP from Active to Passive using the enterLocalPassiveMode() method of FTPClient class, which would have been obvious if I had known a bit more about the FTP protocol.

Thus enabling the passive mode of FTP by calling the enterLocalPassiveMode() method just after the connect() method of FTPClient is called solved my problem. This method causes a PASV command to be issued to the server before the opening of every data connection, telling the server to open a data port to which the client will connect to conduct data transfers. The FTPClient will stay in PASSIVE_LOCAL_DATA_CONNECTION_MODE until the mode is changed by either calling some other method such as enterLocalPassiveMode().

Thanks to the internet and to all those people who have posted the issues on the internet with whose help I was able to resolve the problems and to benefit from the powerful library from Jakarta.

References:

Commons Mailing List Archives
Aaron Johnson’s Blog

Integrating Lucene with Spring Framework & Hibernate

While looking for integration support for Lucene with Spring Framework & Hibernate, I have come across a full-blown open source Java Search Engine Framework called Compass Framework which is built on top of the Lucene Search Engine and provides seamless integration support to popular development frameworks like Hibernate and Spring Framework.

Why do we need yet another framework for implementing search functionality?

Lucene is a low level API which implies that it can easily cause coupling problems especially with the domain objects. This way of directly coding the Lucene API into the application maybe a performance killer and can also become a cause of maintenance nightmare in future (with domain model changes). Looking for other options for integrating Lucene with our Spring based application, I came across two alternatives that exist in the open-source arena:

1. Lucene Spring Modules

One option is using the "Lucene Spring Modules", which is a part of "Spring Modules project" which tries to extend the functionalities of Spring Framework to include other open-source tools. The project is intended to facilitate integration between Spring Framework and other projects without cluttering or expanding the Spring core.

2. Compass Framework

Another option is to use Compass Framework which provides a declarative way to map the domain model to the search engine. Compass provides a high level abstraction on top of the Lucene's low level API which supports a declarative mapping of domain objects. It externalizes all dependencies and coupling in a compass meta data file and thus provides a declarative technique to map the domain objects. Compass also implements fast index operations and optimization which increases the application performance.

Compass Framework provides a module named "Compass::Spring" which is intended to provide closer integration with the Spring Framework. It supports IoC using Spring's Application Context and provides support for Hibernate Session Factory. CF claims to support complex applications with bigger domain models easily. Compass also claims to bring maintenance and performance down to negligible values. Compass comes with a sample project (the old petclinic sample with additional search functionalities using Compass Framework) that demonstrates its integration support with Spring Framework & Hibernate. The product is also quite mature with much elaborate documentation. The current stable version is compass version 1.1M2.

More about Compass Framework

Compass is a first class open source Java Search Engine Framework, enabling the power of Search Engine semantics to your application stack decoratively. Compass is a powerful, transactional Object to Search Engine Mapping (OSEM) Java framework which allows you to declaratively map your Object domain model to the underlying Search Engine, synchronizing data changes between Index and different datasources. Compass provides a high level abstraction on top of the Lucene low level API. Compass also implements fast index operations and optimization and introduce transaction capabilities to the Search Engine.

In recent versions, compass provides a Lucene Jdbc Directory implementation, allowing storing Lucene index within a database for both pure Lucene applications and Compass enabled applications. Compass also provides support to SpringHibernate Gps Device (configured in Spring context file using IoC) which utilizes Compass OSEM feature (Object to Search Engine Mappings) and Hibernate ORM feature (Object to Relational Mappings) to provide simple database indexing. All the OSEM mappings are defined in a compass meta-data file and the SpringHibernate Gps Device intercepts the Hibernate session factory object to index data transparently. The Gps Device also provide real time mirroring of data changes done through Hibernate so you didn't have to explicitly re-index data after a store/update/delete. The path data travels through the system are: Database -- Hibernate -- Objects -- Compass::Gps -- Compass::Core (Search Engine). The compass returns the ids of objects matched along with a tag that identifies the class of object it belongs.

Dear readers don’t forget to read about the origin of compass framework as described by the author Shay Banon’s on his blog. It is well written and I bet you will surely enjoy the narration!!!!!!!!

References:

Open Symphony's Page
Shay Banon’s Blog


Wednesday, November 01, 2006

Full Text Search

In this article I have tried to evaluate some of the options for integrating full-text search features in java applications.

MySQL’s built-in Full Text Search engine

From my initial search what I could find was that MySQL’s built-in Text Search Engine surprisingly does effective full-text searching if the dataset is small. Also it has the least cost to implement since the search criteria can be specified as a part of query itself. But as the size of dataset grows its efficiency becomes dependent on the system resources like CPU, RAM etc.

Open Source Full Text Search engines

Most of the external full text search engines work by keeping a separate index of the table data which will be updated at frequent intervals (maybe with some amount of caching) so that time spend on the database server is less for searching for information. This approach will certainly lessen the load on the database server.

A complete list of popular full text search engines is available at WikiMedia site

1. Sphinx

Sphinx is a full-text search engine, distributed under GPL version 2. Generally, it's a standalone search engine, meant to provide fast, size-efficient and relevant full-text search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data source drivers support fetching data either via direct connection to MySQL, PostgreSQL, or from a pipe in a custom XML format.

2. Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java with features like Scalable High-Performance Indexing, Powerful Accurate and Efficient Search Algorithms etc.

As a full-text search engine, Lucene needs little introduction. Lucene, an open source project hosted by Apache, aims to produce high-performance full-text indexing and search software. The Java Lucene product itself is a high-performance, high capacity, full-text search tool used by many popular Websites such as the Wikipedia online encyclopedia and TheServerSide.com, as well as in many, many Java applications. It is a fast, reliable tool that has proved its value in countless demanding production environments.

Although Lucene is well known for its full-text indexing, many developers are less aware that it can also provide powerful complementary searching, filtering, and sorting functionalities. Indeed, many searches involve combining full-text searches with filters on different fields or criteria. For example, you may want to search a database of books or articles using a full-text search, but with the possibility to limit the results to certain types of books. Traditionally, this type of criteria-based searching is in the realm of the relational database. However, Lucene offers numerous powerful features that let you efficiently combine full-text searches with criteria-based searches and sorts.

Bench Marks


The results of benchmarking the most popular full text search engines (MySQL’s built-in Text Search engine, Sphinix Text Search engine plug-in for MySQL and Lucene) is published in the PlanetMySQL site.

Conclusion

Lucene is “the most” popular full text search solution available now to conduct efficient full text searches on database compared to MySQL’s built-in Text Search engine and Sphinix plug-in for MySQL. Lucene is just a java API so it provides seamless integration with other Java programs as compared to Sphinix written in pearl. It is a set of tools that allows us to create an index and then search it. So we need to manually handle the index creation/updation and searches using the API. But the good news is that Spring & Hibernate supports integration of Lucene through various support classes. So go ahead and use it in your java projects. Have a great time searching inside your applications with Lucene!!!

Reference:

Java World Article on Integrating Lucene

Friday, April 14, 2006

Part III - Open Source Java Frameworks for Web Development

This is the last part of the article on Frameworks. In the previous articles I have given an introduction on the concepts of frameworks and how foundations on modern java frameworks were laid. In this article, I will compare several production quality web frameworks, such as Struts, Spring, and Hibernate and go over basic similarities and underlying concepts.

Basic Concepts

Almost all modern Web-development frameworks follow the Model-View-Controller (MVC) design. Business logic and presentation are separated and a controller of logic flow coordinates requests from clients and actions taken on the server. This approach has become the de facto of Web development. The underlying mechanics of each framework are of course different, but the APIs that developers use to design and implement their Web applications are very similar. The difference also lies in the extensions that each framework provides, such as tag libraries, Java Server Faces, or Java Bean wrappers.

All frameworks use different techniques to coordinate the navigation within the Web application, such as the XML configuration file, java property files, or custom properties. All frameworks also differ in the way the controller module is implemented. For instance, EJBs may instantiate classes needed in each request, or Java reflection can be used to dynamically invoke an appropriate action classes. Also, frameworks may differ conceptually. For example, one framework may define the user request and response (and error) scenario, and another may only define a complete flow from one request to multiple responses and subsequent requests.

Java frameworks are similar in the way they structure data flow. After request, some action takes place on the application server, and some data populated objects are always sent to the JSP layer with the response. Data is then extracted from those objects, which could be simple classes with setter and getter methods, java beans, value objects, or some collection objects. Modern Java frameworks also simplify a developer's tasks by providing automatic Session tracking with easy APIs, database connection pools, and even database call wrappers. Some frameworks either provide hooks into other J2EE technologies, such as JMS (Java Messaging Service) or JMX, or have these technologies integrated. Server data persistence and logging also could be part of a framework.

Popular Web Frameworks

Apache Struts Framework
The Struts framework is an open-source product for building Web applications based on the model-view-controller (MVC) design paradigm. It uses and extends the Java Servlet API and was originally created by Craig McClanahan. In May 2000, it was donated to the Apache Foundation. It features a powerful custom tag library, tiled displays, form validation, and I18N (internationalization). Also, Struts supports a variety of presentation layers, including JSP, XML/XSLT, JavaServer Faces (JSF), and Velocity, as well as a variety of model layers, including JavaBeans and EJB.

Spring Framework
The Spring Framework is a layered Java/J2EE application framework based on code published in Expert One-on-One J2EE Design and Development. The Spring Framework provides a simple approach to development that does away with numerous properties files and helper classes that litter projects. Key features of the Spring Framework include:

Powerful JavaBeans-based configuration management, applying Inversion-of-Control (IoC) principles.

A core bean factory, usable in any environment, from applets to J2EE containers.

Generic abstraction layer for database transaction management, allowing for pluggable transaction managers, and making it easy to demarcate transactions without dealing with low-level issues.

JDBC abstraction layer with a meaningful exception hierarchy.

Integration with Hibernate, DAO implementation support, and transaction strategies.

Hibernate Framework
Hibernate is an object-relational mapping (ORM) solution for the Java language. It is also open source software, as is Struts, and is distributed under the LGPL. Hibernate was developed by a team of Java software developers around the world. It provides an easy to use framework for mapping an object-oriented domain model to a traditional relational database. It not only takes care of the mapping from Java classes to database tables (and from Java data types to SQL data types), but also provides data query and retrieval facilities and can significantly reduce development time otherwise spent with manual data handling in SQL and JDBC.

Hibernate's goal is to relieve the developer from a significant amount of common data persistence-related programming tasks. Hibernate adapts to the development process, whether it is started with a design from scratch or from a legacy database. Hibernate generates the SQL, and relieves the developer from manual result set handling and object conversion, and keeps the application portable to all SQL databases. It provides transparent persistence, the only requirement for a persistent class is a no-argument constructor.

There are more frameworks than I have described here, of course, both open-source and commercial, such as

WebWork -
http://www.opensymphony.com/webwork/
Tapestry -
http://jakarta.apache.org/tapestry/,

and many frameworks were in-house developed by extending some other MVC frameworks.

Enterprise Development Environments

Some of these frameworks became very popular within the Web developer Community and enterprise development space. As these frameworks matured into stable releases, commercial IDE (integrated development environment) toolmakers started to build support for them into their products. Some even went as far as to develop whole products based on the concepts of the framework. For example, BEA WebLogic Workshop is build around the Struts framework.Borland JBuilder has built-in support for Struts and features JSF and JSTL support as well.

The Eclipse platform became a very popular development tool, partly because of its plug-in base and partly because of its Web framework support. Numerous plug-ins to Eclipse or even entire distributions of Eclipse-based IDEs appeared. Many of the plug-ins were designed for Struts framework development, such as MyEclipse (www.myeclipse.org) or M7 (www.m7.com).

As the Web development arena continues to evolve its tools and programming methodologies, so will the Java application frameworks continue to grow. The future seems very bright for the Java Web-development frameworks.

End of Part III

Part II - Java Frameworks (The evolution of Java development)

This is the second part of the article on Frameworks which illustrates the evolution of java development. Major part of it have been gleaned from the article “Java Frameworks Take Hold “ By Rene Bonvanie

Java 2 Platform, Enterprise Edition (J2EE) is an incredibly powerful technology. It is designed to be flexible enough to adapt to many different types of applications without requiring developers to invent new approaches.

Start that first project, and the questions come fast and furious. What combination of JavaServer Pages (JSP), Enterprise JavaBeans (EJB), and servlet components should you use to build each part of the system? How will performance be ensured? Is one approach more scalable than another? And finally, how can the choices, once selected, be enforced consistently across a development team?

These questions are at the core of one of the most important discussions in the Java community today. And the mass adoption of Java in internet development projects has resulted in a flood of solutions in the shape of best practices, frameworks, and development tools.

In The Beginning: Design Patterns

The Java community recognized early on that guidelines were necessary to help developers deal with the myriad J2EE-related choices. Gradually, a set of best practices emerged, usually called the J2EE Design Patterns.

J2EE Design Patterns generalize proven, high-quality approaches for frequently encountered design issues with the J2EE application model in a format that all developers can use. Typically, a design pattern is a written description of the problem domain followed by some sample code implementing a solution.

Take, for example, the Web tier in a typical J2EE application. JSP and servlets do a great job at increasing developer productivity when building individual dynamic Web pages but provide little support for managing page-to-page flow. Furthermore, on their own, JSP and servlets do not enforce separation of the Web presentation and business logic.

Here is where design patterns fit in. The basic problem just described is resolved by a pattern called the Model-View-Controller (MVC) design pattern. This pattern specifies a way to build an application so there is a consistent way to control page flow and to separate presentation and business logic layers. The MVC approach naturally builds on JSP and servlets, using the strength of these core specifications.

Next Generation: Frameworks

Developers have gravitated to the J2EE Design Patterns en masse because they represent some of the best-known practices for J2EE application development. Incorpora-ting design patterns into applications promises high-quality, high-performance implementations.

Yet the problem most developers face when working with design patterns is that they are exactly as their name implies: a set of patterns that tend to be academically rigorous but are not easily enforceable or automated on their own. Design patterns are merely coding templates and recommendations that developers are expected to follow, with no guarantees of consistency, understanding, or enforceability. Vendors and developers are now moving to the next generation: developing frameworks based on the J2EE Design Patterns.

At the lowest level, J2EE frameworks automate the easily repeatable coding aspects of the patterns with techniques such as automatic code generation or a metadata-driven approach. At the highest level, J2EE frameworks turn into visual design and declarative programming environments.

An example framework is Apache Struts, implementing the MVC design pattern. It is a popular open source Web-tier framework that originated in an effort to provide a standard implementation of the MVC design pattern. Struts took the major concepts of MVC and created a consistent, reusable metadata layer into which J2EE developers plug the specifics of their applications.

With Struts, J2EE developers no longer have to worry about building the MVC design pattern "plumbing" in every project; rather, they can focus on applying their creative thinking to the presentation layer of the business application itself. Struts—and frameworks in general—bring other benefits too: reduced training costs, faster project delivery, and consistency across application implementations.

One can work across a standard J2EE architecture and find representative frameworks implementing design patterns in each tier. For example, Web-tier frameworks such as Apache Struts are easily combined with business-tier frameworks such as Business Components for Java (BC4J), to write entire applications.

There are many options in the data tier. For instance, BC4J provides a highly scalable implementation of the data access object pattern for persistence. Such business-tier frameworks are also frequently paired with persistence layers such as Oracle9iAS TopLink that help developers map general-purpose business-domain models to data stores such as relational databases.

Frameworks such as Apache Struts and BC4J represent a growing trend in the framework world: implement a set of J2EE Design Patterns and ensure the framework is open and flexible enough to easily plug into other popular frameworks. The goal is to give J2EE developers choice and productivity at the same time.

Making Choices

In the open source and commercial space, dozens of frameworks are emerging. Logical questions follow. What makes a successful framework? How does a developer choose the right framework? Which ones will survive?

One answer is that the surviving, widely adopted frameworks will likely be those that cleanly and elegantly solve architectural problems and significantly increase productivity over straight programming. Leading J2EE frameworks will be judged on quality of implementation, maturity, usability, cost, performance, and reliability.

As the core J2EE specifications evolve to incorporate framework features, the J2EE containers will provide developers with best practices and design consistency already built in. When this happens, developers will focus their selection criteria on the second major area frameworks tend to feature: productivity and ease of use for developers.

Open source Java Frameworks

Below is the list of some popular frameworks in the open source space.

Open Source J2EE Application Frameworks

Spring - Spring is a layered Java/J2EE application framework, based on code published in Expert One-on-One J2EE Design and Development

Jeenius - Jeenius is a framework to simplify the creation of J2EE applications. It has a strong focus on building web-based applications.

Open Source Web Frameworks in Java

Struts - The core of the Struts framework is a flexible control layer based on standard technologies like Java Servlets, JavaBeans, ResourceBundles, and XML, as well as various Jakarta Commons packages. Struts encourages application architectures based on the Model 2 approach, a variation of the classic Model-View-Controller (MVC) design paradigm.

Spring MVC – MVC framework provided by Spring is amost similar to Struts but is more powerful and easy to use.

WebWork - WebWork is a web application framework for J2EE. It is based on a concept called "Pull HMVC" (Pull Hierarchical Model View Controller).

Cocoon - Apache Cocoon is a web development framework built around the concepts of separation of concerns and component-based web development. Cocoon implements these concepts around the notion of 'component pipelines', each component on the pipeline specializing on a particular operation. This makes it possible to use a Lego(tm)-like approach in building web solutions, hooking together components into pipelines without any required programming.

Turbine - Turbine is a servlet based framework that allows experienced Java developers to quickly build secure web applications. Turbine is an excellent choice for developing applications that make use of a services-oriented architecture. Some of the functionality provided with Turbine includes a security management system, a scheduling service, XML-defined form validation server, and an XML-RPC service for web services. It is a simple task to create new services particular to your application.

Tapestry - Tapestry is a powerful, open-source, all-Java framework for creating leading edge web applications in Java. Tapestry reconceptualizes web application development in terms of objects, methods and properties instead of URLs and query parameters. Tapestry is an alternative to scripting environments such as JavaServer Pages or Velocity. Tapestry goes far further, providing a complete framework for creating extremely dynamic applications with minimal amounts of coding.

Open Source Persistence Frameworks in Java

Hibernate - Hibernate is a powerful, ultra-high performance object/relational persistence and query service for Java. Hibernate lets you develop persistent objects following common Java idiom - including association, inheritance, polymorphism, composition and the Java collections framework. Extremely fine-grained, richly typed object models are possible. The Hibernate Query Language, designed as a "minimal" object-oriented extension to SQL, provides an elegant bridge between the object and relational worlds. Hibernate is now the most popular ORM solution for Java.

OJB - ObJectRelationalBridge (OJB) is an Object/Relational mapping tool that allows transparent persistence for Java Objects against relational databases.

Ibatis SQL Maps - The SQL Maps framework will help to significantly reduce the amount of Java code that is normally needed to access a relational database. This framework maps JavaBeans to SQL statements using a very simple XML descriptor. Simplicity is the biggest advantage of SQL Maps over other frameworks and object relational mapping tools. To use SQL Maps you need only be familiar with JavaBeans, XML and SQL. There is very little else to learn. There is no complex scheme required to join tables or execute complex queries. Using SQL Maps you have the full power of real SQL at your fingertips. The SQL Maps framework can map nearly any database to any object model and is very tolerant of legacy designs, or even bad designs. This is all achieved without special database tables, peer objects or code generation.

End of Part II

Part I - Introduction to Frameworks

For all of my friends who are already familiar with development of web based applications with Java (J2EE) using JSP and servlets, and would like to start using Java Frameworks, here is a three part article that introduce the basics of modern java frameworks.

The concept of framework has been kicking around in software development for a long time in one form or another. In its simplest form, a framework is simply a body of tried and tested code that is reused in multiple software development projects. A framework in general, provides an implementation for the core and unvarying functions and includes mechanisms to allow developer to plug-in various functions or to extend the funtions.

Frameworks can be classified into 3 based on their scope, as follows:

1. System infrastructure frameworks - These frameworks simplify the development of portable and efficient system infrastructure such as operating system and communication frameworks, and frameworks for user interfaces and language processing tools. System infrastructure frameworks are primarily used internally within a software organization and are not sold to customers directly.

2. Middleware integration frameworks - These frameworks are commonly used to integrate distributed applications and components. Middleware integration frameworks are designed to enhance the ability of software developers to modularize, reuse, and extend their software infrastructure to work seamlessly in a distributed environment. There is a thriving market for Middleware integration frameworks, which are rapidly becoming commodities. Common examples include ORB frameworks, message-oriented middleware, and transactional databases.

3. Enterprise application frameworks - These frameworks address broad application domains (such as telecommunications, avionics, manufacturing, and financial engineering) and are the cornerstone of enterprise business activities. Relative to System infrastructure and Middleware integration frameworks, Enterprise frameworks are expensive to develop and/or purchase. However, Enterprise frameworks can provide a substantial return on investment since they support the development of end-user applications and products directly.

Regardless of their scope, frameworks can also be classified by the techniques used to extend them, which range along a continuum from whitebox frameworks to blackbox frameworks.

1. Whitebox frameworks rely heavily on OO language features like inheritance and dynamic binding to achieve extensibilty. Existing functionality is reused and extended by (1) inheriting from framework base classes and (2) overriding pre-defined hook methods using patterns like Template Method. Whitebox frameworks require application developers to have intimate knowledge of the frameworks' internal structure. Although whitebox frameworks are widely used, they tend to produce systems that are tightly coupled to the specific details of the framework's inheritance hierarchies.

2. Blackbox frameworks support extensibility by defining interfaces for components that can be plugged into the framework via object composition. Existing functionality is reused by (1) defining components that conform to a particular interface and (2) integrating these components into the framework using patterns like Strategy and Functor. Blackbox frameworks are structured using object composition and delegation more than inheritance. As a result, blackbox frameworks are generally easier to use and extend than whitebox frameworks. However, blackbox frameworks are more difficult to develop since they require framework developers to define interfaces and hooks that anticipate a wider range of potential use-cases.

Object-Oriented (OO) Application Frameworks

Object-oriented (OO) application frameworks are a promising technology for reifying proven software designs and implementations in order to reduce the cost and improve the quality of software. An OO application framework is a reusable, ``semi-complete'' application that can be specialized to produce custom applications. In contrast to earlier OO reuse techniques based on class libraries, frameworks are targeted for particular business units (such as data processing or cellular communications) and application domains (such as user interfaces or persistance).

The primary benefits of OO application frameworks stem from the modularity, reusability, extensibility, and inversion of control they provide to developers, as described below:

Modularity - Frameworks enhance modularity by encapsulating volatile implementation details behind stable interfaces. Framework modularity helps improve software quality by localizing the impact of design and implementation changes. This localization reduces the effort required to understand and maintain existing software.

Reusability - The stable interfaces provided by frameworks enhance reusability by defining generic components that can be reapplied to create new applications. Framework reusability leverages the domain knowledge and prior effort of experienced developers in order to avoid re-creating and re-validating common solutions to recurring application requirements and software design challenges. Reuse of framework components can yield substantial improvements in programmer productivity, as well as enhance the quality, performance, reliability and interoperability of software.

Extensibility - A framework enhances extensibility by providing explicit hook methods that allow applications to extend its stable interfaces. Hook methods systematically decouple the stable interfaces and behaviors of an application domain from the variations required by instantiations of an application in a particular context. Framework extensibility is essential to ensure timely customization of new application services and features.

Inversion of control - The run-time architecture of a framework is characterized by an ``inversion of control.'' This architecture enables canonical application processing steps to be customized by event handler objects that are invoked via the framework's reactive dispatching mechanism. When events occur, the framework's dispatcher reacts by invoking hook methods on pre-registered handler objects, which perform application-specific processing on the events. Inversion of control allows the framework (rather than each application) to determine which set of application-specific methods to invoke in response to external events (such as window messages arriving from end-users or packets arriving on communication ports).

Early object-oriented frameworks (such as MacApp and Interviews) originated in the domain of graphical user interfaces (GUIs). The Microsoft Foundation Classes (MFC) is a contemporary GUI framework that has become the de facto industry standard for creating graphical applications on PC platforms. Although MFC has limitations (such as lack of portability to non-PC platforms), its wide-spread adoption demonstrates the productivity benefits of reusing common frameworks to develop graphical business applications.

The next generation of OO application frameworks targeted at complex business and application domains. At the heart of this effort were the Object Request Broker (ORB) frameworks, which facilitate communication between local and remote objects. ORB frameworks eliminate many tedious, error-prone, and non-portable aspects of creating and managing distributed applications and reusable service components. This enables programmers to develop and deploy complex applications rapidly and robustly, rather than wrestling endlessly with low-level infrastructure concerns. Widely used ORB frameworks include CORBA, DCOM, and Java RMI.

In server-side development, a number of core tasks crop up over and over again. Such tasks can be pulled into a core framework, built and tested once, and reused across multiple projects. Utilizing this opportunity, many frameworks emerged that simplified the development of web based projects. As development of Web-based application servers and their applications expanded, so did the frameworks that supported these technologies. Currently, there are many software frameworks in the enterprise development space especially for the Java J2EE platform.

A good framework enhances the maintainability of software through API consistency, comprehensive documentation, and thorough testing. Some companies invest formally in frameworks and developers build up a library of components that they use often. Such actions reduce development time while improving delivered software quality - which means that developers can spend more time concentrating on the business-specific problem at hand rather than on the plumbing code behind it. There are also many mature frameworks available in the open source arena. Adopting such stable frameworks are more effective than going on to develop a framework from scratch.

End of Part I

Wednesday, April 12, 2006

Object Oriented Database Management Systems

In today's world, Client-Server applications that rely on a database on the server as a data store while servicing requests from multiple clients are quite commonplace. Most of these applications use a Relational Database Management System (RDBMS) as their data store while using an object oriented programming language for development. This causes a certain inefficency as objects must be mapped to tuples in the database and vice versa instead of the data being stored in a way that is consistent with the programming model. The "impedance mismatch" caused by having to map objects to tables and vice versa has long been accepted as a necessary performance penalty. The following article is aimed at seeking out an alternative that avoids this penalty.This information was gleaned from the article “An Exploration Of Object Oriented Database Management Systems“ by Dare Obasanjo.

Overview of OODBMS

An OODBMS is the result of combining object oriented programming principles with database management principles. Object oriented programming concepts such as encapsulation, polymorphism and inheritance are enforced as well as database management concepts such as the ACID properties (Atomicity, Consistency, Isolation and Durability) which lead to system integrity, support for an ad hoc query language and secondary storage management systems which allow for managing very large amounts of data.

The Object Oriented Database Manifesto specifically lists the following features as mandatory for a system to support before it can be called an OODBMS; Complex objects, Object identity, Encapsulation, Types and Classes, Class or Type Hierarchies, Overriding, overloading and late binding, Computational completeness, Extensibility, Persistence, Secondary storage management, Concurrency, Recovery and an Ad Hoc Query Facility. An OODBMS is thus a full scale object oriented development environment as well as a database management system. Features that are common in the RDBMS world such as transactions, the ability to handle large amounts of data, indexes, deadlock detection, backup and restoration features and data recovery mechanisms also exist in the OODBMS world.

A primary feature of an OODBMS is that accessing objects in the database is done in a transparent manner such that interaction with persistent objects is no different from interacting with in-memory objects. This is very different from using an RDBMSs in that there is no need to interact via a query sub-language like SQL nor is there a reason to use a Call Level Interface such as ODBC, ADO or JDBC. Database operations typically involve obtaining a database root from the the OODBMS which is usually a data structure like a graph, vector, hash table, or set and traversing it to obtain objects to create, update or delete from the database.

Comparisons of OODBMSs to RDBMSs


There are concepts in the relational database model that are similar to those in the object database model. A relation or table in a relational database can be considered to be analogous to a class in an object database. A tuple is similar to an instance of a class but is different in that it has attributes but no behaviors. A column in a tuple is similar to a class attribute except that a column can hold only primitive data types while a class attribute can hold data of any type. Finally classes have methods which are computationally complete (meaning that general purpose control and computational structures are provided) while relational databases typically do not have computationally complete programming capabilities although some stored procedure languages come close.

Below is a list of advantages and disadvantages of using an OODBMS over an RDBMS with an object oriented programming language.

Advantages

Composite Objects and Relationships: Objects in an OODBMS can store an arbitrary number of atomic types as well as other objects. It is thus possible to have a large class which holds many medium sized classes which themselves hold many smaller classes, ad infinitum. In a relational database this has to be done either by having one huge table with lots of null fields or via a number of smaller, normalized tables which are linked via foreign keys. Having lots of smaller tables is still a problem since a join has to be performed every time one wants to query data based on the "Has-a" relationship between the entities. Also an object is a better model of the real world entity than the relational tuples with regards to complex objects. The fact that an OODBMS is better suited to handling complex,interrelated data than an RDBMS means that an OODBMS can outperform an RDBMS by ten to a thousand times depending on the complexity of the data being handled.

Class Hierarchy: Data in the real world is usually has hierarchical characteristics. The ever popular Employee example used in most RDBMS texts is easier to describe in an OODBMS than in an RDBMS. An Employee can be a Manager or not, this is usually done in an RDBMS by having a type identifier field or creating another table which uses foreign keys to indicate the relationship between Managers and Employees. In an OODBMS, the Employee class is simply a parent class of the Manager class.

Circumventing the Need for a Query Language: A query language is not necessary for accessing data from an OODBMS unlike an RDBMS since interaction with the database is done by transparently accessing objects. It is still possible to use queries in an OODBMS however.

No Impedence Mismatch: In a typical application that uses an object oriented programming language and an RDBMS, a signifcant amount of time is usually spent mapping tables to objects and back. There are also various problems that can occur when the atomic types in the database do not map cleanly to the atomic types in the programming language and vice versa. This "impedance mismatch" is completely avoided when using an OODBMS.


No Primary Keys: The user of an RDBMS has to worry about uniquely identifying tuples by their values and making sure that no two tuples have the same primary key values to avoid error conditions. In an OODBMS, the unique identification of objects is done behind the scenes via OIDs and is completely invisible to the user. Thus there is no limitation on the values that can be stored in an object.

One Data Model: A data model typically should model entities and their relationships, constraints and operations that change the states of the data in the system. With an RDBMS it is not possible to model the dynamic operations or rules that change the state of the data in the system because this is beyond the scope of the database. Thus applications that use RDBMS systems usually have an Entity Relationship diagram to model the static parts of the system and a seperate model for the operations and behaviors of entities in the application. With an OODBMS there is no disconnect between the database model and the application model because the entities are just other objects in the system. An entire application can thus be comprehensively modelled in one UML diagram.

Disadvantages

Schema Changes: In an RDBMS modifying the database schema either by creating, updating or deleting tables is typically independent of the actual application. In an OODBMS based application modifying the schema by creating, updating or modifying a persistent class typically means that changes have to be made to the other classes in the application that interact with instances of that class. This typically means that all schema changes in an OODBMS will involve a system wide recompile. Also updating all the instance objects within the database can take an extended period of time depending on the size of the database.

Language Dependence: An OODBMS is typically tied to a specific language via a specific API. This means that data in an OODBMS is typically only accessible from a specific language using a specific API, which is typically not the case with an RDBMS.

Lack of Ad-Hoc Queries: In an RDBMS, the relational nature of the data allows one to construct ad-hoc queries where new tables are created from joining existing tables then querying them. Since it is currently not possible to duplicate the semantics of joining two tables by "joining" two classes then there is a loss of flexibility with an OODBMS. Thus the queries that can be performed on the data in an OODBMS is highly dependent on the design of the system.

List of Object Oriented Database Management Systems

Proprietary
Object Store
O2
Gemstone
Versant
Ontos
DB/Explorer ODBMS
Ontos
Poet
Objectivity/DB
EyeDB

Open Source
Ozone
Zope
FramerD
XL2

Conclusion

The gains from using an OODBMS while developing an application using an OO programming language are many. The savings in development time by not having to worry about seperate data models as well as the fact that there is less code to write due to the lack of impedance mismatch is very attractive. There is little reason to pick an RDBMS over an OODBMS system for new application development unless there are legacy issues that have to be dealt with.

Technorati links:

Technology - Genetic Algorithms

A genetic algorithm (GA) is a search technique used in computer science to find approximate solutions to optimization and search problems. Genetic algorithms are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology such as inheritance, mutation, natural selection, and recombination (or crossover).

Genetic algorithms were formally introduced in the United States in the 1970s by John Holland at University of Michigan. The continuing price/performance improvements of computational systems has made them attractive for some types of optimization. In particular, genetic algorithms work very well on mixed (continuous and discrete), combinatorial problems. They are less susceptible to getting 'stuck' at local optima than gradient search methods. But they tend to be computationally expensive.

Operation of a GA

Genetic algorithms are typically implemented as a computer simulation in which a population of abstract representations (called chromosomes) of candidate solutions (called individuals) to an optimization problem evolves toward better solutions. Traditionally, solutions are represented in binary as strings of 0s and 1s, but different encodings are also possible. The evolution starts from a population of completely random individuals and happens in generations. In each generation, the fitness of the whole population is evaluated, multiple individuals are stochastically selected from the current population (based on their fitness), modified (mutated or recombined) to form a new population, which becomes current in the next iteration of the algorithm.

Pseudo-code algorithm

Choose initial population
Repeat
Evaluate the individual fitnesses of a certain proportion of the population
Select pairs of best-ranking individuals to reproduce
Breed new generation through crossover and mutation
Until terminating condition

Initialization - Initially many individual solutions are randomly generated to form an initial population. The population size depends on the nature of the problem, but typically contains several hundreds or thousands of possible solutions. Traditionally, the population is generated randomly, covering the entire range of possible solutions (the search space). Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found.

Selection - During each successive epoch, a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected. Certain selection methods rate the fitness of each solution and preferentially select the best solutions. Other methods rate only a random sample of the population, as this process may be very time-consuming.

Reproduction - The next step is to generate a second generation population of solutions from those selected through genetic operators: crossover (or recombination), and mutation. For each new solution to be produced, a pair of "parent" solutions is selected for breeding from the pool selected previously. By producing a "child" solution using the above methods of crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents." New parents are selected for each child, and the process continues until a new population of solutions of appropriate size is generated. These processes ultimately result in the next generation population of chromosomes that is different from the initial generation. Generally the average fitness will have increased by this procedure for the population, since only the best organisms from the first generation are selected for breeding, alongwith a small proportion of less fit solutions, for reasons already mentioned above.

Termination - This generational process is repeated until a termination condition has been reached. Common terminating conditions are
• A solution is found that satisfies minimum criteria
• Fixed number of generations reached
• Allocated budget (computation time/money) reached
• The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations no longer produce better results
• Manual inspection
• Combinations of the above

The three most important aspects of using genetic algorithms are: (1) definition of the objective function, (2) definition and implementation of the genetic representation, and (3) definition and implementation of the genetic operators. Once these three have been defined, the generic genetic algorithm should work fairly well. Beyond that you can try many different variations to improve performance, find multiple optima (species - if they exist), or parallelize the algorithms.

Problem domains

Problems which appear to be particularly appropriate for solution by genetic algorithms include timetabling and scheduling problems, and many scheduling software packages are based on GAs. GAs have also been applied to engineering. Genetic algorithms are often applied as an approach to solve global optimization problems. Genetic algoritms also have applications in:

• Software engineering
• Code-breaking, using the GA to search large solution spaces of ciphers for the one correct decryption.
• Distributed computer network topologies.
• Electronic circuit design, known as Evolvable hardware.
• File allocation for a distributed system.
• Game Theory Equilibrium Resolution.
• Learning Robot behavior using Genetic Algorithms and lot more.

For probing deeper into the world of Genetic Algoritms visit
http://cs.felk.cvut.cz/~xobitko/ga/