Thursday, September 13, 2007

My tracks for the second day


It is only two sessions left for day one at JavaZone and I have to make planning arrangements for day III lready. Hopefully, I will also be able to update my first blog entry with comments on my previous encounters at JavaZone 2007 as well.

Ok, we have a lot of nice sessions tomorrow as well, actually we are doing BOFS' until 21PM on Thursday, not bad!!
This is my schedule for September 13.

Time
Talk
Comment
9.00-10:00
Java Persistence 2.0 - What's New, Linda DeMichiel
10:15-11.15
Open source Enterprise Integration with Mule, Tijs Rademakers og Jos Dirksen
11:45-12:45
Deploying Maven in the Enterprise, Kristoffer Moum
13:00-14:00
None

14:15-15:15
How Do I Invoke a Method, and Yes I Am A Senior Developer, Ole Andre Ranvik, Bjørn Bjerkeli This is my own talk actually, I really recommend it, puts some things in perspective (I might be biased here :-)
15:45-16:45
Teaching computers How to Drive, Mary Poppendieck
Mind blowing
17:00-18:00
WS Duck Typing, Arjen Poutsma









Wednesday, September 12, 2007

Jim Coplien talking about "Agile Stuff"


The Agile track is covered pretty well at JavaZone 2007, we have a lot of good speakers doing agile-talks. This is very much thanks to Jessica Hildrum that seems to be very well connected. Jim has been looking into agile issues for the past 12 years and works for a Danish company called Nordija.
Most of the projects that I have been working on lately has been organized in an Agile matter, Jim goes into the some details on explaining some common myths on agile processes and what works and whatnot. The five Myths discussed are:

  • On-Site customer
  • User Stories
  • TDD
  • YAGNI
  • DLS's

Before Jim starts to talk out, I will poin out the ones that I wold tend to focus on, from a developers perspective will be YAGNI and TDD. Jim tells us that it doesn't work, he's go me curious.
Jim clains that TDD will destroy your system, which is a pretty bald claim, nonetheless interesting. He claims that you will end-up building an unmaintainable and procedural architecture, which does not lend itself to refactoring. Well, he is getting my attention. I simply do not agree. I think what Jim thinks about here is that Unit testing at the JUnit level is not really testing of customer requirements, and he is perfectly right. It is tests testing parts of the system implemented using classes to build an object oriented system. This means requirements translated into something computionally feasible. Maybe I am not using TDD properly, but I do not expect that my JUnit tests maps one-to-one to functional-requirements.

What Jim misses out on here, especially with respect to saying that YAGNI will kill you is what Mary Poppendieck states about eliminating waste. I am a strong believer in up-front design, but I am not a strong beleiver in building anticipated structures, especially without tests. I think you should really eliminate waste.




My track for the first day at JavaZone 2007

I have just came home from speakers dinner after talking to a lot of different people that are giving talks at javazone.no. I really didn't have time to figure out what tracks I will visit tomorrow, there are a lot of options, but I am trying to prepare i bit now. I finally was able to meet Tor Nordbye, Carl Quinn and Dick Wall from the JavaPosse, I am really looking forward to be present at the live podcast tomorrow evening.

Here are the shortlist of talks (selected from 7 parallell tracks) that I will probably attend the first day:

Time
Talk
9.00-10.0
Web 2.5 development with Java, Stephan Janssen
10.15-11.15
The State of Ajax, Dion Almaer and Ben Galbraith
13.00-14-00
Five practical solutions to Agile myths, James Coplien
14.15-15.15
Latest in Java Technology From Sun Microsystems , Simon Ritter og Angela M. Caicedo
15.45-16.45
Have to think about this one not sure actually
17.00-18.00
Scripting in Java 6 , Per Mengshoel og Thomas Heiberg
18.15-19.15
JavaPosse RoundUp, Live podcasting from Oslo (You would not want to miss this one)






Kicking of JavaZone 2007

Finaly it has launched, JavaZone 2007 is running with more than 120 speakers from all over the world, and more than 2200 attendants, make this a pretty large conference, at least according to Scandinavian standards. I have been participating the program-commitee myself and it is very fun to see everything come to a start. The program this year is pretty well balanced, Enterprise, Web , Agile and Hard-Core Java speakers from all over the world is here in large numbers.

There are definitely a lot of focus on Agile this year. On other conferences I have been attending during the last couple of years, there is always one or two subjects that stands out, but I don't find it to be as clear this year. There are definitely bias do use less heavyweight infrastructure in web-like deployment-scenarios that it used to be. There is also a lot of focus on Maven. We actually had to reject as much as five talks on Maven, because it was already so well-covered.

I spoke with the Javaposse crew-members yesterday, and my impression is that Maven has a strong position in Scandinavia, and Less in the US.

Regardless if you are interested in Maven or Ajax there should definitely be something for every taste in this years program, hope you will enjoy it, we enjoyed putting it together!




Monday, June 25, 2007

Continuous integration infrastructure, which OS to use

I have been on several projects lately where we have been using continuous integration-servers like CruiseControl and Continuum. This part of the development infrastructure is crucial, unless you are all by yourself (unlikely) you will need some server that will continuously integrate, build and deploy your code. This has been pretty much been a de-facto standard for quite a while now, and is not what I am trying to address here. What I am trying to address is, does the OS used for continuous integration-servers (CI servers) really matter? Yes it does and it should not run on Windows for reasons that should become obvious after reading this article.

It really matters, it is really not the question of which operating-system to use, but which operation system not to use. For some reason, it is not uncommon to configure and set-up CI-servers on Windows early in early development stages. I know several companies that only supports W2K/XP as part of their central operations, or Unix of course. But in case of Unix, the servers are often quite costly, cannot be dedicated for running CI, needs to be ordered and configured centrally and so on. Lately more and more companies are also supporting Linux as part of their infrastructure base, but it is not as widely supported as W2K/XP (I don't know anyone running Vista yet).

It is also quite common to use Windows as development desktop, even in environments that does not rely on windows-specific features. I am primarily developing in Java and for the most I use Windows desktops for development at client sites, not because I want to, but becose this is the company standard at most of my clients where I do contracting. At my home office I run Linux (Ubuntu) and Laptop runs Mac OSX. Given the fact that you will most likely end-up developing code using a Windows box when you are at work, it is also quite possible that it is easier to get access to a spare box for running one or several instances of your preferred CI-server if you go for Windows.

Well, so you go about and install CruiseControl, Continuum or whatever other software that you want to use, on Windows. Well, I have tried the same quite a few times now, and I can tell you, it simply doesn't work on Windows. Windows cannot and should not be used ever to run CI-server, because you will most certainly end up in a situation that some file is locked and that the CI server cannot run its goal because it cannot delete a file locked by some other process.
Another thing is long pathnames. Windows doesn't support pathnames that are longer than 255 characters, and I guarantee you, if you have a deep project structure you will almost certainly run into this problem. A third issue that may become a problem is access. When accessing a windows-box, people has this mental image that they will need to use explorer to actually view and edit contents of directories and files, which is just stupid. People will need to access the box when the test fails on the CI-sever and not locally, which is a quite common case (hard coded paths, locale-issues and network access in integration tests are common things). This means that developers will need to use RemoteDesktop to access the server. Well, only a limited amount on connections are available, and it is heavy on the server to support many simultaneous connections so this will quickly bring you into trouble as well. All these issues makes Windows an no-go for CI servers, it is simply not possible to maintain stable CI servers without file-locking, pathname-problems and locking because of simultaneous access. You will need a full time employee to do the maintenance or go with Unix or Linux.

What should I do?

Well, an option is to take any Unix/Linux box available and use that for continuous integration instead. It will make all the problems mentioned in the previous paragraph go away. File locking is not a problem, pathnames can be as long as you want (almost true) and since everyone is using ssh, you could support a high number of simultaneous users viewing the build-log files using terminal access (which is very lightweight) instead of RemoteDesktop. The short answer is go with Unix based system. But what if that is not possible?
That was the case at previous engagement. Going for Linux was a no-op. And if we wanted to go for Unix, Solaris was the way to go. But we would never get root-access to that box and they have to order it first....*sigh*. Well, first of all.

My experience is that configuring CI-servers are generally best left as a task to the development team, preferably someone that has done this a couple of times before. I don't think it is possible to put these kind of things on order at your system-administration department, those responsible for test and production environments. It is probably outsourced anyway, so you will have to fill out a form before you are even allowed to all them. So much for agile.
Well, so what did we do. We went for Virtualization using VMWare. VMWare gives you the ability to run any number for operating systems, called the guest OS, by using a player. The OS running the player is referred to as the the host OS. Here we were able to use a Windows-box running Ubuntu-Linux (my choice of course :-) as a guest operating system. After the guest operating system has been started it behaves looks an performs like a full-blown standalone OS without all the quirks mentioned earlier. The Virtual image was created with VMWare Workstation and may be executed with either VMWare Workstation or VMWare Player which is free. The process of installing the operation system (Ubuntu Feisty) is identical to what you would do when you install it from scratch. I will not cover the installation details here, but briefly reflect on some additional benefit that may be provided when using VMware when running CI-servers (these issues may also apply to running appliances in general). Note that this also makes Windows boxes a viable alternative as a host OS.

Benefits of running a virtualized CI-server

In my experience, CI servers are not treated as production-goods my most organizations, although they should, This often results in decisions like, hey we need to use that box we gave you last week, can't you just use this one instead? The point is, it should be really easy to change hardware. In VMWare your entire installation is basically just a couple of files in a directory. If you copy and transfer those files to a new box, you can start the guest-OS using the player there instead. The only thing you will have to install on the new box is the VMWare Player.

Staging or upgrading such environments are often also problematic. This is actually production software for your development team. If you want to upgrade the CI server with lets' say a new version of Continuum, you can do so my taking a snapshot of your current image. Copy that image and start it up on your desktop. Here you can do whatever changes that you want to the guest operating system, including upgrading to the newest version of Maven and Continuum. You can even dry test the build on your local machine. If it work, you are all set to offload an updated snapshot to the CI-production server. By using VMWare in this way, it really gives the term "staging" true meaning.
Based upon experience for several years running large projects with CI servers, this is really the best approach. Having a build CI-server going to a stop caused by reasons not related to code or test uncovering actual problems over and over a again will quickly render your CI infrastructure useless, people will stop to trust it, and this is not what you want.








Friday, May 04, 2007

Upgrading Ubuntu from Edgy to Feisty


Last week I upgraded my my Ubuntu installation from Edgy to Feisty. I had been playing around quite a bit to get Compiz and Beryl to work whithout much luck on Edgy. Some time ago I also tried to install the beta of Feisty, but i could not get my Radeon X1800 card working properly, the desktop was a mess. Anyways, I am occasionally using the Synaptic package manager, and not to long ago it asked me if I wanted to upgrade from Edgy Eft, which is 6.10 to Feisty running on 7.0.4. Apart from a few new tools, more adequate support for restricted drivers, like my ATI -card and build-in support for Compiz and Beryl, I was tempted to give it a go.

I have had my Dual screen desktop running 2 x 22" screen monitors running with pristine colors for some time, but withould 3D effects. After upgrading however, I could just see that it was not working properly. It did not render fonts properly in the dual heda mode, but as working fine when using just one monitor. Disappointing, I was really looking forward to getting Compiz and Beryl up and running, instead I was actuallu set back to start. Now I am waiting for the system to download all updates after I reinstalled the entire system from scratch.

So if you are using an ATI Radon X1800/1900 card, do not upgrade to Feisty if you are using a dual-head configuration on you displays. Now I have been wise enough to configure a separate partition where I can engage myself in a feasibility study before I actually see if it is working.

Anyways, after switching from XP 64, which did not work at all!! I am very satisfied with my Ubuntu distro, I really want to recommend it.



Monday, March 19, 2007

Unit testing, a lot of developers don't get it


As a developer I quite often stumble across code that I would like to rewrite. The reasons why I want to rewrite the piece of code, may differ from case to case, reduce coupling and remove duplication are well known reasons. Lately I have been doing a lot of refactoring and one thing that stands out is how much terrible test code it is out there. It seems as if developers think that if they just have written a test for it, and the code coverage if good, then everything is good. This is dead wrong. I have seen people backing away from changing bad code just because they were backed by brittle and hard to understand tests.

Doing refactoring requires that tests are present. In systems using code written less than two or maybe three years ago, it seems to be quite rare that the code is not been backed by tests to some extension. But the way tests are written varies greatly. Having tests is a necessary but not sufficient prerequisite to do effective refactoring. In the latter cases where I have been doing requirements it strikes me as if a lot of developers do not understand that the same principles that they apply when writing system-code, should also be applied when writing test-code. It is all code, and should be treated as such. I have seen by far too much duplication, lengthy pieces of code iterating over collections just to add a conditional and assert for the particular element they where looking for, and by far to much cut-copy paste, ecpecialy in test-fixtures.

I am am doing a Lightening Talk on how to simplify unit testing on Oslo XP Meetup on 26 March. I will discuss small things you can use to write more fluent tests, tests that are easier to read and that requires less code. When I was reading up on different aspects on Unit testing, I came over this great article that was linked from unit.org. It really explains in sufficient details code smells in test-code and how to refactor them. Read it and distribute it to your peer developers.





Sunday, March 04, 2007

Effective pagination using Hibernate

This article presents with an solution on how to page through large amounts of data effectively without taking down your database and without keeping it all in memory.

Last major update 07.09.2008

I have now added references that describes how to retrieve the source-code for the examples. All source-code is released under the Creative Commons License as can be seen from the Licensing details, the source code can be retrieved from http://svitjod.rymdweb.com/repository/com/zenior/sandbox/ which contains the source-code in maven form. To include a dependency to the source code just include:

<repository>

  <id>internal</id>
  <name>Svitjod</name>
  <layout>default</layout>
  <url>http://svitjod.rymdweb.com/repository</url>
  <snapshots>
    <enabled>false</enabled>
  </snapshots>

</repository>

and 

 <dependency>
  <groupId>com.zenior.sandbox</groupId>
  <artifactId>example-domain</artifactId>
  <version>1.0-SNAPSHOT</version>
  <scope>compile</scope>
  </dependency>

 <dependency>
  <groupId>com.zenior.sandbox</groupId>
  <artifactId>hibernate-sandbox</artifactId>
  <version>1.0-SNAPSHOT</version>
  <scope>compile</scope>
  </dependency>

To your project pom.xml. 

If do not use maven, you may just download the source code by looking for files ending with -sources in the maven repository mentioned above.

Abstract

Dealing effectively with large resultsets is important. Batch applications may require to process millions of rows as part of a transaction.  The brute force approach would be read everything up in memory before iterating over theelements. This approach will effectively kill performance and use far to much memory. One common approach when using Hibernate is to use options in the Criteria interface called setFirstResult/setMaxResult. This will generate SQL (at least for Oracle) where the database will actually have to read and throw away data before returning them to the client. Another approach is to use ScrollableResults, but here you will have to keep the connection open during the entire pagination process. By retrieving a collection of all the primary keys for the matching object, you will need to perform an extra query before actually retrieving the data, but it is very easy to generate effective SQL to retrieve using this approach. This is by far the best solution in terms of database-load and memory footprint, and it is easy to implement in a portable manner.

The problem

In order to have a rich domain model, we want relationships between domain objects to be implemented as relationships between the Java classes. If we are going to be able to do that without killing performance, we have be smart in how we deal with relationships having high cardinality.

Most importantly we must iterate through a large resultset without keeping all data in memory. We also want the approach to generate simple SQL, and avoid using esoteric options that make the SQL difficult to optimize or difficult to port. We might also need adjustments to the mapping strategy and relationships to retrieve data effectively.

It is assumed that the reader is familiar with Hibernate and with common ORM issues like n+1 selects.

The example

An order application processing an order with more than 100.000 order lines.We will see how different solutions differ with respect to memory consumption, resource usage in general, effectiveness in terms of execution and differences in how the SQL is generated. The solution will be using Java 6, Hibernate3.2.0GA with annotations, the Spring framework and Oracle XE.

I will test three different approaches that don't use custom code and for each approach say something about execution time, memory consumption and explain the inner details.
As a domain model for the test I have been using a domain-model representing a generic order-domain as shown in figure 1.

Figure 1
The model is simplified but representative.

Test fixture

I have used Hibernate and Springs AbstractTransactionalSpringContextTeststopopulate the test fixture that has been inserted into the database. Idon't want the unit test to include timing for actually setting up the data.Therefore I have executed the fixture-setup by adjusting the test-case slightly so that the fixture is actually committed, by using the following approach:
  1 protected void onSetUpInTransaction() throws Exception {
2 if (!isTestDataPresent()) {
3 createOrderWithOrderLines(1000 * 1000);
4 setComplete();
5 endTransaction();
6 }
7 }
Listing 1
By using the AbstractTransactionalSpringContextTests I can actually force the test itself to commit the data, which may sometimes be useful when inspecting persistent side-effects to data when executing tests. The methods setComplete() and endTransaction() will force the execution to commit after the test is finished. To make it idempotent, I have added a method that checks f the test-data is already there.

The brute-force approach yields the following test-results on my workstation, which is a custom built AMD Athlon 64 X2 4600, 4GB Ram, Dual Raptor 150GB SATA in a Raid 0 setup. The tests have been launched from IntelliJ 6.0.4 and with the Oracle XE instance running on the same machine. The test runner is launched in a spawned process using a maximum of 1 GB RAM (-Xmx1G).


Brute force

In the first approach, we process the entire order without thinking about memory consumption when we retrieve OrderLine instances from Hibernate.
This solution basically uses the one-to-many relational mapping and leaves it up to Hibernate to retrieve the data.

The mapping is defined as follows:
  1   @OneToMany(mappedBy = "order", cascade = CascadeType.ALL, fetch = FetchType.LAZY)
2 public Set getOrderLines() {
3 return orderLines;
4 }
Listing 2
And on the OrderLine side of the bi-directional relationship, we have:
  1  @ManyToOne(fetch = FetchType.EAGER)
2 public Order getOrder() {
3 return order;
4 }

Listing 3

The test that gathers timing and resource consumption information is a simple JUnit-test method:

 

  1 public void testRetrieveOrderBruteForce() {
2 Order order = (Order) hibernateTemplate.get(Order.class, orderId);
3 ResourceMonitor monitor = new ResourceMonitor("order.getOrderLines()");
4 assertEquals(1000*1000,order.getOrderLines().size());
5 System.out.println(monitor.stop());
6 }

Listing 4


Time elapsed
Memory usage delta
79,5s
501000kb
Table 1 Test results brute-force approach

As we can see, it is quite fast to retrieve all the data in just a single select. We have retrieved 1000000 OrderLine instances in close to 80-seconds. The problem with this approach is that it is memory intensive. On most 32-bit systems, there is a practical upper limit of around 2GB heap, which will make it difficult to have more than three such queries running simultaneously [1].
In addition to this, when using a StatefulSession in Hibernate to keep track of updated fields, keeping this much data in memory at the same time will eventually halt your system or make it very slow [2]. Note that making the relationship lazy (which is the default behavior in Hibernate) will not change that fact that everything is fetched using one single select.

Using Criteria search limited by firstResult and maxResult

In order to retrieve the OrderLine side of the relationship using a Criteria query, we clearly cannot have a Hibernate-managed relationship between Order and OrderLine, we will have to manage this relationship ourselves. The most visible consequence of this will be that the Set<OrderLine> property of Order will have to be removed.This is unfortunate because it removes an important aspect of the domain-model, which is the explicit composition that constitutes the relationship.


When dealing with cases like this, I would like to be explicit about the relationship also in the domain of the Order-class, so other than removing the getOrderLines()method from the Order class, I would prefer changing it to something like this shown in listing 5.
  1 public Set<OrderLine> getOrderLines(HibernateTemplate template) {
2 ....
3 }
Listing 5

Here it is still quite clear that Order has a relationship to OrderLine. Still, however, we will have to implement the paging mechanism using a Colllection-implementation, which preserves a pretty decent domain-model with respect to an explicit relationship between Order and OrderLine. This might turn out to be a good thing, because it will probably easier to refactor the getOrderLines method to be extended to take a HibernateTemplate as an argument instead of in lining some other implementation whenever we need to retrieve OrderLine instances belonging to an Order.
An even better approach here would be to de-couple the dependency to Hibernate completely.

I discovered this after implementing the getOrderLines() - method, taking hibernateTemplate as a parameter. The example-domain module that contains the order-domain related classes, did not have a dependency to Hibernate, which I did not want to introduce because I wanted the example-domain to be as lean as possible. In order to compare multiple strategies for paging using the same domain-classes I had to introduce an interface that made plugging different strategies as in listing 6.
 1 public Collection getOrderLines(CollectionBuilder collectionBuilder) {
2 return collectionBuilder.build(this);
3 }


1 public interface CollectionBuilder {
2 /**
3 * Build the collection using information from the given instance.
4 * @param instance
5 * @return
6 */
7 Collection build(Object owner);
8 }

Listing 6

Using the Collection-builder interface, I could easily provide a Hibernate based implementation of the different strategies without introducing a Hibernate dependency into the domain-classes. Here it would be even more visible that the one-to-many collection is owned by an order instance, signified by the owner parameter of the build-method.

To support paging by restricting the underlying resultset Hibernate uses a mechanism by using the Dialect interface. This mechanism varies greatly between different RDBMS'es. The Orcale9Dialect uses a construct in oracle called rownum restrictions, which may be found in the getLimitString-method of the Dialect implementation given in listing 7 (listed only in part).
  1 public String getLimitString(String sql, boolean hasOffset) {
2 ....
10 StringBuffer pagingSelect = new StringBuffer( sql.length()+100 );
11 if (hasOffset) {
12 pagingSelect.append("select * from ( select row_.*, rownum rownum_ from ( ");
13 }
14 else {
15 pagingSelect.append("select * from ( ");
16 }
17 pagingSelect.append(sql);
........
30 }
31
Listing 7

Without careful indexing and dependent upon the underlying RDBMS you might end up with a situation that may force the RDBMS to scan all the data for the entire set of OrderLine items each time in order to return the subset limited by setFirstResult and setMaxResult.
Here is an implementation of a Collection implementation that may page the underlying data  based upon limiting the page size in the way mentioned above:

  1 public class PagingCollection extends AbstractCollection {
2 private HibernateTemplate hibernateTemplate;
3 private HibernateCriteriaSpecification specificationCriteria;
4 private int pageSize = 2000;
5 private Integer collectionSize;
6 private Criteria queryCriteria;
7
8
9
public PagingCollection(HibernateTemplate hibernateTemplate, HibernateCriteriaSpecifiaa criteria, int pageSize)
10 this.specificationCriteria = criteria;
11 this.hibernateTemplate = hibernateTemplate;
12 this.pageSize = pageSize;
13 this.queryCriteria = criteria.populate(createCriteria());
14 }
15
16
public Iterator iterator() {
17 return new PagedIterator(size());
18 }
19
20
public int size() {
21 if (collectionSize == null) {
22 Criteria criteria = specificationCriteria.populate(createCriteria());
23 criteria.setProjection(Projections.countDistinct("id"));
24 collectionSize = (Integer) criteria.uniqueResult();
25 }
26 return collectionSize;
27 }
28
29
private Criteria createCriteria() {
30 Criteria criteria = (Criteria) hibernateTemplate.execute(new HibernateCallback() {
31 public Object doInHibernate(Session session) throws HibernateException, SQLException {
32 return session.createCriteria(specificationCriteria.getPersistentClass());
33 }
34 });
35 return criteria;
36 }
37
38
private class PagedIterator implements Iterator {
39 public boolean lastPage = false;
40 private int offset;
41 private List currentCollection;
42 private Iterator currentIterator;
43 private int size;
44
45
46
public PagedIterator(int size) {
47 this.size = size;
48 this.currentCollection = loadPage(0);
49 this.currentIterator = currentCollection.iterator();
50 }
51
52
protected List loadPage(int offset) {
53 if (offset > size) {
54 throw new NoSuchElementException();
55 }
56 this.offset = offset;
57 queryCriteria.setFirstResult(offset);
58 queryCriteria.setMaxResults(pageSize);
59 currentCollection = queryCriteria.list();
60 if ( (this.offset + pageSize )>= size) {
61 lastPage = true;
62 }
63 return currentCollection;
64 }
65
66
public boolean hasNext() {
67 return size() > 0 && (currentIterator.hasNext() || !lastPage);
68 }
69
70
public Object next() {
71 if (!currentIterator.hasNext()) {
72 unloadPage(currentCollection);
73 currentCollection = loadPage(offset + pageSize);
74 currentIterator = currentCollection.iterator();
75 }
76 return currentIterator.next();
77 }
78
79
protected void unloadPage(List currentCollection) {
80 for (Object o : currentCollection) {
81 hibernateTemplate.evict(o);
82 }
83 }
84
85
public void remove() {
86 throw new UnsupportedOperationException();
87 }
88 }
89 }
Listing 8

The test case looks as follows, note that we have to iterate the entire set if we want to poll everything out from the database. The size() method just returns an aggregated value by issuing a count.
  1  public void testRetrieveOrderLinesUsingPaging() {
2 Order order = (Order) hibernateTemplate.get(Order.class, orderId);
3 Collection orderLines = order.getOrderLines(hibernateTemplate, 50000);
4
5
ResourceMonitor monitor = new ResourceMonitor("order.getOrderLines()");
6 for (Object orderLine : orderLines) {
7 //
8 OrderLine line = (OrderLine) orderLine;
9 }
10 assertEquals(1000 * 1000, orderLines.size());
11 monitor.stop();
12 System.out.println(monitor);
13
14
}
Listing 9

Without tuning the underlying database with respect to query optimization, the test result looks like this:
Time elapsed
Memory usage delta
381s
29791kb
Table 2 Test result using setFirstResult/setMaxResult

Note that the footprint is reduced to below one tenth, but the response-time is not satisfactory. In this case a new query will be issued across each page boundary, in this case each page being 50.000 elements. It turns out that executing the SQL to retrieve one single page is almost as expensive as retrieving all rows in a single select. To understand why, we have to examine the SQL.
 1 SELECT *
2 FROM
3 (SELECT row_.*,
4 rownum rownum_
5 FROM
6 (SELECT *
7 FROM orderline o
8 WHERE o.order_id = order2_.id)
9 FROM orderline this_ LEFT
10 OUTER JOIN customer_order order2_ ON this_.order_id = order2_.id LEFT
11 WHERE this_.order_id = 10)
12 row_
13 WHERE rownum <= 100000)
14 WHERE rownum_ > 50000;
Listing 10

The SQL in listing 10 has been simplified from what Hibernate generates, but the most important attribute is the nested select with the rownum-restriction, which is shown. The query above will retrieve the second page in the select, the rows ranging from 50000 to 100000. Note the nested query where the first (innermost) select will return all rows with a rownum less than 100000. This is followed by a filter on the inner result, returning only the rows counting from 50000 and above. This means that the RDBMS will actually read and throw away most of the rows for each query. For this to be implemented consistently, we also need to make sure that the same select will return the same rows repeatedly. This can only be done if we add an order by clause to the select above (which is not included in this case). See [3] for a discussion on how rownum works in Oracle.

Using this approach for large results will effectively use up all your database resources.

Using ScrollableResults

The approach that uses ScrollableReults relies on the underlying JDBC-driver providing support for some kind of cursor that may allow you to retrieve data in an incremental manner without keeping everything in memory. A ScrollableResults can beretrieved from a Criteria implementation by using one of the Criteria.scroll(..) methods.Support for this may vary. Old versions of the JDBC-drivers in Oracle has implemented this as a client side cache that keeps everything in memory.Whereas newer-versions implements this using server-side cursors.

The implementation that uses the ScrollableResult is straightforward. The CollectionBuilder implementation is shown in Listing 11 and the actual Collection implementation in Listing 12.
 1 public class OrderLineScrolleableResultSetCollectionBuilder implements CollectionBuilder {
2 private HibernateTemplate hibernateTemplate;
3 private int pageSize;
4
5
public OrderLineScrolleableResultSetCollectionBuilder(HibernateTemplate hibernateTemplate, int pageSize) {
6 this.hibernateTemplate = hibernateTemplate;
7 this.pageSize = pageSize;
8 }
9
10
public Collection build(Object instance) {
11 Order order = (Order) instance;
12 return new ScrollableResultsCollection(hibernateTemplate, new HibernateCriteriaSpecification(OrderLine.class, order, "order"), pageSize);
13 }
14 }
15
Listing 11
  1 public class ScrollableResultsCollection extends AbstractCollection {
2 private HibernateTemplate hibernateTemplate;
3 private HibernateCriteriaSpecification specificationCriteria;
4 private int pageSize = 2000;
5 private Integer collectionSize = null;
6 private Criteria queryCriteria;
7
8
9
public ScrollableResultsCollection(HibernateTemplate hibernateTemplate, HibernateCriteriaSpecification criteria, int pageSize) {
10 this.specificationCriteria = criteria;
11 this.hibernateTemplate = hibernateTemplate;
12 this.pageSize = pageSize;
13 this.queryCriteria = criteria.populate(createCriteria());
14 }
15
16
public Iterator iterator() {
17 return new PagedIterator(size());
18 }
19
20
public synchronized int size() {
21 if (collectionSize == null) {
22 Criteria criteria = specificationCriteria.populate(createCriteria());
23 criteria.setProjection(Projections.countDistinct("id"));
24 collectionSize = (Integer) criteria.uniqueResult();
25
26
}
27 return collectionSize;
28 }
29
30
private Criteria createCriteria() {
31 Criteria criteria = (Criteria) hibernateTemplate.execute(new HibernateCallback() {
32 public Object doInHibernate(Session session) throws HibernateException, SQLException {
33 return session.createCriteria(specificationCriteria.getPersistentClass());
34 }
35 });
36 return criteria;
37 }
38
39
private class PagedIterator implements Iterator {
40 public boolean lastPage = false;
41 private int offset;
42 private List currentCollection;
43 private Iterator currentIterator;
44 private int size;
45 private ScrollableResults scrollebleResults;
46
47
48
public PagedIterator(int size) {
49 this.size = size;
50 this.scrollebleResults = queryCriteria.scroll(ScrollMode.FORWARD_ONLY);
51 this.currentCollection = loadPage(0);
52 this.currentIterator = currentCollection.iterator();
53
54
}
55
56
protected List loadPage(int offset) {
57 if (offset > size) {
58 throw new NoSuchElementException();
59 }
60 this.offset = offset;
61 currentCollection = new ArrayList(pageSize);
62 System.out.println("Loading offset ["+offset+"]");
63 if ((this.offset + pageSize) >= size) {
64 lastPage = true;
65 }
66 int scrollSize = lastPage ? size - offset: pageSize;
67 for (int i = 0; i < scrollSize; i++) {
68 if (!scrollebleResults.next()) {
69 throw new NoSuchElementException();
70 }
71 currentCollection.add(scrollebleResults.get(0));
72 }
73
74
currentIterator = currentCollection.iterator();
75 return currentCollection;
76 }
77
78
public boolean hasNext() {
79 return size() > 0 && (currentIterator.hasNext() || !lastPage);
80 }
81
82
public Object next() {
83 if (!currentIterator.hasNext()) {
84 unloadPage(currentCollection);
85 currentCollection = loadPage(offset + pageSize);
86 currentIterator = currentCollection.iterator();
87 }
88 return currentIterator.next();
89 }
90
91
protected void unloadPage(List currentCollection) {
92 for (Object o : currentCollection) {
93 hibernateTemplate.evict(o);
94 }
95 }
96
97
public void remove() {
98 throw new UnsupportedOperationException();
99 }
100 }
101 }
102
Listing 12
The ScrollableResults -interface is basically just a low-level approach in mapping row-based data into domain-objects. The instantiated and mapped objects are retrieved using the ScrollableResult.get(..)method. The get-method retrieves objects index from 0. It may be possible to retrieve several objects from a result-set, because hibernate generates queries that provides data for several objects when using outer-join fetching to avoid multiple roundtrips to the database.

Let us say that we extend the model presented in Figure 1 with a one-to-many relation relation to OrderLineDetail as shown in Figure 2.

Figure 2

In tis case, to get all the data from OrderDetail using one select and outer-join fetching, will make the returned select return four rows in total with the data for OrderLine repeated once for each instance caused by the outer-join with OrderDetail. This means that we lack a layer of abstraction here. We cannot implement this in a generic manner. We will actually have to alter the implementation of the strategy implemented to retrieve OrderLine instances just because we did a small change to the domain-model, adding the OrderDetail -relation.

The ScrollableResults is backed by a ResultSet and needs the connection to be open during the iteration of the query. This will pose some challenges when implementing paging spanning multiple web-requests, typically paging in a web-page, because the ScrollableResult will need to be valid and usable across requests.

Even though the approach is fast and lean on memory as can be seen by looking at the test-results in table 3, these limitations will probably get you in to problems except from in the simplest cases.

Time elapsed
Memory usage delta
89s
26009

Table 3


The solution

The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
  1 package com.zenior.sandbox.hibernate.pagination;
2
3
import no.zenior.domain.HibernateCriteriaSpecification;
4 import org.springframework.orm.hibernate3.HibernateTemplate;
5
6
import java.util.AbstractCollection;
7 import java.util.ArrayList;
8 import java.util.Collection;
9 import java.util.Iterator;
10
11
/**
12 * Collection implementation that iterates using a range query for available identifiers
13 * for the queried collection.
14 * </br>
15 *
16 * @author Bjørn Bjerkeli
17 */
18 public class IdIntervalCollection extends AbstractCollection {
19 private static final int DEFAULT_PAGESIZE = 2000;
20 private HibernateTemplate hibernateTemplate;
21 private HibernateCriteriaSpecification finderSpecification;
22 private int pageSize = DEFAULT_PAGESIZE;
23 private IdIntervalCollectionBuilder pager;
24 private int size;
25
26
/**
27 * @param hibernateTemplate
28 * @param finderSpecification
29 */
30 public IdIntervalCollection(HibernateTemplate hibernateTemplate,
31 HibernateCriteriaSpecification finderSpecification) {
32 this(hibernateTemplate, finderSpecification, DEFAULT_PAGESIZE);
33 }
34
35
/**
36 * @param hibernateTemplate
37 * @param finderSpecification
38 * @param pageSize
39 */
40 public IdIntervalCollection(HibernateTemplate hibernateTemplate,
41 HibernateCriteriaSpecification finderSpecification, int pageSize) {
42 this.hibernateTemplate = hibernateTemplate;
43 this.finderSpecification = finderSpecification;
44 this.pageSize = pageSize;
45 initialize();
46 }
47
48
/**
49 * Initialize the instance of BucketIdIntervalPager.
50 */
51 private void initialize() {
52 Collection identifiers = HibernateRepositoryTool.idProjection(hibernateTemplate, finderSpecification);
53 this.pager = new IdIntervalCollectionBuilder(pageSize);
54 for (Object identifier : identifiers) {
55 pager.add((Long) identifier);
56 }
57 this.size = identifiers.size();
58 hibernateTemplate.clear();
59 }
60
61
62
63
public int size() {
64 return size;
65 }
66
67
public Iterator iterator() {
68 return new Iterator() {
69 private Collection currentPage = new ArrayList();
70 private Iterator currentIterator = currentPage.iterator();
71 private Iterator currentIdIntervalIterator = new ArrayList(pager.buildIdIntervals()).iterator();
72 private IdInterval currentIdInterval;
73
74
public void remove() {
75 throw new UnsupportedOperationException();
76 }
77
78
public boolean hasNext() {
79 return ((size > 0) && currentIdIntervalIterator.hasNext())
80 || currentIterator.hasNext();
81 }
82
83
public Object next() {
84 if (!currentIterator.hasNext()) {
85 unloadPage(currentPage);
86 currentIdInterval = (IdInterval) currentIdIntervalIterator.next();
87 loadPage(currentIdInterval);
88 }
89
90
return currentIterator.next();
91 }
92
93
protected void unloadPage(Collection currentPage) {
94 for (Object o : currentPage) {
95 hibernateTemplate.evict(o);
96 }
97 }
98
99
/**
100 * Load the data corresponding to the idInterval.
101 */
102 protected void loadPage(IdInterval currentIdInterval) {
103 finderSpecification.setIdInterval(currentIdInterval);
104 currentPage = HibernateRepositoryTool.find(hibernateTemplate, finderSpecification);
105 currentIterator = currentPage.iterator();
106 }
107 };
108 }
109 }
110
111
Listing 13

The key here is basically to implement the iteration process in two steps, one to retrieve identifiers, and one to actually provide the iterator with an id-interval that is used tospecify a between range-query. Retrieving the set of identifiers is easy, the HibernateCriteriaSpecification that uses Criteria for search has a Projections.id()factory that returns all identifiers for a given Criteria-query. Basedupon the entire set of identifiers we are able to construct buckets of identifiers each representing a logical and proper page of data.
  1  public void testRetrieveOrderLinesUsingPagingIdInterval() {
2 final Order order = (Order) hibernateTemplate.get(Order.class, orderId);
3 Collection orderLines = order.getOrderLines(new IdIntervalPagedCollectionBuilder(hibernateTemplate, 50000));
4
5
ResourceMonitor monitor = new ResourceMonitor("order.getOrderLines()");
6 for (Object orderLine : orderLines) {
7 //
8 OrderLine line = (OrderLine) orderLine;
9 }
10 assertEquals(1000 * 1000, orderLines.size());
11 monitor.stop();
12 System.out.println(monitor);
13
14 }
Listing 14

The IdIntervalPagedCollectionBuilder is the pluggable implementation of the CollectionBuilder interface described in listing 6.
 1 public class IdIntervalPagedCollectionBuilder implements CollectionBuilder {
2 private HibernateTemplate hibernateTemplate;
3 private int pageSize;
4
5
public IdIntervalPagedCollectionBuilder(HibernateTemplate hibernateTemplate, int pageSize) {
6 this.hibernateTemplate = hibernateTemplate;
7 this.pageSize = pageSize;
8 }
9
10
public Collection build(Object instance) {
11 return new IdIntervalCollection(hibernateTemplate,
12 new HibernateCriteriaSpecification(OrderLine.class, instance, "order"), pageSize);
13 }
14 }


Listing 15

The three examples have almost the same amount of code, but rely onquite different approaches. The test-result for the last test is remarkable.
Time elapsed
Memory Usage Delta
99,5s
29402
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResult version does, nor does it strain the database like the version using setFirstResult/setMaxResult.

Limitations

We need a primary key, preferably a surrogate key, that has a natural ordering so that we can implement the paging concept by using a between search for the identifiers. Another limitation is the need to actually retrieve all the identifiers before performing the search, which means two select-statements instead of possibly only one.

The Author

The Author is working as a senior consultant in a Norwegian consulting company, Zenior AS, and is an active member of the Norwegian JUG, javaBin.

Licensing

Creative Commons License
Hibernate paging using id-intervals by Bjørn Bjerkeli is licensed under a Creative Commons Attribution-Share Alike 3.0 Norway License.
Based on a work at svitjod.rymdweb.com.

References

[1] Memory limitation on 32 bit systems, http://www.theserverside.com/discussions/thread.tss?thread_id=19442
[2] Hibernate batch processing http://www.hibernate.org/hib_docs/v3/reference/en/html/batch.html
[3] Oracle rownum http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html