Category Archives: Taxonomy Management

MOSS 2007 for ECM – Can MOSS handle 15 Million documents?

The simple answer to this question is yes, MOSS 2007 can easily handle managing 15 million documents.  In fact, MOSS 2007 Enterprise Search can index 50 million documents; see MOSS 2007 Limitations.  Channel 9 (MSDN) has an interesting video titled Office SharePoint Server at Microsoft: 12TB and Counting, I recommend you watch it.  It’s not going to give you all the answers, but it will help you understand what is involved with managing such a large amount of content.

An extensive amount of initial planning and information architecture would need to put in place for any ECM system that was to manage 15 million documents. But done right, you can do the same.  You will also find, if you put the time in up front, the overall future operations and maintenance will be significantly reduced.

Planning and Information Architecture

To be effective and useful, there is a great deal of planning and information architecture that needs to be thought through.  By planning, I am referring to team, governance, policies, procedures, search configuration, infrastructure and so on.  With regards to information architecture, be it 1,500 documents, 15,000 documents or 15 million documents, without the appropriate content classification, permissions and ownership, complete failure is eminent.

Please, please, please… do not simply take the structure of a file system and duplicate that in SharePoint.  Sure, you immediately benefit from versioning, recycle bin, auditing and other enhanced features, but users will not be able to find files any better than they could when on your file system.  One of the key benefits of a Document Management System (DMS), such as MOSS is it’s ability to return relevant search results.  You will not obtain this without content classification.

If you read MOSS 2007 Limitations, you will see there are limitations to the number of documents (files) that can be stored in a single Document Library.  It is also a recommendation, or limitation, to store no more than 2,000 documents (files) in a folder.  However, if your user browses to a Document Library folder and it does contain 2,000 documents, they may not be very happy with the results; both performance and having to page through so many documents.  In short, don’t do it to them…

Studies have shown that there are two types of users in your organisation, those who prefer to point-n-click their way to content and those who prefer to search.  It is important to implement a solution that works with both these types of users in mind.  In this type of situation, I would recommend breaking the documents out into topics.

Some of the limitations outlined in
MOSS 2007 Limitations may, in fact, not be hard limitations.  For example, Microsoft has indicated that a Document Library can store a maximum of 2 million documents (files).  I know this is not a hard limit because I have a customer with over 3.3 million documents in a single Document Library.  So… there is additional investigative work to be done here.

Limits of MOSS 2007

I’ve been doing some research and testing on MOSS 2007 in prepartion for our new (soon to be annouced) Records Management framework for MOSS 2007 and as with virtually every application, there are limitations as to what it can do, Microsoft Office SharePoint Server is no exception.  However, I think you will see these limitations are such that they will have little impact in virtually any enviroment.

Microsoft Office SharePoint Server (MOSS) 2007 Limits

Site Collections in a Web Application 50,000
Sites in a Site Collection 250,000
Sub-sites nested under a Site 2,000
Lists on a Site 2,000
Items in a List 10,000,000
Documents in a Library 2,000,000
Documents in a Folder 2,000
Maximum document file size 2GB
Documents in an Index 50,000,000
Search Scopes 1,000
User Profiles 5,000,000

The semantic web – Why bother?

Of late I’ve been doing some research on how we might take advantage of the semantic web concepts and how that applies to SharePoint and other Microsoft technologies to say I’ve been disappointed is a little of an understatement, not so much by Microsoft but by where the whole semantic web is going.

Ubiquitous semantic interoperability on the web (no matter the technology) is like world peace: It’s a goal so grandiose, nebulous and contrary to the fractious realities of distributed networking that it hardly seems worth waiting for.

In most circumstances we can assume that heterogeneous applications will employ different schemas to define semantically equivalent entities — such as customer data records — and that some sweat equity will be needed to define cross-domain data mappings for full interoperability. All of this is fine but the W3C vision goes further it seems to refer to some sort of super-magical metadata, description and policy layer that will deliver universal interoperability by making every networked resource automatically and perpetually self-describing on every conceivable level.

Needless to say, this future is going to be slow to arrive. Commercial progress on the Semantic Web front has been glacial at best, with no clear tipping point in sight. It’s been eight years since RDF was ratified by W3C, and more than three years since OWL spread its wings, but neither has achieved breakaway vendor or user adoption.

To be fair, there has been a steady rise in the number of semantics projects and start-ups and recently there has been resurgence in industry attention to semantics issues, such as the recent announcement of a Semantic SOA Consortium. Some have even attempted — lamely — to rebrand the Semantic Web as “Web 3.0,” so as to create the impression this is a new initiative, not an old effort straining to stay relevant.

You would think that Microsoft and its partners would be one sector that one would expect to embrace the Semantic Web although they have largely kept their distance. In theory, SharePoint with its search, enterprise content management, enterprise information integration (Business data catalogue) and business intelligence functionality out of the box and also with the potential for Microsoft to use SharePoint as its enterprise service bus and data quality management toolset. This could mean that the core users of SharePoint could all benefit from the ability to harmonize divergent ontologies automatically across many heterogeneous environments with SharePoint at the core of the enterprise.

Unfortunately nearly the entire Microsoft ecosystem seems to be taking a wait-and-see attitude. One big reason for reluctance is that there already are many established tools and approaches for semantic interoperability in the SOA world, and the new W3C-developed approaches haven’t yet demonstrated any significant advantages in development productivity, flexibility or cost.

By that criterion, the Semantic Web has a long way to go, and may not get to first base until early in the next decade, at the very least. Microsoft’s ambitious road map for its SQL Server product includes no mention of the Semantic Web, ontologies, RDF or anything else to that effect.

So far the only mention of semantic interoperability in Microsoft’s strategy is in a new development project code-named Astoria. Project Astoria, which was announced in May at Microsoft’s MIX conference, will support greater SOA-based semantic interoperability on the ADO.Net framework through a new Entity Data Model schema that implements RDF, XML, and URIs. Microsoft has not committed to integrating Astoria with SQL Server or SharePoint, however, nor is it planning to implement any of the W3C’s other Semantic Web specifications. Essentially, Astoria is Microsoft’s trial balloon to see if a Semantic-Web-lite architecture lights any fires in the development community.

The W3C’s Semantic Web initiative indeed could be the seedbed of a new, semantics-enabling SOA, although it could take a lot longer for this dream to be realised fully. It might take another generation or so before we see anything resembling a universal semantic backplane that spans all SOA platforms.

After all, the Utopian hypertext visions articulated by Vannevar Bush in the 1940s and Ted Nelson in the 1960s had to wait till the 1990s, until Tim Berners-Lee nudged something called the World Wide Web into existence.

Taxonomy/Tagging in MOSS 2007

Just to show how far SharePoint as an ECM has come it’s now possible to apply a complete Taxonomy and tagging kit over all of the content managed in MOSS 2007.