From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

In The News:


pen paper and inkwell


cat break through


10 Minutes to Your Google Sitemap

Google's calling your name...Hi, Google here. We want to index... Read More

Five Ways To Win The Favor Of Search Engines

You've got a cool new website with all the works:... Read More

To Understand the Success of Website Ranking

Time is a factorTo obtain positive results is not very... Read More

The Mystery of the Magical Keyword Density Formula

Keyword density. When it comes to SEO copywriting, this has... Read More

Is a High Search Engine Ranking Important for YOU?

In the process of designing a website for clients the... Read More

4 Tricks For Lightning Fast Indexing

The biggest problem that most are running into seems to... Read More

Monitor and Increase Your Search Engine Visibility with the DIY SEO Tools

In this three part article, you'll find many tools that... Read More

Playing By Googles Rules

As the undisputable leader in search engines, Google places a... Read More

Is Google Having a Tough Time with Their Website Limit?

If you are one to pay attention to what happens... Read More

Surviving Googles Aging Delay

Google has always been the search industry's innovator and that's... Read More

Google Rankings ? Achieving a Top 10 Position in Google ? Part 1

Achieving a top ranking position in Google is every webmasters... Read More

Why Articles Are Not The Route To High Search Engine Rankings

If you have any interest in getting high search engine... Read More

Understanding Googles Algorithm

What is the Google Algorithm?An algorithm is a mathematical equation... Read More

Why Search Engine Optimization is Not Enough

OK. So you've created a nice website with lots of... Read More

Submit All Of Your Pages And Watch Your Traffic Grow

Everyone is looking for "secrets" about how to get more... Read More

Over Optimization and the OOP - Does a Penalty Exist?

If you have questions about whether or not the Over-Optimization... Read More

Content Is King

Over the past few years, there have been many debates... Read More

How to Google; or How to be Easily Distracted

I set out with the intention of writing a self... Read More

Internet Marketing and SEO

Have you ever seen any email offers of getting you... Read More

The Business Case for SEO

It's interesting how potential clients have preconceived notions about which... Read More

Search Engine Traffic Myths, Time Wasters, and Pitfalls

Everyone wants to increase their rankings with the search engines... Read More

Fresh Content Improves Search Engine Optimization

Many search engine optimization companies will sell you a search... Read More

Boost Your Website Rankings with Expert Content

The chase for a high web ranking is constantly on... Read More

Getting Listed in the ODP, Google Directory

First of all, the Google directory is really just the... Read More

Search Engine Optimization Tips For 2005 - Part Three

Welcome to part three of our series of articles on... Read More

Promoting Home Business: Tips to Increase Web Site Sales

You've selected an appropriate Online Business Opportunity. That is not... Read More

Why Search Engine Marketing Has A Passion for Web Site Usability

Watching a recent football game, I imagined two very different... Read More

Keywords are the ?KEY? to a Popular and Profitable Web Site

Keyword Research will reveal answers to 3 critical questions:1. Is... Read More

Website Promotion: 10 Search Engine Optimization Blunders to Avoid

If you want to develop a successful search engine optimization... Read More

What You Did Wrong With Your Domain Names!

Trying to improve search engine rankings is just like a... Read More

STOP Writing for Search Engines

Back when I was starting out with my first internet... Read More

How to Pick an SEO Firm

If you're looking for an SEO firm, we recommend that... Read More

7 Simple Steps to Spy on Your Online Competition and Acheive a High Page Rank

My Grandfather ran a small Grocery Store and when you... Read More