Rating data for a later on sorted select…
28. März 2010
Problem space
Imagine a system where a huge (say 100000 and even more) amount of data records have to be imported to the system every day. The records are stored after parsing as a parsed result as BLOB/CLOB or XML in a database staging table of the import sub system. This table holds also some meta data (like the provider of the file and so on) for each record. This abstract form of storage has been chosen against the traditional relational storage, because there are a lot of different import types where you don’t want to create new relational tables every time a new import type comes up. So the parsed result (the payload if you want) is something like a DynaBean from apache.commons.beanutils library where you can add easily key/value pairs.
There are cases where you know some meta data from the files header. Like how much data records are delivered in one file in example. But mostly you don’t know. Also the data records are mostly unordered. For some reasons it’s necessary to give these records to the business logic in a sorted order (i.e. Top customers are handled in a preferred way and their data is imported before the data of the low prioritized customers).
When the parsed record is stored as BLOB you can’t access the content for ordering of the parsed result anymore when it’s stored in the database table. With CLOB or XML storage this could be done, but it’s ugly and difficult to do in a generic way (i.e. XMLType-Queries in ORACLE).
First Solution which comes to mind
What comes to mind first is to extract the sorting attributes to a additional column in the database table before storing the BLOB/CLOB/XML and use these additional attributes to select the data records in an ordered way.
The disadvantages of this approach are:
- … leads into more and more additional attributes cause of the different import types
- … or distinct record staging tables for each import type
- … meta data and payload data are mixed up
- … record data is stored redundant in the BLOB/CLOB/XML and in the additional attributes columns
- … record tables have to be changed if a different sorting order is needed
- … queries for ordering are very complex, and have to be adjusted every time the attributes change
- … even you get all working, all records have to be read again after parsing for the extra sorting step
Let’s think a bit further
Wouldn’t it be nice to write the query for ordering just once and never touch again? When thinking about this it’s clear that we need a way to rate the attributes in some way during the parsing step so that we could use this rating value later on while selecting the records again for handing them over to the business logic.
The advantages of such an solution would be:
- … every records is rated while parsing, so there’s just a insert on the database table for each record
- … the records don’t have to be updated when giving the record the right position in the order
- … sorting is done only when selecting the records for handing them over to the business logic with a simple order by which will never change again
But how to do this?
The rating process which was coming to my mind for simple int values is a offset relative to the first record of each file. In example we have a record with an attribute priority where the values 10 to 0 stand for very high priority to very low priority. When you now get 5 records one after an other which have the following priority values
- record1 -> priority 7
- record2 -> priority 1
- record3 -> priority 5
- record4 -> priority 10
- record5 -> priority 7
The first record you read from the file has priority 7 which you set as the initial value and give this record a offset 0 (cause 7-7 results to 0). When reading the next record you subtract from the record2 with priority 1 the priority from record1 and get the offset for record2 which is 1-7=-6
So we have now
- record1 -> priority 7, offset 0
- record2 -> priority 1, offset -6
- record3 -> priority 5
- record4 -> priority 10
- record5 -> priority 7
after doing this for all other records (subtracting record1 priority from the actual records priority) we get
- record1 -> priority 7, offset 0
- record2 -> priority 1, offset -6
- record3 -> priority 5, offset -2
- record4 -> priority 10, offset 3
- record5 -> priority 7, offset 0
What you now have is a classification of each record related to the first record of the file. It’s the same like measuring distances from a defined point on a line. The defined point is the zero position. Every other point on the line on the left or right has a positive or negative offset to the defined zero point. If you now select the records with an order by offset desc you get the record in the following order: record4, record5, record1, record3, record2
- record4 -> priority 10, offset 3
- record5 -> priority 7, offset 0
- record1 -> priority 7, offset 0
- record3 -> priority 5, offset -2
- record2 -> priority 1, offset -6
You might recognize that record5 and record1 have the same offset. Wouldn’t it be nice to get them also ordered? Maybe by a second attribute? I will show this in the next paragraph.
More than one attribute…
In the last paragraph we used the priority as the one and only attribute which we want to use in the select for ordering. This attribute has a range from 10 to 0 for priority from very high to very low. So we use a 2 digit value for sorting which could be 10, 09, 08, 07, 06, 05, 04, 03, 02, 01, 00. When we want to use a second attribute for ordering we need more digits to enhance the range of values. If we take a second attribute like ‘year of registration’ (YofReg) which represents the year when the customer was registered. If we want to sort first for the priority and second for YofReg we have to do the same with second attribute as we did with the first. Then we get a number where the leftmost 2 digits represent the priority and the 4 rightmost digits represent the YofReg. So we have a range from 00|0000 to 10|9999. So we just paste the year of registration on the right of the priority and get a new unique number.
Now we can reuse the algorithm from above for rating. In example if the records look like this.
- record1 -> priority 7, YofReg 2002
- record2 -> priority 1, YofReg 2002
- record3 -> priority 5, YofReg 1997
- record4 -> priority 10, YofReg 2005
- record5 -> priority 7, YofReg 2008
Rating example for record 4: offset = 102005 – 072002 = 30003
After rating all records we get
- record1 -> priority 7, YofReg 2002, offset 0
- record2 -> priority 1, YofReg 2002, offset -60000
- record3 -> priority 5, YofReg 1997, offset -20005
- record4 -> priority 10, YofReg 2005, offset 30003
- record5 -> priority 7, YofReg 2008, offset 6
After sorting with the offset descending we get
- record4 -> priority 10, YofReg 2005, offset 30003
- record5 -> priority 7, YofReg 2008, offset 6
- record1 -> priority 7, YofReg 2002, offset 0
- record3 -> priority 5, YofReg 1997, offset -20005
- record2 -> priority 1, YofReg 2002, offset -60000
Voilá! We have sorted the records with the offset which is representing two attributes, but we don’t need both attributes in the query which does the sorting – only the offset. This means we can use still the same query as before! No changes in the selecting query just the rating of the records while parsing has to be enhanced slightly.
Descending/Ascending
In the last paragraphs we just selected in a descending order. If we want to sort also by some attributes in an ascending order this could be done also in the same way. What’s needed is an offset for ascending (ascOffset) and one for descending (descOffset). Both can contain a rating value for more than one attribute. If all offset values are the same the offset value is not used in the select query. the select will look like this
SELECT * from Record WHERE ORDER BY descOffset DESC, ascOffset ASC
And again, this query will never change!
Example:
- r1->priority 7,YofReg 2002,descOffset 0,ascOffset 0
- r2->priority 1,YofReg 2002,descOffset -6,ascOffset 0
- r3->priority 5,YofReg 1997,descOffset -2,ascOffset -5
- r4->priority 10,YofReg 2005,descOffset 3,ascOffset 3
- r5->priority 7,YofReg 2008,descOffset 0,ascOffset 6
After sorting we get
- r1->priority 10,YofReg 2005,descOffset 3,ascOffset 3
- r1->priority 7,YofReg 2002,descOffset 0,ascOffset 0
- r5->priority 7,YofReg 2008,descOffset 0,ascOffset 6
- r3->priority 5,YofReg 1997,descOffset -2,ascOffset -5
- r2->priority 1,YofReg 2002,descOffset -6,ascOffset 0
So record5 is behind record1 because the ascending order of the YofReg
What about ordering attributes which represents text?
Good question. In the above paragraph we just used attributes which represent a number. All these numbers are in the base 10 numbering system. We all know that we can convert number from one numbering system into an other. So from base 10 (decimal) to base 16 (hexadecimal) or vice versa. The example of the hexadecimal system shows that alphabetic letter’s could also represent a value. In example hex A for 10 in the decimal system or hex B for 11 in the decimal system. As you might guess already, that’s the trick. A Text is converted from the 26 based numbering system to the decimal system. And then the algorithm from above works again. Sure, the longer the text is – the number of digits – the bigger is the number! But in general this will work. Also you have to think about which letters are all used in the text. Are the digits 0 to 9 also available, only upper or lower case letters special locale letters, and so on… All these factors will scale up or down the base of the numbering system which has to be used for calculating the number in the decimal numbering system for representing the Text.
Example with the three words ABBA, OTTO and BAAB
For simplicity i choose three easy uppercase words and suppose that the attribute is just 4 digits long. And the four digits of the attribute are filled always – so no spaces just the 26 Characters of the alphabet.
First we calculate the decimal representation.
ABBA = 1 * 260 + 2 * 261 + 2 * 262 + 1 * 263
ABBA = 18981
OTTO = 15 * 260 + 20 * 261 + 20 * 262 + 15 * 263
OTTO = 277695
BAAB = 2 * 260 + 1 * 261 + 1 * 262 + 2 * 263
BAAB = 35856
So we have now the decimal values representing the three text values which we want to sort.
The records are submitted in the following order
- record1 -> name ABBA, offset 18981 – 18981 = 0
- record2 -> name OTTO, offset 277695 – 18981 = 258714
- record3 -> name BAAB, offset 35856 – 18981 = 16875
When sorting descending by the offset we get the record in the following way
- record2 -> name OTTO, offset 277695 – 18981 = 258714
- record3 -> name BAAB, offset 35856 – 18981 = 16875
- record1 -> name ABBA, offset 18981 – 18981 = 0
Not surprisingly this works also! And again we can use the same query with ordering by the offset value.
Some Java code to try it…
I created two simple java classes to test all this. So there is no database in this test so everyone can try it. The values are added to a HashMap with the calculated offset value as the key and the TestObject as the value. The TestObject itself has two attributes which are filed with random numbers. After the test objects are created and added all to the HashMap the keys are sorted with Arrays.sort() which gives you the ordered keys. Then the test objects are read again with the ordered keys and you get the values in the ordered way.
The TestObject class
package com.frevel.offsetsort;
import java.util.Random;
public class TestObject
{
private Integer testNumber1 = null;
private Integer testNumber2 = null;
private Integer sortOffset = null;
public static TestObject createTestObject()
{
return new TestObject(new Integer(new Random().nextInt(100)), new Integer( new Random().nextInt(1000)));
}
private TestObject(Integer testNumber1, Integer testNumber2)
{
super();
this.testNumber1 = testNumber1;
this.testNumber2 = testNumber2;
this.sortOffset = new Integer(String.format("%03d%04d", new Object[]{testNumber1, testNumber2}));
}
public Integer getTestNumber1()
{
return testNumber1;
}
public Integer getTestNumber2()
{
return testNumber2;
}
public Integer getSortOffset()
{
return sortOffset;
}
}
The JUnit test driver class (no real JUnit test)
package com.frevel.offsetsort;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import junit.framework.TestCase;
public class OffsetSort extends TestCase
{
private static final int MAX_ELEMENTS = 100;
public void test()
{
Map elements = new HashMap();
TestObject firstElement = TestObject.createTestObject();
Integer firstOffset = new Integer(firstElement.getSortOffset()
.intValue()
- firstElement.getSortOffset().intValue());// new Integer(0)
elements.put(firstOffset, firstElement);
for (int i = 0; i < MAX_ELEMENTS - 1; i++)
{
TestObject currentElement = TestObject.createTestObject();
Integer currentOffset = new Integer(currentElement.getSortOffset()
.intValue()
- firstElement.getSortOffset().intValue());//* -1 for descending
elements.put(currentOffset, currentElement);
}
Object[] keys = elements.keySet().toArray();
for (int i = 0; i < keys.length; i++)
{
System.out.println(((TestObject) elements.get(keys[i]))
.getSortOffset()
+ "="
+ ((TestObject) elements.get(keys[i])).getTestNumber1()
+ ":"
+ ((TestObject) elements.get(keys[i])).getTestNumber2());
}
System.out.println();
Arrays.sort(keys);
for (int i = 0; i < keys.length; i++)
{
System.out.println("[" + i + ":" + keys[i] + "]"
+ ((TestObject) elements.get(keys[i])).getSortOffset()
+ "="
+ ((TestObject) elements.get(keys[i])).getTestNumber1()
+ ":"
+ ((TestObject) elements.get(keys[i])).getTestNumber2());
}
}
}
… Awaiting comments
So everyone reading this… please sent me comments. Cause i really want to know if i did miss something… Right now – in my problem space – it looks really good with a solution like this. No additional query for reading the records. No sorting in memory (which is really not possible cause of the huge number of data records).
Gefällt mir:
Getaggt mit: ascending, descending, hash function, mass data, offset, Ordering, rating, reduce database access, Sorting
28. März 2010 um 22:01
Congratulation!
Positive Comments:
I think, one off the most importent advantages is,
you only have to create one table for all different records you like to order.
Negative Comments:
Some typo’s. ;o) I will talk to you later about that.
29. März 2010 um 22:19
hi tetanuss,
nicely explained, some typos though, but they don’t disturb the reading too much.
as far as I understood the algorithm, you already know beforehand the data classification (type and range).
furthermore, you also know which attributes you want to sort for (at least at the point where you process the records), right?
given those two assumptions, this looks like an elegant solution.
having to deal with text of varying width needs a bit of care, I guess. wouldn’t you have to rightpad the string with “0″ until the max length of the field?
e.g. comparing
* gad
* gadget
would compute offsets for “gad000″ vs. “gadget”, assuming max field length is 6.
the basic idea of the algorithm could also be used for hashing the data to allow parallel processing.
Regards,
MIB
30. März 2010 um 16:31
Hi MIB,
Yes the types and ranges are specified in a format specification for the records. So it’s easy to classify the data. The attributes which are relavant for the sorting are also know to that time and that’s no problem then, too. Indeed almost everything is specified in this format specification – like a contract –
You’re right with your comment about text when it does not fill all ‘digits’ of the attribut. For text attributes the padding is done with a space which is counted as a zero, so space in the text means 0 * 26 0 in example. I just left this out to keep it a bit more simple.
Using this in parallel should also work, you just need the very first data record for the base object with offset 0 initialized. Indeed we also have such a case where event could could be delivered in a time frame.
Thanks for your comment … I’m just catching the typos