Some more experimental details. So for the UniProtKB 2022_04 dataset
there are 17,435,087,503 quads where the predicate is rdf:type.
On disk this consumes 6,411,506,834 bytes. Leading to just under 3 bits
per quad of this kind disk usage.
needed here but I coded it that way).
PS. The blocker off having a 100 billion+ triple store running on a my 5
On 31/10/2022 15:09, jerven Bolleman wrote:
> Dear RDF4j dev-community,
>
> I have been distracted by writing a write-once/read-many quad store :)
>
> This store is designed with some of the challenges of UniProt in mind.
> It is based around two concepts sort all the things, and don't mix value
> types. This quad store is aimed to be good for datasets with up to about
> 4000 distinct predicates and graphs in a few 100s range, billions of
> distinct values and trillions of triples. That change relatively rarely
> and when they do can be generated/reloaded from scratch.
>
> # Some technical snippets.
>
> ## Sorted lists for values
>
> The store has dictionaries for values like the vast majority of quad
> stores. Difference is one dictionary for each distinct datatype plus one
> for iris. A nuance of these dictionaries are that they are based around
> sorted lists compressed and memory mapped and all keys are therefore
> just index position values. These keys are valid for comparison
> operators e.g. key 1 value "a" key 2 value "b" and key comparison
> (Long.compare) would match SPARQL value comparison.
>
> ## Partioned triple tables, with graph filters
>
> The quad table however is highly partitioned. e.g. one table per
> * if the subject is bnode or iri
> * the unique predicate
> * if the object is bnode or iri or specific datatype.
>
> e.g.
>
> _:1 :pred_0 <
http://example.org/iri> .
> <
http://example.org/iri> :pred_0 3 .
> <
http://example.org/iri> :pred_0 "lala" .
>
> Will be stored in 3 distinct tables. Allowing us to a completely avoid
> storing the predicates and the type of subject or object. For now stored
> in separate files e.g.
>
> ./pred_0/bnode/iris
> ./pred_0/iri/datatype_xsd_int
> ./pred_0/iri/datatype_xsd_string
>
> Which graphs a triple is in is encoded in bitset (roaring for
> compression) and there might be multiple graph bitsets per table.
> All graphs must be identified by an IRI.
>
> ## Inverted indexes using bitsets
> Many values can be stored complet
> ely inline in such a representation
> and we also do inversion of the table. e.g. very valuable for when there
> is a small set of distinct objects. e.g. for a with boolean values
>
> We do
> true -> [:iri1, :iri2, :iri4]
> false -> [:iri1, :iri4, :iri8]
>
> instead of
> :iri1 true
> :iri1 false
> :iri2 true
> :iri4 true
> :iri4 false
> :iri7 false
>
> As all iri's string values are addressable by a 63 bit long value
> (positive only). We an turn this into two bitsets. Which give very large
> compression ratios and speed afterwards. Reduction to 2% of the input
> data for quite a large number of datasets is possible. (2/3rds of the
> predicate value combinations in UniProtKB are compressible this way)
>
> ## Join optimization candidates
>
> Considering all triples are stored in subject, object order (or that
> order is cheap to generate) we can also do a MergeJoin per default for
> all patterns where a "subject variable" is joined on. BitSet joins might
> in some cases also be possible.
>
> ## Open work
>
> There is still a lot of work to be done to make it as fast as possible
> and validate that it really works as it is supposed too.
> * Strings using less than nine UTF-8 characters are also inline value
> candidates but this is not wired up yet.
> * FSST compression for the IRI dictionary instead of LZ4.
> * Cleanup experiments
> * Document more :(
> * Reduce temporary file size requirements during compression stage (7TB
> for UniProtKB)
>
>
> ## Early results
>
> Early results are encouraging. With for UniProtKB release we need 610 GB
> of diskspace. 197 GB for the "quads" the other 413GB for the values.
> e.g. roughly 16 bit per triple! This is better than the raw rdf/xml
> compressed with xz --best :)
>
> Loading time (for UniProtKB 2022_04) is currently 59 hours on a 128 core
> machine (first generation EPYC). With 24 hours in preparsing the rdf/xml
> and merge sorting the triples. Another 10 hours in sorting all IRIs, and
> 25 for converting all values in the triple tables down into their long
> identifiers.
>
> In principle the first and last step are highly parallelize and the last
> step might be much faster when moving from lz4 to fsst[1] compression
> for IRIs and long strings.
>
> I have an in principle agreement that I am allowed to contribute this to
> RDF4j. But would like to poll if there is a desire for this and what
> kind of paper work do I need to supply.
>
> Considering it is a larger than normal contribution for me. I won't make
> the code available until I am clear that the paperwork will be fine/or
> that making it fine requires it to be open somewhere already.
>
> Regards,
> Jerven
>
>
> [1]
https://github.com/cwida/fsst/
>
>
>
>
>
>
>
>
>
--
*Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss
_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/rdf4j-dev