5.0 -> 5.1 - A small revolution is taking place in MG4J: now most classes handling indices have an IOFactory parameter that makes it possible to open files in alternative filesystems, such as HDFS. Beware--the feature is very pervasive and there might be missing spots. Thanks to Tim Potter for useful discussions and for testing this new feature. - InputStreamDocumentSequence was not behaving correctly in case of keyboard input (two EOFs were necessary). - The Maven artifacts did not contain the Velocity templates. Thanks to Andrew MacKinlay for reporting this issue. 4.0.4 -> 5.0 - WARNING: this release has source and binary incompatibilities with previous releases. Watch out. - nextDocument() now returns DocumentIterator.END_OF_LIST instead of -1 to denote list exhaustion. To avoid confusion and ease the transition, the package prefix of MG4J is now it.unimi.di.*, following the change of name of our department. - it.unimi.di.mg4j.search.DocumentIterator is now strictly lazy; in particular, it does not implement java.util.Iterator. Please replace calls to DocumentIterator.hasNext() with a check to DocumentIterator.nextDocument() != DocumentIterator.END_OF_LIST, or try whether the semantics of DocumentIterator.mayHaveNext() suits you. The change aligns the behaviour of the two versions of MG4J. - The plethora of methods that accessed the positions of a term in an IndexIterator have been replaced by the single lazy nextPosition() call, which returns IndexIterator.END_OF_POSITIONS when the positions are exhausted. Some static methods in IndexIterators should help with the transition. - MG4J is no longer based on gap-based indices. Classical interleaved indices are used for incremental index construction and high-performance indices are still supported for historical reasons, but all new indices are by default built using the new quasi-succinct format. - DiskBasedIndex.getInstance() now return an Index instead of a BitStreamIndex. Old code should check with a reflective call whether the result is a BitStreamIndex and act accordingly, as now it might be a QuasiSuccinctIndex, too. 4.0.1 -> 4.0.4 - Fixed SimpleParser.parse(MutableString), which was throwing a NullPointerException. - Now DocumentRankScorer can load score files of any type. 4.0 -> 4.0.1 - We now force the number of documents of a virtual index to be equal to that specified by the resolver. Collections in which the last few documents were not referred would have generated virtual indices with fewer documents than the standard ones. - DocumentSequenceImmutableGraph is now part of MG4J. Building graphs out of web documents should be quite easy. - Fixed a small bug in the equals method of Term. - Fixed bug in the equals and hashCode methods of Select (before, only Index was taken into account, and not the actual subquery). - Fixed several small inconsistencies in the Scorer hierarchy. - Added the SubsetDocumentSequence class to extract a subset of documents from a given sequence. - The default target for skipping structures is now 1%. - Now ConsecutiveDocumentIterator has specialized code for non-gapped phrases cointaining just terms. - Now Combine loads sizes when compressing positions using interpolative coding. - We now use bare-bones heaps and array priority queues to increase speed. - BM25Scorer and BM25FScorer have significantly faster ranking logic. 3.0.1 -> 4.0 - WARNING: This release has minor binary incompatibilities with previous releases, mainly due to the move from the interface it.unimi.dsi.util.LongBigList to the now standard it.unimi.dsi.fasutil.longs.LongBigList. It is part of a parallel release of fastutil, the DSI Utilities, Sux4J, MG4J, WebGraph, etc. that were all modified to fit the new interface, and that prepare the way for our "big" versions, that is, supporting >2^31 entries in arrays (simulated), elements in lists, terms, documents, nodes, etc. Please read our (short) "Moving Java to Big Data" document (JavaBig.pdf) for details. - We now require Java 6. - WARNING: document iterators will return FALSE, instead of TRUE, for indices for which there are no intervals. The actual intervals returned (if there are any) has not changed, but the placeholder role of TRUE has been taken by FALSE. - WARNING: The semantics of TermCollectionVisitor.prepare(ReferenceSet) has changed slightly. - PdfDocumentFactory has been removed. Please use the new Tika-based factory for PDF parsing. - Backport from the big version of DocumentIterator.END_OF_LIST as a substitute for Integer.MAX_VALUE. Please use it in new code--it also makes transitions to the big version easier. - Refined semantics for DocumentIterator, with new streamlined implementations based on AbstractDocumentIterator. - A long-standing bug in skipTo() has been fixed thanks to a very detailed and replicable bug report by Soumen Chakrabarti. If the last posting in a list had an ordinal position that was an exact multiple of the quantum, skipping beyond the pointer contained in the posting would have erroneously returned the last pointer instead of Integer.MAX_VALUE. - A few serious bugs of the alignment operators have been fixed. It is also significantly faster. - A new set of classes interfaces with Apache's Tika to provide parsing of Office, RTF, etc. files. - Many fixes to the remote classes (still experimental!). - IdentityDocumentFactory was not using the FIELDNAME property. - The toString() method of LowPass was erroneously printing "<" instead of "~". - Query is now serializable and AbstractCompositeQuery exposes the component queries. - The range operator for payloads was broken. 3.0 -> 3.0.1 - MG4J is now distributed under the GNU Lesser General Public License 3. - MG4J is no longer dependent on COLT or jal, but it requires at least fastutil 6. - Memory usage during indexing is a bit less tight, due to the new linear probing hash maps in fastutil 6 which could use more RAM. - SKEWED_GOLOMB is no longer supported for writing. - When loading offsets in memory, the bit stream used to read them was not properly closed. - Fixed bug that would cause an error when creating a single empty batch under Windows. - Fixed bug that was preventing payload indices from working correctly (thanks to Polina Morozova for finding and fixing this bug). - Combine and subclasses now will work even if the occurrences field of the component indices is not set (thanks to Soumen Chakrabarti for reporting this bug). - Fixed bug in AlignDocumentIterator that was causing random IllegalStateExceptions (thanks to Roi Blanco for reporting this bug). 2.1.3 -> 3.0 - WARNING: Massive revamp of the DocumentIteratorVisitor subsystem. Now such visitors can return data, much like a QueryIteratorBuildervisitor. It also has a special visit method for MultiTermIndexIterators. You'll have to adapt your previous implementations. - WARNING: QueryParser instances are required to provide a parse(MutableString) method and two new escape methods that can be used to turn a string into a text token. This feature is fundamental for automatic query generation (thanks to Hugo Zaragoza for pointing out this problem). - WARNING: To make a few things easier, we now have explicit document iterators representing true and false. Their construction requires a reference index (contrarily to that was happening with DocumentIterators.EMPTY_ITERATOR), so the getInstance() methods of most document iterators had to be updated, and DocumentIteratorVisitor instances need to implemented two new visit() methods. The iterators are generated by the tokens #TRUE and #FALSE. - WARNING: Indexing of virtual fields uses much less memory, but batches now have a different content: they represent actual positions in the final virtual document. Sizes of each batch represent the known size of a virtual moment when the batch was written. With this change, Paste does no longer require more memory than Concatenate. - WARNING: A new RemappingDocumentIterator class makes it possible to mix results from different indices with positional operators. Since there is a new Remap query node, all DocumentVisitors will have to be updated. - WARNING: All deprecated classes have been removed. - WARNING: The -B option of IndexBuilder is now aligned to Scan--it specifies the basename of a collection to be built at indexing time. It used to be the size of the Combine buffer. - New classes for efficient document collection construction at indexing time. The architecture is now also very open--you can plug in your own builders. - Completely restructured size handling for Combine and subclasses. Unless you use Golomb coding, you will not need to load sizes. This is true even of batches of virtual fields, as Paste now by default does not renumber positions, but rather expects them to be already renumbered. The old behaviour can be obtained via a flag. - We moved to Jetty 6. Also, a few problems with Velocity not finding templates have been fixed. - New, more intelligent memory handling that should be able to avoid completely out-of-memory errors. There is also a limit on the number of terms per batch that should help with garbage collection. - Fixed a bug in collection creation: we used to provide the original factory, but this is wrong as we might not be indexing all fields. Now we generate a suitable factory that contains only the indexed fields. - New important feature: high-performance indices may have now variable quanta depending on the list frequency and density. Indices now sport a .posnumbits file that records how many bits are used to store positions. It is used as a basic statistics to compute the correct quantum. You can ask for a percentage of the index to be used to skip towers, and the right quantum for each list will be computed for you. The process is quite empirical, so always look into .stats files to check that you are actually using no more than the percentage requested. In general, old indices will have to be rebuilt before being able to Combine them into an index with variable quanta, but for high-performance indices the tool ComputePosNumBitsPositions can be used to add the missing file. - Memory mapping of indices now uses the new multiplexed approach implemented in ByteBufferInputStream. This means that we can map into memory essentially every index. Thanks to Valentin Tablan and Ian Roberts for suggesting this approach. - Now we feature an implementation of the state-of-the-art BM25F ranking function. - ZipDocumentCollection.getInstance() makes it possible to load realiably ZipDocumentCollection instances even if they are not in the current directory. - New UTF-8 nice mathematical symbols for conjunction, disjunction, TRUE and FALSE. - Fixed problem with too many connections open when using JdbcDocumentCollection. - A new SUCCINCTSIZES URI key makes it possible to ask for loading sizes into an Elias-Fano compressed list. This will slow down access by two orders of magnitude, but it can be very useful when pasting large indices, as pasting needs to load a large amount of size data. - EmptyIndexIterator instances are no longer Index-based singletons. This change was necessary to make it possible to run ranking algorithms that require to set the weight or id even of empty iterators. This should cause no problem. - All document iterators have now a settable weight. The weight can be espressed in standard syntax using braces. Note that weights per se have no meaning--it is up to the scorers to use them. - Now the metadata-only option of Combine and its implementations generates the file of frequencies. This is very useful as it makes it possible to compute the term frequencies for the virtual documents obtained by concatenating all fields--something that is necessary for the correct computation of BM25F. - Fixed a bug in the grammar: queries such as "(a))" would have been parsed as "(a)" because of a lack of check for EOF (thanks to Hugo Zaragoza for reporting this bug). - The parser will now accept Unicode characters 0x2227 and 0x2228 (the standard mathematical symbols for conjunction and disjunction) for AND and OR, respectively. - Following some testing TREC GOV2, the defaults for MAXPREANCHOR and MAXPOSTANCHOR in HtmlDocumentFactory have been reduced to 8 and 4, respectively. - Fixed old bug in SemiExternalGammaList; readBits(0) was not called after numLongs estimation, leading to EOFExceptions. - Document pointers can now be coded in unary. - Fixed bad bug in PartitionLexically: for high-performance indices, the positions of the last term were not being written. - HttpFileServer has a settable port. - New Scorer.getWeights() method to get weights. - Fixed a bug in TfIdf scorer that would have caused NaNs. - Query accepts a newline-separated list of titles, besides the usual serialised object. 2.1.2 -> 2.1.3 - URLMPHVirtualDocumentResolver required a sorted list, even if this was not in the class specification. Now you can choose between a sorted list (with reduced space occupancy) or a generic list (thanks to Nuno Cardoso for reporting this bug). - Fixed problem with VelocityViewServlet (getTemplate() must not be invoked statically on Velocity for things to work properly; thanks to Valentin Villenave for reporting a problem with the Lilypond Snippet Repository which led me to fix this bug). 2.1.1 -> 2.1.2 - AlignDocumentIterator (syntax: ^) makes it possible to align document/interval iterators from different indices. Using this feature MG4J can easily support queries based on semantic tagging. - Fixed another bug in Snowball stemmers: calling processTerm() with a null argument would have caused an exception. - Now Scan and IndexBuilder accept parseable objects as sequences. The same happens for the WORDREADER property of some factories, making it possible to create a moderately command-line-configurable FastBufferedReader as WordReader. - UNICODE_INPUT is now set in SimpleParser.jj, making it possible to write wild Unicode queries again. - QueryServlet now forces UTF-8 for output. - We now distribute the javacc-generated files for easier installation. - More liberal Velocity template-resolution setup, now documented in the HttpQueryServer Javadoc. - The --skips command-line option is gone. --no-skips disable skips for interleaved indices only. By default, all indices have skips that use about 2% of the index size. - Fixed bad integer overflow bug when using large heights. - New -i option for URLMPHVirtualDocumentResolver, mimicking the same option in Sux4J's functions. 2.1 -> 2.1.1 - Major fix: the Snowball stemmers would generated empty strings, and Combine would choke (generating empty indices) on empty strings. - Removed obsolete PorterStemmerTermProcessor. 2.0.1 -> 2.1 - WARNING: Most utility classes have been moved to dsiutils. Old versions are still here and deprecated, but you'll have some problems when importing this version. Always check which version you're using! - WARNING: TermMap has been replaced by StringMap (in dsiutils). PrefixMap exists, but the dsiutils signature is completely different from the old one. - Lots of stemmers coming from Snowball. We actually made some improvements to the Java Snowball compiler to get this working at a reasonable speed. - New (somewhat experimental) feature: you can get the terms that caused an interval to be emitted. - Sequential scan was not working for high-performance indices if positions were not read. The problem was evident when combining high-performance indices specifying -cPOSITIONS:NONE. - Fixed a couple of NullPointerException in index construction (thanks to Marko Srdanovic for reporting these bugs). - Fixed missing call to super.close() in AbstractIndexClusterIndexReader that was causing spurious warnings. - Now Query has multiplex on by default. - Fixed bug in MutableString.subSequence() (thanks to Espen Amble Kolstad for reporting this bug). MutableString is now in dsiutils. - New QueryExpander interface for modifying queries between parsing and actual resolution. It can be used, for instance, to do term expansion. A simple abstract implementation (AbstractTermExpander) is provided for term expansion. Also, an implementation that multiplexes terms over indices (MultiIndexTermExpander) is provided. - New allLines() method in LineIterator. LineIterator is now in dsiutils. 2.0 -> 2.0.1 - Can you believe that? Fast.leastSignificantBit() under very peculiar circumstance was returning random data, but apparently this was causing no warm. I don't wanna know. - Better memory handling: buffer reallocation logic in index construction could cause out-of-memory errors. Now we retry a small reallocation after dumping the content in a temporary file, and record the event so the Scan process can dump the current batch. - Fixed old minor bug in Combine: term files and global-counts files were not closed, leading to bizarre and spurious too-many-open-files errors. - Fixed derelativisation when using FileSystemItem. 1.1.3 -> 2.0 - METAWARNING: This release has so many changes and so many new features that we strongly suggest to read carefully all information below and the manual. - WARNING: there are performance improvements due to fixed-point computation of Golomb moduli (yes, it *really* slows down things), but unfortunately all indices have to be rebuilt. - WARNING: virtual fields have changed in a completely incompatible way, and the same happened to AnchorExtractor. This was necessary to get finally rid of problems with System.identityHashCode() (see below). - WARNING: BitStreamIndexIterator will now throw an UnsupportedOperationException when positions or intervals are retrieved on an index without positions. Previously, getting positions would have produced the same effect, but getting intervals would have returned TRUE. This was causing a very confusing behaviour with ordered AND, consecutivity, etc., as they were returning false positives. - WARNING: a great deal of work has gone into making all relevant iterators fully lazy. Please use DocumentIterator.nextDocument() and IntervalIterator.nextInterval(), after reading the related Javadoc documentation. The change has produced significant performance improvements. - WARNING: IOExceptions are now rethrown by most index-access methods. Previously, they would have been catched and wrapped into RuntimeException, but this behaviour was slightly slowing down methods called very often like nextDocument(). - WARNING: The old sequential reading methods (e.g., readDocumentPointer()) are no longer available (I guess nobody was using them anyway). They are replaced by an IndexReader.nextIterator() method that returns an index iterator on the term after the current one, until exhaustion. - WARNING: Quanta are now restricted to powers of two. - Completely new kind of index (high-performance). It uses the Lucene idea of keeping positions in a separate file, and enriches it with MG4J skip structures. It is now the default index type. - Completely rewritten index reading. Now a ruby script generate different readers for different combination of flags, increasing significantly performance due to the reduced logic overhead. A generic class is always available, but for production sites wired index readers are the right choice. The wired, faster class is fetched automagically by reflection if available. - Completely new, memory-adaptive index construction strategy. Just specify a number of *documents* per batch and let MG4J do the rest. Please read the Scan class documentation. - New payload-based indices. Now it is possible to index dates, integers, or any other payload. By default we supply range queries. - Significant improvements in performance. System.identityHashCode() turned out to be *deadly* slow, so we dropped reference-based open hash map and started using brute-force array maps (you need fastutil >= 5.0.7) whenever we have to have to manipulate very small sets. The gains are suprising, in particular for queries containing frequent terms. - Even more improvement due to parallel reimplementation of all operators for the special case in which all document iterators are index iterators. In this case all intervals have length 1 and can be retrieved eagerly. In some cases performance is almost doubled. - New low-level coded-integer skipping methods have further increased performance in certain situations (e.g., phrasal queries containing stopwords). - Now we use precomputed bit codes for 65536 words, uniformly. This requires 4MiB of memory just for precomputed words, but it almost doubles decoding speed (as the logic is much, much simpler). - New bulk reading methods for integers in gamma, shifted gamma and delta coding. They make readDocumentPositions() several times faster as most decodings do not require a method call. - Many fixes to the code involving generics. - Fixed stupid bugs in PartitionLexically. - Moved sizes into Index (brom BitStreamIndex) and added new SIZES property that makes it possible to specify a global sizes file. This way, it is possible to use BM25 on clusters. - Major fixes to documentally clustered document iterators. - Fixed subtle semantic issue in LowPassDocumentIterator: TRUE iterators now make the iterator valid. - Fixed subtle semantic issue in subclasses of AbstractOrderedIntervalIterator: how TRUE subiterators are considered as always matching (so the actual interval matching is performed just on non-TRUE iterators). - Fixed bug in ScoreDocumentBoundedSizeQueue that was causing enqueuing of documents with score equal to the minimum. - Improved implementation of MinimalPerfectHash. By fixing deterministically the perfect hash functions we reduce to virtually zero the trials during the construction (thanks to djam8193ah@hotmail.com for suggesting the idea). - Fixed old copy-and-paste bug in non-scored requests to QueryEngine: offset was not used at all (but I guess nobody was using that method anyway). - Completely new support for query expansion. A MultiTermIndexIterator behaves in all respects like an IndexIterator, but it's actually built by merging the index iterators of several terms. The "frequency" is settable so to solve term-dependency problems in IDF-based ranking schemes. For debugging purposes, + can be used (instead of |) to cause the constructon of a MultiTermIndexIterator. - Brouwerian difference is now supported. It kills all intervals of the minuend that appear in the subtrahend. It can be used for searching for terms forcing however the context in which they are found *not* to contain some terms, or more generally a query. It can also be used to modify index granularity by subtracting 2-element intervals that cross section boundaries. - ConsecutiveDocumentIterator now support gaps that can be used to match arbitrary words. This is particularly useful to perform phrasal queries in indices where some terms have not being indexed. Gaps are specifiable using $ instead of a term in the built-in parser. - New methods to access the front of a subclass of AbstractUnionDocumentIterator, that is, the indices of the component iterators positioned on the current document. They are used by all union-based iterators, providing a significant performance improvements on large unions. - New metadata-only mode for Combine and related subclasses. Mainly useful for getting the global sizes, terms, etc. of a cluster. - The array-writing methods of OutputBitStream now take a long for the bit length/offset, and correspondingly return a long. The old methods are still present, but they are deprecated (just to avoid proliferation). - Deprecated all minimal perfect hashing constructors using the platform default encoding. They are just an endless cause of problems. There are now constructors with just a filename and an encoding (which can be null to mean the platform encoding, but you have to explicitly ask for it). - Now all TermMap implementations have a constructor accepting an Iterable. - New constructors and main method options for minimal perfect hash tables, prefix dictionaries and front-coded lists that support reading gzip'd files. - Query provides a clearer selection between *no interval selection* and *no intervals*. - Fixed bug in ImmutableBinaryTrie: prefixes of the first binary string would have generated an empty approximated interval (instead of [0]). - Fixed bug in writeShiftedGamma()/readShiftedGamma(), and modified test.bsh so that it detects the bug. - The SPIRE 2006 algorithms are by now obsolete--we have new, provably optimally lazy algorithms. The code reflects this. - Lots, lots, lots of unit tests. 1.1.2 -> 1.1.3 - New score(digits) method for ResultItem for easier display. - Now JdbcDocumentCollection works with factories featuring more than one field. - Reintroduced the JavaBeans Activation Framework in dependencies. - Fixed lack of calls to close() in some document factory, generating spurious warnings. - Fixed static fields in QueryServlet. 1.1.1 -> 1.1.2 - Fixed default values of K_1 and B in BM25 scorer following Büttcher & Clarke's paper. - Fixed interval methods for nonsense calls on the empty interval. - Dumped jline--we now suggest using rlwrap. - More sensible hash for intervals. As a consequence, the serialUID had to be bumped. - Fixed serious bug in OrderedAndDocumentIterator that was dropping several correct intervals (thanks to Fabien Campagne for finding this bug). - Fixed very old bug in InputBitStream.read(byte[], int)--reads of full length would have caused an ArrayIndexOutOfBoundsException (thanks to Kevin Dorff for finding this bug). - OrDocumentIterator was using an indirect queue instead of a semi-indirect queue, maybe for historical reasons. - Complete rewrite of interval operators due to new algorithms, to be included in the revised SPIRE 2006 paper. On TREC data this led to an average 3% increase in speed. Now the algorithms used by MG4J are provably optimally lazy. - The BulletParser now accepts element-type names with dashes, etc., and moreover parses correctly explicit CDATA sections (thanks to Kevin Dorff for finding these bugs). - Support for unsigning signed minimal perfect hash maps. - New Shift-Add-Xor-based signed minimal perfect hashes (even with long signatures). Moreover, now all signed hashes have a main() method generating by default instances of that hash. - Massive speed improvements in OutputBitStream: finally we write precomputed words for small integers, analogously to what happens in InputBitStream. 1.1 -> 1.1.1 - Better loading of InputBitStream data, working also with multiple class loaders, and serialisability of SelectedInterval (fixed by the Twease people). - AbstractAggregator was not setting up the equalisation factors when equalisation was not required, resulting in divisions by 0. - CountScorer is now a DelegatingScorer (as it should have always been). - The empty-constructor interval selector wasn't really letting out *all* interval--overlapping intervals would have been discarded. - Fixed a *very old* bug in the computation of minimal-interval semantics. Now the code is fully aligned with our SPIRE 2006 paper. 1.0.2 -> 1.1 - IMPORTANT: IndexWriter.close() no longer save automagicall properties--you have to fetch them with IndexWriter.properties(). - Java 5 only. - Probably the largest rewrite and extension in the history of MG4J. Too many changes, fixes and optimisations to be described here. Almost nothing is backward-compatible. - We are starting to distribute unit tests with each release. We have actually many more tests, but they are not cast inside JUnit and rather undocumented. You are welcome to donate unit tests. 1.0.1 -> 1.0.2 - Fixed bug in InputStreamDocumentCollection: the document index (and thus the title) was never incremented. - New parsing factory for the BulletParser: you decide how to parse your names (an idea by Fabien Campagne). - Now we use 1.26n integers to minimally hash n words. 1.25n is in fact the threshold--you need something larger than that. The change should be fully backward-compatible. - Now FileLinesCollection returns a Closeable FileLinesIterator. - BloomFilter does not implement any longer the nonsensical size() method. add() is more efficient and does not return a value. 1.0.0 -> 1.0.1 - Fixed bug in Paste if the size of size lists differ (now we extend to zero). - The "field" property was not propogated by Combine. - A missing throws clause in AbstractDocumentCollection's implemention of iterator() was making it impossible to throw exceptions in implementing subclasses. - New, efficient single-query iterator for JdbcDocumentCollection. - NULLs do not generate null pointer exceptions any longer in JDBC document collections. They're converted to empty input streams. 0.9.2 -> 1.0.0 - Too much to be written. 0.9.1 -> 0.9.2 - IMPORTANT: To avoid clashes with List, the get() method of TermMap has been changed to getTerm(). We're sorry for this inconvenience. - Now we support prefixes by means of a PrefixMap. There are easy (ternary search trees) and very sophisticated (semi-external tries) implementation. If you have a PrefixMap you can search for things like "foo*" (meaning "starts with foo"), provided that the terms starting with "foo" do not exceed a constant defined in QueryParser. - Interval has new methods that compare to points. - Fixed stupid bug in ClarkCormack scorer: we were comparing the document indices, not the scores. Ouch. - Fixed ScoredDocumentBoundedSizeQueue: now stability is forced by making the order an actual order (not a preorder) so it is possible to get the k-th to (k+j)-th ranked documents in a consistent way. The new version is, unfortunately, completely incompatible with the old one. - New CachingDocumentIterator: it decorates a DocumentIterator so that you can get several times its interval iterators. - FastBufferedOutputStream was NOT flushing. The flush() method was inherited, but of course that didn't work. FastBuffered{In,Out}putStream are now deprecated as they have been moved to fastutil. - OrDocumentIterator would have caused IllegalStateException in some circumstances (the array of underlying iterators was assumed to have null'd position for empty component iterators, but this wasn't happening). - InputBitStream is a boolean iterator, and OutputBitStream has a method accepting a boolean iterator. This opens a new world of possibilities 8^). - New replace() and delete() methods in MutableString for handling more easily deletions or substitution of a class of characters. - readLine() no long empties its argument on end of file. 0.9.0 -> 0.9.1 - A couple of missing methods in it.unimi.dsi.mg4j.util.Fast were necessary for WebGraph. 0.8.2 -> 0.9.0 - IMPORTANT: Int2IntArrayMap and Int2LongArrayMap no longer exist: offsets and sizes now are type-specific array lists (and they can be easily generated using fastutil wrappers). - Fixed stupid, stupid bug in state handling in IndexReader. Sequential reads of an entire index would have thrown an IllegalStateException. - Changed a few array static methods to faster fastutil counterparts. - Fixed small glitch in lastIndexOf() semantics--searches from negative offset of the empty string would have returned 0 instead of -1. - Bunch of new methods in MutableString ((last)indexOfAnyBut, (co)span). - Golomb read/write methods now support modulus 0, 0 being the only valid argument (and the result returned upon reads). - Minimal perfect hashes support lists with less then 16 elements, by storing them transparently in a vector. - Signed hashes have an incompatible format (sorry). - Minimal perfect hashes support optimal weight length computation for sorted term collections. - New left/right trim methods. Moreover, trim methods preverve looseness and compactness. - Literally zillions of new features, everywhere. - Experimental support for multi-index minimal-interval semantics and skipping towers. 0.8.1 -> 0.8.2 - New methods for starting and stopping a progress meter with messages. - New FastMultiByteArrayInputStream class that can hold 256 PiB (256 PiB = 2^28 GiB) and expose them as a repositionable stream. 0.8 -> 0.8.1 - Modified imports and class name for compliance with fastutil 3.0. - Relicensed under the LGPL. 0.7.1 -> 0.8 - New NullInputStream to support new InputBitStream direct array wrapping. - position() in InputBitStream will always work if the new position is within the current buffer. - Removed unused buffer in InputBitStream, and made unget buffer allocation on-demand. - Eliminated finalizers from streams. - New debugging class. - Fixed bug in position(). - Now the ProgressMeter gives items/s at the first printout. - New methods for variable-length nibble coding. - New methods for zeta coding (a new code!). - Fixed erroneous serialisation of CRC32SignedMinimalPerfectHash. - Completely renewed hashing scheme for minimal perfect hashing: supports the empty string and it is faster to compute. Moreover, MinimalPerfectHash now has an offline builder that never loads the words actually in RAM, thus allowing to hash very large sets (albeit slowly), and checks a suitable system property to provide optional verbose logging. The serialisation, unfortunately, is incompatible. 0.7 -> 0.7.1 - Removed experimental classes. - Fixed two bad bugs introduced during the in 0.7 during optimisation. 0.6 -> 0.7 - IMPORTANT: MG4J now uses the new fastutil package name (i.e., no more fastutil). If you use parts of MG4J that require fastutil, you should upgrade. - New replace() methods for MutableString that entirely replace the string content. New copy() method to obtain easily a compact copy of a mutable string. - New RepositionableStream interface to mark streams that can be repositioned by bit streams. - New FastByteArrayInputStream to read memory blocks as bit streams. - New unsynchronised FastBufferedReader. - ProgressMeter count value has now setters and getters. - Programmable meter quantum for FirstPass. - Some optimisations. 0.5 -> 0.6 - IMPORTANT: streams and self-delimiting string formats are not binary compatible with previous versions. Please read the docs. - Fixed bug with serialisation of empty strings and set serialVersionUID. - Too many addition to be described in a file, but in short: optimised indexOf() family of methods, flexible index construction, QuickSearch fast searches. 0.4 -> 0.5 - MutableString has now a more coherent policy for compactness and looseness. They are preserved by all operations. - StringBuffer-specific methods have been killed to reduce code duplication. You need to recompile so that java uses the alternative CharSequence-specific methods. - Several new methods such as startsWith, endsWith etc. 0.3 -> 0.4 IMPORTANT: the hash computation functions in MinimalPerfectHash have been changed. Please regenerate your maps. MinimalPerfectHash has been reimplemented to use CharSequence, so it is more general. Moreover, we have a new SignedMinimalPerfectHash that can be used to avoid false positives. - New replace methods in mutable strings. - Now we try to return a reference to this in all mutable string methods. - Various fixes to documentation. 0.2 -> 0.3 - Introduced new class MutableString. 0.1 -> 0.2 - By mistake writeLongDelta() was really called writeDelta().