FANTABULOUS!

It goes without saying that I am EXTREMELY GRATEFUL! I am deeply grateful to the Ebbe Nielsen Challenge jury members, and especially to GBIF for its efforts, including the establishment of this Challenge, to help the world gain access to biodiversity data. And of course, congratulations to Peter Desmet, Bart Aelterman and Nicolas Noé for their work on Datafable, which very-much deserves the first-place award!

New Logo, and New Affiliation

If you look really closely, you might have noticed a very subtle change to the BioGUID logo (see the upper banner, and the bottom of the API Page; you might need to refresh your browser). The change involves a slight shift in the hue of the blue and green parts, plus the addition of a third (yellow) link in the chain between “Bio” and “GUID”. Also, the background is now white (transparent for the png files). Why did I do this? Well, as I hinted in an earlier post, BioGUID is now formally part of the Global Names Architecture (GNA), and to represent this association, the BioGUID logo color scheme has been updated to match that of the GNA logo. Also, BioGUID’s GitHub space has been moved to the GNA GitHub project. The old GitHub site will remain live, but only consists of a note re-directing visitors to the new BioGUID GitHub site. Over the next few weeks, I will gradually transfer all the documentation and source code over to the new GitHub repository.

Bug Workaround

I discovered a bug that prevents uploading identifiers with no value entered for “RelationshipType” in the uploaded CSV. It should be easy to fix, but for now, the work-around is to make sure you include a value in the “RelationshipType” column of the uploaded file. If you’re only submitting raw identifiers (with no related identifiers), just use “Congruent” for RelationshipType.

Import Performance Tests

Another late night…sigh… I wanted to do some performance tweaking on the bulk data import process, so I set up a series of import dataset CSV files, one containing 10 records, one containing 100 records, one containing 1,000 records, one containing 10,000 records, one containing 100,000 records, and one containing 475,574 records. Unfortunately, my import routine crashed on the first dataset (10 records). Surely the problem must be related to the new performance monitoring code I added to the routine. Right? Nope. SIX hours of frustrating hair-pulling later, I finally sleuthed out the bug, which was related to a rare circumstance that just HAPPENED to be represented in those first ten random records. (Note to self: no one wins when you try to find an obscure bug on dense code when you’re sleep deprived. No one.) In any case, the 100,000 record batch took about 2 minutes to complete, and the 475,574 record batch took about 15 minutes to complete. Not bad, but not great. More performance tuning is in order, methinks.

Email Addresses Hidden

I noticed that earlier today someone batch-uploaded a dataset that contained several email addresses as identifiers mapped to Agents. The import was successful, and the identifiers were correctly incorporated into the BioGUID.org index. However, email addresses are one of the few Identifier Domains that is classified as “Hidden”. This means that the results are not displayed in the search output. For obvious reasons, we don’t want to expose email addresses on a web service such as this. However, we might want to use BioGUID as a way of locating people when you already have an email address. For example, if I search for ‘8C466CBE-3F7D-4DC9-8CBD-26DD3F57E212’ (my ZooBank/GNUB UUID), I don’t want my email address included in the results. However, if I search BioGuid.org for “deepreef@bishopmuseum.org”, I see no reason why I shouldn’t display the results of other identifiers mapped to me (including my ZooBank/GNUB UUID). I’ll give this some thought.

GNA Work in Hawaii

This week, Dima Mozzherin and Alexander Myltsev are visiting me to work on the Global Names Architecture, and they are very kindly giving me a crash course in using GitHub correctly. I’ll report more on our meeting soon (and the relevance of our GNA discussions to BioGUID.org), but over the next week or so, I will be moving more and more content and source code from BioGUID.org to GitHub. Watch this space!

Moving on to Bulk Import

OK, it hasn’t been 24 hours, but I’m satisfied now that the memory leak issue really has been solved. So I’m going to start playing with bulk import performance.

Screen Scraping Done

The screen-scraping madness finally stopped! I hope I can now get an actual test on the memory leak issue. As soon as that’s done, I’ve got a bunch of new identifiers to start importing!

Architecture Documentation

Oh, and by the way… while I was waiting for the Tomcat thing to sort itself out, I spent some time creating a new documentation page on the data Architecture. It’s not finished yet, but it’s important to try to get as much documentation online as possible. Nowhere near as much fun as writing Code, but important nonetheless. Oh, and another by the way: I updated the content on GitHub to include SQL scripts for generating all the objects in the four databases used to run BioGUID. You can access them directly here.

Ahem

Hokay… so I was able to track down the source of the runaway Tomcat situation. It turned out to be a screen-scraping exercise to harvest identifier cross-links against another website on the same server (yeah, I’m looking at you, Global Phylogeny of Birds people…. ahem). Actually, I have no problem with them doing it; it just messed up my memory leak test; so I’m going for another 24 hours to see if I really did fix the problem. In any case, all that screen-scraping thing really tells me is that I need to finish making BioGUID services much more functional, and much more visible, so people can just come and get the cross-linked identifiers, without having to ping them one at a time at a rate of about 20 per second…

Memory Leak seems to be Solved

Well…. there’s good news, and there’s bad news. The good news is that, after some 20 hours (ish), the memory leak problem seems to be solved! (Woo-hoo!) The bad news is that I can’t be sure if it’s really solved, because I have a new (separate) issue with Tomcat eating up all my server CPU. SQL’s also pegged out (as it can be at times when there is heavy usage), but Tomcat is saturating the CPU usage (which almost never happens), so something else seems to be happening now. I don’t think my fixes of the memory leak issue touched Tomcat, so I suspect it’s an unrelated issue. The problem is: I don’t know whether I really solved the memory leak problem, or if the runaway Tomcat thing has just masked it (i.e., prevented it from happening by saturating the server). If Only I were a real computer programmer, instead of a taxonomist, I wouldn’t feel like such a noob all the time. Sigh. OK, it will probably be a long night (again)….

Dithering Away

OK, except for taking a break to watch “The Martian” with my son (very good, but the book is better), I spent most of today fixing the orphan Identifiers object, dithering away on documentation for BioGUID.org, and tracking down the source of the memory leak. I think I figured it out, but I need to leave the system alone for 24 hours to make sure (I don’t want to confound performance stats and memory usage with my efforts to monkey around with the underlying system). So, I’ll leave the BioGUID.org site alone tomorrow to get some “clean” stats, and instead crank out some more documentation (data model, etc.). If all goes well, I’ll be back to tweaking performance and increase robustness of the bulk import process, while simultaneously adding more content!

Search Result Categories

By the way, I forgot to mention earlier that I decided to expand the categories of search result matches to three. I used to split “Exact Matches” from “Close Matches”, but “Exact” matches were not really exact. So, now “Exact” matches literally mean exact matches, where the search term exactly equals the identifier. “Close” matches are now what “Exact” matches used to be: that is, if the search term is found verbatim within the Identifier, but it’s not an exact match, then it’s considered a “Close” match. Finally, other matches that somehow relate to the search term, but not closely, are displayed in the new category, “Other Matches”. I hope that makes sense…

Orphans No More

The missing object records for the orphan Identifiers have now been completely generated. There were over 150 thousand of them, but because of the way SQL works to find such orphans (i.e., via an outer Join with missing link, or via a “NOT IN” WHERE clause), it actually takes a long time to find them all. At first, I was able to find them in batches of about 10,000 in 10 minutes (also allowing the server an additional half hour between each batch to update full-text indexing across the billion identifiers and half-billion objects). However, the fewer orphans there are, the longer it takes SQL to find them. The last three missing records took 3 hours and 42 minutes to find! Once the missing object records were created, it took an additional 2 hours to apply the referential integrity constraints on the database (again, a billion identifiers cross-linked to a half-billion object records) to prevent any more such orphan identifiers in the future. The site was slowed to a crawl during this time, but the dust now finally seems to be settled, so BiOGUID should be back up and running again at normal speeds.

Old Data Page

Oops! Well… I uploaded the files for the new data export services, but forgot to update the Data page with the user-interface to allow people to play with the new export services through the website. Fixed now.

Dealing with Orphans

OK, after a brief nap, I’ve got a routine working now to fix all the orphaned identifiers (i.e., identifier record linked to Object records that don’t yet exist in BioGUID.org). As soon as that’s done, I’ll alter the database to prevent it from happening again (rookie mistake on my part). Meanwhile, I just discovered a glitch that was causing the “Exact Matches” vs. “Close Matches” to give funky results when searching for identifiers. It should be fine now.

Orphan Identifiers

YIKES! OK, I just discovered an issue where Identifiers might be linked to records that don’t exist in the Object table! This was a serious oversight on my part in establishing the core database infrastructure. It’s nearly 4am here in Hawaii, so I’ll have to deal with this tomorrow. I don’t know how extensive the problem is, but the server may be intermittently offline for the next day or two – depending on how many records are affected.

Back Home!

After returning from our expedition to the land of ample bandwidth (and putting out all manner of fires and catching up on some badly needed sleep), I was finally to upload the rest of the Code for BioGUID.org. The good news is that the full functionality of the site is finally working! The bad news is that there still seems to be a memory leak associated with bulk uploading datasets. I will continue to research this for the remainder of the weekend. I also need to update all the documentation, and get all the source code loaded onto GitHub. I hope to have this done within the next couple of days.

Tedious

For what it’s worth, it took me several minutes just to upload the text of the previous news item, so you can imagine how difficult it is to manage a server through a remote connection at this speed!

No Joy on Internet Speed

Well… my hopes of having better internet access from the ship have not panned out, so I am still unable to finalize the code on the BioGUID server. There have been some issues with the server itself, but it seems to be working most of the time. The good news is that I will be back in the land of high-speed internet in two days, so I hope to have everything uploaded by October 1.

Memory Leak

So… I forgot to mention — until I can figure out where the memory leak is coming from, the site will likely be intermitently unavailable and/or slow. Sorry about that! Maybe I’ll get lucky and the problem will just go away on its own. Yeah, right….

Gone Diving

After eight consecutive days of deep diving, plus several days of no internet access (not to mention a temporarily broken keyboard on my very expensive laptop…), I’m finally now able to get back on the BioGUID server to finish transferring code for the new data services. Unfortunately, when I logged in to the server, I discovered there has been a pretty serious memory leak issue. This was not evident on the development server, and it might just go away when I get the rest of the code ported (oh, to be that lucky!). However, I’m bracing myself for several tedious hours of troubleshooting through what feels like the equivalent of a 9600baud modem connection to the internet. sigh On a brighter note, I was able to finish up a major NSF proposal due on Tuesday, and just finished uploading it to the NSF website. Of course, I need to do this at 1am local time to avoid competition for the precious few data bits I can send and receive to the internet…

Strange Server Bahavior

The server has been acting strangely all day long, and I’m not sure why. I’ve had to restart it several times, and it seems OK for a while, but then it grinds to a halt. I can’t help but think that I might have broken something yesterday during my marathon session to port the new code over to the site via the ship’s internet, but I don’t see any obvious issues, and there are no smoking guns in any of the activity logs. I’ll keep an eye on it and see if the problem continues.

Damn Cache

Damn! I just discovered that the background processing of identifiers associated with new objects (i.e., data objects not already in BioGUID.org) is extremely processor intensive, due to the cache updating and subsequent generation of full-text indexes. Not only does this slow the data import process for batch uploads, but it turns out that it brings the server to its knees. Obviously, we’ll need to re-architect the batch import process (perhaps dividing large batches into smaller batches), but that will have to wait until we return to Hawaii. Until then, expect BioGUID.org to be slow or non-responsive while large datasets (more than a few thousand records) are imported.

Batch Import Testing

I’ve just run a test of the Batch Import feature and it seems to be working correctly. Currently, it requires careful formatting of the submitted CSV file, but when we return to Hawaii we plan to implement a system similar to the IPT system, where content providers can create a metadata file describing the structure of the batch file, and then generate updates on a web server that can be regularly updated and harvested by BioGUID.org. It had been my intention to have most of this implemented before I got on the NOAA ship, but I let it slide because I figured I could finish it on the ship. Little did I realize how painstaking that process would be! To give you an idea how tedious it has been, each keystroke and mouse click takes about 3 seconds through the ship’s internet!

Even MORE Progress!

More Progress! It took the better part of the day, but we finally got most of the code transferred to the production server through the ship’s internet. HUGE thanks to Rob Whitton! We didn’t get everything transferred (the code for two more data export services need to be ported), but we at least got the new API file updated, and the Batch Import system seems to be fully transferred and running. Read the API page for more details, or visit the Data page to try it out. We didn’t spend much time making it look pretty, but we’ll try to clean it up a bit over the next few days. And we’ll definitely do a lot of clean-up work after we return to Hawaii at the end of September. Meanwhile, please report any issues to me at deepreef [at] bishopmuseum.org. The internet problems are mostly related to uploading content. Downloading (viewing web pages and receiving email) seems to be working a bit better on the ship.

Slow but Steady Progress

Making slow but steady progress transferring the Code files. Unfortunately, last night I tried to transfer some files but the internet on the ship cut out when the power cut out, and I ended up breaking the BioGUID home page! It’s fixed now, but we still have more files to transfer, so there may be some wonkiness on the site over the course of the day.

More Internet Issues

Crap! The ship’s internet sort of works for email and web pages, but heavier lifting (like transferring Code to the BioGUID server) seems to be blocked. So…. some of the pages got trasnferred, but a lot of the back-end Code did not. I’m hoping to get this solved within the next few days. My apologies!

SSSLLOOWWW Internet

To give you an idea of the internet situation, it took me more than a minute just to transfer that last newsitem text…. yikes!

At Sea

Oh, and by the way… I’m on a NOAA ship enroute to the remote Northwestern Hawaiian Islands (amid a few hurricanes), trying to upload two large video files through a very slow and tenuous satellite internet connection. Obviously, it would have been smart to have transferred the files before getting on the ship; but alas, the days leading up to this cruise were filled with preparations for the cruise!

And we're back!

No, this site is note dead! (Despite what it might seem based on the nearly 5-month hiatus from updating this newsfeed.) I can’t say I’ve been working on BioGUID.org continuously for the past five months (there was a major expedition and some other travelling, some grant proposals and reports to write, some major issues at the Museum, my daughter home for the summer, etc.) However, I can also say that an absence of newsfeed posts does NOT reflect an absence of progress! I will be reporting on the progress over the next few days, but for right now I need to transfer the latest Code to the Production site in time for the GBIF Ebbe Nielsen Challenge (Round 2) deadline.

Slow, but Steady

I’m still running batch processes behind the scenes, so I sincerely apologize for the inconsistent performance. Almost all of these resource-intensive processes are related to initializing datasets and generating additional indexes. Once the dust settles, routine content enrichment should be far less resource-intensive, and will have much less of an impact on the overall performance of the system. Until then, thanks for bearing with me!

Background Slowdown

DAMN! I guess I picked the wrong time to run a bunch of background processes! To everyone visiting this site for the first time: I promise that the search results are normally very fast (1-3 seconds)! But I’m moving around 220GB of data right now (along side the 550GB BioGUID database), so the system is much slower than normal! I’ll try to get it back to the normal stable/fast performance as soon as possible!

Finalist!

Wow!! I am deeply honored, humbled, and excited that BioGUID.org has been selected as one of the finalists for the GBIF Ebbe Nielsen Challenge! A hearty THANK YOU to the jury — this has defininitely re-invigorated me in keeping this project moving forward! More to come soon!

GBIF Refresh

The FTP transfer of the new GBIF download is complete, and I’m running some batch processing in the background. Over the next few days there will be periods of time when the server is much slower than normal (including right now). I hope to have this done within the next couple of days, and I will try to confine the processing to the middle of the day in Hawaii (night-time everywhere else in the world!)

Still Making Progress!

Once again, an absence of posts to this news-feed does not reflect an absence of progress. Last week I downloaded a new cut of the GBIF dataset (identifier fields only — thanks again to Tim!), and I spent this weekend processing and indexing it. This time I did the processing on a different server so that I wouldn’t impact performance on the production server. The bad news is that there are still 11 hours left on a 36-hour FTP transfer of the processed file to this server. Sigh…

IT'S WORKING!

HOLY CRAP! It’s WORKING!! No, I don’t just mean that the BioGUID.org website and data service are working; I mean the BioGUID.org CONCEPT is working! Check out the “Update” to Use-Case #1

Damn Robots

So… I noticed a lot of queries for filenames ending with ‘.jsp’ (obviously robot probes looking for security holes). Out of curiosity, I did a search on ‘.jsp’, and came up with three objects, one of which had a zillion linked identifiers. At first I thought it was an error, but it’s actually correct (all are identifiers associated with the Black-headed Gull, Chroicocephalus ridibundus). However, at least two of the links were broken. The IUCN identifier had been linked to ‘106003240’ (should be ‘22694420’), and the OBIS identifier had been linked to ‘834981’ (should be ‘460266’). I corrected both records, but my concern is: how many other identifiers out there really aren’t so peristent? And… should BioGUID.org track historical identifiers (with re-direction to the correct ones)? Food for thought.

Oops!

I was experimenting with building a full-text index on one of our import databases, and I inadvertently filled the hard drive to the point where queries slowed to a crawl. I’ve solved the problem, but for part of today the search results were very slow.

ZooKeys Identifiers

I just realized that the Journal ZooKeys has its own internal identifiers assigned to articles, with its own Dereference Service. I indexed one record, but will ask Pensoft for the full set of identifiers.

Search Buffer Purge

Oops! I just discovered that I wasn’t properly clearing the search buffer table, and it had grown to nearly 9 million records! I’ve purged it now, so I hope that search performance will imporve. NOW I’ll spend some time with family!

Taking a Break

I’m going to leave the system alone for the next couple of days (while I spend time with family), and monitor performance. Next week I’ve got a few million more identifiers to import, then it’s all about developing new services.

Peroformance Slowdown

Over the past few days as I’ve been running scripts in the background, the search performance slowed to roughly 10 seconds per search (with some searches taking 30-90 seconds). I’ve finished this for now, so searches are back down to 1-2 seconds.

Refreshing Records

I’ve now completed the script to refesh new identifiers from ZooBank/GNUB. I’ll continue testing it over the next few days, then implement it as a general service so any issuer of identifiers can maintain current identifiers indexed in BioGUID.org.

Automatic Updating

One of the things I’m working on is a service to allow automatic updating of identifier lists from external sources. I spent today testing it with GNUB/ZooBank identifiers (still ongoing).

Resuming Development

After a week of working on other projects, I returned to development on BioGUID.org. The system is currently slow because of some background processing and indexing. More to report tomorrow.

Speedy

One last thing before calling it a night: it looks like almost all recent searches took less than a second! Fingers crossed that this speedy behavior continues!

To-Do List

I decided to post my on-going “To-Do” List on the Home Page, so it’s more apparent what has already been done, and what’s planned for the near-term future. If nothing else, it will serve as a reminder to me!

More on Dereference Srvices

I added support on the identifier Search Results page for additional Dereference Services. The Preferred Dereference Service is the first icon after the identfier, and any other Dereference Services are shown as additional icons.

MY Logo!

BioGUID.org now has a Logo! See the left end of the banner. OK, so I’m not an artist… but it’s reasonable, I think.

Logos

I changed the column name “Logo” to “IdentifierDomainLogo” in the existing APIs. I did this because I now am adding support for DereferenceServiceLogo as well.

Taking a Break

I want to leave the system alone for a few days, so it will actually work the way it’s supposed to. However, next week I’ll start looking into the next big batch of identifiers to index — probably the values of occurrenceID in the GBIF dataset.

Slow but Steady

OK, things seem to be working properly now. Searches are still slow (averaging about 10 seconds) — but that’s much better than it was before (several minutes). No more batch processing for a while…

More Indexing

Residual indexing is still ongoing, so searches will continue to be slow. If this lasts more than another day, I’ll restart the server. Apologies for the inconvenience!

Disk IO Maxed

In case anyone is interested, the specific aspect of the server that is maxed out is disk I/O (not CPU, RAM or Network). This is why I think it’s the internal SQL index refresh that is causing the problem.

Still Indexing

Apparently, simply stopping the batch processing did not unlock the server (most likely due to lingering indexing). For now, I’ll wait to see if it eventually finishes; but for the time being, searches will be slow.

Server Restart

Well… something went awry with my batch process to merge data objects, and it essentially locked up the server. My sincere apologies to anyone who tried to use BioGUID during the past few hours. I’ve stopped the batch processing for now. 2015-03-28-Server_Restart.md — layout: news_item title: Server Restart date: 2015-03-28 16:59:50 author: deepreef —

Well… something went awry with my batch process to merge data objects, and it essentially locked up the server. My sincere apologies to anyone who tried to use BioGUID during the past few hours. I’ve stopped the batch processing for now.

Stability

The system is now fairly stable. I’m continuing to run processes to locate and merge data objects based on certain identifier values. These are run in batches of 50,000 identifiers every 30 minutes or so, and result in momentary slow search performance.

Background Heavy Lifting

One final note before I call it a night: I’m still running scripts in the background to discover duplicate identifiers, so the search feature may be intermittently slow. When the scripts are not running, a search should only take 2-3 seconds at most.

Follow-Up

Just a quick follow-up to the previous post: the download files are intend to be examples of what BioGUID.org can do. Most of these will be developed as dynamic data services, rather than static download files.

API Expansion

I added more information on the API page, describing a new “Export Data Structure” download file format that allows cross-linking identifiers to each other.

Use Cases

I created a new Use Cases page that helps explain the value and function of BioGUID.org through two specific use cases. There are MANY other use cases that will be added in the near future, so check back often!

Still Alive

The absence of new News Items in recent days is not reflective of an absence of progress! The indexing finished a couple of days ago, and I have spent the past two days reviewing content and further developing the website. Details in a moment!

Indexing Redux

Oops! I guess the indexes are not quite done yet. I discovered a glitch that affects the search feature using the index, which is currently updating. I’ll make sure it’s really ready before my next post (within the next few days).

First New Dataset

The first new dataset was just successfully imported! The entire import process took only 540 milliseconds. It only involved 28 new identifiers; but the important thing is the import code works! About a half-million more identifiers in the pipeline…

Two Weeks of Labor

It’s READY! After two weeks of trying to get the indexing of over a billion identifiers for more than half a billion data objects correctly implemented, BioGUID.org is now fully functional! More to follow throughout the rest of today…

Indexing Complete!

The indexing is finally complete! But…. it’s after 3am here in Hawaii, so time for some sleep. More in a few hours…

10 Million per Hour (+20)

We seem to be making steady progress on indexing records at a rate of 10 million objects every hour and 20 minutes or so. When we reach 530 million objects, the indexing will be complete and I’ll start adding new content.

A New Approach to Indexing

After nearly a week, the indexes were still being built. I restarted the process in batches of 10 million records at a time. Based on the current batch rate, I expect the process to be complete in about two days. Fingers crossed!

STILL Indexing!

The indexing process is STILL ongoing (sigh…). There’s not much I can do with the database itself until the indexing is complete, but I am preparing some new datasets for import as soon as it’s done.

Slow Searches

Also, searching will be slow and results will be incomplete until the indexing has finished.

Indexing Ongoing

The full-text indexing of identifiers is still ongoing. I don’t know how much longer it will take, but I’m holding off adding more identifier records until after it completes. I’ll post an update when that happens.

Still Indexing

As of this morning (Hawaii time), the indexes are still being rebuilt, so I apologize for the slow response of the server. There’s no easy way to predict how long this will take, but I’m hoping it will be done by the end of the weekend.

GBIF Complete

All GBIF-issued Occurrence identifiers (gbifID) and their correspoding DarwinCore “Triplet” (institutionCode+collectionCode+catalogNumber) identifiers have now been imported to BioGUID.org. Later we will also import values of occurrenceID.

ONE BILLION!

We just surpassed one BILLION indexed identifiers in BioGUID.org! It will be another day or so before the internal indexes are fully re-built, and the services functioning as they should. Then the real fun begins!

Re-Indexing

I need to rebuild some indexes on the large tables, which means two things: 1) The rest of the GBIF identifiers won’t be imported for another couple of days; and 2) the web services and web site will probably be extremely slow for a while.

Big Batches Better

I tried importing 100 batches of records with one million records per batch (instead of ten batches with ten million records each), and that proved to be a mistake (the server bogged down after only 31 million records). I need to allow the server some time to breathe, then will finish importing the GBIF records in batches of 10 million records.

Another 100 Million!

Another hundred million records imported while I slept! And it only took three and a half hours. I need to cap the memory used by the database; otherwise it pegs out and brings the server to its knees. About to kick off another hundred million records…

60 Million Records and Growing

The time I spent revising the data model paid off — I just imported 60 million GBIF records in less than two hours. The optimum ratio seems to be batches of 10 million records separated by 10 minutes (to allow indexing to catch up).

Improving Performance and Scalability

I've slightly revised the data model to improve import performance, and also to make the model more scalable for large numbers of data objects (>> 2 billion). This also simplified the model and eliminated the superfluous UUIDs generated by BioGUID.org.

One Million at a Time

After some experimentation, I’ve decided to import GBIF Occurrence records in batches of 1 million records, at an interval of one batch every 30 minutes. At this rate, it will take some time to get all 528 million GBIF records imported, but fortunately this only needs to be done once!

Faster Imports!

I can now process 100,000 GBIF records in about 20 seconds. I tried processing a million at a time but it was less efficient, so batches of 100,000 seem to be about the best performance. Later today I’ll figure out how close together I can process the batches.

Slow Imports

The first 200,000 GBIF Occurrence identifiers were successfully imported into BioGUID! Unfortunately, the current process takes about 7 minutes per 100,000 records. I’ll work on improving this performance.

Server Limits!

Unfortunately, I’m bumping up against some server limits in processing GBIF identifiers. While I get that sorted, check out the new FAQ page. It’s still under development, but I’ll add more content later this week.

First Million Records

The first 1 million GBIF records took only about a minute to process. However, as the indexes grew, each successive million records slowed down. I’m now planning to do them in smaller batches.

Too Ambitious

I was a bit ambitious and tried to process the first 100 million GBIF records in one go. After 7 hours and 45 minutes of processing, the server was brought to its knees, and needed to be restarted.

Indexing Glitch

I just discovered a glitch in the indexing search routine that failed to find matches for identifiers without any associated DereferenceServices (e.g., ISSNs). This has been fixed, so searches on ISSNs (and other non-service-based identifiers) should be working fine now.

Gbif_import

– layout: news_item title: GBIF Import date: 2015-03-09 18:04:02 author: deepreef —

And now the real work begins! After downloading 528 million+ records from GBIF (thanks to Tim Robertson for creating an identifier field dump for me!), I’m now processing the records for importing to BioGUID.org.

Cleaning up Duplicates

I discovered an error in my batch import script that caused an excess of identifiers to be imported (I wasn’t trapping for logical duplicates). The current totals stand at 1,298,749 identifiers linked to 448,502 unique objects, yielding an identfier-to-object ratio of about 2.9.

Submitted!

It’s submittied! With literally one minute to spare, no less! (Phew!) I still need to update the About page and write descriptions of the APIs. And we still need to make the IdentifierDomain submission form work. But we’re getting there!

Online!

OK, the website is functional (still tweaking a bit…). Time to generate the video file describing the site. Powerpoint, don’t fail me now!

Looming Deadline

Only a few hours to go before the deadline, and we’re still tweaking the site. This is going to be close…

First 1.5 Million Idnetifiers!

More than 1.5 million identfiers assigned to over 800K objects are now in the system, and so far the search process is wicked fast! Kudos to Full-Text Searching!

Batch Import Testing

Now testing the batch import routines. We’re going to use the existing internal identifiers (UUIDs) and cross-linked external identifiers within the Global Names Usage Bank database to seed the BioGUID system and test performance on the data services.

Building a Web Interface

Most of the basic stored procedures and functions are written and have been tested. Time to start building the web interface and web services!

Data Model Development

The data model behind the BioGUID indexing service is nearly complete. The next step is to write the stored procedures to create, edit, and search on the key data objects.