Working in the GIS field, it’s very common to gather data for a project from various adhoc sources, and often that data is delivered in the form of an Excel spreadsheet. I’ll preface by saying that there’s nothing wrong with delivering data in this way. Excel is a powerful and widely used tool in the business world, and just about everybody who’s ever worked in an office has opened an Excel spreadsheet. But too often in my GIS career I’ve come across projects that are using Excel spreadsheets as a master repository, storing large amounts of tabular data within. It’s not hard to imagine how data integrity can quickly become compromised when people are copying/pasting/emailing/editing/saving spreadsheets, both on their local disk and shared network folders.
So why isn’t Excel a database? It has rows and columns and you can sort data, create formulas, query data with features like VLOOKUP right? While those things may be true, there are several reasons Excel is not a database:
- Excel does not allow columns to contain only one specific primitive datatype – integers, floating point decimals, boolean, string, etc. In a database table, each column or field can be constrained to only allow one datatype. Often there will be a column you want to only contain a specific value, like an integer for an ID number. Excel does not provide any data validation, and will not prevent users from entering an invalid value for for the ID. A database table would throw an error, not allowing text to be used for the integer ID column. A database table can also disallow null or empty values for a column, requiring every row to have a value, useful for ensuring each row has an ID number for example. Excel does not provide this functionality.
- Excel does not allow multiple users to open and edit the same Excel file. This is a very limiting factor experienced by many users who share files on network drives. I have personally run into this issue quite a few times, having to ask a somebody on the team to close a file that they had open so I could make some edits. Inevitably, myself or somebody else would make a copy of the file to work on locally, and then we’d have issues reconciling the edits each person made with the master copy. A database is designed to allow hundreds or thousands of users to concurrently query, view, and edit the data.
- Excel is slooooow. Like really slow. The file size can balloon when there are many rows and columns, many formulas, and special formatting, filters, etc in the worksheets. Sorting worksheets with many rows can be painfully slow, and out of memory errors can happen when the content of the Excel file becomes too large. Databases are designed to maximize performance and offer nearly limitless space, depending on hardware configuration of course. (SQL Server can handle databases of up to 524,272 terabytes!) Excel files should be out of the question when data tables have hundreds of thousands of rows or more or data.
- Excel does not allow you to query or join data from multiple tables. Well, it does kinda…features like PivotTables, VLOOKUP,and HLOOKUP can provide functionality similar to some database queries by summarizing data, searching for matching criteria, etc. I won’t get into how to use these features but they do not quite give the same functionality and flexibility as a database provides. A significant limiting factor of Excel is not being able to link or join tables between files – all of these operations must be done with worksheets in the same Excel file. Where Excel really falls short however, is one to many relationships. This is an important concept in data relationships, and these kinds of table joins cannot easily be performed in Excel:
- Last, but certainly not least, Excel doesn’t provide any data backup or recovery tools. You may have some automatic file backups on your computer’s hard drive or on a shared network drive, but these may be insufficient if an Excel file is destroyed or altered. A database will have built in features to backup data, and sometimes allow the data to be rolled back to a given point in time. Bringing data back online after a hardware failure is generally an easy process with most database software. A database can also keep logs of all changes to the tables, making it possible to undo a specific table update if an error was made. On a related note, databases will offer options to securely access your data as well. Excel files offer little in the way of protecting the data from unauthorized viewers.
To summarize, Excel is a great tool for data analysis, presentation, and mathematical and statistical calculations. It is a not so great tool for storing large amounts of data. It’s often difficult to maintain data integrity using Excel, especially among multiple users. When the content of the file becomes too large, Excel performance drops significantly making viewing and editing the data difficult.
For any readers out there who currently find themselves in a situation where they feel Excel is not quite the right tool for data storage, I would urge you to explore using a database for your project. There are quite a few options out there, both free and paid, for setting up a database. It’s probably a little beyond the scope of this post to discuss all the various database storage and retrieval types and the specific database software, but since I’m a .NET developer I would suggest trying SQL Server Express. It’s easy to set up a local database server and begin experimenting with migrating data from Excel into database tables. You may also already have Microsoft Access installed on your machine as part of Microsoft Office.