Big Data: Dataset Importing Guide
And if there's lot of data that needs to be constantly updated, then it may easily become a terrible headache for you. How to approach the problem of doing regular data imports? Let us walk you through the process step by step.
Data sources
First, you need to decide where you will fetch your data from. For example, if it's data about schools you want, the Ministry of Education’s database might be a good place to start looking. But there can be other sources, and here's what you should take into consideration when comparing them.
Price
It's worth noting that paying for data does not automatically guarantee its quality. Sometimes, there are great free and community-maintained data sources available. For example, OpenStreetMap will be good enough for most people looking for map-related data. However, if there are both paid and free alternatives, and a paid database has fewer errors, it makes sense to make the initial investment. It will pay off in the long run, because you won’t need to spend so much time fixing the errors during every import.
How is the data collected?
You can save a lot of time by analyzing data collection methods. Let's take students' grades as an example. In some schools, teachers write those grades on paper and later some poor soul has to type them in in bulk to store them digitally. In other schools, grades are typed in directly by teachers and can immediately be seen (and contested, if a mistake is made) by students and their parents. It's obvious that the latter method is better, because it makes a self-correcting system with students and parents doing the verification for you. Analyzing details like that can get you a dataset free of errors instead of one that has a lot of them.
How many sources do you need?
It's great to have lots of sources, because you can use cross-referencing as an automated system for correcting mistakes. If you have ten sources and in nine of them some movie is titled "Batman & Robin", you can ignore the one that claims it's titled "Batman and Robin". However, those sources have the tendency to change formats, and each time it happens, somebody will have to fix the code. We need balance here – an extremely small number of sources can be error prone, but if you have a lot of them, they may become difficult to maintain.
Initial Import
Now that you have selected the sources, developers can start doing their magic. At this stage, the most important thing to remember is that writing an importer from scratch can, and probably will, be a very long process. However, the longer it takes initially, the less time subsequent imports will take, so that initial time investment will pay off.
You will also need to make a lot of decisions at this stage. Sometimes, it may be worth it not to provide some information, because when obtaining it is too hard, you won’t get enough value for your money. So be prepared for questions like "if a place has no zipcode provided, should we look for it using paid Google Maps services, or is it not that important?" or "what should we do if the same book has different authors in two sources?", etc.
Finally, be aware that it can take days to do the import itself. Adding millions of interconnected records to a database is time consuming. But if the importer is written correctly, subsequent imports should be much quicker, since it will import only changes made to the dataset over a given period.
Handling Errors
Unfortunately, there will also be some errors to handle. Some of them will be fixed by developers, but problems with invalid, missing, or inconsistent data should be handled by more suitable employees. For example, if there’s no clear information on whether a school is a primary school or a secondary school, it's best for somebody knowledgeable about the education system in the country of interest to search for that data. This is extremely important – dividing responsibilities so that each task lands with the most suitable person will make the whole process faster, less frustrating for everyone involved, and the data will become more reliable.
However, in order to allow other people to fix data-related problems, developers should prepare code that will generate spreadsheets with categories such as "books with price missing", "people with inconsistent age", etc. Once filled, these spreadsheets should go back to the developers, and the provided answers should be saved to the database. The great thing about this solution is that the same errors won’t pop up again and again during every import. Solving them once will be enough – the solutions will be stored and can be used in the future.
Another important thing is that problems should be reported in bulk, after analyzing all the data sources, but before the import to the database itself. This way, the longest part of the import – adding data to the database – has a higher chance of being successful, since all the problems should be solved before it even begins.
If that's not possible, because there's too much data and it must be split into chunks, there has to be a way of reverting the whole import. Sometimes, the last thing to be imported reveals some problems with everything else, and that's when you want to have the option of going back. Let’s say you're importing a dataset on schools, and it turns out that the last school in it belongs to a previously unseen subcategory for high schools, which means that all the schools labeled as "high school" should have a different age range than you’d thought. A revert button will come in handy in a situation like this.
Proper error handling is probably the most important part. Do it right, and you can reduce the time needed for each import from weeks to days or even hours.
Scheduling Subsequent Imports
Subsequent imports will be quicker than the first one, but it doesn't mean that they will be instantaneous, because some problems can still happen. New data might appear with some previously unseen values, which will have to be handled. Formatting may change – columns can be added, renamed or removed. Generally, it's best to assume that those problems will happen when deciding on how often those imports should be performed. If every import usually takes about two days, scheduling imports each week is probably safe, but not more often than that.
Summary
Importing data is not an easy task and definitely not one that can be handled by developers alone. This guide should help you to make those imports as painless as possible. Choose data sources wisely. Be aware that building importers from scratch takes time. Have somebody help developers with error handling. Finally, don't rush your imports. Keep this advice in mind, and everything will go as smooth as possible.