When working with APIs to ingest data into databases using Python, efficiency and reliability are key. APIs are often outside of your control, you depend on their uptime, rate limits, and data structures. That means your ingestion pipelines need to be resilient against failures. In this blog I’ll share three tips on creating more resilient and efficient ingestion pipelines.
Often when an API call fails, programs simply return an error directly or create a simple loop with n retries. This results in more manual intervention by the developer than necessary. Implementing a proper retry policy reduces the amount of failures of your program. This can be achieved by using a package like tenacity in Python. With the decorator @retry you can retry your function (API call). It has the following useful features:
APIs often use pagination when data is too large or for more efficient networking. Usually this is implemented by simply looping over the pages and fetching each page. When there are hundreds of pages this is very inefficient. Since you are waiting for each response before sending the next. A much faster approach is doing concurrent requests using asyncio. With this you can send all requests at the same time and gather all the responses at the same time. This significantly reduces the wait time of the program. For one of my ingestion pipelines it reduced runtime from 30 minutes to 5 minutes! This is especially important when your script uses expensive cloud resources.
However beware of sending too many requests at once, you might go over the request limit of the API and be blocked!
When ingesting API data into a database, the way you handle inserts is just as important as fetching the data. A common mistake is writing all data at once into the database. Your application uses too much memory and you have a chance to overload the database ingestion tool. Instead, you should batch insert. This works by inserting a manageable number of rows (a couple thousand) at a time. It improves throughput and reduces the load on both your application and the database. Finally, combining batching with retry logic ensures that even if a batch write fails, you don’t lose the entire ingestion job only the affected batch is retried.