TIPS to interact efficiently with Databases using Talend
- In the Database Output component choose the "Enable Parallel Execution" (Last option in the Advance Settings)
- Use a reasonable batch-size (The default is 10,000 rows)
- Use the ELT components and Bulk Execute components where ever applicable.
- Choose only the columns you are interested in your data-sets. Avoid using select * from in your queries.
- Choose only those rows that you are interested in by limiting them with a WHERE clause
- Do not perform blocking operations like "Aggregate/Sorting" on the Talend. The DB is more efficient in doing these operations. For Aggregation/Sorting, the entire dataset has to be loaded before the operation is performed. So if the data-set is huge, it could choke the memory and cpu.
- Understand the "Commit" options. Commiting frequently has pros and cons
- Example - You have 20000 rows to be written to the database. Lets assume that we are commiting for every 5000 rows. If there is any error after the second commit (meaning after 5000 + 5000 = 10000 rows), our database will be in a state where rows have been written partially. Unless your code is re-runnable, you could end up in a situation where you need to first clean up before you start again..
- Commiting frequently avoids huge log files on the database.
- Commiting only at the end could have an impact on the memory usage if it is a large batch of rows that you are waiting to commit.
No comments:
Post a Comment