# GAIS Data Import for ClassroomCopilot This document describes the GAIS (Get All Information from Schools) data import functionality for ClassroomCopilot, which allows you to import publicly available school databases into the Neo4j database. ## Overview The GAIS data import system is designed to import publicly available educational data from various sources, starting with Edubase All Data. The system follows Neo4j naming conventions and creates a comprehensive graph structure representing schools and their relationships. ## Neo4j Naming Conventions The import system adheres to the following Neo4j naming conventions: - **Node Labels**: PascalCase (e.g., `Establishment`, `LocalAuthority`, `EstablishmentType`) - **Relationships**: `HAS_` prefix followed by the target node label (e.g., `HAS_LOCAL_AUTHORITY`, `HAS_ESTABLISHMENT_TYPE`) - **Properties**: camelCase (e.g., `establishmentName`, `schoolCapacity`, `numberOfPupils`) ## Data Structure ### Main Nodes 1. **Establishment** - The primary school/educational institution node - Properties: URN, name, address, capacity, pupil counts, etc. - Unique identifier: `urn` property 2. **LocalAuthority** - Local authority governing the establishment - Properties: code, name - Relationship: `HAS_LOCAL_AUTHORITY` 3. **EstablishmentType** - Type of educational establishment - Properties: code, name - Relationship: `HAS_ESTABLISHMENT_TYPE` 4. **EstablishmentTypeGroup** - Group classification of establishment types - Properties: code, name - Relationship: `HAS_ESTABLISHMENT_TYPE_GROUP` 5. **PhaseOfEducation** - Educational phase (Primary, Secondary, etc.) - Properties: code, name - Relationship: `HAS_PHASE_OF_EDUCATION` 6. **GenderType** - Gender classification of the establishment - Properties: code, name - Relationship: `HAS_GENDER_TYPE` 7. **ReligiousCharacter** - Religious character of the establishment - Properties: code, name - Relationship: `HAS_RELIGIOUS_CHARACTER` 8. **Diocese** - Religious diocese (if applicable) - Properties: code, name - Relationship: `HAS_DIOCESE` 9. **GovernmentOfficeRegion** - Government office region - Properties: code, name - Relationship: `HAS_GOVERNMENT_OFFICE_REGION` 10. **DistrictAdministrative** - Administrative district - Properties: code, name - Relationship: `HAS_DISTRICT_ADMINISTRATIVE` 11. **MSOA** - Middle Super Output Area - Properties: code, name - Relationship: `HAS_MSOA` 12. **LSOA** - Lower Super Output Area - Properties: code, name - Relationship: `HAS_LSOA` 13. **Country** - Country of the establishment - Properties: name - Relationship: `HAS_COUNTRY` ## Usage ### Command Line You can run the GAIS data import using the startup script: ```bash # Import GAIS data ./start.sh gais-data # Or directly with Python python main.py --mode gais-data ``` ### Programmatic Usage ```python from run.initialization.gais_data import import_gais_data # Import the data result = import_gais_data() if result["success"]: print(f"Successfully imported {result['total_rows']} records") print(f"Processing time: {result['processing_time']:.2f} seconds") print(f"Nodes created: {result['nodes_created']}") print(f"Relationships created: {result['relationships_created']}") else: print(f"Import failed: {result['message']}") ``` ## Data Sources ### Edubase All Data The primary data source is the Edubase All Data CSV file, which contains information about all educational establishments in England and Wales. **File Location**: `run/initialization/import/edubasealldata20250828.csv` **Data Volume**: Approximately 51,900 records **Key Fields**: - URN (Unique Reference Number) - Establishment details (name, type, status) - Geographic information (address, coordinates, administrative areas) - Educational characteristics (phase, gender, religious character) - Capacity and pupil numbers - Contact information - Inspection details ## Data Processing ### Batch Processing The import system processes data in batches to optimize performance and memory usage: - **Batch Size**: 1,000 records per batch - **Processing**: Nodes are created first, then relationships - **Error Handling**: Individual record failures don't stop the entire import ### Data Validation The system automatically handles: - Empty/blank values (excluded from node properties) - "Not applicable" values (treated as empty) - Date format conversion (DD-MM-YYYY to ISO format) - Numeric value parsing - Duplicate node prevention ### Relationship Creation Relationships are created using a two-pass approach: 1. **First Pass**: Create all nodes and build a mapping of keys to node objects 2. **Second Pass**: Create relationships between nodes using the mapping ## Performance Considerations - **Memory Usage**: Data is processed in batches to minimize memory footprint - **Database Connections**: Uses connection pooling for efficient database access - **Duplicate Prevention**: Tracks created nodes to avoid duplicates - **Error Resilience**: Continues processing even if individual records fail ## Future Enhancements The GAIS import system is designed to be extensible for additional data sources: 1. **Governance Data** - School governance and management information 2. **Links Data** - Relationships between schools and other entities 3. **Groups Data** - Multi-academy trusts and federations 4. **Additional Sources** - Other publicly available educational datasets ## Troubleshooting ### Common Issues 1. **File Not Found**: Ensure the Edubase CSV file is in the correct location 2. **Database Connection**: Verify Neo4j is running and accessible 3. **Memory Issues**: Reduce batch size if processing large datasets 4. **Permission Errors**: Check file permissions for the CSV data file ### Logging The system provides comprehensive logging: - Import progress updates - Error details for failed records - Performance metrics - Node and relationship creation counts ### Testing Use the test script to verify functionality: ```bash python test_gais_import.py ``` ## Data Quality The import system maintains data quality by: - Filtering out invalid or empty values - Converting data types appropriately - Maintaining referential integrity - Providing detailed error reporting ## Schema Compatibility The imported data is compatible with the existing ClassroomCopilot schema and can be integrated with: - Calendar structures - User management systems - Educational content management - Analytics and reporting tools ## Support For issues or questions related to the GAIS data import: 1. Check the logs for detailed error information 2. Verify data file format and content 3. Ensure database connectivity and permissions 4. Review the Neo4j schema constraints and indexes