6.7 KiB
GAIS Data Import for ClassroomCopilot
This document describes the GAIS (Get All Information from Schools) data import functionality for ClassroomCopilot, which allows you to import publicly available school databases into the Neo4j database.
Overview
The GAIS data import system is designed to import publicly available educational data from various sources, starting with Edubase All Data. The system follows Neo4j naming conventions and creates a comprehensive graph structure representing schools and their relationships.
Neo4j Naming Conventions
The import system adheres to the following Neo4j naming conventions:
- Node Labels: PascalCase (e.g.,
Establishment,LocalAuthority,EstablishmentType) - Relationships:
HAS_prefix followed by the target node label (e.g.,HAS_LOCAL_AUTHORITY,HAS_ESTABLISHMENT_TYPE) - Properties: camelCase (e.g.,
establishmentName,schoolCapacity,numberOfPupils)
Data Structure
Main Nodes
-
Establishment - The primary school/educational institution node
- Properties: URN, name, address, capacity, pupil counts, etc.
- Unique identifier:
urnproperty
-
LocalAuthority - Local authority governing the establishment
- Properties: code, name
- Relationship:
HAS_LOCAL_AUTHORITY
-
EstablishmentType - Type of educational establishment
- Properties: code, name
- Relationship:
HAS_ESTABLISHMENT_TYPE
-
EstablishmentTypeGroup - Group classification of establishment types
- Properties: code, name
- Relationship:
HAS_ESTABLISHMENT_TYPE_GROUP
-
PhaseOfEducation - Educational phase (Primary, Secondary, etc.)
- Properties: code, name
- Relationship:
HAS_PHASE_OF_EDUCATION
-
GenderType - Gender classification of the establishment
- Properties: code, name
- Relationship:
HAS_GENDER_TYPE
-
ReligiousCharacter - Religious character of the establishment
- Properties: code, name
- Relationship:
HAS_RELIGIOUS_CHARACTER
-
Diocese - Religious diocese (if applicable)
- Properties: code, name
- Relationship:
HAS_DIOCESE
-
GovernmentOfficeRegion - Government office region
- Properties: code, name
- Relationship:
HAS_GOVERNMENT_OFFICE_REGION
-
DistrictAdministrative - Administrative district
- Properties: code, name
- Relationship:
HAS_DISTRICT_ADMINISTRATIVE
-
MSOA - Middle Super Output Area
- Properties: code, name
- Relationship:
HAS_MSOA
-
LSOA - Lower Super Output Area
- Properties: code, name
- Relationship:
HAS_LSOA
-
Country - Country of the establishment
- Properties: name
- Relationship:
HAS_COUNTRY
Usage
Command Line
You can run the GAIS data import using the startup script:
# Import GAIS data
./start.sh gais-data
# Or directly with Python
python main.py --mode gais-data
Programmatic Usage
from run.initialization.gais_data import import_gais_data
# Import the data
result = import_gais_data()
if result["success"]:
print(f"Successfully imported {result['total_rows']} records")
print(f"Processing time: {result['processing_time']:.2f} seconds")
print(f"Nodes created: {result['nodes_created']}")
print(f"Relationships created: {result['relationships_created']}")
else:
print(f"Import failed: {result['message']}")
Data Sources
Edubase All Data
The primary data source is the Edubase All Data CSV file, which contains information about all educational establishments in England and Wales.
File Location: run/initialization/import/edubasealldata20250828.csv
Data Volume: Approximately 51,900 records
Key Fields:
- URN (Unique Reference Number)
- Establishment details (name, type, status)
- Geographic information (address, coordinates, administrative areas)
- Educational characteristics (phase, gender, religious character)
- Capacity and pupil numbers
- Contact information
- Inspection details
Data Processing
Batch Processing
The import system processes data in batches to optimize performance and memory usage:
- Batch Size: 1,000 records per batch
- Processing: Nodes are created first, then relationships
- Error Handling: Individual record failures don't stop the entire import
Data Validation
The system automatically handles:
- Empty/blank values (excluded from node properties)
- "Not applicable" values (treated as empty)
- Date format conversion (DD-MM-YYYY to ISO format)
- Numeric value parsing
- Duplicate node prevention
Relationship Creation
Relationships are created using a two-pass approach:
- First Pass: Create all nodes and build a mapping of keys to node objects
- Second Pass: Create relationships between nodes using the mapping
Performance Considerations
- Memory Usage: Data is processed in batches to minimize memory footprint
- Database Connections: Uses connection pooling for efficient database access
- Duplicate Prevention: Tracks created nodes to avoid duplicates
- Error Resilience: Continues processing even if individual records fail
Future Enhancements
The GAIS import system is designed to be extensible for additional data sources:
- Governance Data - School governance and management information
- Links Data - Relationships between schools and other entities
- Groups Data - Multi-academy trusts and federations
- Additional Sources - Other publicly available educational datasets
Troubleshooting
Common Issues
- File Not Found: Ensure the Edubase CSV file is in the correct location
- Database Connection: Verify Neo4j is running and accessible
- Memory Issues: Reduce batch size if processing large datasets
- Permission Errors: Check file permissions for the CSV data file
Logging
The system provides comprehensive logging:
- Import progress updates
- Error details for failed records
- Performance metrics
- Node and relationship creation counts
Testing
Use the test script to verify functionality:
python test_gais_import.py
Data Quality
The import system maintains data quality by:
- Filtering out invalid or empty values
- Converting data types appropriately
- Maintaining referential integrity
- Providing detailed error reporting
Schema Compatibility
The imported data is compatible with the existing ClassroomCopilot schema and can be integrated with:
- Calendar structures
- User management systems
- Educational content management
- Analytics and reporting tools
Support
For issues or questions related to the GAIS data import:
- Check the logs for detailed error information
- Verify data file format and content
- Ensure database connectivity and permissions
- Review the Neo4j schema constraints and indexes